Amazon Kinesis Data Firehose is the easiest way to capture and stream data into a data lake built on Amazon S3. This data can be anything—from AWS service logs like AWS CloudTrail log files, Amazon VPC Flow Logs, Application Load Balancer logs, and others. It can also be IoT events, game events, and much more. To efficiently query this data, a time-consuming ETL (extract, transform, and load) process is required to massage and convert the data to an optimal file format, which increases the time to insight. This situation is less than ideal, especially for real-time data that loses its value over time.
To solve this common challenge, Kinesis Data Firehose can now save data to Amazon S3 in Apache Parquet or Apache ORC format. These are optimized columnar formats that are highly recommended for best performance and cost-savings when querying data in S3. This feature directly benefits you if you use Amazon Athena, Amazon Redshift, AWS Glue, Amazon EMR, or any other big data tools that are available from the AWS Partner Network and through the open-source community.
Amazon Connect is a simple-to-use, cloud-based contact center service that makes it easy for any business to provide a great customer experience at a lower cost than common alternatives. Its open platform design enables easy integration with other systems. One of those systems is Amazon Kinesis—in particular, Kinesis Data Streams and Kinesis Data Firehose.
What’s really exciting is that you can now save events from Amazon Connect to S3 in Apache Parquet format. You can then perform analytics using Amazon Athena and Amazon Redshift Spectrum in real time, taking advantage of this key performance and cost optimization. Of course, Amazon Connect is only one example. This new capability opens the door for a great deal of opportunity, especially as organizations continue to build their data lakes.
Amazon Connect includes an array of analytics views in the Administrator dashboard. But you might want to run other types of analysis. In this post, I describe how to set up a data stream from Amazon Connect through Kinesis Data Streams and Kinesis Data Firehose and out to S3, and then perform analytics using Athena and Amazon Redshift Spectrum. I focus primarily on the Kinesis Data Firehose support for Parquet and its integration with the AWS Glue Data Catalog, Amazon Athena, and Amazon Redshift.
Solution overview
Here is how the solution is laid out:
The following sections walk you through each of these steps to set up the pipeline.
1. Define the schema
When Kinesis Data Firehose processes incoming events and converts the data to Parquet, it needs to know which schema to apply. The reason is that many times, incoming events contain all or some of the expected fields based on which values the producers are advertising. A typical process is to normalize the schema during a batch ETL job so that you end up with a consistent schema that can easily be understood and queried. Doing this introduces latency due to the nature of the batch process. To overcome this issue, Kinesis Data Firehose requires the schema to be defined in advance.
To see the available columns and structures, see Amazon Connect Agent Event Streams. For the purpose of simplicity, I opted to make all the columns of type String rather than create the nested structures. But you can definitely do that if you want.
The simplest way to define the schema is to create a table in the Amazon Athena console. Open the Athena console, and paste the following create table statement, substituting your own S3 bucket and prefix for where your event data will be stored. A Data Catalog database is a logical container that holds the different tables that you can create. The default database name shown here should already exist. If it doesn’t, you can create it or use another database that you’ve already created.
That’s all you have to do to prepare the schema for Kinesis Data Firehose.
2. Define the data streams
Next, you need to define the Kinesis data streams that will be used to stream the Amazon Connect events. Open the Kinesis Data Streams console and create two streams. You can configure them with only one shard each because you don’t have a lot of data right now.
3. Define the Kinesis Data Firehose delivery stream for Parquet
Let’s configure the Data Firehose delivery stream using the data stream as the source and Amazon S3 as the output. Start by opening the Kinesis Data Firehose console and creating a new data delivery stream. Give it a name, and associate it with the Kinesis data stream that you created in Step 2.
As shown in the following screenshot, enable Record format conversion (1) and choose Apache Parquet (2). As you can see, Apache ORC is also supported. Scroll down and provide the AWS Glue Data Catalog database name (3) and table names (4) that you created in Step 1. Choose Next.
To make things easier, the output S3 bucket and prefix fields are automatically populated using the values that you defined in the LOCATION parameter of the create table statement from Step 1. Pretty cool. Additionally, you have the option to save the raw events into another location as defined in the Source record S3 backup section. Don’t forget to add a trailing forward slash “ / “ so that Data Firehose creates the date partitions inside that prefix.
On the next page, in the S3 buffer conditions section, there is a note about configuring a large buffer size. The Parquet file format is highly efficient in how it stores and compresses data. Increasing the buffer size allows you to pack more rows into each output file, which is preferred and gives you the most benefit from Parquet.
Compression using Snappy is automatically enabled for both Parquet and ORC. You can modify the compression algorithm by using the Kinesis Data Firehose API and update the OutputFormatConfiguration.
Be sure to also enable Amazon CloudWatch Logs so that you can debug any issues that you might run into.
Lastly, finalize the creation of the Firehose delivery stream, and continue on to the next section.
4. Set up the Amazon Connect contact center
After setting up the Kinesis pipeline, you now need to set up a simple contact center in Amazon Connect. The Getting Started page provides clear instructions on how to set up your environment, acquire a phone number, and create an agent to accept calls.
After setting up the contact center, in the Amazon Connect console, choose your Instance Alias, and then choose Data Streaming. Under Agent Event, choose the Kinesis data stream that you created in Step 2, and then choose Save.
At this point, your pipeline is complete. Agent events from Amazon Connect are generated as agents go about their day. Events are sent via Kinesis Data Streams to Kinesis Data Firehose, which converts the event data from JSON to Parquet and stores it in S3. Athena and Amazon Redshift Spectrum can simply query the data without any additional work.
So let’s generate some data. Go back into the Administrator console for your Amazon Connect contact center, and create an agent to handle incoming calls. In this example, I creatively named mine Agent One. After it is created, Agent One can get to work and log into their console and set their availability to Available so that they are ready to receive calls.
To make the data a bit more interesting, I also created a second agent, Agent Two. I then made some incoming and outgoing calls and caused some failures to occur, so I now have enough data available to analyze.
5. Analyze the data with Athena
Let’s open the Athena console and run some queries. One thing you’ll notice is that when we created the schema for the dataset, we defined some of the fields as Strings even though in the documentation they were complex structures. The reason for doing that was simply to show some of the flexibility of Athena to be able to parse JSON data. However, you can define nested structures in your table schema so that Kinesis Data Firehose applies the appropriate schema to the Parquet file.
Let’s run the first query to see which agents have logged into the system.
The query might look complex, but it’s fairly straightforward:
WITH dataset AS (
SELECT
from_iso8601_timestamp(eventtimestamp) AS event_ts,
eventtype,
-- CURRENT STATE
json_extract_scalar(
currentagentsnapshot,
'$.agentstatus.name') AS current_status,
from_iso8601_timestamp(
json_extract_scalar(
currentagentsnapshot,
'$.agentstatus.starttimestamp')) AS current_starttimestamp,
json_extract_scalar(
currentagentsnapshot,
'$.configuration.firstname') AS current_firstname,
json_extract_scalar(
currentagentsnapshot,
'$.configuration.lastname') AS current_lastname,
json_extract_scalar(
currentagentsnapshot,
'$.configuration.username') AS current_username,
json_extract_scalar(
currentagentsnapshot,
'$.configuration.routingprofile.defaultoutboundqueue.name') AS current_outboundqueue,
json_extract_scalar(
currentagentsnapshot,
'$.configuration.routingprofile.inboundqueues[0].name') as current_inboundqueue,
-- PREVIOUS STATE
json_extract_scalar(
previousagentsnapshot,
'$.agentstatus.name') as prev_status,
from_iso8601_timestamp(
json_extract_scalar(
previousagentsnapshot,
'$.agentstatus.starttimestamp')) as prev_starttimestamp,
json_extract_scalar(
previousagentsnapshot,
'$.configuration.firstname') as prev_firstname,
json_extract_scalar(
previousagentsnapshot,
'$.configuration.lastname') as prev_lastname,
json_extract_scalar(
previousagentsnapshot,
'$.configuration.username') as prev_username,
json_extract_scalar(
previousagentsnapshot,
'$.configuration.routingprofile.defaultoutboundqueue.name') as current_outboundqueue,
json_extract_scalar(
previousagentsnapshot,
'$.configuration.routingprofile.inboundqueues[0].name') as prev_inboundqueue
from kfhconnectblog
where eventtype <> 'HEART_BEAT'
)
SELECT
current_status as status,
current_username as username,
event_ts
FROM dataset
WHERE eventtype = 'LOGIN' AND current_username <> ''
ORDER BY event_ts DESC
The query output looks something like this:
Here is another query that shows the sessions each of the agents engaged with. It tells us where they were incoming or outgoing, if they were completed, and where there were missed or failed calls.
WITH src AS (
SELECT
eventid,
json_extract_scalar(currentagentsnapshot, '$.configuration.username') as username,
cast(json_extract(currentagentsnapshot, '$.contacts') AS ARRAY(JSON)) as c,
cast(json_extract(previousagentsnapshot, '$.contacts') AS ARRAY(JSON)) as p
from kfhconnectblog
),
src2 AS (
SELECT *
FROM src CROSS JOIN UNNEST (c, p) AS contacts(c_item, p_item)
),
dataset AS (
SELECT
eventid,
username,
json_extract_scalar(c_item, '$.contactid') as c_contactid,
json_extract_scalar(c_item, '$.channel') as c_channel,
json_extract_scalar(c_item, '$.initiationmethod') as c_direction,
json_extract_scalar(c_item, '$.queue.name') as c_queue,
json_extract_scalar(c_item, '$.state') as c_state,
from_iso8601_timestamp(json_extract_scalar(c_item, '$.statestarttimestamp')) as c_ts,
json_extract_scalar(p_item, '$.contactid') as p_contactid,
json_extract_scalar(p_item, '$.channel') as p_channel,
json_extract_scalar(p_item, '$.initiationmethod') as p_direction,
json_extract_scalar(p_item, '$.queue.name') as p_queue,
json_extract_scalar(p_item, '$.state') as p_state,
from_iso8601_timestamp(json_extract_scalar(p_item, '$.statestarttimestamp')) as p_ts
FROM src2
)
SELECT
username,
c_channel as channel,
c_direction as direction,
p_state as prev_state,
c_state as current_state,
c_ts as current_ts,
c_contactid as id
FROM dataset
WHERE c_contactid = p_contactid
ORDER BY id DESC, current_ts ASC
The query output looks similar to the following:
6. Analyze the data with Amazon Redshift Spectrum
With Amazon Redshift Spectrum, you can query data directly in S3 using your existing Amazon Redshift data warehouse cluster. Because the data is already in Parquet format, Redshift Spectrum gets the same great benefits that Athena does.
Here is a simple query to show querying the same data from Amazon Redshift. Note that to do this, you need to first create an external schema in Amazon Redshift that points to the AWS Glue Data Catalog.
SELECT
eventtype,
json_extract_path_text(currentagentsnapshot,'agentstatus','name') AS current_status,
json_extract_path_text(currentagentsnapshot, 'configuration','firstname') AS current_firstname,
json_extract_path_text(currentagentsnapshot, 'configuration','lastname') AS current_lastname,
json_extract_path_text(
currentagentsnapshot,
'configuration','routingprofile','defaultoutboundqueue','name') AS current_outboundqueue,
FROM default_schema.kfhconnectblog
The following shows the query output:
Summary
In this post, I showed you how to use Kinesis Data Firehose to ingest and convert data to columnar file format, enabling real-time analysis using Athena and Amazon Redshift. This great feature enables a level of optimization in both cost and performance that you need when storing and analyzing large amounts of data. This feature is equally important if you are investing in building data lakes on AWS.
Roy Hasson is a Global Business Development Manager for AWS Analytics. He works with customers around the globe to design solutions to meet their data processing, analytics and business intelligence needs. Roy is big Manchester United fan cheering his team on and hanging out with his family.
A customer has been successfully creating and running multiple Amazon Elasticsearch Service (Amazon ES) domains to support their business users’ search needs across products, orders, support documentation, and a growing suite of similar needs. The service has become heavily used across the organization. This led to some domains running at 100% capacity during peak times, while others began to run low on storage space. Because of this increased usage, the technical teams were in danger of missing their service level agreements. They contacted me for help.
This post shows how you can set up automated alarms to warn when domains need attention.
Solution overview
Amazon ES is a fully managed service that delivers Elasticsearch’s easy-to-use APIs and real-time analytics capabilities along with the availability, scalability, and security that production workloads require. The service offers built-in integrations with a number of other components and AWS services, enabling customers to go from raw data to actionable insights quickly and securely.
One of these other integrated services is Amazon CloudWatch. CloudWatch is a monitoring service for AWS Cloud resources and the applications that you run on AWS. You can use CloudWatch to collect and track metrics, collect and monitor log files, set alarms, and automatically react to changes in your AWS resources.
CloudWatch collects metrics for Amazon ES. You can use these metrics to monitor the state of your Amazon ES domains, and set alarms to notify you about high utilization of system resources. For more information, see Amazon Elasticsearch Service Metrics and Dimensions.
While the metrics are automatically collected, the missing piece is how to set alarms on these metrics at appropriate levels for each of your domains. This post includes sample Python code to evaluate the current state of your Amazon ES environment, and to set up alarms according to AWS recommendations and best practices.
There are two components to the sample solution:
es-check-cwalarms.py: This Python script checks the CloudWatch alarms that have been set, for all Amazon ES domains in a given account and region.
es-create-cwalarms.py: This Python script sets up a set of CloudWatch alarms for a single given domain.
The sample code can also be found in the amazon-es-check-cw-alarms GitHub repo. The scripts are easy to extend or combine, as described in the section “Extensions and Adaptations”.
Assessing the current state
The first script, es-check-cwalarms.py, is used to give an overview of the configurations and alarm settings for all the Amazon ES domains in the given region. The script takes the following parameters:
python es-checkcwalarms.py -h
usage: es-checkcwalarms.py [-h] [-e ESPREFIX] [-n NOTIFY] [-f FREE][-p PROFILE] [-r REGION]
Checks a set of recommended CloudWatch alarms for Amazon Elasticsearch Service domains (optionally, those beginning with a given prefix).
optional arguments:
-h, --help show this help message and exit
-e ESPREFIX, --esprefix ESPREFIX Only check Amazon Elasticsearch Service domains that begin with this prefix.
-n NOTIFY, --notify NOTIFY List of CloudWatch alarm actions; e.g. ['arn:aws:sns:xxxx']
-f FREE, --free FREE Minimum free storage (MB) on which to alarm
-p PROFILE, --profile PROFILE IAM profile name to use
-r REGION, --region REGION AWS region for the domain. Default: us-east-1
The script first identifies all the domains in the given region (or, optionally, limits them to the subset that begins with a given prefix). It then starts running a set of checks against each one.
The script can be run from the command line or set up as a scheduled Lambda function. For example, for one customer, it was deemed appropriate to regularly run the script to check that alarms were correctly set for all domains. In addition, because configuration changes—cluster size increases to accommodate larger workloads being a common change—might require updates to alarms, this approach allowed the automatic identification of alarms no longer appropriately set as the domain configurations changed.
The output shown below is the output for one domain in my account.
Starting checks for Elasticsearch domain iotfleet , version is 53
Iotfleet Automated snapshot hour (UTC): 0
Iotfleet Instance configuration: 1 instances; type:m3.medium.elasticsearch
Iotfleet Instance storage definition is: 4 GB; free storage calced to: 819.2 MB
iotfleet Desired free storage set to (in MB): 819.2
iotfleet WARNING: Not using VPC Endpoint
iotfleet WARNING: Does not have Zone Awareness enabled
iotfleet WARNING: Instance count is ODD. Best practice is for an even number of data nodes and zone awareness.
iotfleet WARNING: Does not have Dedicated Masters.
iotfleet WARNING: Neither index nor search slow logs are enabled.
iotfleet WARNING: EBS not in use. Using instance storage only.
iotfleet Alarm ok; definition matches. Test-Elasticsearch-iotfleet-ClusterStatus.yellow-Alarm ClusterStatus.yellow
iotfleet Alarm ok; definition matches. Test-Elasticsearch-iotfleet-ClusterStatus.red-Alarm ClusterStatus.red
iotfleet Alarm ok; definition matches. Test-Elasticsearch-iotfleet-CPUUtilization-Alarm CPUUtilization
iotfleet Alarm ok; definition matches. Test-Elasticsearch-iotfleet-JVMMemoryPressure-Alarm JVMMemoryPressure
iotfleet WARNING: Missing alarm!! ('ClusterIndexWritesBlocked', 'Maximum', 60, 5, 'GreaterThanOrEqualToThreshold', 1.0)
iotfleet Alarm ok; definition matches. Test-Elasticsearch-iotfleet-AutomatedSnapshotFailure-Alarm AutomatedSnapshotFailure
iotfleet Alarm: Threshold does not match: Test-Elasticsearch-iotfleet-FreeStorageSpace-Alarm Should be: 819.2 ; is 3000.0
The output messages fall into the following categories:
System overview, Informational: The Amazon ES version and configuration, including instance type and number, storage, automated snapshot hour, etc.
Free storage: A calculation for the appropriate amount of free storage, based on the recommended 20% of total storage.
Warnings: best practices that are not being followed for this domain. (For more about this, read on.)
Alarms: An assessment of the CloudWatch alarms currently set for this domain, against a recommended set.
The script contains an array of recommended CloudWatch alarms, based on best practices for these metrics and statistics. Using the array allows alarm parameters (such as free space) to be updated within the code based on current domain statistics and configurations.
For a given domain, the script checks if each alarm has been set. If the alarm is set, it checks whether the values match those in the array esAlarms. In the output above, you can see three different situations being reported:
Alarm ok; definition matches. The alarm set for the domain matches the settings in the array.
Alarm: Threshold does not match. An alarm exists, but the threshold value at which the alarm is triggered does not match.
WARNING: Missing alarm!! The recommended alarm is missing.
All in all, the list above shows that this domain does not have a configuration that adheres to best practices, nor does it have all the recommended alarms.
Setting up alarms
Now that you know that the domains in their current state are missing critical alarms, you can correct the situation.
To demonstrate the script, set up a new domain named “ver”, in us-west-2. Specify 1 node, and a 10-GB EBS disk. Also, create an SNS topic in us-west-2 with a name of “sendnotification”, which sends you an email.
Run the second script, es-create-cwalarms.py, from the command line. This script creates (or updates) the desired CloudWatch alarms for the specified Amazon ES domain, “ver”.
python es-create-cwalarms.py -r us-west-2 -e test -c ver -n "['arn:aws:sns:us-west-2:xxxxxxxxxx:sendnotification']"
EBS enabled: True type: gp2 size (GB): 10 No Iops 10240 total storage (MB)
Desired free storage set to (in MB): 2048.0
Creating Test-Elasticsearch-ver-ClusterStatus.yellow-Alarm
Creating Test-Elasticsearch-ver-ClusterStatus.red-Alarm
Creating Test-Elasticsearch-ver-CPUUtilization-Alarm
Creating Test-Elasticsearch-ver-JVMMemoryPressure-Alarm
Creating Test-Elasticsearch-ver-FreeStorageSpace-Alarm
Creating Test-Elasticsearch-ver-ClusterIndexWritesBlocked-Alarm
Creating Test-Elasticsearch-ver-AutomatedSnapshotFailure-Alarm
Successfully finished creating alarms!
As with the first script, this script contains an array of recommended CloudWatch alarms, based on best practices for these metrics and statistics. This approach allows you to add or modify alarms based on your use case (more on that below).
After running the script, navigate to Alarms on the CloudWatch console. You can see the set of alarms set up on your domain.
Because the “ver” domain has only a single node, cluster status is yellow, and that alarm is in an “ALARM” state. It’s already sent a notification that the alarm has been triggered.
In most cases, the alarm triggers due to an increased workload. The likely action is to reconfigure the system to handle the increased workload, rather than reducing the incoming workload. Reconfiguring any backend store—a category of systems that includes Elasticsearch—is best performed when the system is quiescent or lightly loaded. Reconfigurations such as setting zone awareness or modifying the disk type cause Amazon ES to enter a “processing” state, potentially disrupting client access.
Other changes, such as increasing the number of data nodes, may cause Elasticsearch to begin moving shards, potentially impacting search performance on these shards while this is happening. These actions should be considered in the context of your production usage. For the same reason I also do not recommend running a script that resets all domains to match best practices.
Avoid the need to reconfigure during heavy workload by setting alarms at a level that allows a considered approach to making the needed changes. For example, if you identify that each weekly peak is increasing, you can reconfigure during a weekly quiet period.
While Elasticsearch can be reconfigured without being quiesced, it is not a best practice to automatically scale it up and down based on usage patterns. Unlike some other AWS services, I recommend against setting a CloudWatch action that automatically reconfigures the system when alarms are triggered.
There are other situations where the planned reconfiguration approach may not work, such as low or zero free disk space causing the domain to reject writes. If the business is dependent on the domain continuing to accept incoming writes and deleting data is not an option, the team may choose to reconfigure immediately.
Extensions and adaptations
You may wish to modify the best practices encoded in the scripts for your own environment or workloads. It’s always better to avoid situations where alerts are generated but routinely ignored. All alerts should trigger a review and one or more actions, either immediately or at a planned date. The following is a list of common situations where you may wish to set different alarms for different domains:
Dev/test vs. production You may have a different set of configuration rules and alarms for your dev environment configurations than for test. For example, you may require zone awareness and dedicated masters for your production environment, but not for your development domains. Or, you may not have any alarms set in dev. For test environments that mirror your potential peak load, test to ensure that the alarms are appropriately triggered.
Differing workloads or SLAs for different domains You may have one domain with a requirement for superfast search performance, and another domain with a heavy ingest load that tolerates slower search response. Your reaction to slow response for these two workloads is likely to be different, so perhaps the thresholds for these two domains should be set at a different level. In this case, you might add a “max CPU utilization” alarm at 100% for 1 minute for the fast search domain, while the other domain only triggers an alarm when the average has been higher than 60% for 5 minutes. You might also add a “free space” rule with a higher threshold to reflect the need for more space for the heavy ingest load if there is danger that it could fill the available disk quickly.
“Normal” alarms versus “emergency” alarms If, for example, free disk space drops to 25% of total capacity, an alarm is triggered that indicates action should be taken as soon as possible, such as cleaning up old indexes or reconfiguring at the next quiet period for this domain. However, if free space drops below a critical level (20% free space), action must be taken immediately in order to prevent Amazon ES from setting the domain to read-only. Similarly, if the “ClusterIndexWritesBlocked” alarm triggers, the domain has already stopped accepting writes, so immediate action is needed. In this case, you may wish to set “laddered” alarms, where one threshold causes an alarm to be triggered to review the current workload for a planned reconfiguration, but a different threshold raises a “DefCon 3” alarm that immediate action is required.
The sample scripts provided here are a starting point, intended for you to adapt to your own environment and needs.
Running the scripts one time can identify how far your current state is from your desired state, and create an initial set of alarms. Regularly re-running these scripts can capture changes in your environment over time and adjusting your alarms for changes in your environment and configurations. One customer has set them up to run nightly, and to automatically create and update alarms to match their preferred settings.
Removing unwanted alarms
Each CloudWatch alarm costs approximately $0.10 per month. You can remove unwanted alarms in the CloudWatch console, under Alarms. If you set up a “ver” domain above, remember to remove it to avoid continuing charges.
Conclusion
Setting CloudWatch alarms appropriately for your Amazon ES domains can help you avoid suboptimal performance and allow you to respond to workload growth or configuration issues well before they become urgent. This post gives you a starting point for doing so. The additional sleep you’ll get knowing you don’t need to be concerned about Elasticsearch domain performance will allow you to focus on building creative solutions for your business and solving problems for your customers.
Dr. Veronika Megler is a senior consultant at Amazon Web Services. She works with our customers to implement innovative big data, AI and ML projects, helping them accelerate their time-to-value when using AWS.
The Twelve-Factor App methodology is twelve best practices for building modern, cloud-native applications. With guidance on things like configuration, deployment, runtime, and multiple service communication, the Twelve-Factor model prescribes best practices that apply to a diverse number of use cases, from web applications and APIs to data processing applications. Although serverless computing and AWS Lambda have changed how application development is done, the Twelve-Factor best practices remain relevant and applicable in a serverless world.
In this post, I directly apply and compare the Twelve-Factor methodology to serverless application development with Lambda and Amazon API Gateway.
The Twelve Factors
As you’ll see, many of these factors are not only directly applicable to serverless applications, but in fact a default mechanism or capability of the AWS serverless platform. Other factors don’t fit, and I talk about how these factors may not apply at all in a serverless approach.
A general software development best practice is to have all of your code in revision control. This is no different with serverless applications.
For a single serverless application, your code should be stored in a single repository in which a single deployable artifact is generated and from which it is deployed. This single code base should also represent the code used in all of your application environments (development, staging, production, etc.). What might be different for serverless applications is the bounds for what constitutes a “single application.”
Here are two guidelines to help you understand the scope of an application:
If events are shared (such as a common Amazon API Gateway API), then the Lambda function code for those events should be put in the same repository.
Otherwise, break functions along event sources into their own repositories.
Following these two guidelines helps you keep your serverless applications scoped to a single purpose and help prevent complexity in your code base.
Code that needs to be used by multiple functions should be packaged into its own library and included inside your deployment package. Going back to the previous factor on codebase, if you find that you need to often include special processing or business logic, the best solution may be to try to create a purposeful library yourself. Every language that Lambda supports has a model for dependencies/libraries, which you can use:
Both Lambda and API Gateway allow you to set configuration information, using the environment in which each service runs.
In Lambda, these are called environment variables and are key-value pairs that can be set at deployment or when updating the function configuration. Lambda then makes these key-value pairs available to your Lambda function code using standard APIs supported by the language, like process.env for Node.js functions. For more information, see Programming Model, which contains examples for each supported language.
Lambda also allows you to encrypt these key-value pairs using KMS, such that they can be used to store secrets such as API keys or passwords for databases. You can also use them to help define application environment specifics, such as differences between testing or production environments where you might have unique databases or endpoints with which your Lambda function needs to interface. You could also use these for setting A/B testing flags or to enable or disable certain function logic.
For API Gateway, these configuration variables are called stage variables. Like environment variables in Lambda, these are key-value pairs that are available for API Gateway to consume or pass to your API’s backend service. Stage variables can be useful to send requests to different backend environments based on the URL from which your API is accessed. For example, a single configuration could support both beta.yourapi.com vs. prod.yourapi.com. You could also use stage variables to pass information to a Lambda function that causes it to perform different logic.
Because Lambda doesn’t allow you to run another service as part of your function execution, this factor is basically the default model for Lambda. Typically, you reference any database or data store as an external resource via HTTP endpoint or DNS name. These connection strings are ideally passed in via the configuration information, as previously covered.
The separation of build, release, and run stages follows the development best practices of continuous integration and delivery. AWS recommends that you have a CI &CD process no matter what type of application you are building. For serverless applications, this is no different. For more information, see the Building CI/CD Pipelines for Serverless Applications (SRV302) re:Invent 2017 session.
An example minimal pipeline (from presentation linked above)
This is inherent in how Lambda is designed so there is nothing more to consider. Lambda functions should always be treated as being stateless, despite the ability to potentially store some information locally between execution environment re-use. This is because there is no guaranteed affinity to any execution environment, and the potential for an execution environment to go away between invocations exists. You should always store any stateful information in a database, cache, or separate data store via a backing service.
This factor also does not apply to Lambda, as execution environments do not expose any direct networking to your functions. Instead of a port, Lambda functions are invoked via one or more triggering services or AWS APIs for Lambda. There are currently three different invocation models:
Lambda was built with massive concurrency and scale in mind. A recent post on this blog, titled Managing AWS Lambda Function Concurrency explained that for a serverless application, “the unit of scale is a concurrent execution” and that these are consumed by your functions.
Lambda automatically scales to meet the demands of invocations sent at your function. This is in contrast to a traditional compute model using physical hosts, virtual machines, or containers that you self-manage. With Lambda, you do not need to manage overall capacity or apply scaling policies.
Each AWS account has an overall AccountLimit value that is fixed at any point in time, but can be easily increased as needed. As of May 2017, the default limit is 1000 concurrent executions per AWS Region. You can also set and manage a reserved concurrency limit, which provides a limit to how much concurrency a function can have. It also reserves concurrency capacity for a given function out of the total available for an account.
Shutdown doesn’t apply to Lambda because Lambda is intrinsically event-driven. Invocations are tied directly to incoming events or triggers.
However, speed at startup does matter. Initial function execution latency, or what is called “cold starts”, can occur when there isn’t a “warmed” compute resource ready to execute against your application invocations. In the AWS Lambda Execution Model topic, it explains that:
“It takes time to set up an execution context and do the necessary “bootstrapping”, which adds some latency each time the Lambda function is invoked. You typically see this latency when a Lambda function is invoked for the first time or after it has been updated because AWS Lambda tries to reuse the execution context for subsequent invocations of the Lambda function.”
The Best Practices topic covers a number of issues around how to think about performance of your functions. This includes where to place certain logic, how to re-use execution environments, and how by configuring your function for more memory you also get a proportional increase in CPU available to your function. With AWS X-Ray, you can gather some insight as to what your function is doing during an execution and make adjustments accordingly.
Along with continuous integration and delivery, the practice of having independent application environments is a solid best practice no matter the development approach. Being able to safely test applications in a non-production environment is key to development success. Products within the AWS Serverless Platform do not charge for idle time, which greatly reduces the cost of running multiple environments. You can also use the AWS Serverless Application Model (AWS SAM), to manage the configuration of your separate environments.
SAM allows you to model your serverless applications in greatly simplified AWS CloudFormation syntax. With SAM, you can use CloudFormation’s capabilities—such as Parameters and Mappings—to build dynamic templates. Along with Lambda’s environment variables and API Gateway’s stage variables, those templates give you the ability to deploy multiple environments from a single template, such as testing, staging, and production. Whenever the non-production environments are not in use, your costs for Lambda and API Gateway would be zero. For more information, see the AWS Lambda Applications with AWS Serverless Application Model 2017 AWS online tech talk.
In a typical non-serverless application environment, you might be concerned with log files, logging daemons, and centralization of the data represented in them. Thankfully, this is not a concern for serverless applications, as most of the services in the platform handle this for you.
With Lambda, you can just output logs to the console via the native capabilities of the language in which your function is written. For more information about Go, see Logging (Go). Similar documentation pages exist for other languages. Those messages output by your code are captured and centralized in Amazon CloudWatch Logs. For more information, see Accessing Amazon CloudWatch Logs for AWS Lambda.
API Gateway provides two different methods for getting log information:
Execution logs Includes errors or execution traces (such as request or response parameter values or payloads), data used by custom authorizers, whether API keys are required, whether usage plans are enabled, and so on.
Access logs Provide the ability to log who has accessed your API and how the caller accessed the API. You can even customize the format of these logs as desired.
Capturing logs and being able to search and view them is one thing, but CloudWatch Logs also gives you the ability to treat a log message as an event and take action on them via subscription filters in the service. With subscription filters, you could send a log message matching a certain pattern to a Lambda function, and have it take action based on that. Say, for example, that you want to respond to certain error messages or usage patterns that violate certain rules. You could do that with CloudWatch Logs, subscription filters, and Lambda. Another important capability of CloudWatch Logs is the ability to “pivot” log information into a metric in CloudWatch. With this, you could take a data point from a log entry, create a metric, and then an alarm on a metric to show a breached threshold.
This is another factor that doesn’t directly apply to Lambda due to its design. Typically, you would have your functions scoped down to single or limited use cases and have individual functions for different components of your application. Even if they share a common invoking resource, such as an API Gateway endpoint and stage, you would still separate the individual API resources and actions to their own Lambda functions.
The Seven-and-a-Half–Factor model and you
As we’ve seen, Twelve-Factor application design can still be applied to serverless applications, taking into account some small differences! The following diagram highlights the factors and how applicable or not they are to serverless applications:
NOTE: Disposability only half applies, as discussed in this post.
Conclusion
If you’ve been building applications for cloud infrastructure over the past few years, the Twelve-Factor methodology should seem familiar and straight-forward. If you are new to this space, however, you should know that the general best practices and default working patterns for serverless applications overlap heavily with what I’ve discussed here.
It shouldn’t require much work to adhere rather closely to the Twelve-Factor model. When you’re building serverless applications, following the applicable points listed here helps you simplify development and any operational work involved (though already minimized by services such as Lambda and API Gateway). The Twelve-Factor methodology also isn’t all-or-nothing. You can apply only the practices that work best for you and your applications, and still benefit.
AWS Fargate is a new technology that works with Amazon Elastic Container Service (ECS) to run containers without having to manage servers or clusters. What does this mean? With Fargate, you no longer need to provision or manage a single virtual machine; you can just create tasks and run them directly!
Fargate uses the same API actions as ECS, so you can use the ECS console, the AWS CLI, or the ECS CLI. I recommend running through the first-run experience for Fargate even if you’re familiar with ECS. It creates all of the one-time setup requirements, such as the necessary IAM roles. If you’re using a CLI, make sure to upgrade to the latest version.
In this blog, you will see how to migrate ECS containers from running on Amazon EC2 to Fargate.
Getting started
Note: Anything with code blocks is a change in the task definition file. Screen captures are from the console. Additionally, Fargate is currently available in the us-east-1 (N. Virginia) region.
Launch type
When you create tasks (grouping of containers) and clusters (grouping of tasks), you now have two launch type options: EC2 and Fargate. The default launch type, EC2, is ECS as you knew it before the announcement of Fargate. You need to specify Fargate as the launch type when running a Fargate task.
Even though Fargate abstracts away virtual machines, tasks still must be launched into a cluster. With Fargate, clusters are a logical infrastructure and permissions boundary that allow you to isolate and manage groups of tasks. ECS also supports heterogeneous clusters that are made up of tasks running on both EC2 and Fargate launch types.
The optional, new requiresCompatibilities parameter with FARGATE in the field ensures that your task definition only passes validation if you include Fargate-compatible parameters. Tasks can be flagged as compatible with EC2, Fargate, or both.
"requiresCompatibilities": [
"FARGATE"
]
Networking
"networkMode": "awsvpc"
In November, we announced the addition of task networking with the network mode awsvpc. By default, ECS uses the bridge network mode. Fargate requires using the awsvpc network mode.
In bridge mode, all of your tasks running on the same instance share the instance’s elastic network interface, which is a virtual network interface, IP address, and security groups.
The awsvpc mode provides this networking support to your tasks natively. You now get the same VPC networking and security controls at the task level that were previously only available with EC2 instances. Each task gets its own elastic networking interface and IP address so that multiple applications or copies of a single application can run on the same port number without any conflicts.
The awsvpc mode also provides a separation of responsibility for tasks. You can get complete control of task placement within your own VPCs, subnets, and the security policies associated with them, even though the underlying infrastructure is managed by Fargate. Also, you can assign different security groups to each task, which gives you more fine-grained security. You can give an application only the permissions it needs.
"portMappings": [
{
"containerPort": "3000"
}
]
What else has to change? First, you only specify a containerPort value, not a hostPort value, as there is no host to manage. Your container port is the port that you access on your elastic network interface IP address. Therefore, your container ports in a single task definition file need to be unique.
Additionally, links are not allowed as they are a property of the “bridge” network mode (and are now a legacy feature of Docker). Instead, containers share a network namespace and communicate with each other over the localhost interface. They can be referenced using the following:
When launching a task with the EC2 launch type, task performance is influenced by the instance types that you select for your cluster combined with your task definition. If you pick larger instances, your applications make use of the extra resources if there is no contention.
In Fargate, you needed a way to get additional resource information so we created task-level resources. Task-level resources define the maximum amount of memory and cpu that your task can consume.
memory can be defined in MB with just the number, or in GB, for example, “1024” or “1gb”.
cpu can be defined as the number or in vCPUs, for example, “256” or “.25vcpu”.
vCPUs are virtual CPUs. You can look at the memory and vCPUs for instance types to get an idea of what you may have used before.
The memory and CPU options available with Fargate are:
CPU
Memory
256 (.25 vCPU)
0.5GB, 1GB, 2GB
512 (.5 vCPU)
1GB, 2GB, 3GB, 4GB
1024 (1 vCPU)
2GB, 3GB, 4GB, 5GB, 6GB, 7GB, 8GB
2048 (2 vCPU)
Between 4GB and 16GB in 1GB increments
4096 (4 vCPU)
Between 8GB and 30GB in 1GB increments
IAM roles
Because Fargate uses awsvpc mode, you need an Amazon ECS service-linked IAM role named AWSServiceRoleForECS. It provides Fargate with the needed permissions, such as the permission to attach an elastic network interface to your task. After you create your service-linked IAM role, you can delete the remaining roles in your services.
With the EC2 launch type, an instance role gives the agent the ability to pull, publish, talk to ECS, and so on. With Fargate, the task execution IAM role is only needed if you’re pulling from Amazon ECR or publishing data to Amazon CloudWatch Logs.
Fargate currently supports non-persistent, empty data volumes for containers. When you define your container, you no longer use the host field and only specify a name.
Load balancers
For awsvpc mode, and therefore for Fargate, use the IP target type instead of the instance target type. You define this in the Amazon EC2 service when creating a load balancer.
Tip: If you are using an Application Load Balancer, make sure that your tasks are launched in the same VPC and Availability Zones as your load balancer.
Let’s migrate a task definition!
Here is an example NGINX task definition. This type of task definition is what you’re used to if you created one before Fargate was announced. It’s what you would run now with the EC2 launch type.
Yep! Head to the AWS Samples GitHub repo. We have several sample task definitions you can try for both the EC2 and Fargate launch types. Contributions are very welcome too :).
When managing your AWS resources, you often need to grant one AWS service access to another to accomplish tasks. For example, you could use an AWS Lambdafunction to resize, watermark, and postprocess images, for which you would need to store the associated metadata in Amazon DynamoDB. You also could use Lambda, Amazon S3, and Amazon CloudFront to build a serverless website that uses a DynamoDB table as a session store, with Lambda updating the information in the table. In both these examples, you need to grant Lambda functions permissions to write to DynamoDB.
In this post, I demonstrate how to create an AWS Identity and Access Management (IAM) policy that will be attached to an IAM role. The role is then used to grant a Lambda function access to a DynamoDB table. By using an IAM policy and role to control access, I don’t need to embed credentials in code and can tightly control which services the Lambda function can access. The policy also includes permissions to allow the Lambda function to write log files to Amazon CloudWatch Logs. This allows me to view utilization statistics for your Lambda functions and to have access to additional logging for troubleshooting issues.
Solution overview
The following architecture diagram presents an overview of the solution in this post.
The architecture of this post’s solution uses a Lambda function (1 in the preceding diagram) to make read API calls such as GET or SCAN and write API calls such as PUT or UPDATE to a DynamoDB table (2). The Lambda function also writes log files to CloudWatch Logs (3). The Lambda function uses an IAM role (4) that has an IAM policy attached (5) that grants access to DynamoDB and CloudWatch.
Overview of the AWS services used in this post
I use the following AWS services in this post’s solution:
IAM – For securely controlling access to AWS services. With IAM, you can centrally manage users, security credentials such as access keys, and permissions that control which AWS resources users and applications can access.
DynamoDB – A fast and flexible NoSQL database service for all applications that need consistent, single-digit-millisecond latency at any scale.
Lambda – Run code without provisioning or managing servers. You pay only for the compute time you consume—there is no charge when your code is not running.
CloudWatch Logs– For monitoring, storing, and accessing log files generated by AWS resources, including Lambda.
IAM access policies
I have authored an IAM access policy with JSON to grant the required permissions to the DynamoDB table and CloudWatch Logs. I will attach this policy to a role, and this role will then be attached to a Lambda function, which will assume the required access to DynamoDB and CloudWatch Logs
I will walk through this policy, and explain its elements and how to create the policy in the IAM console.
The following policy grants a Lambda function read and write access to a DynamoDB table and writes log files to CloudWatch Logs. This policy is called MyLambdaPolicy. The following is the full JSON document of this policy (the AWS account ID is a placeholder value that you would replace with your own account ID).
The first element in this policy is the Version, which defines the JSON version. At the time of this post’s publication, the most recent version of JSON is 2012-10-17.
The next element in this first policy is a Statement. This is the main section of the policy and includes multiple elements. This first statement is to Allow access to DynamoDB, and in this example, the elements I use are:
An Effect element – Specifies whether the statement results in an Allow or an explicit Deny. By default, access to resources is implicitly denied. In this example, I have used Allow because I want to allow the actions.
An Action element – Describes the specific actions for this statement. Each AWS service has its own set of actions that describe tasks that you can perform with that service. I have used the DynamoDB actions that I want to allow. For the definitions of all available actions for DynamoDB, see the DynamoDB API Reference.
A Resource element – Specifies the object or objects for this statement using Amazon Resource Names (ARNs). You use an ARN to uniquely identify an AWS resource. All Resource elements start with arn:aws and then define the object or objects for the statement. I use this to specify the DynamoDB table to which I want to allow access. To build the Resource element for DynamoDB, I have to specify:
The AWS service (dynamodb)
The AWS Region (eu-west-1)
The AWS account ID (123456789012)
The table (table/SampleTable)
The complete Resource element of the first statement is: arn:aws:dynamodb:eu-west-1:123456789012:table/SampleTable
In this policy, I created a second statement to allow access to CloudWatch Logs so that the Lambda function can write log files for troubleshooting and analysis. I have used the same elements as for the DynamoDB statement, but have changed the following values:
For the Action element, I used the CloudWatch actions that I want to allow. Definitions of all the available actions for CloudWatch are provided in the CloudWatch API Reference.
For the Resource element, I specified the AWS account to which I want to allow my Lambda function to write its log files. As in the preceding example for DynamoDB, you have to use the ARN for CloudWatch Logs to specify where access should be granted. To build the Resource element for CloudWatch Logs, I have to specify:
The AWS service (logs)
The AWS region (eu-west-1)
The AWS account ID (123456789012)
All log groups in this account (*)
The complete Resource element of the second statement is: arn:aws:logs:eu-west-1:123456789012:*
Create the IAM policy in your account
Before you can apply MyLambdaPolicy to a Lambda function, you have to create the policy in your own account and then apply it to an IAM role.
To create an IAM policy:
Navigate to the IAM console and choose Policies in the navigation pane. Choose Create policy.
Because I have already written the policy in JSON, you don’t need to use the Visual Editor, so you can choose the JSON tab and paste the content of the JSON policy document shown earlier in this post (remember to replace the placeholder account IDswith your own account ID). Choose Review policy.
Name the policy MyLambdaPolicy and give it a description that will help you remember the policy’s purpose. You also can view a summary of the policy’s permissions. Choose Create policy.
You have created the IAM policy that you will apply to the Lambda function.
Attach the IAM policy to an IAM role
To apply MyLambdaPolicy to a Lambda function, you first have to attach the policy to an IAM role.
To create a new role and attach MyLambdaPolicy to the role:
Navigate to the IAM console and choose Roles in the navigation pane. Choose Create role.
Choose AWS service and then choose Lambda. Choose Next: Permissions.
On the Attach permissions policies page, type MyLambdaPolicy in the Search box. Choose MyLambdaPolicy from the list of returned search results, and then choose Next: Review.
On the Review page, type MyLambdaRole in the Role name box and an appropriate description, and then choose Create role.
You have attached the policy you created earlier to a new IAM role, which in turn can be used by a Lambda function.
Apply the IAM role to a Lambda function
You have created an IAM role that has an attached IAM policy that grants both read and write access to DynamoDB and write access to CloudWatch Logs. The next step is to apply the IAM role to a Lambda function.
To apply the IAM role to a Lambda function:
Navigate to the Lambda console and choose Create function.
On the Create function page under Author from scratch, name the function MyLambdaFunction, and choose the runtime you want to use based on your application requirements. Lambda currently supports Node.js, Java, Python, Go, and .NET Core. From the Role dropdown, choose Choose an existing role, and from the Existing role dropdown, choose MyLambdaRole. Then choose Create function.
MyLambdaFunction now has access to CloudWatch Logs and DynamoDB. You can choose either of these services to see the details of which permissions the function has.
If you have any comments about this blog post, submit them in the “Comments” section below. If you have any questions about the services used, start a new thread in the applicable AWS forum: IAM, Lambda, DynamoDB, or CloudWatch.
AWS has updated its certifications against ISO 9001, ISO 27001, ISO 27017, and ISO 27018 standards, bringing the total to 67 services now under ISO compliance. We added the following 29 services this cycle:
AWS maintains certifications through extensive audits of its controls to ensure that information security risks that affect the confidentiality, integrity, and availability of company and customer information are appropriately managed.
You can download copies of the AWS ISO certificates that contain AWS’s in-scope services and Regions, and use these certificates to jump-start your own certification efforts:
Today we are launching our 18th AWS Region, our fourth in Europe. Located in the Paris area, AWS customers can use this Region to better serve customers in and around France.
The Paris Region will benefit from three AWS Direct Connect locations. Telehouse Voltaire is available today. AWS Direct Connect will also become available at Equinix Paris in early 2018, followed by Interxion Paris.
All AWS infrastructure regions around the world are designed, built, and regularly audited to meet the most rigorous compliance standards and to provide high levels of security for all AWS customers. These include ISO 27001, ISO 27017, ISO 27018, SOC 1 (Formerly SAS 70), SOC 2 and SOC 3 Security & Availability, PCI DSS Level 1, and many more. This means customers benefit from all the best practices of AWS policies, architecture, and operational processes built to satisfy the needs of even the most security sensitive customers.
AWS is certified under the EU-US Privacy Shield, and the AWS Data Processing Addendum (DPA) is GDPR-ready and available now to all AWS customers to help them prepare for May 25, 2018 when the GDPR becomes enforceable. The current AWS DPA, as well as the AWS GDPR DPA, allows customers to transfer personal data to countries outside the European Economic Area (EEA) in compliance with European Union (EU) data protection laws. AWS also adheres to the Cloud Infrastructure Service Providers in Europe (CISPE) Code of Conduct. The CISPE Code of Conduct helps customers ensure that AWS is using appropriate data protection standards to protect their data, consistent with the GDPR. In addition, AWS offers a wide range of services and features to help customers meet the requirements of the GDPR, including services for access controls, monitoring, logging, and encryption.
From Our Customers Many AWS customers are preparing to use this new Region. Here’s a small sample:
Societe Generale, one of the largest banks in France and the world, has accelerated their digital transformation while working with AWS. They developed SG Research, an application that makes reports from Societe Generale’s analysts available to corporate customers in order to improve the decision-making process for investments. The new AWS Region will reduce latency between applications running in the cloud and in their French data centers.
SNCF is the national railway company of France. Their mobile app, powered by AWS, delivers real-time traffic information to 14 million riders. Extreme weather, traffic events, holidays, and engineering works can cause usage to peak at hundreds of thousands of users per second. They are planning to use machine learning and big data to add predictive features to the app.
Radio France, the French public radio broadcaster, offers seven national networks, and uses AWS to accelerate its innovation and stay competitive.
Les Restos du Coeur, a French charity that provides assistance to the needy, delivering food packages and participating in their social and economic integration back into French society. Les Restos du Coeur is using AWS for its CRM system to track the assistance given to each of their beneficiaries and the impact this is having on their lives.
AlloResto by JustEat (a leader in the French FoodTech industry), is using AWS to to scale during traffic peaks and to accelerate their innovation process.
AWS Consulting and Technology Partners We are already working with a wide variety of consulting, technology, managed service, and Direct Connect partners in France. Here’s a partial list:
AWS in France We have been investing in Europe, with a focus on France, for the last 11 years. We have also been developing documentation and training programs to help our customers to improve their skills and to accelerate their journey to the AWS Cloud.
As part of our commitment to AWS customers in France, we plan to train more than 25,000 people in the coming years, helping them develop highly sought after cloud skills. They will have access to AWS training resources in France via AWS Academy, AWSome days, AWS Educate, and webinars, all delivered in French by AWS Technical Trainers and AWS Certified Trainers.
Use it Today The EU (Paris) Region is open for business now and you can start using it today!
In late September, during the annual Splunk .conf, Splunk and Amazon Web Services (AWS) jointly announced that Amazon Kinesis Data Firehose now supports Splunk Enterprise and Splunk Cloud as a delivery destination. This native integration between Splunk Enterprise, Splunk Cloud, and Amazon Kinesis Data Firehose is designed to make AWS data ingestion setup seamless, while offering a secure and fault-tolerant delivery mechanism. We want to enable customers to monitor and analyze machine data from any source and use it to deliver operational intelligence and optimize IT, security, and business performance.
With Kinesis Data Firehose, customers can use a fully managed, reliable, and scalable data streaming solution to Splunk. In this post, we tell you a bit more about the Kinesis Data Firehose and Splunk integration. We also show you how to ingest large amounts of data into Splunk using Kinesis Data Firehose.
Push vs. Pull data ingestion
Presently, customers use a combination of two ingestion patterns, primarily based on data source and volume, in addition to existing company infrastructure and expertise:
Push-based approach: Streaming data directly from AWS to Splunk HTTP Event Collector (HEC) by using AWS Lambda. Examples of applicable data sources include CloudWatch Logs and Amazon Kinesis Data Streams.
The pull-based approach offers data delivery guarantees such as retries and checkpointing out of the box. However, it requires more ops to manage and orchestrate the dedicated pollers, which are commonly running on Amazon EC2 instances. With this setup, you pay for the infrastructure even when it’s idle.
On the other hand, the push-based approach offers a low-latency scalable data pipeline made up of serverless resources like AWS Lambda sending directly to Splunk indexers (by using Splunk HEC). This approach translates into lower operational complexity and cost. However, if you need guaranteed data delivery then you have to design your solution to handle issues such as a Splunk connection failure or Lambda execution failure. To do so, you might use, for example, AWS Lambda Dead Letter Queues.
How about getting the best of both worlds?
Let’s go over the new integration’s end-to-end solution and examine how Kinesis Data Firehose and Splunk together expand the push-based approach into a native AWS solution for applicable data sources.
By using a managed service like Kinesis Data Firehose for data ingestion into Splunk, we provide out-of-the-box reliability and scalability. One of the pain points of the old approach was the overhead of managing the data collection nodes (Splunk heavy forwarders). With the new Kinesis Data Firehose to Splunk integration, there are no forwarders to manage or set up. Data producers (1) are configured through the AWS Management Console to drop data into Kinesis Data Firehose.
You can also create your own data producers. For example, you can drop data into a Firehose delivery stream by using Amazon Kinesis Agent, or by using the Firehose API (PutRecord(), PutRecordBatch()), or by writing to a Kinesis Data Stream configured to be the data source of a Firehose delivery stream. For more details, refer to Sending Data to an Amazon Kinesis Data Firehose Delivery Stream.
You might need to transform the data before it goes into Splunk for analysis. For example, you might want to enrich it or filter or anonymize sensitive data. You can do so using AWS Lambda. In this scenario, Kinesis Data Firehose buffers data from the incoming source data, sends it to the specified Lambda function (2), and then rebuffers the transformed data to the Splunk Cluster. Kinesis Data Firehose provides the Lambda blueprints that you can use to create a Lambda function for data transformation.
Systems fail all the time. Let’s see how this integration handles outside failures to guarantee data durability. In cases when Kinesis Data Firehose can’t deliver data to the Splunk Cluster, data is automatically backed up to an S3 bucket. You can configure this feature while creating the Firehose delivery stream (3). You can choose to back up all data or only the data that’s failed during delivery to Splunk.
In addition to using S3 for data backup, this Firehose integration with Splunk supports Splunk Indexer Acknowledgments to guarantee event delivery. This feature is configured on Splunk’s HTTP Event Collector (HEC) (4). It ensures that HEC returns an acknowledgment to Kinesis Data Firehose only after data has been indexed and is available in the Splunk cluster (5).
Now let’s look at a hands-on exercise that shows how to forward VPC flow logs to Splunk.
How-to guide
To process VPC flow logs, we implement the following architecture.
Amazon Virtual Private Cloud (Amazon VPC) delivers flow log files into an Amazon CloudWatch Logs group. Using a CloudWatch Logs subscription filter, we set up real-time delivery of CloudWatch Logs to an Kinesis Data Firehose stream.
Data coming from CloudWatch Logs is compressed with gzip compression. To work with this compression, we need to configure a Lambda-based data transformation in Kinesis Data Firehose to decompress the data and deposit it back into the stream. Firehose then delivers the raw logs to the Splunk Http Event Collector (HEC).
If delivery to the Splunk HEC fails, Firehose deposits the logs into an Amazon S3 bucket. You can then ingest the events from S3 using an alternate mechanism such as a Lambda function.
When data reaches Splunk (Enterprise or Cloud), Splunk parsing configurations (packaged in the Splunk Add-on for Kinesis Data Firehose) extract and parse all fields. They make data ready for querying and visualization using Splunk Enterprise and Splunk Cloud.
Walkthrough
Install the Splunk Add-on for Amazon Kinesis Data Firehose
The Splunk Add-on for Amazon Kinesis Data Firehose enables Splunk (be it Splunk Enterprise, Splunk App for AWS, or Splunk Enterprise Security) to use data ingested from Amazon Kinesis Data Firehose. Install the Add-on on all the indexers with an HTTP Event Collector (HEC). The Add-on is available for download from Splunkbase.
HTTP Event Collector (HEC)
Before you can use Kinesis Data Firehose to deliver data to Splunk, set up the Splunk HEC to receive the data. From Splunk web, go to the Setting menu, choose Data Inputs, and choose HTTP Event Collector. Choose Global Settings, ensure All tokens is enabled, and then choose Save. Then choose New Token to create a new HEC endpoint and token. When you create a new token, make sure that Enable indexer acknowledgment is checked.
When prompted to select a source type, select aws:cloudwatch:vpcflow.
Create an S3 backsplash bucket
To provide for situations in which Kinesis Data Firehose can’t deliver data to the Splunk Cluster, we use an S3 bucket to back up the data. You can configure this feature to back up all data or only the data that’s failed during delivery to Splunk.
Note: Bucket names are unique. Thus, you can’t use tmak-backsplash-bucket.
Create an IAM role for the Lambda transform function
Firehose triggers an AWS Lambda function that transforms the data in the delivery stream. Let’s first create a role for the Lambda function called LambdaBasicRole.
Note: You can also set this role up when creating your Lambda function.
$ aws iam create-role --role-name LambdaBasicRole --assume-role-policy-document file://TrustPolicyForLambda.json
After the role is created, attach the managed Lambda basic execution policy to it.
$ aws iam attach-role-policy
--policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
--role-name LambdaBasicRole
Create a Firehose Stream
On the AWS console, open the Amazon Kinesis service, go to the Firehose console, and choose Create Delivery Stream.
In the next section, you can specify whether you want to use an inline Lambda function for transformation. Because incoming CloudWatch Logs are gzip compressed, choose Enabled for Record transformation, and then choose Create new.
From the list of the available blueprint functions, choose Kinesis Data Firehose CloudWatch Logs Processor. This function unzips data and place it back into the Firehose stream in compliance with the record transformation output model.
Enter a name for the Lambda function, choose Choose an existing role, and then choose the role you created earlier. Then choose Create Function.
Go back to the Firehose Stream wizard, choose the Lambda function you just created, and then choose Next.
Select Splunk as the destination, and enter your Splunk Http Event Collector information.
Note: Amazon Kinesis Data Firehose requires the Splunk HTTP Event Collector (HEC) endpoint to be terminated with a valid CA-signed certificate matching the DNS hostname used to connect to your HEC endpoint. You receive delivery errors if you are using a self-signed certificate.
In this example, we only back up logs that fail during delivery.
To monitor your Firehose delivery stream, enable error logging. Doing this means that you can monitor record delivery errors.
Create an IAM role for the Firehose stream by choosing Create new, or Choose. Doing this brings you to a new screen. Choose Create a new IAM role, give the role a name, and then choose Allow.
If you look at the policy document, you can see that the role gives Kinesis Data Firehose permission to publish error logs to CloudWatch, execute your Lambda function, and put records into your S3 backup bucket.
You now get a chance to review and adjust the Firehose stream settings. When you are satisfied, choose Create Stream. You get a confirmation once the stream is created and active.
Create a VPC Flow Log
To send events from Amazon VPC, you need to set up a VPC flow log. If you already have a VPC flow log you want to use, you can skip to the “Publish CloudWatch to Kinesis Data Firehose” section.
On the AWS console, open the Amazon VPC service. Then choose VPC, Your VPC, and choose the VPC you want to send flow logs from. Choose Flow Logs, and then choose Create Flow Log. If you don’t have an IAM role that allows your VPC to publish logs to CloudWatch, choose Set Up Permissions and Create new role. Use the defaults when presented with the screen to create the new IAM role.
Once active, your VPC flow log should look like the following.
Publish CloudWatch to Kinesis Data Firehose
When you generate traffic to or from your VPC, the log group is created in Amazon CloudWatch. The new log group has no subscription filter, so set up a subscription filter. Setting this up establishes a real-time data feed from the log group to your Firehose delivery stream.
At present, you have to use the AWS Command Line Interface (AWS CLI) to create a CloudWatch Logs subscription to a Kinesis Data Firehose stream. However, you can use the AWS console to create subscriptions to Lambda and Amazon Elasticsearch Service.
To allow CloudWatch to publish to your Firehose stream, you need to give it permissions.
$ aws iam create-role --role-name CWLtoKinesisFirehoseRole --assume-role-policy-document file://TrustPolicyForCWLToFireHose.json
Here is the content for TrustPolicyForCWLToFireHose.json.
When you run the AWS CLI command preceding, you don’t get any acknowledgment. To validate that your CloudWatch Log Group is subscribed to your Firehose stream, check the CloudWatch console.
As soon as the subscription filter is created, the real-time log data from the log group goes into your Firehose delivery stream. Your stream then delivers it to your Splunk Enterprise or Splunk Cloud environment for querying and visualization. The screenshot following is from Splunk Enterprise.
In addition, you can monitor and view metrics associated with your delivery stream using the AWS console.
Conclusion
Although our walkthrough uses VPC Flow Logs, the pattern can be used in many other scenarios. These include ingesting data from AWS IoT, other CloudWatch logs and events, Kinesis Streams or other data sources using the Kinesis Agent or Kinesis Producer Library. We also used Lambda blueprint Kinesis Data Firehose CloudWatch Logs Processor to transform streaming records from Kinesis Data Firehose. However, you might need to use a different Lambda blueprint or disable record transformation entirely depending on your use case. For an additional use case using Kinesis Data Firehose, check out This is My Architecture Video, which discusses how to securely centralize cross-account data analytics using Kinesis and Splunk.
Tarik Makota is a solutions architect with the Amazon Web Services Partner Network. He provides technical guidance, design advice and thought leadership to AWS’ most strategic software partners. His career includes work in an extremely broad software development and architecture roles across ERP, financial printing, benefit delivery and administration and financial services. He holds an M.S. in Software Development and Management from Rochester Institute of Technology.
Roy Arsan is a solutions architect in the Splunk Partner Integrations team. He has a background in product development, cloud architecture, and building consumer and enterprise cloud applications. More recently, he has architected Splunk solutions on major cloud providers, including an AWS Quick Start for Splunk that enables AWS users to easily deploy distributed Splunk Enterprise straight from their AWS console. He’s also the co-author of the AWS Lambda blueprints for Splunk. He holds an M.S. in Computer Science Engineering from the University of Michigan.
Today we launched our 17th Region globally, and the second in China. The AWS China (Ningxia) Region, operated by Ningxia Western Cloud Data Technology Co. Ltd. (NWCD), is generally available now and provides customers another option to run applications and store data on AWS in China.
Operating Partner To comply with China’s legal and regulatory requirements, AWS has formed a strategic technology collaboration with NWCD to operate and provide services from the AWS China (Ningxia) Region. Founded in 2015, NWCD is a licensed datacenter and cloud services provider, based in Ningxia, China. NWCD joins Sinnet, the operator of the AWS China China (Beijing) Region, as an AWS operating partner in China. Through these relationships, AWS provides its industry-leading technology, guidance, and expertise to NWCD and Sinnet, while NWCD and Sinnet operate and provide AWS cloud services to local customers. While the cloud services offered in both AWS China Regions are the same as those available in other AWS Regions, the AWS China Regions are different in that they are isolated from all other AWS Regions and operated by AWS’s Chinese partners separately from all other AWS Regions. Customers using the AWS China Regions enter into customer agreements with Sinnet and NWCD, rather than with AWS.
Use it Today The AWS China (Ningxia) Region, operated by NWCD, is open for business, and you can start using it now! Starting today, Chinese developers, startups, and enterprises, as well as government, education, and non-profit organizations, can leverage AWS to run their applications and store their data in the new AWS China (Ningxia) Region, operated by NWCD. Customers already using the AWS China (Beijing) Region, operated by Sinnet, can select the AWS China (Ningxia) Region directly from the AWS Management Console, while new customers can request an account at www.amazonaws.cn to begin using both AWS China Regions.
One of the key benefits of serverless applications is the ease in which they can scale to meet traffic demands or requests, with little to no need for capacity planning. In AWS Lambda, which is the core of the serverless platform at AWS, the unit of scale is a concurrent execution. This refers to the number of executions of your function code that are happening at any given time.
Thinking about concurrent executions as a unit of scale is a fairly unique concept. In this post, I dive deeper into this and talk about how you can make use of per function concurrency limits in Lambda.
Understanding concurrency in Lambda
Instead of diving right into the guts of how Lambda works, here’s an appetizing analogy: a magical pizza. Yes, a magical pizza!
This magical pizza has some unique properties:
It has a fixed maximum number of slices, such as 8.
Slices automatically re-appear after they are consumed.
When you take a slice from the pizza, it does not re-appear until it has been completely consumed.
One person can take multiple slices at a time.
You can easily ask to have the number of slices increased, but they remain fixed at any point in time otherwise.
Now that the magical pizza’s properties are defined, here’s a hypothetical situation of some friends sharing this pizza.
Shawn, Kate, Daniela, Chuck, Ian and Avleen get together every Friday to share a pizza and catch up on their week. As there is just six of them, they can easily all enjoy a slice of pizza at a time. As they finish each slice, it re-appears in the pizza pan and they can take another slice again. Given the magical properties of their pizza, they can continue to eat all they want, but with two very important constraints:
If any of them take too many slices at once, the others may not get as much as they want.
If they take too many slices, they might also eat too much and get sick.
One particular week, some of the friends are hungrier than the rest, taking two slices at a time instead of just one. If more than two of them try to take two pieces at a time, this can cause contention for pizza slices. Some of them would wait hungry for the slices to re-appear. They could ask for a pizza with more slices, but then run the same risk again later if more hungry friends join than planned for.
What can they do?
If the friends agreed to accept a limit for the maximum number of slices they each eat concurrently, both of these issues are avoided. Some could have a maximum of 2 of the 8 slices, or other concurrency limits that were more or less. Just so long as they kept it at or under eight total slices to be eaten at one time. This would keep any from going hungry or eating too much. The six friends can happily enjoy their magical pizza without worry!
Concurrency in Lambda
Concurrency in Lambda actually works similarly to the magical pizza model. Each AWS Account has an overall AccountLimit value that is fixed at any point in time, but can be easily increased as needed, just like the count of slices in the pizza. As of May 2017, the default limit is 1000 “slices” of concurrency per AWS Region.
Also like the magical pizza, each concurrency “slice” can only be consumed individually one at a time. After consumption, it becomes available to be consumed again. Services invoking Lambda functions can consume multiple slices of concurrency at the same time, just like the group of friends can take multiple slices of the pizza.
Let’s take our example of the six friends and bring it back to AWS services that commonly invoke Lambda:
Amazon S3
Amazon Kinesis
Amazon DynamoDB
Amazon Cognito
In a single account with the default concurrency limit of 1000 concurrent executions, any of these four services could invoke enough functions to consume the entire limit or some part of it. Just like with the pizza example, there is the possibility for two issues to pop up:
One or more of these services could invoke enough functions to consume a majority of the available concurrency capacity. This could cause others to be starved for it, causing failed invocations.
A service could consume too much concurrent capacity and cause a downstream service or database to be overwhelmed, which could cause failed executions.
For Lambda functions that are launched in a VPC, you have the potential to consume the available IP addresses in a subnet or the maximum number of elastic network interfaces to which your account has access. For more information, see Configuring a Lambda Function to Access Resources in an Amazon VPC. For information about elastic network interface limits, see Network Interfaces section in the Amazon VPC Limits topic.
One way to solve both of these problems is applying a concurrency limit to the Lambda functions in an account.
Configuring per function concurrency limits
You can now set a concurrency limit on individual Lambda functions in an account. The concurrency limit that you set reserves a portion of your account level concurrency for a given function. All of your functions’ concurrent executions count against this account-level limit by default.
If you set a concurrency limit for a specific function, then that function’s concurrency limit allocation is deducted from the shared pool and assigned to that specific function. AWS also reserves 100 units of concurrency for all functions that don’t have a specified concurrency limit set. This helps to make sure that future functions have capacity to be consumed.
Going back to the example of the consuming services, you could set throttles for the functions as follows:
Amazon S3 function = 350 Amazon Kinesis function = 200 Amazon DynamoDB function = 200 Amazon Cognito function = 150 Total = 900
With the 100 reserved for all non-concurrency reserved functions, this totals the account limit of 1000.
Here’s how this works. To start, create a basic Lambda function that is invoked via Amazon API Gateway. This Lambda function returns a single “Hello World” statement with an added sleep time between 2 and 5 seconds. The sleep time simulates an API providing some sort of capability that can take a varied amount of time. The goal here is to show how an API that is underloaded can reach its concurrency limit, and what happens when it does. To create the example function
Under Basic settings, set Timeout to 10 seconds. While this function should only ever take up to 5-6 seconds (with the 5-second max sleep), this gives you a little bit of room if it takes longer.
Choose Save at the top right.
At this point, your function is configured for this example. Test it and confirm this in the console:
Choose Test.
Enter a name (it doesn’t matter for this example).
Choose Create.
In the console, choose Test again.
You should see output similar to the following:
Now configure API Gateway so that you have an HTTPS endpoint to test against.
In the Lambda console, choose Configuration.
Under Triggers, choose API Gateway.
Open the API Gateway icon now shown as attached to your Lambda function:
Under Configure triggers, leave the default values for API Name and Deployment stage. For Security, choose Open.
Choose Add, Save.
API Gateway is now configured to invoke Lambda at the Invoke URL shown under its configuration. You can take this URL and test it in any browser or command line, using tools such as “curl”:
Now start throwing some load against your API Gateway + Lambda function combo. Right now, your function is only limited by the total amount of concurrency available in an account. For this example account, you might have 850 unreserved concurrency out of a full account limit of 1000 due to having configured a few concurrency limits already (also the 100 concurrency saved for all functions without configured limits). You can find all of this information on the main Dashboard page of the Lambda console:
For generating load in this example, use an open source tool called “hey” (https://github.com/rakyll/hey), which works similarly to ApacheBench (ab). You test from an Amazon EC2 instance running the default Amazon Linux AMI from the EC2 console. For more help with configuring an EC2 instance, follow the steps in the Launch Instance Wizard.
After the EC2 instance is running, SSH into the host and run the following:
sudo yum install go
go get -u github.com/rakyll/hey
“hey” is easy to use. For these tests, specify a total number of tests (5,000) and a concurrency of 50 against the API Gateway URL as follows(replace the URL here with your own):
You can see a helpful histogram and latency distribution. Remember that this Lambda function has a random sleep period in it and so isn’t entirely representational of a real-life workload. Those three 502s warrant digging deeper, but could be due to Lambda cold-start timing and the “second” variable being the maximum of 5, causing the Lambda functions to time out. AWS X-Ray and the Amazon CloudWatch logs generated by both API Gateway and Lambda could help you troubleshoot this.
Configuring a concurrency reservation
Now that you’ve established that you can generate this load against the function, I show you how to limit it and protect a backend resource from being overloaded by all of these requests.
In the console, choose Configure.
Under Concurrency, for Reserve concurrency, enter 25.
Click on Save in the top right corner.
You could also set this with the AWS CLI using the Lambda put-function-concurrency command or see your current concurrency configuration via Lambda get-function. Here’s an example command:
Either way, you’ve set the Concurrency Reservation to 25 for this function. This acts as both a limit and a reservation in terms of making sure that you can execute 25 concurrent functions at all times. Going above this results in the throttling of the Lambda function. Depending on the invoking service, throttling can result in a number of different outcomes, as shown in the documentation on Throttling Behavior. This change has also reduced your unreserved account concurrency for other functions by 25.
Rerun the same load generation as before and see what happens. Previously, you tested at 50 concurrency, which worked just fine. By limiting the Lambda functions to 25 concurrency, you should see rate limiting kick in. Run the same test again:
While this test runs, refresh the Monitoring tab on your function detail page. You see the following warning message:
This is great! It means that your throttle is working as configured and you are now protecting your downstream resources from too much load from your Lambda function.
This looks fairly different from the last load test run. A large percentage of these requests failed fast due to the concurrency throttle failing them (those with the 0.724 seconds line). The timing shown here in the histogram represents the entire time it took to get a response between the EC2 instance and API Gateway calling Lambda and being rejected. It’s also important to note that this example was configured with an edge-optimized endpoint in API Gateway. You see under Status code distribution that 3076 of the 5000 requests failed with a 502, showing that the backend service from API Gateway and Lambda failed the request.
Other uses
Managing function concurrency can be useful in a few other ways beyond just limiting the impact on downstream services and providing a reservation of concurrency capacity. Here are two other uses:
Emergency kill switch
Cost controls
Emergency kill switch
On occasion, due to issues with applications I’ve managed in the past, I’ve had a need to disable a certain function or capability of an application. By setting the concurrency reservation and limit of a Lambda function to zero, you can do just that.
With the reservation set to zero every invocation of a Lambda function results in being throttled. You could then work on the related parts of the infrastructure or application that aren’t working, and then reconfigure the concurrency limit to allow invocations again.
Cost controls
While I mentioned how you might want to use concurrency limits to control the downstream impact to services or databases that your Lambda function might call, another resource that you might be cautious about is money. Setting the concurrency throttle is another way to help control costs during development and testing of your application.
You might want to prevent against a function performing a recursive action too quickly or a development workload generating too high of a concurrency. You might also want to protect development resources connected to this function from generating too much cost, such as APIs that your Lambda function calls.
Conclusion
Concurrent executions as a unit of scale are a fairly unique characteristic about Lambda functions. Placing limits on how many concurrency “slices” that your function can consume can prevent a single function from consuming all of the available concurrency in an account. Limits can also prevent a function from overwhelming a backend resource that isn’t as scalable.
Unlike monolithic applications or even microservices where there are mixed capabilities in a single service, Lambda functions encourage a sort of “nano-service” of small business logic directly related to the integration model connected to the function. I hope you’ve enjoyed this post and configure your concurrency limits today!
It was just about three years ago that AWS announced Amazon Elastic Container Service (Amazon ECS), to run and manage containers at scale on AWS. With Amazon ECS, you’ve been able to run your workloads at high scale and availability without having to worry about running your own cluster management and container orchestration software.
Today, AWS announced the availability of AWS Fargate – a technology that enables you to use containers as a fundamental compute primitive without having to manage the underlying instances. With Fargate, you don’t need to provision, configure, or scale virtual machines in your clusters to run containers. Fargate can be used with Amazon ECS today, with plans to support Amazon Elastic Container Service for Kubernetes (Amazon EKS) in the future.
Fargate has flexible configuration options so you can closely match your application needs and granular, per-second billing.
Amazon ECS with Fargate
Amazon ECS enables you to run containers at scale. This service also provides native integration into the AWS platform with VPC networking, load balancing, IAM, Amazon CloudWatch Logs, and CloudWatch metrics. These deep integrations make the Amazon ECS task a first-class object within the AWS platform.
To run tasks, you first need to stand up a cluster of instances, which involves picking the right types of instances and sizes, setting up Auto Scaling, and right-sizing the cluster for performance. With Fargate, you can leave all that behind and focus on defining your application and policies around permissions and scaling.
The same container management capabilities remain available so you can continue to scale your container deployments. With Fargate, the only entity to manage is the task. You don’t need to manage the instances or supporting software like Docker daemon or the Amazon ECS agent.
Fargate capabilities are available natively within Amazon ECS. This means that you don’t need to learn new API actions or primitives to run containers on Fargate.
Using Amazon ECS, Fargate is a launch type option. You continue to define the applications the same way by using task definitions. In contrast, the EC2 launch type gives you more control of your server clusters and provides a broader range of customization options.
For example, a RunTask command example is pasted below with the Fargate launch type:
Resource-based pricing and per second billing You pay by the task size and only for the time for which resources are consumed by the task. The price for CPU and memory is charged on a per-second basis. There is a one-minute minimum charge.
Flexible configurations options Fargate is available with 50 different combinations of CPU and memory to closely match your application needs. You can use 2 GB per vCPU anywhere up to 8 GB per vCPU for various configurations. Match your workload requirements closely, whether they are general purpose, compute, or memory optimized.
Networking All Fargate tasks run within your own VPC. Fargate supports the recently launched awsvpc networking mode and the elastic network interface for a task is visible in the subnet where the task is running. This provides the separation of responsibility so you retain full control of networking policies for your applications via VPC features like security groups, routing rules, and NACLs. Fargate also supports public IP addresses.
Load Balancing ECS Service Load Balancing for the Application Load Balancer and Network Load Balancer is supported. For the Fargate launch type, you specify the IP addresses of the Fargate tasks to register with the load balancers.
Permission tiers Even though there are no instances to manage with Fargate, you continue to group tasks into logical clusters. This allows you to manage who can run or view services within the cluster. The task IAM role is still applicable. Additionally, there is a new Task Execution Role that grants Amazon ECS permissions to perform operations such as pushing logs to CloudWatch Logs or pulling image from Amazon Elastic Container Registry (Amazon ECR).
Container Registry Support Fargate provides seamless authentication to help pull images from Amazon ECR via the Task Execution Role. Similarly, if you are using a public repository like DockerHub, you can continue to do so.
Amazon ECS CLI The Amazon ECS CLI provides high-level commands to help simplify to create and run Amazon ECS clusters, tasks, and services. The latest version of the CLI now supports running tasks and services with Fargate.
EC2 and Fargate Launch Type Compatibility All Amazon ECS clusters are heterogeneous – you can run both Fargate and Amazon ECS tasks in the same cluster. This enables teams working on different applications to choose their own cadence of moving to Fargate, or to select a launch type that meets their requirements without breaking the existing model. You can make an existing ECS task definition compatible with the Fargate launch type and run it as a Fargate service, and vice versa. Choosing a launch type is not a one-way door!
Logging and Visibility With Fargate, you can send the application logs to CloudWatch logs. Service metrics (CPU and Memory utilization) are available as part of CloudWatch metrics. AWS partners for visibility, monitoring and application performance management including Datadog, Aquasec, Splunk, Twistlock, and New Relic also support Fargate tasks.
Conclusion
Fargate enables you to run containers without having to manage the underlying infrastructure. Today, Fargate is availabe for Amazon ECS, and in 2018, Amazon EKS. Visit the Fargate product page to learn more, or get started in the AWS Console.
CodeBuild is a fully managed build service that compiles source code, runs tests, and produces software packages that are ready to deploy. As part of the build process, developers often require access to resources that should be isolated from the public Internet. Now CodeBuild builds can be optionally configured to have VPC connectivity and access these resources directly.
Accessing Resources in a VPC
You can configure builds to have access to a VPC when you create a CodeBuild project or you can update an existing CodeBuild project with VPC configuration attributes. Here’s how it looks in the console:
To configure VPC connectivity: select a VPC, one or more subnets within that VPC, and one or more VPC security groups that CodeBuild should apply when attaching to your VPC. Once configured, commands running as part of your build will be able to access resources in your VPC without transiting across the public Internet.
Use Cases
The availability of VPC connectivity from CodeBuild builds unlocks many potential uses. For example, you can:
Run integration tests from your build against data in an Amazon RDS instance that’s isolated on a private subnet.
Query data in an ElastiCache cluster directly from tests.
Interact with internal web services hosted on Amazon EC2, Amazon ECS, or services that use internal Elastic Load Balancing.
Retrieve dependencies from self-hosted, internal artifact repositories such as PyPI for Python, Maven for Java, npm for Node.js, and so on.
Access objects in an Amazon S3 bucket configured to allow access only through a VPC endpoint.
Query external web services that require fixed IP addresses through the Elastic IP address of the NAT gateway associated with your subnet(s).
… and more! Your builds can now access any resource that’s hosted in your VPC without any compromise on network isolation.
Internet Connectivity
CodeBuild requires access to resources on the public Internet to successfully execute builds. At a minimum, it must be able to reach your source repository system (such as AWS CodeCommit, GitHub, Bitbucket), Amazon Simple Storage Service (Amazon S3) to deliver build artifacts, and Amazon CloudWatch Logs to stream logs from the build process. The interface attached to your VPC will not be assigned a public IP address so to enable Internet access from your builds, you will need to set up a managed NAT Gateway or NAT instance for the subnets you configure. You must also ensure your security groups allow outbound access to these services.
IP Address Space
Each running build will be assigned an IP address from one of the subnets in your VPC that you designate for CodeBuild to use. As CodeBuild scales to meet your build volume, ensure that you select subnets with enough address space to accommodate your expected number of concurrent builds.
Service Role Permissions
CodeBuild requires new permissions in order to manage network interfaces on your VPCs. If you create a service role for your new projects, these permissions will be included in that role’s policy automatically. For existing service roles, you can edit the policy document to include the additional actions. For the full policy document to apply to your service role, see Advanced Setup in the CodeBuild documentation.
For more information, see VPC Support in the CodeBuild documentation. We hope you find the ability to access internal resources on a VPC useful in your build processes! If you have any questions or feedback, feel free to reach out to us through the AWS CodeBuild forum or leave a comment!
The AWS US East/West Region has received a Provisional Authority to Operate (P-ATO) from the Joint Authorization Board (JAB) at the Federal Risk and Authorization Management Program (FedRAMP) Moderate baseline.
Though AWS has maintained an AWS US East/West Region Agency-ATO since early 2013, this announcement represents AWS’s carefully deliberated move to the JAB for the centralized maintenance of our P-ATO for 10 services already authorized. This also includes the addition of 10 new services to our FedRAMP program (see the complete list of services below). This doubles the number of FedRAMP Moderate services available to our customers to enable increased use of the cloud and support modernized IT missions. Our public sector customers now can leverage this FedRAMP P-ATO as a baseline for their own authorizations and look to the JAB for centralized Continuous Monitoring reporting and updates. In a significant enhancement for our partners that build their solutions on the AWS US East/West Region, they can now achieve FedRAMP JAB P-ATOs of their own for their Platform as a Service (PaaS) and Software as a Service (SaaS) offerings.
In line with FedRAMP security requirements, our independent FedRAMP assessment was completed in partnership with a FedRAMP accredited Third Party Assessment Organization (3PAO) on our technical, management, and operational security controls to validate that they meet or exceed FedRAMP’s Moderate baseline requirements. Effective immediately, you can begin leveraging this P-ATO for the following 20 services in the AWS US East/West Region:
Amazon Aurora (MySQL)*
Amazon CloudWatch Logs*
Amazon DynamoDB
Amazon Elastic Block Store
Amazon Elastic Compute Cloud
Amazon EMR*
Amazon Glacier*
Amazon Kinesis Streams*
Amazon RDS (MySQL, Oracle, Postgres*)
Amazon Redshift
Amazon Simple Notification Service*
Amazon Simple Queue Service*
Amazon Simple Storage Service
Amazon Simple Workflow Service*
Amazon Virtual Private Cloud
AWS CloudFormation*
AWS CloudTrail*
AWS Identity and Access Management
AWS Key Management Service
Elastic Load Balancing
* Services with first-time FedRAMP Moderate authorizations
We continue to work with the FedRAMP Project Management Office (PMO), other regulatory and compliance bodies, and our customers and partners to ensure that we are raising the bar on our customers’ security and compliance needs.
So many launches and cloud innovations, that you simply may not believe. In order to catch up on some service launches and features, this post will be a round-up of some cool releases that happened this summer and through the end of September.
The launches and features I want to share with you today are:
AWS IAM for Authenticating Database Users for RDS MySQL and Amazon Aurora
Amazon SES Reputation Dashboard
Amazon SES Open and Click Tracking Metrics
Serverless Image Handler by the Solutions Builder Team
AWS Ops Automator by the Solutions Builder Team
Let’s dive in, shall we!
AWS IAM for Authenticating Database Users for RDS MySQL and Amazon Aurora
Wished you could manage access to your Amazon RDS database instances and clusters using AWS IAM? Well, wish no longer. Amazon RDS has launched the ability for you to use IAM to manage database access for Amazon RDS for MySQL and Amazon Aurora DB.
What I like most about this new service feature is, it’s very easy to get started. To enable database user authentication using IAM, you would select a checkbox Enable IAM DB Authentication when creating, modifying, or restoring your DB instance or cluster. You can enable IAM access using the RDS console, the AWS CLI, and/or the Amazon RDS API.
After configuring the database for IAM authentication, client applications authenticate to the database engine by providing temporary security credentials generated by the IAM Security Token Service. These credentials can be used instead of providing a password to the database engine.
You can learn more about using IAM to provide targeted permissions and authentication to MySQL and Aurora by reviewing the Amazon RDS user guide.
Amazon SES Reputation Dashboard
In order to aid Amazon Simple Email Service customers’ in utilizing best practice guidelines for sending email, I am thrilled to announce we launched the Reputation Dashboard to provide comprehensive reporting on email sending health. To aid in proactively managing emails being sent, customers now have visibility into overall account health, sending metrics, and compliance or enforcement status.
The Reputation Dashboard will provide the following information:
Account status: A description of your account health status.
Healthy – No issues currently impacting your account.
Probation – Account is on probation; Issues causing probation must be resolved to prevent suspension
Pending end of probation decision – Your account is on probation. Amazon SES team member must review your account prior to action.
Shutdown – Your account has been shut down. No email will be able to be sent using Amazon SES.
Pending shutdown – Your account is on probation and issues causing probation are unresolved.
Bounce Rate: Percentage of emails sent that have bounced and bounce rate status messages.
Complaint Rate: Percentage of emails sent that recipients have reported as spam and complaint rate status messages.
Notifications: Messages about other account reputation issues.
Amazon SES Open and Click Tracking Metrics
Another exciting feature recently added to Amazon SES is support for Email Open and Click Tracking Metrics. With Email Open and Click Tracking Metrics feature, SES customers can now track when email they’ve sent has been opened and track when links within the email have been clicked. Using this SES feature will allow you to better track email campaign engagement and effectiveness.
How does this work?
When using the email open tracking feature, SES will add a transparent, miniature image into the emails that you choose to track. When the email is opened, the mail application client will load the aforementioned tracking which triggers an open track event with Amazon SES. For the email click (link) tracking, links in email and/or email templates are replaced with a custom link. When the custom link is clicked, a click event is recorded in SES and the custom link will redirect the email user to the link destination of the original email.
You can take advantage of the new open tracking and click tracking features by creating a new configuration set or altering an existing configuration set within SES. After choosing either; Amazon SNS, Amazon CloudWatch, or Amazon Kinesis Firehose as the AWS service to receive the open and click metrics, you would only need to select a new configuration set to successfully enable these new features for any emails you want to send.
The AWS Solution Builder team has been hard at work helping to make it easier for you all to find answers to common architectural questions to aid in building and running applications on AWS. You can find these solutions on the AWS Answers page. Two new solutions released earlier this fall on AWS Answers are Serverless Image Handler and the AWS Ops Automator. Serverless Image Handler was developed to provide a solution to help customers dynamically process, manipulate, and optimize the handling of images on the AWS Cloud. The solution combines Amazon CloudFront for caching, AWS Lambda to dynamically retrieve images and make image modifications, and Amazon S3 bucket to store images. Additionally, the Serverless Image Handler leverages the open source image-processing suite, Thumbor, for additional image manipulation, processing, and optimization.
AWS Ops Automator solution helps you to automate manual tasks using time-based or event-based triggers to automatically such as snapshot scheduling by providing a framework for automated tasks and includes task audit trails, logging, resource selection, scaling, concurrency handling, task completion handing, and API request retries. The solution includes the following AWS services:
AWS CloudFormation: a templates to launches the core framework of microservices and solution generated task configurations
Amazon DynamoDB: a table which stores task configuration data to defines the event triggers, resources, and saves the results of the action and the errors.
Amazon CloudWatch Logs: provides logging to track warning and error messages
Amazon SNS: topic to send messages to a subscribed email address to which to send the logging information from the solution
Our Health Customer Stories page lists just a few of the many customers that are building and running healthcare and life sciences applications that run on AWS. Customers like Verge Health, Care Cloud, and Orion Health trust AWS with Protected Health Information (PHI) and Personally Identifying Information (PII) as part of their efforts to comply with HIPAA and HITECH.
Sixteen More Services In my last HIPAA Eligibility Update I shared the news that we added eight additional services to our list of HIPAA eligible services. Today I am happy to let you know that we have added another sixteen services to the list, bringing the total up to 46. Here are the newest additions, along with some short descriptions and links to some of my blog posts to jog your memory:
Amazon RDS for MariaDB – This service lets you set up scalable, managed MariaDB instances in minutes, and offers high performance, high availability, and a simplified security model that makes it easy for you to encrypt data at rest and in transit. Read Amazon RDS Update – MariaDB is Now Available to learn more.
AWS Batch – This service lets you run large-scale batch computing jobs on AWS. You don’t need to install or maintain specialized batch software or build your own server clusters. Read AWS Batch – Run Batch Computing Jobs on AWS to learn more.
AWS Key Management Service – This service makes it easy for you to create and control the encryption keys used to encrypt your data. It uses HSMs to protect your keys, and is integrated with AWS CloudTrail in order to provide you with a log of all key usage. Read New AWS Key Management Service (KMS) to learn more.
AWS Snowball Edge – This is a data transfer device with 100 terabytes of on-board storage as well as compute capabilities. You can use it to move large amounts of data into or out of AWS, as a temporary storage tier, or to support workloads in remote or offline locations. To learn more, read AWS Snowball Edge – More Storage, Local Endpoints, Lambda Functions.
AWS Snowmobile – This is an exabyte-scale data transfer service. Pulled by a semi-trailer truck, each Snowmobile packs 100 petabytes of storage into a ruggedized 45-foot long shipping container. Read AWS Snowmobile – Move Exabytes of Data to the Cloud in Weeks to learn more (and to see some of my finest LEGO work).
Success in the popular music industry is typically measured in terms of the number of Top 10 hits artists have to their credit. The music industry is a highly competitive multi-billion dollar business, and record labels incur various costs in exchange for a percentage of the profits from sales and concert tickets.
Predicting the success of an artist’s release in the popular music industry can be difficult. One release may be extremely popular, resulting in widespread play on TV, radio and social media, while another single may turn out quite unpopular, and therefore unprofitable. Record labels need to be selective in their decision making, and predictive analytics can help them with decision making around the type of songs and artists they need to promote.
In this walkthrough, you leverage H2O.ai, Amazon Athena, and RStudio to make predictions on whether a song might make it to the Top 10 Billboard charts. You explore the GLM, GBM, and deep learning modeling techniques using H2O’s rapid, distributed and easy-to-use open source parallel processing engine. RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. This is ideal if you don’t want to connect to a server via SSH and use code editors such as vi to do analytics. RStudio is available in a desktop version, or a server version that allows you to access R via a web browser. RStudio’s Notebooks feature is used to demonstrate the execution of code and output. In addition, this post showcases how you can leverage Athena for query and interactive analysis during the modeling phase. A working knowledge of statistics and machine learning would be helpful to interpret the analysis being performed in this post.
Walkthrough
Your goal is to predict whether a song will make it to the Top 10 Billboard charts. For this purpose, you will be using multiple modeling techniques―namely GLM, GBM and deep learning―and choose the model that is the best fit.
This solution involves the following steps:
Install and configure RStudio with Athena
Log in to RStudio
Install R packages
Connect to Athena
Create a dataset
Create models
Install and configure RStudio with Athena
Use the following AWS CloudFormation stack to install, configure, and connect RStudio on an Amazon EC2 instance with Athena.
Launching this stack creates all required resources and prerequisites:
Amazon EC2 instance with Amazon Linux (minimum size of t2.large is recommended)
Provisioning of the EC2 instance in an existing VPC and public subnet
Installation of Java 8
Assignment of an IAM role to the EC2 instance with the required permissions for accessing Athena and Amazon S3
Security group allowing access to the RStudio and SSH ports from the internet (I recommend restricting access to these ports)
S3 staging bucket required for Athena (referenced within RStudio as ATHENABUCKET)
RStudio username and password
Setup logs in Amazon CloudWatch Logs (if needed for additional troubleshooting)
Amazon EC2 Systems Manager agent, which makes it easy to manage and patch
All AWS resources are created in the US-East-1 Region. To avoid cross-region data transfer fees, launch the CloudFormation stack in the same region. To check the availability of Athena in other regions, see Region Table.
Log in to RStudio
The instance security group has been automatically configured to allow incoming connections on the RStudio port 8787 from any source internet address. You can edit the security group to restrict source IP access. If you have trouble connecting, ensure that port 8787 isn’t blocked by subnet network ACLS or by your outgoing proxy/firewall.
In the CloudFormation stack, choose Outputs, Value, and then open the RStudio URL. You might need to wait for a few minutes until the instance has been launched.
Log in to RStudio with the and password you provided during setup.
Install R packages
Next, install the required R packages from the RStudio console. You can download the R notebook file containing just the code.
#install pacman – a handy package manager for managing installs
if("pacman" %in% rownames(installed.packages()) == FALSE)
{install.packages("pacman")}
library(pacman)
p_load(h2o,rJava,RJDBC,awsjavasdk)
h2o.init(nthreads = -1)
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 hours 42 minutes
## H2O cluster version: 3.10.4.6
## H2O cluster version age: 4 months and 4 days !!!
## H2O cluster name: H2O_started_from_R_rstudio_hjx881
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.30 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## R Version: R version 3.3.3 (2017-03-06)
## Warning in h2o.clusterInfo():
## Your H2O cluster version is too old (4 months and 4 days)!
## Please download and install the latest version from http://h2o.ai/download/
#install aws sdk if not present (pre-requisite for using Athena with an IAM role)
if (!aws_sdk_present()) {
install_aws_sdk()
}
load_sdk()
## NULL
Connect to Athena
Next, establish a connection to Athena from RStudio, using an IAM role associated with your EC2 instance. Use ATHENABUCKET to specify the S3 staging directory.
URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
list.files()
## [1] "AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")
con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
s3_staging_dir=Sys.getenv("ATHENABUCKET"),
aws_credentials_provider_class="com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
Verify the connection. The results returned depend on your specific Athena setup.
For this analysis, you use a sample dataset combining information from Billboard and Wikipedia with Echo Nest data in the Million Songs Dataset. Upload this dataset into your own S3 bucket. The table below provides a description of the fields used in this dataset.
Field
Description
year
Year that song was released
songtitle
Title of the song
artistname
Name of the song artist
songid
Unique identifier for the song
artistid
Unique identifier for the song artist
timesignature
Variable estimating the time signature of the song
timesignature_confidence
Confidence in the estimate for the timesignature
loudness
Continuous variable indicating the average amplitude of the audio in decibels
tempo
Variable indicating the estimated beats per minute of the song
tempo_confidence
Confidence in the estimate for tempo
key
Variable with twelve levels indicating the estimated key of the song (C, C#, B)
key_confidence
Confidence in the estimate for key
energy
Variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitch
Continuous variable that indicates the pitch of the song
timbre_0_min thru timbre_11_min
Variables that indicate the minimum values over all segments for each of the twelve values in the timbre vector
timbre_0_max thru timbre_11_max
Variables that indicate the maximum values over all segments for each of the twelve values in the timbre vector
top10
Indicator for whether or not the song made it to the Top 10 of the Billboard charts (1 if it was in the top 10, and 0 if not)
Create an Athena table based on the dataset
In the Athena console, select the default database, sampled, or create a new database.
Run the following create table statement.
create external table if not exists billboard
(
year int,
songtitle string,
artistname string,
songID string,
artistID string,
timesignature int,
timesignature_confidence double,
loudness double,
tempo double,
tempo_confidence double,
key int,
key_confidence double,
energy double,
pitch double,
timbre_0_min double,
timbre_0_max double,
timbre_1_min double,
timbre_1_max double,
timbre_2_min double,
timbre_2_max double,
timbre_3_min double,
timbre_3_max double,
timbre_4_min double,
timbre_4_max double,
timbre_5_min double,
timbre_5_max double,
timbre_6_min double,
timbre_6_max double,
timbre_7_min double,
timbre_7_max double,
timbre_8_min double,
timbre_8_max double,
timbre_9_min double,
timbre_9_max double,
timbre_10_min double,
timbre_10_max double,
timbre_11_min double,
timbre_11_max double,
Top10 int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://aws-bigdata-blog/artifacts/predict-billboard/data'
;
Inspect the table definition for the ‘billboard’ table that you have created. If you chose a database other than sampledb, replace that value with your choice.
Next, run a sample query to obtain a list of all songs from Janet Jackson that made it to the Billboard Top 10 charts.
dbGetQuery(con, " SELECT songtitle,artistname,top10 FROM sampledb.billboard WHERE lower(artistname) = 'janet jackson' AND top10 = 1")
## songtitle artistname top10
## 1 Runaway Janet Jackson 1
## 2 Because Of Love Janet Jackson 1
## 3 Again Janet Jackson 1
## 4 If Janet Jackson 1
## 5 Love Will Never Do (Without You) Janet Jackson 1
## 6 Black Cat Janet Jackson 1
## 7 Come Back To Me Janet Jackson 1
## 8 Alright Janet Jackson 1
## 9 Escapade Janet Jackson 1
## 10 Rhythm Nation Janet Jackson 1
Determine how many songs in this dataset are specifically from the year 2010.
dbGetQuery(con, " SELECT count(*) FROM sampledb.billboard WHERE year = 2010")
## _col0
## 1 373
The sample dataset provides certain song properties of interest that can be analyzed to gauge the impact to the song’s overall popularity. Look at one such property, timesignature, and determine the value that is the most frequent among songs in the database. Timesignature is a measure of the number of beats and the type of note involved.
Running the query directly may result in an error, as shown in the commented lines below. This error is a result of trying to retrieve a large result set over a JDBC connection, which can cause out-of-memory issues at the client level. To address this, reduce the fetch size and run again.
#t<-dbGetQuery(con, " SELECT timesignature FROM sampledb.billboard")
#Note: Running the preceding query results in the following error:
#Error in .jcall(rp, "I", "fetch", stride, block): java.sql.SQLException: The requested #fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try #again. Refer to the Athena documentation for valid fetchSize values.
# Use the dbSendQuery function, reduce the fetch size, and run again
r <- dbSendQuery(con, " SELECT timesignature FROM sampledb.billboard")
dftimesignature<- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
table(dftimesignature)
## dftimesignature
## 0 1 3 4 5 7
## 10 143 503 6787 112 19
nrow(dftimesignature)
## [1] 7574
From the results, observe that 6787 songs have a timesignature of 4.
Next, determine the song with the highest tempo.
dbGetQuery(con, " SELECT songtitle,artistname,tempo FROM sampledb.billboard WHERE tempo = (SELECT max(tempo) FROM sampledb.billboard) ")
## songtitle artistname tempo
## 1 Wanna Be Startin' Somethin' Michael Jackson 244.307
Create the training dataset
Your model needs to be trained such that it can learn and make accurate predictions. Split the data into training and test datasets, and create the training dataset first. This dataset contains all observations from the year 2009 and earlier. You may face the same JDBC connection issue pointed out earlier, so this query uses a fetch size.
#BillboardTrain <- dbGetQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
#Running the preceding query results in the following error:-
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve #JDBC result set for SELECT * FROM sampledb.billboard WHERE year <= 2009 (Internal error)
#Follow the same approach as before to address this issue.
r <- dbSendQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
BillboardTrain <- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
BillboardTrain[1:2,c(1:3,6:10)]
## year songtitle artistname timesignature
## 1 2009 The Awkward Goodbye Athlete 3
## 2 2009 Rubik's Cube Athlete 3
## timesignature_confidence loudness tempo tempo_confidence
## 1 0.732 -6.320 89.614 0.652
## 2 0.906 -9.541 117.742 0.542
nrow(BillboardTrain)
## [1] 7201
Create the test dataset
BillboardTest <- dbGetQuery(con, "SELECT * FROM sampledb.billboard where year = 2010")
BillboardTest[1:2,c(1:3,11:15)]
## year songtitle artistname key
## 1 2010 This Is the House That Doubt Built A Day to Remember 11
## 2 2010 Sticks & Bricks A Day to Remember 10
## key_confidence energy pitch timbre_0_min
## 1 0.453 0.9666556 0.024 0.002
## 2 0.469 0.9847095 0.025 0.000
nrow(BillboardTest)
## [1] 373
Convert the training and test datasets into H2O dataframes
You need to designate the independent and dependent variables prior to applying your modeling algorithms. Because you’re trying to predict the ‘top10’ field, this would be your dependent variable and everything else would be independent.
Create your first model using GLM. Because GLM works best with numeric data, you create your model by dropping non-numeric variables. You only use the variables in the dataset that describe the numerical attributes of the song in the logistic regression model. You won’t use these variables: “year”, “songtitle”, “artistname”, “songid”, or “artistid”.
Create Model 1 with the training dataset, using GLM as the modeling algorithm and H2O’s built-in h2o.glm function.
modelh1 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
##
|
| | 0%
|
|===== | 8%
|
|=================================================================| 100%
Measure the performance of Model 1, using H2O’s built-in performance function.
h2o.performance(model=modelh1,newdata=test.h2o)
## H2OBinomialMetrics: glm
##
## MSE: 0.09924684
## RMSE: 0.3150347
## LogLoss: 0.3220267
## Mean Per-Class Error: 0.2380168
## AUC: 0.8431394
## Gini: 0.6862787
## R^2: 0.254663
## Null Deviance: 326.0801
## Residual Deviance: 240.2319
## AIC: 308.2319
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 255 59 0.187898 =59/314
## 1 17 42 0.288136 =17/59
## Totals 272 101 0.203753 =76/373
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.192772 0.525000 100
## 2 max f2 0.124912 0.650510 155
## 3 max f0point5 0.416258 0.612903 23
## 4 max accuracy 0.416258 0.879357 23
## 5 max precision 0.813396 1.000000 0
## 6 max recall 0.037579 1.000000 282
## 7 max specificity 0.813396 1.000000 0
## 8 max absolute_mcc 0.416258 0.455251 23
## 9 max min_per_class_accuracy 0.161402 0.738854 125
## 10 max mean_per_class_accuracy 0.124912 0.765006 155
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
h2o.auc(h2o.performance(modelh1,test.h2o))
## [1] 0.8431394
The AUC metric provides insight into how well the classifier is able to separate the two classes. In this case, the value of 0.8431394 indicates that the classification is good. (A value of 0.5 indicates a worthless test, while a value of 1.0 indicates a perfect test.)
Next, inspect the coefficients of the variables in the dataset.
Typically, songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”). This knowledge is helpful for interpreting the modeling results.
You can make the following observations from the results:
The coefficient estimates for the confidence values associated with the time signature, key, and tempo variables are positive. This suggests that higher confidence leads to a higher predicted probability of a Top 10 hit.
The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs with heavier instrumentation.
The coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those songs with light instrumentation.
These coefficients lead to contradictory conclusions for Model 1. This could be due to multicollinearity issues. Inspect the correlation between the variables “loudness” and “energy” in the training set.
This number indicates that these two variables are highly correlated, and Model 1 does indeed suffer from multicollinearity. Typically, you associate a value of -1.0 to -0.5 or 1.0 to 0.5 to indicate strong correlation, and a value of 0.1 to 0.1 to indicate weak correlation. To avoid this correlation issue, omit one of these two variables and re-create the models.
You build two variations of the original model:
Model 2, in which you keep “energy” and omit “loudness”
Model 3, in which you keep “loudness” and omit “energy”
You compare these two models and choose the model with a better fit for this use case.
Inspecting the coefficient of the variable energy, Model 2 suggests that songs with high energy levels tend to be more popular. This is as per expectation.
As H2O orders variables by significance, the variable energy is not significant in this model.
You can conclude that Model 2 is not ideal for this use , as energy is not significant.
From the confusion matrix, the model correctly predicts that 33 songs will be top 10 hits (true positives). However, it has 26 false positives (songs that the model predicted would be Top 10 hits, but ended up not being Top 10 hits).
Loudness has a positive coefficient estimate, meaning that this model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion from Model 2.
Loudness is significant in this model.
Overall, Model 3 predicts a higher number of top 10 hits with an accuracy rate that is acceptable. To choose the best fit for production runs, record labels should consider the following factors:
Desired model accuracy at a given threshold
Number of correct predictions for top10 hits
Tolerable number of false positives or false negatives
Next, make predictions using Model 3 on the test dataset.
The first set of output results specifies the probabilities associated with each predicted observation. For example, observation 1 is 96.54739% likely to not be a Top 10 hit, and 3.4526052% likely to be a Top 10 hit (predict=1 indicates Top 10 hit and predict=0 indicates not a Top 10 hit). The second set of results list the actual predictions made. From the third set of results, this model predicts that 61 songs will be top 10 hits.
Compute the baseline accuracy, by assuming that the baseline predicts the most frequent outcome, which is that most songs are not Top 10 hits.
table(BillboardTest$top10)
##
## 0 1
## 314 59
Now observe that the baseline model would get 314 observations correct, and 59 wrong, for an accuracy of 314/(314+59) = 0.8418231.
It seems that Model 3, with an accuracy of 0.8552, provides you with a small improvement over the baseline model. But is this model useful for record labels?
View the two models from an investment perspective:
A production company is interested in investing in songs that are more likely to make it to the Top 10. The company’s objective is to minimize the risk of financial losses attributed to investing in songs that end up unpopular.
How many songs does Model 3 correctly predict as a Top 10 hit in 2010? Looking at the confusion matrix, you see that it predicts 33 top 10 hits correctly at an optimal threshold, which is more than half the number
It will be more useful to the record label if you can provide the production company with a list of songs that are highly likely to end up in the Top 10.
The baseline model is not useful, as it simply does not label any song as a hit.
Considering the three models built so far, you can conclude that Model 3 proves to be the best investment choice for the record label.
GBM model
H2O provides you with the ability to explore other learning models, such as GBM and deep learning. Explore building a model using the GBM technique, using the built-in h2o.gbm function.
Before you do this, you need to convert the target variable to a factor for multinomial classification techniques.
train.h2o$top10=as.factor(train.h2o$top10)
gbm.modelh <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 500, max_depth = 4, learn_rate = 0.01, seed = 1122,distribution="multinomial")
##
|
| | 0%
|
|=== | 5%
|
|===== | 7%
|
|====== | 9%
|
|======= | 10%
|
|====================== | 33%
|
|===================================== | 56%
|
|==================================================== | 79%
|
|================================================================ | 98%
|
|=================================================================| 100%
perf.gbmh<-h2o.performance(gbm.modelh,test.h2o)
perf.gbmh
## H2OBinomialMetrics: gbm
##
## MSE: 0.09860778
## RMSE: 0.3140188
## LogLoss: 0.3206876
## Mean Per-Class Error: 0.2120263
## AUC: 0.8630573
## Gini: 0.7261146
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 266 48 0.152866 =48/314
## 1 16 43 0.271186 =16/59
## Totals 282 91 0.171582 =64/373
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.189757 0.573333 90
## 2 max f2 0.130895 0.693717 145
## 3 max f0point5 0.327346 0.598802 26
## 4 max accuracy 0.442757 0.876676 14
## 5 max precision 0.802184 1.000000 0
## 6 max recall 0.049990 1.000000 284
## 7 max specificity 0.802184 1.000000 0
## 8 max absolute_mcc 0.169135 0.496486 104
## 9 max min_per_class_accuracy 0.169135 0.796610 104
## 10 max mean_per_class_accuracy 0.169135 0.805948 104
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
h2o.sensitivity(perf.gbmh,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501205344484314. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.1355932
h2o.auc(perf.gbmh)
## [1] 0.8630573
This model correctly predicts 43 top 10 hits, which is 10 more than the number predicted by Model 3. Moreover, the AUC metric is higher than the one obtained from Model 3.
As seen above, H2O’s API provides the ability to obtain key statistical measures required to analyze the models easily, using several built-in functions. The record label can experiment with different parameters to arrive at the model that predicts the maximum number of Top 10 hits at the desired level of accuracy and threshold.
H2O also allows you to experiment with deep learning models. Deep learning models have the ability to learn features implicitly, but can be more expensive computationally.
Now, create a deep learning model with the h2o.deeplearning function, using the same training and test datasets created before. The time taken to run this model depends on the type of EC2 instance chosen for this purpose. For models that require more computation, consider using accelerated computing instances such as the P2 instance type.
system.time(
dlearning.modelh <- h2o.deeplearning(y = y.dep,
x = x.indep,
training_frame = train.h2o,
epoch = 250,
hidden = c(250,250),
activation = "Rectifier",
seed = 1122,
distribution="multinomial"
)
)
##
|
| | 0%
|
|=== | 4%
|
|===== | 8%
|
|======== | 12%
|
|========== | 16%
|
|============= | 20%
|
|================ | 24%
|
|================== | 28%
|
|===================== | 32%
|
|======================= | 36%
|
|========================== | 40%
|
|============================= | 44%
|
|=============================== | 48%
|
|================================== | 52%
|
|==================================== | 56%
|
|======================================= | 60%
|
|========================================== | 64%
|
|============================================ | 68%
|
|=============================================== | 72%
|
|================================================= | 76%
|
|==================================================== | 80%
|
|======================================================= | 84%
|
|========================================================= | 88%
|
|============================================================ | 92%
|
|============================================================== | 96%
|
|=================================================================| 100%
## user system elapsed
## 1.216 0.020 166.508
perf.dl<-h2o.performance(model=dlearning.modelh,newdata=test.h2o)
perf.dl
## H2OBinomialMetrics: deeplearning
##
## MSE: 0.1678359
## RMSE: 0.4096778
## LogLoss: 1.86509
## Mean Per-Class Error: 0.3433013
## AUC: 0.7568822
## Gini: 0.5137644
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 290 24 0.076433 =24/314
## 1 36 23 0.610169 =36/59
## Totals 326 47 0.160858 =60/373
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.826267 0.433962 46
## 2 max f2 0.000000 0.588235 239
## 3 max f0point5 0.999929 0.511811 16
## 4 max accuracy 0.999999 0.865952 10
## 5 max precision 1.000000 1.000000 0
## 6 max recall 0.000000 1.000000 326
## 7 max specificity 1.000000 1.000000 0
## 8 max absolute_mcc 0.999929 0.363219 16
## 9 max min_per_class_accuracy 0.000004 0.662420 145
## 10 max mean_per_class_accuracy 0.000000 0.685334 224
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.sensitivity(perf.dl,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.496293348880151. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.3898305
h2o.auc(perf.dl)
## [1] 0.7568822
The AUC metric for this model is 0.7568822, which is less than what you got from the earlier models. I recommend further experimentation using different hyper parameters, such as the learning rate, epoch or the number of hidden layers.
H2O’s built-in functions provide many key statistical measures that can help measure model performance. Here are some of these key terms.
Metric
Description
Sensitivity
Measures the proportion of positives that have been correctly identified. It is also called the true positive rate, or recall.
Specificity
Measures the proportion of negatives that have been correctly identified. It is also called the true negative rate.
Threshold
Cutoff point that maximizes specificity and sensitivity. While the model may not provide the highest prediction at this point, it would not be biased towards positives or negatives.
Precision
The fraction of the documents retrieved that are relevant to the information needed, for example, how many of the positively classified are relevant
AUC
Provides insight into how well the classifier is able to separate the two classes. The implicit goal is to deal with situations where the sample distribution is highly skewed, with a tendency to overfit to a single class.
0.90 – 1 = excellent (A)
0.8 – 0.9 = good (B)
0.7 – 0.8 = fair (C)
.6 – 0.7 = poor (D)
0.5 – 0.5 = fail (F)
Here’s a summary of the metrics generated from H2O’s built-in functions for the three models that produced useful results.
Metric
Model 3
GBM Model
Deep Learning Model
Accuracy
(max)
0.882038
(t=0.435479)
0.876676
(t=0.442757)
0.865952
(t=0.999999)
Precision
(max)
1.0
(t=0.821606)
1.0
(t=0802184)
1.0
(t=1.0)
Recall
(max)
1.0
1.0
1.0
(t=0)
Specificity
(max)
1.0
1.0
1.0
(t=1)
Sensitivity
0.2033898
0.1355932
0.3898305
(t=0.5)
AUC
0.8492389
0.8630573
0.756882
Note: ‘t’ denotes threshold.
Your options at this point could be narrowed down to Model 3 and the GBM model, based on the AUC and accuracy metrics observed earlier. If the slightly lower accuracy of the GBM model is deemed acceptable, the record label can choose to go to production with the GBM model, as it can predict a higher number of Top 10 hits. The AUC metric for the GBM model is also higher than that of Model 3.
Record labels can experiment with different learning techniques and parameters before arriving at a model that proves to be the best fit for their business. Because deep learning models can be computationally expensive, record labels can choose more powerful EC2 instances on AWS to run their experiments faster.
Conclusion
In this post, I showed how the popular music industry can use analytics to predict the type of songs that make the Top 10 Billboard charts. By running H2O’s scalable machine learning platform on AWS, data scientists can easily experiment with multiple modeling techniques and interactively query the data using Amazon Athena, without having to manage the underlying infrastructure. This helps record labels make critical decisions on the type of artists and songs to promote in a timely fashion, thereby increasing sales and revenue.
If you have questions or suggestions, please comment below.
Gopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.
Bob Strahan, a Senior Consultant with AWS Professional Services, contributed to this post.
As part of the AWS Shared Responsibility Model, you are responsible for monitoring and managing your resources at the operating system and application level. When you monitor your application servers, for example, you can measure, visualize, react to, and improve the security of those servers. You probably already do this on premises or in other environments, and you can adapt your existing processes, tools, and methodologies for use in the AWS Cloud. For more details about best practices for monitoring your AWS resources, see the “Manage Security Monitoring, Alerting, Audit Trail, and Incident Response” section in the AWS Security Best Practices whitepaper.
This blog post focuses on how to log and create alarms on invalid Secure Shell (SSH) access attempts. Implementing live monitoring and session recording facilitates the identification of unauthorized activity and can help confirm that remote users access only those systems they are authorized to use. With SSH log information in hand (such as invalid access type, bad private keys, and remote IP addresses), you can take proactive actions to protect your servers. For example, you can use an AWS Lambda function to adjust your server’s security rules when an alarm is triggered that indicates an invalid SSH access attempt.
In this post, I demonstrate how to use Amazon CloudWatch Logs to monitor SSH access to your application servers (Amazon EC2 Linux instances) so that you can monitor rejected SSH connection requests and take action. I also show how to configure CloudWatch Logs to send SSH access logs from application servers that reside in a public subnet. Last, I demonstrate how to visualize how many attempts are made to SSH into your application servers with bad private keys and invalid user names. Using these techniques and tools can help you improve the security of your application servers.
AWS services and terminology I use in this post
In this post, I use the following AWS services and terminology:
Amazon CloudWatch – A monitoring service for the resources and applications you run on the AWS Cloud. You can use CloudWatch to collect and track metrics, collect and monitor log files, set alarms, and automatically react to changes in your AWS resources.
CloudWatch namespaces – Containers for metrics. Metrics in different namespaces are isolated from each other so that metrics from different applications are not mistakenly aggregated into the same statistics. You also can create custom metrics for which you must specify namespaces as containers.
CloudWatch Logs – A feature of CloudWatch that allows you to monitor, store, and access your log files from EC2 instances, AWS CloudTrail, and other sources. Additionally, you can use CloudWatch Logs to monitor applications and systems by using log data and create alarms. For example, you can choose to search for a phrase in logs and then create an alarm if the phrase you are looking for is found in the log more than 5 times in the last 10 minutes. You can then take action on these alarms, if necessary.
Log stream – A log stream represents the sequence of events coming from an application instance or resource that you are monitoring. In this post, I use the EC2 instance ID as the log stream identifier so that I can easily map log entries to the instances that produced the log entries
Log group – In CloudWatch Logs, a group of log streams that share the same retention time, monitoring, and access control settings. Each log stream must belong to one log
Metric – A specific term or value that you can monitor and extract from log events.
Metric filter – A metric filter describes how Amazon CloudWatch Logs extracts information from logs and transforms it into CloudWatch metrics. It defines the terms and patterns to look for in log data as the data is sent to CloudWatch Logs. Metric filters are assigned to log groups, and all metric filters assigned to a given log group are applied to their log stream—see the following diagram for more details.
SSH logs – Reside on EC2 instances and capture all SSH activities. The logs include successful attempts as well as unsuccessful attempts. Debian Linux SSH logs reside in /var/log/auth.log, and stock CentOS SSH logs are written to /var/log/secure. This blog post uses an Amazon Linux AMI, which also logs SSH sessions to /var/log/secure.
AWS Identity and Access Management (IAM) – IAM enables you to securely control access to AWS services and resources for your users. In the solution in this post, you create an IAM policy and configure an EC2 instance that assumes a role. The IAM policy allows the EC2 instance to create log events and save them in an Amazon S3 bucket (in other words, CloudWatch Logs log files are saved in the S3 bucket).
CloudWatch dashboards – Amazon CloudWatch dashboards are customizable home pages in the CloudWatch console that you can use to monitor your resources in a single view, even those resources that are spread across different regions. You can use CloudWatch dashboards to create customized views of the metrics and alarms for your AWS resources.
Architectural overview
The following diagram depicts the services and flow of information between the different AWS services used in this post’s solution.
Here is how the process works, as illustrated and numbered in the preceding diagram:
A CloudWatch Logs agent runs on each EC2 instance. The agents are configured to send SSH logs from the EC2 instance to a log stream identified by an instance ID.
Log streams are aggregated into a log group. As a result, one log group contains all the logs you want to analyze from one or more instances.
You apply metric filters to a log group in order to search for specific keywords. When the metric filter finds specific keywords, the filter counts the occurrences of the keywords in a time-based sliding window. If the occurrence of a keyword exceeds the CloudWatch alarm threshold, an alarm is triggered.
An IAM policy defines a role that gives the EC2 servers permission to create logs in a log group and send log events (new log entries) from EC2 to log groups. This role is then assumed by the application servers.
CloudWatch alarms notify users when a specified threshold has been crossed. For example, you can set an alarm to trigger when more than 2 failed SSH connections happen in a 5-minute period.
The CloudWatch dashboard is used to visualize data and alarms from the monitoring process.
Deploy and test the solution
1. Deploy the solution by using CloudFormation
Now that I have explained how the solution works, I will show how to use AWS CloudFormation to create a stack with the desired solution configuration. CloudFormation allows you to create a stack of resources in your AWS account.
Sign in to the AWS Management Console, choose CloudFormation, choose Create Stack, choose Specify an Amazon S3 template URL and paste the following link in the box: https://s3.amazonaws.com/awsiammedia/public/sample/MonitorSSHActivities/CloudWatchLogs_ssh.yaml
Choose Launch to deploy the stack.
On the Specify Details page, enter the Stack name. Then enter the KeyName, which is the SSH key pair for the region you use. I use this key-pair later in this post; if you don’t have a key pair for your current region, follow these instructions to create one. The OperatorEmail is the CloudWatch alarm recipient email address (this field is mandatory to launch the stack), which is the email address to which SSH activity alarms will be sent. You can use the SSHLocation box to limit the IP address range that can access your instances; the default is 0.0.0/0, which means that any IP can access the instance. After specifying these variables, click Next.
On the Options page, tag your instance, and click Next. Tags allow you to assign metadata to AWS resources. For example, you can tag a project’s resources and then use the tag to manage, search for, and filter resources. For more information about tagging, see Tagging Your Amazon EC2 Resources.
Wait until the CloudFormation template shows CREATE_COMPLETE, as shown in the following screenshot. This means your stack was created successfully.
After the stack is created successfully, you have two distinct application servers running, each with a CloudWatch agent. These servers represent a fleet of servers in your infrastructure. Choose the Outputs tab to see more details about the resources, such as the public IP addresses of the servers. You will need to use these IP addresses later in this post in order to trigger log events and alarms.
The CloudWatch log agent on each server is installed at startup and configured to stream SSH log entries from /var/log/secure to CloudWatch via a log stream. CloudWatch aggregates the log streams (ssh.log) from the application servers and saves them in a CloudWatch Logs log group. Each log stream is identified by an instance-ID, as shown in the following screenshot.
The application servers assume a role that gives them permissions to create CloudWatch Logs log files and events. CloudFormation also configures two metrics: ssh/InvalidUser and ssh/Disconnect. The ssh/InvalidUser metric sends an alarm when there are more than 2 SSH attempts into any server that include an invalid user name. Similarly, the ssh/Disconnect metric creates an alarm when more than 10 SSH disconnect requests come from users within 5 minutes.
To review the metrics created by CloudFormation, choose Metrics in the CloudWatch console. A new SSH custom namespace has been created, which contains the two metrics described in the previous paragraph.
You should now have two application servers running and two custom CloudWatch metrics and alarms configured. Now, it’s time to generate log events, trigger alarms, and test the configurations.
2. Test SSH metrics and alarms
Now, let’s try to trigger an alarm by trying to SSH with an invalid user name into one of the servers. Use the key pair you specified when launching the stack and connect to one of the Linux instances from a terminal window (replace the placeholder values in the following command).
ssh -i <the-key-pair-you-specified-in-the-CloudFormation-template>[email protected]<ec2 DNS or IP address>
Now, exit the session and try to sign in as bad-user, as shown in the following command.
ssh -i <the-key-pair-you-specified-in-the-CloudFormation-template>bad-user@<ec2 DNS or IP address>
The following command is the same as the previous command, but with the placeholder values replaced by actual values.
Because the alarm triggers after two or more unsuccessful SSH login attempts with an invalid user name in 1 minute, run the preceding command a few times. The server’s log captures the bad SSH login attempts, and after a minute, you should see InvalidUserAlarm in the CloudWatch console, as shown in the following screenshot. Choose Alarms to see more details. The alarm should disappear after another minute if there are no more SSH login attempts.
You can also view the history of your alarms by choosing the History tab. CloudWatch metrics are saved for 15 months.
When the CloudFormation stack launches, a topic-registration email is sent to the email address you specified in the template. After you accept the topic registration, you will receive an alarm email with details about the alarm. The email looks like what is shown in the following screenshot.
3. Understanding CloudWatch metric filters and their transformation
The CloudFormation template includes two alarms, InvalidUserAlarm and SSHReceiveddisconnectAlarm, and two metric filters. As I mentioned previously, the metric filters define the pattern you want to match in a CloudWatch Logs log group. When a pattern is found, it transforms into an Amazon metric as defined in the MetricTransformations section of the metric filter.
The following is a snippet of the InvalidUser metric filter. Each pattern match—denoted by FilterPattern—is counted as one metric value as specified in the MetricValue parameter in the MetricTranformations section. The CloudWatch alarm associated with this metric filter will be triggered when the metric value crosses a specified threshold.
When a CloudWatch alarm is triggered, the service sends an email to an Amazon SNS topic with more details about the alarm type, trigger parameters, and status.
4. Create a CloudWatch metric filter to identify attempts to SSH into your servers with bad private keys
You can create additional metric filters in CloudWatch Logs to provide better visibility into the SSH activity on your servers. Let’s assume you want to know if there are too many attempts to SSH into your servers with bad private keys. If an attempt is made with a bad private key, a line like the following is logged in the SSH log file.
You can produce this log line by modifying the pem file you are using (a pem file holds your private key). In a terminal window, modify your private key by copying and pasting the following lines in the same directory where your key resides.
$ cat <valid-pem-file> | sed s/./A/25 > bad-keys.pem
$ cat bad-keys.pem | sed s/./A/26 > bad-keys.pem
These lines simply change the characters at positions 25 and 26 from their current value to the character A, keeping the original pem file intact. Alternatively, you can use nano <valid-keys>.pem from the command line or any other editor, change a character, save the file as bad-keys.pem, and exit the file.
Now, try to use bad-keys.pem to access one of the application servers.
The SSH attempt should fail because you are using a bad private key.
Permission denied (public key).
Now, let’s look at the server’s ssh.log file from the CloudWatch Logs console and analyze the error log messages. I need to understand the log format in order to configure a new filter. To review the logs, choose Logs in the navigation pane, and select the log group that was created by CloudFormation (it starts with the name you specified when launching the CloudFormation template).
In particular, notice the following line when you try to SSH with a bad private key.
Let’s add a metric filter to capture this line so that we can use this metric later when we build an SSH Dashboard. Copy the following line to the Filter events search box at the top of the console screen and press Enter.
You can now see only the lines that match the pattern you specified. These are the lines you want to count and transform into metrics. Each string in the message is represented by a word in the filter. In our example, we are looking for a pattern where the sixth word is Connection and the seventh word is closed. Other words in the log line are not important in this context. The following image depicts the mapping between a string in a log file and a metric filter.
To create the metric filter, choose Logs in the navigation pane of the CloudWatch console. Choose the log groups to which you want to apply the new metric filter and then choose Create Metric Filter. Choose Next.
Paste the filter pattern we used previously (the sixth word equals Connection and the seventh word equals closed) in the Filter Pattern box. Select the server you tried to sign in to with the bad private key to Select Log Data to Test and click Test Pattern. You should see the results that are shown in the following screenshot. When completed, click Assign Metric.
Type SSH for the Metric Namespace and sshClosedConnection-InvalidKeysFilter for Filter Name. Choose Create Filter to see your new metric filter listed. You can use the newly created metric filter to graph the metrics and set alarms. The alarms can be used to inform your administrator via email of any event you specify. In addition, metrics can be used to generate SNS notification to trigger an AWS Lambda function in order to take proactive actions, such as blocking suspicious IP addresses in a security group.
Choose Create Alarm next to Filter Name and follow the instructions to create a CloudWatch alarm.
Back at the Metrics view, you should now have three SSH metric filters under Custom Namespaces. Note that it can take a few minutes for the number of SSH metrics to update from two to three.
5. Create a graph by using a CloudWatch dashboard
After you have configured the metrics, you can display SSH metrics in a graph. CloudWatch dashboards allow you to create reusable graphs of AWS resources and custom metrics so that you can quickly monitor your operational status and identify issues at a glance. Metrics data is kept for a period of two weeks.
In the CloudWatch console, choose Dashboards in the navigation pane, and then choose Create dashboard to create a new graph in a dashboard. Name your dashboard SSH-Dashboard and choose Create dashboard. Choose Line Graph from the graph options and choose Configure.
In the Add metric graph window under Custom Namespace, choose SSH > Metrics with no dimensions. Select all three metrics you have configured (the CloudFormation template configured two metrics and you manually added one more metric).
By default, the metrics are displayed on the graph as an average. However, you configured metrics that are based on summary metrics (for example, the total number of alarms in two minutes). To change the default, choose the Graphed metrics tab, and change the statistic from Average to Sum, as shown in the following screenshot. Also, change the time period from 5 minutes to 1 minute.
Your graphed metrics should look like the following screenshot. When you have provided all the necessary information, choose Create Widget.
You can rename the graph and add static text to give the console more context. To add a text widget, choose Widget and select text. Then edit the widget with markdown language. Your dashboard may then look like the following screenshot.
The consolidated metrics graph displays the number of SSH attempts with bad private keys, invalid user names, and too many disconnects.
Conclusion
In this blog post, I demonstrated how to automate the deployment of the CloudWatch Logs agent, create filters and alarms, and write, test, and apply metrics on the fly from the AWS Management Console. You can then visualize the metrics with the AWS Management Console. The solution described in this post gives you monitoring and alarming capabilities that can help you understand the status of and potential risks to your instances and applications. You can easily aggregate logs from many servers and different applications, create alarms, and display logs’ metrics on a graph.
If you have comments about this post, submit them in the “Comments” section below. If you have questions about the solution in this post, start a new thread on the CloudWatch forum.
Twelve more AWS services have obtained Payment Card Industry Data Security Standard (PCI DSS) compliance, giving you more options, flexibility, and functionality to process and store sensitive payment card data in the AWS Cloud. The services were audited by Coalfire to ensure that they meet strict PCI DSS standards.
AWS now offers 42 services that meet PCI DSS standards, putting administrators in better control of their frameworks and making workloads more efficient and cost effective.
I am an AWS Professional Services consultant, which has me working directly with AWS customers on a daily basis. One of my customers recently asked me to provide a solution to help them fulfill their security requirements by having the flow log data from VPC Flow Logs sent to a central AWS account. This is a common requirement in some companies so that logs can be available for in-depth analysis. In addition, my customers regularly request a simple, scalable, and serverless solution that doesn’t require them to create and maintain custom code.
In this blog post, I demonstrate how to configure your AWS accounts to send flow log data from VPC Flow Logs to an Amazon S3 bucket located in a central AWS account by using only fully managed AWS services. The benefit of using fully managed services is that you can lower or even completely eliminate operational costs because AWS manages the resources and scales the resources automatically.
Solution overview
The solution in this post uses VPC Flow Logs, which is configured in a source account to send flow logs to an Amazon CloudWatch Logs log group. To receive the logs from multiple accounts, this solution uses a CloudWatch Logs destination in the central account. Finally, the solution utilizes fully managed Amazon Kinesis Firehose, which delivers streaming data to scalable and durable S3 object storage automatically without the need to write custom applications or manage resources. When the logs are processed and stored in an S3 bucket, these can be tiered into a lower cost, long-term storage solution (such as Amazon Glacier) automatically to help meet any company-specific or industry-specific requirements for data retention.
The following diagram illustrates the process and components of the solution described in this post.
As numbered in the preceding diagram, these are the high-level steps for implementing this solution:
Enable VPC Flow Logs to send data to the CloudWatch Logs log group in the source accounts.
Finally, in the source accounts, set a subscription filter on the CloudWatch Logs log group to send data to the CloudWatch Logs destination.
Configure the solution by using AWS CloudFormation and the AWS CLI
Now that I have explained the solution, its benefits, and the components involved, I will show how to configure a source account by using the AWS CLI and the central account using a CloudFormation template. To implement this, you need two separate AWS accounts. If you need to set up a new account, navigate to the AWS home page, choose Create an AWS Account, and follow the instructions. Alternatively, you can use AWS Organizations to create your accounts. See AWS Organizations – Policy-Based Management for Multiple AWS Accounts for more details and some step-by-step instructions about how to use this service.
Note your source and target account IDs, source account VPC ID number, and target account region where you create your resources with the CloudFormation template. You will need these values as part of implementing this solution.
Gather the required configuration information
To find your AWS account ID number in the AWS Management Console, choose Support in the navigation bar, and then choose Support Center. Your currently signed-in, 12-digit account ID number appears in the upper-right corner under the Support menu. Sign in to both the source and target accounts and take note of their respective account numbers.
To find your VPC ID number in the AWS Management Console, sign in to your source account and choose Services. In the search box, type VPC and then choose VPC Isolated Cloud Resources.
In the VPC console, click Your VPCs, as shown in the following screenshot.
Note your VPC ID.
Finally, take a note of the region in which you are creating your resources in the target account. The region name is shown in the region selector in the navigation bar of the AWS Management Console (see the following screenshot), and the region is shown in your browser’s address bar. Sign in to your target account, choose Services, and type CloudFormation. In the following screenshot, the region name is Ireland and the region is eu-west-1.
Create the resources in the central account
In this example, I use a CloudFormation template to create all related resources in the central account. You can use CloudFormation to create and manage a collection of AWS resources called a stack. CloudFormation also takes care of the resource provisioning for you.
The provided CloudFormation template creates an S3 bucket that will store all the VPC Flow Logs, IAM roles with associated IAM policies used by Kinesis Firehose and the CloudWatch Logs destination, a Kinesis Firehose delivery stream, and a CloudWatch Logs destination.
To configure the resources in the central account:
Sign in to the AWS Management Console, navigate to CloudFormation, and then choose Create Stack. On the Select Template page, type the URL of the CloudFormation template (https://s3.amazonaws.com/awsiammedia/public/sample/VPCFlowLogsCentralAccount/targetaccount.template) for the target account and choose Next.
The CloudFormation template defines a parameter, paramDestinationPolicy, which sets the IAM policy on the CloudWatch Logs destination. This policy governs which AWS accounts can create subscription filters against this destination.
Change the Principal to your SourceAccountID, and in the Resource section, change TargetAccountRegion and TargetAccountID to the values you noted in the previous section.
After you have updated the policy, choose Next. On the Options page, choose Next.
On the Review page, scroll down and choose the I acknowledge that AWS CloudFormation might create IAM resources with custom names check box. Finally, choose Create to create the stack.
After you are done creating the stack, verify the status of the resources. Choose the check box next to the Stack Name and choose the Resources tab, where you can see a list of all the resources created with this template and their status.
This completes the configuration of the central account.
Configure a source account
Now that you have created all the necessary resources in the central account, I will show you how to configure a source account by using the AWS CLI, a tool for managing your AWS resources. For more information about installing and configuring the AWS CLI, see Configuring the AWS CLI.
To configure a source account:
Create the IAM role that will grant VPC Flow Logs the permissions to send data to the CloudWatch Logs log group. To start, create a trust policy in a file named TrustPolicyForCWL.json by using the following policy document.
Use the following command to create the IAM role, specifying the trust policy file you just created. Note the returned Arn value because that will also be passed to VPC Flow Logs later.
aws iam create-role \
--role-name PublishFlowLogs \
--assume-role-policy-document file://~/TrustPolicyForCWL.json
Now, create a permissions policy to define which actions VPC Flow Logs can perform on the source account. Start by creating the permissions policy in a file named PermissionsForVPCFlowLogs.json. The following set of permissions (in the Action element) is the minimum requirement for VPC Flow Logs to be able to send data to the CloudWatch Logs log group.
Now that you have created an IAM role with required permissions and a CloudWatch Logs log group, change the placeholder values of the role in the following command and run it to enable VPC Flow Logs.
Finally, change the destination arn in the following command to reflect your targetaccountregion and targetaccountID, and run the command to subscribe your CloudWatch Logs log group to CloudWatch Logs in the central account.
This completes the configuration of this central logging solution. Within a few minutes, you should start seeing compressed logs sent from your source account to the S3 bucket in your central account (see the following screenshot). You can process and analyze these logs with the analytics tool of your choice. In addition, you can tier the data into a lower cost, long-term storage solution such as Amazon Glacier and archive the data to meet the data retention requirements of your company.
Summary
In this post, I demonstrated how to configure AWS accounts to send VPC Flow Logs to an S3 bucket located in a central AWS account by using only fully managed AWS services. If you have comments about this post, submit them in the “Comments” section below. If you have questions about implementing this solution, start a new thread on the CloudWatch forum.
Nowadays, it’s common for a web server to be fronted by a global content delivery service, like Amazon CloudFront. This type of front end accelerates delivery of websites, APIs, media content, and other web assets to provide a better experience to users across the globe.
The insights gained by analysis of Amazon CloudFront access logs helps improve website availability through bot detection and mitigation, optimizing web content based on the devices and browser used to view your webpages, reducing perceived latency by caching of popular object closer to its viewer, and so on. This results in a significant improvement in the overall perceived experience for the user.
This blog post provides a way to build a serverless architecture to generate some of these insights. To do so, we analyze Amazon CloudFront access logs both at rest and in transit through the stream. This serverless architecture uses Amazon Athena to analyze large volumes of CloudFront access logs (on the scale of terabytes per day), and Amazon Kinesis Analytics for streaming analysis.
The analytic queries in this blog post focus on three common use cases:
Detection of common bots using the user agent string
Calculation of current bandwidth usage per Amazon CloudFront distribution per edge location
Determination of the current top 50 viewers
However, you can easily extend the architecture described to power dashboards for monitoring, reporting, and trigger alarms based on deeper insights gained by processing and analyzing the logs. Some examples are dashboards for cache performance, usage and viewer patterns, and so on.
Following we show a diagram of this architecture.
Prerequisites
Before you set up this architecture, install the AWS Command Line Interface (AWS CLI) tool on your local machine, if you don’t have it already.
Setup summary
The following steps are involved in setting up the serverless architecture on the AWS platform:
Create an Amazon S3 bucket for your Amazon CloudFront access logs to be delivered to and stored in.
Create a second Amazon S3 bucket to receive processed logs and store the partitioned data for interactive analysis.
Create an Amazon Kinesis Firehose delivery stream to batch, compress, and deliver the preprocessed logs for analysis.
Create an AWS Lambda function to preprocess the logs for analysis.
Configure Amazon S3 event notification on the CloudFront access logs bucket, which contains the raw logs, to trigger the Lambda preprocessing function.
Create an Amazon DynamoDB table to look up partition details, such as partition specification and partition location.
Create an Amazon Athena table for interactive analysis.
Create a second AWS Lambda function to add new partitions to the Athena table based on the log delivered to the processed logs bucket.
Configure Amazon S3 event notification on the processed logs bucket to trigger the Lambda partitioning function.
Configure Amazon Kinesis Analytics application for analysis of the logs directly from the stream.
ETL and preprocessing
In this section, we parse the CloudFront access logs as they are delivered, which occurs multiple times in an hour. We filter out commented records and use the user agent string to decipher the browser name, the name of the operating system, and whether the request has been made by a bot. For more details on how to decipher the preceding information based on the user agent string, see user-agents 1.1.0 in the Python documentation.
We use the Lambda preprocessing function to perform these tasks on individual rows of the access log. On successful completion, the rows are pushed to an Amazon Kinesis Firehose delivery stream to be persistently stored in an Amazon S3 bucket, the processed logs bucket.
To create a Firehose delivery stream with a new or existing S3 bucket as the destination, follow the steps described in Create a Firehose Delivery Stream to Amazon S3 in the S3 documentation. Keep most of the default settings, but select an AWS Identity and Access Management (IAM) role that has write access to your S3 bucket and specify GZIP compression. Name the delivery stream CloudFrontLogsToS3.
Another pre-requisite for this setup is to create an IAM role that provides the necessary permissions our AWS Lambda function to get the data from S3, process it, and deliver it to the CloudFrontLogsToS3 delivery stream.
Let’s use the AWS CLI to create the IAM role using the following the steps:
Create the IAM policy (lambda-exec-policy) for the Lambda execution role to use.
Create the Lambda execution role (lambda-cflogs-exec-role) and assign the service to use this role.
Attach the policy created in step 1 to the Lambda execution role.
To download the policy document to your local machine, type the following command.
To attach the policy (lambda-exec-policy) created to the AWS Lambda execution role (lambda-cflogs-exec-role), type the following command.
aws iam attach-role-policy --role-name lambda-cflogs-exec-role --policy-arn arn:aws:iam::<your-account-id>:policy/lambda-exec-policy
Now that we have created the CloudFrontLogsToS3 Firehose delivery stream and the lambda-cflogs-exec-role IAM role for Lambda, the next step is to create a Lambda preprocessing function.
This Lambda preprocessing function parses the CloudFront access logs delivered into the S3 bucket and performs a few transformation and mapping operations on the data. The Lambda function adds descriptive information, such as the browser and the operating system that were used to make this request based on the user agent string found in the logs. The Lambda function also adds information about the web distribution to support scenarios where CloudFront access logs are delivered to a centralized S3 bucket from multiple distributions. With the solution in this blog post, you can get insights across distributions and their edge locations.
Use the Lambda Management Console to create a new Lambda function with a Python 2.7 runtime and the s3-get-object-python blueprint. Open the console, and on the Configure triggers page, choose the name of the S3 bucket where the CloudFront access logs are delivered. Choose Put for Event type. For Prefix, type the name of the prefix, if any, for the folder where CloudFront access logs are delivered, for example cloudfront-logs/. To invoke Lambda to retrieve the logs from the S3 bucket as they are delivered, select Enable trigger.
Choose Next and provide a function name to identify this Lambda preprocessing function.
For Code entry type, choose Upload a file from Amazon S3. For S3 link URL, type https.amazonaws.com//preprocessing-lambda/pre-data.zip. In the section, also create an environment variable with the key KINESIS_FIREHOSE_STREAM and a value with the name of the Firehose delivery stream as CloudFrontLogsToS3.
Choose lambda-cflogs-exec-role as the IAM role for the Lambda function, and type prep-data.lambda_handler for the value for Handler.
Choose Next, and then choose Create Lambda.
Table creation in Amazon Athena
In this step, we will build the Athena table. Use the Athena console in the same region and create the table using the query editor.
A popular website with millions of requests each day routed using Amazon CloudFront can generate a large volume of logs, on the order of a few terabytes a day. We strongly recommend that you partition your data to effectively restrict the amount of data scanned by each query. Partitioning significantly improves query performance and substantially reduces cost. The Lambda partitioning function adds the partition information to the Athena table for the data delivered to the preprocessed logs bucket.
Before delivering the preprocessed Amazon CloudFront logs file into the preprocessed logs bucket, Amazon Kinesis Firehose adds a UTC time prefix in the format YYYY/MM/DD/HH. This approach supports multilevel partitioning of the data by year, month, date, and hour. You can invoke the Lambda partitioning function every time a new processed Amazon CloudFront log is delivered to the preprocessed logs bucket. To do so, configure the Lambda partitioning function to be triggered by an S3 Put event.
For a website with millions of requests, a large number of preprocessed logs can be delivered multiple times in an hour—for example, at the interval of one each second. To avoid querying the Athena table for partition information every time a preprocessed log file is delivered, you can create an Amazon DynamoDB table for fast lookup.
Based on the year, month, data and hour in the prefix of the delivered log, the Lambda partitioning function checks if the partition specification exists in the Amazon DynamoDB table. If it doesn’t, it’s added to the table using an atomic operation, and then the Athena table is updated.
Type the following command to create the Amazon DynamoDB table.
PartitionSpec is the hash key and is a representation of the partition signature—for example, year=”2017”; month=”05”; day=”15”; hour=”10”.
Depending on the rate at which the processed log files are delivered to the processed log bucket, you might have to increase the ReadCapacityUnits and WriteCapacityUnits values, if these are throttled.
The other attributes besides PartitionSpec are the following:
PartitionPath – The S3 path associated with the partition.
PartitionType – The type of partition used (Hour, Month, Date, Year, or ALL). In this case, ALL is used.
Next step is to create the IAM role to provide permissions for the Lambda partitioning function. You require permissions to do the following:
Look up and write partition information to DynamoDB.
Alter the Athena table with new partition information.
Perform Amazon CloudWatch logs operations.
Perform Amazon S3 operations.
To download the policy document to your local machine, type following command.
To create the Lambda execution role and assign the service to use this role, type the following command.
aws iam create-role --role-name lambda-cflogs-exec-role --assume-role-policy-document file://<path>/assume-lambda-policy.json
Let’s use the AWS CLI to create the IAM role using the following three steps:
Create the IAM policy(lambda-partition-exec-policy) used by the Lambda execution role.
Create the Lambda execution role (lambda-partition-execution-role)and assign the service to use this role.
Attach the policy created in step 1 to the Lambda execution role.
To create the IAM policy used by Lambda execution role, type the following command.
aws iam create-policy --policy-name lambda-partition-exec-policy --policy-document file://<path>/lambda-partition-function-execution-policy.json
To create the Lambda execution role and assign the service to use this role, type the following command.
aws iam create-role --role-name lambda-partition-execution-role --assume-role-policy-document file://<path>/assume-lambda-policy.json
To attach the policy (lambda-partition-exec-policy) created to the AWS Lambda execution role (lambda-partition-execution-role), type the following command.
aws iam attach-role-policy --role-name lambda-partition-execution-role --policy-arn arn:aws:iam::<your-account-id>:policy/lambda-partition-exec-policy
Following is the lambda-partition-function-execution-policy.json file, which is the IAM policy used by the Lambda execution role.
From the AWS Management Console, create a new Lambda function with Java8 as the runtime. Select the Blank Function blueprint.
On the Configure triggers page, choose the name of the S3 bucket where the preprocessed logs are delivered. Choose Put for the Event Type. For Prefix, type the name of the prefix folder, if any, where preprocessed logs are delivered by Firehose—for example, out/. For Suffix, type the name of the compression format that the Firehose stream (CloudFrontLogToS3) delivers the preprocessed logs —for example, gz. To invoke Lambda to retrieve the logs from the S3 bucket as they are delivered, select Enable Trigger.
Choose Next and provide a function name to identify this Lambda partitioning function.
Choose Java8 for Runtime for the AWS Lambda function. Choose Upload a .ZIP or .JAR file for the Code entry type, and choose Upload to upload the downloaded aws-lambda-athena-1.0.0.jar file.
Next, create the following environment variables for the Lambda function:
TABLE_NAME – The name of the Athena table (for example, cf_logs).
PARTITION_TYPE – The partition to be created based on the Athena table for the logs delivered to the sub folders in S3 bucket based on Year, Month, Date, Hour, or Set this to ALL to use Year, Month, Date, and Hour.
DDB_TABLE_NAME – The name of the DynamoDB table holding partition information (for example, athenapartitiondetails).
ATHENA_REGION – The current AWS Region for the Athena table to construct the JDBC connection string.
S3_STAGING_DIR – The Amazon S3 location where your query output is written. The JDBC driver asks Athena to read the results and provide rows of data back to the user (for example, s3://<bucketname>/<folder>/).
To configure the function handler and IAM, for Handler copy and paste the name of the handler: com.amazonaws.services.lambda.CreateAthenaPartitionsBasedOnS3EventWithDDB::handleRequest. Choose the existing IAM role, lambda-partition-execution-role.
Choose Next and then Create Lambda.
Interactive analysis using Amazon Athena
In this section, we analyze the historical data that’s been collected since we added the partitions to the Amazon Athena table for data delivered to the preprocessing logs bucket.
Scenario 1 is robot traffic by edge location.
SELECT COUNT(*) AS ct, requestip, location FROM cf_logs
WHERE isbot='True'
GROUP BY requestip, location
ORDER BY ct DESC;
Scenario 2 is total bytes transferred per distribution for each edge location for your website.
SELECT distribution, location, SUM(bytes) as totalBytes
FROM cf_logs
GROUP BY location, distribution;
Scenario 3 is the top 50 viewers of your website.
SELECT requestip, COUNT(*) AS ct FROM cf_logs
GROUP BY requestip
ORDER BY ct DESC;
Streaming analysis using Amazon Kinesis Analytics
In this section, you deploy a stream processing application using Amazon Kinesis Analytics to analyze the preprocessed Amazon CloudFront log streams. This application analyzes directly from the Amazon Kinesis Stream as it is delivered to the preprocessing logs bucket. The stream queries in section are focused on gaining the following insights:
The IP address of the bot, identified by its Amazon CloudFront edge location, that is currently sending requests to your website. The query also includes the total bytes transferred as part of the response.
The total bytes served per distribution per population for your website.
The top 10 viewers of your website.
To download the firehose-access-policy.json file, type the following.
Before we create the Amazon Kinesis Analytics application, we need to create the IAM role to provide permission for the analytics application to access Amazon Kinesis Firehose stream.
Let’s use the AWS CLI to create the IAM role using the following three steps:
Create the IAM policy(firehose-access-policy) for the Lambda execution role to use.
Create the Lambda execution role (ka-execution-role) and assign the service to use this role.
Attach the policy created in step 1 to the Lambda execution role.
Following is the firehose-access-policy.json file, which is the IAM policy used by Kinesis Analytics to read Firehose delivery stream.
To create the IAM policy used by Analytics access role, type the following command.
aws iam create-policy --policy-name firehose-access-policy --policy-document file://<path>/firehose-access-policy.json
To create the Analytics execution role and assign the service to use this role, type the following command.
aws iam attach-role-policy --role-name ka-execution-role --policy-arn arn:aws:iam::<your-account-id>:policy/firehose-access-policy
To attach the policy (irehose-access-policy) created to the Analytics execution role (ka-execution-role), type the following command.
aws iam attach-role-policy --role-name ka-execution-role --policy-arn arn:aws:iam::<your-account-id>:policy/firehose-access-policy
To deploy the Analytics application, first download the configuration file and then modify ResourceARN and RoleARN for the Amazon Kinesis Firehose input configuration.
Scenario 1 is a query for detecting bots for sending request to your website detection for your website.
-- Create output stream, which can be used to send to a destination
CREATE OR REPLACE STREAM "BOT_DETECTION" (requesttime TIME, destribution VARCHAR(16), requestip VARCHAR(64), edgelocation VARCHAR(64), totalBytes BIGINT);
-- Create pump to insert into output
CREATE OR REPLACE PUMP "BOT_DETECTION_PUMP" AS INSERT INTO "BOT_DETECTION"
--
SELECT STREAM
STEP("CF_LOG_STREAM_001"."request_time" BY INTERVAL '1' SECOND) as requesttime,
"distribution_name" as distribution,
"request_ip" as requestip,
"edge_location" as edgelocation,
SUM("bytes") as totalBytes
FROM "CF_LOG_STREAM_001"
WHERE "is_bot" = true
GROUP BY "request_ip", "edge_location", "distribution_name",
STEP("CF_LOG_STREAM_001"."request_time" BY INTERVAL '1' SECOND),
STEP("CF_LOG_STREAM_001".ROWTIME BY INTERVAL '1' SECOND);
Scenario 2 is a query for total bytes transferred per distribution for each edge location for your website.
-- Create output stream, which can be used to send to a destination
CREATE OR REPLACE STREAM "BYTES_TRANSFFERED" (requesttime TIME, destribution VARCHAR(16), edgelocation VARCHAR(64), totalBytes BIGINT);
-- Create pump to insert into output
CREATE OR REPLACE PUMP "BYTES_TRANSFFERED_PUMP" AS INSERT INTO "BYTES_TRANSFFERED"
-- Bytes Transffered per second per web destribution by edge location
SELECT STREAM
STEP("CF_LOG_STREAM_001"."request_time" BY INTERVAL '1' SECOND) as requesttime,
"distribution_name" as distribution,
"edge_location" as edgelocation,
SUM("bytes") as totalBytes
FROM "CF_LOG_STREAM_001"
GROUP BY "distribution_name", "edge_location", "request_date",
STEP("CF_LOG_STREAM_001"."request_time" BY INTERVAL '1' SECOND),
STEP("CF_LOG_STREAM_001".ROWTIME BY INTERVAL '1' SECOND);
Scenario 3 is a query for the top 50 viewers for your website.
-- Create output stream, which can be used to send to a destination
CREATE OR REPLACE STREAM "TOP_TALKERS" (requestip VARCHAR(64), requestcount DOUBLE);
-- Create pump to insert into output
CREATE OR REPLACE PUMP "TOP_TALKERS_PUMP" AS INSERT INTO "TOP_TALKERS"
-- Top Ten Talker
SELECT STREAM ITEM as requestip, ITEM_COUNT as requestcount FROM TABLE(TOP_K_ITEMS_TUMBLING(
CURSOR(SELECT STREAM * FROM "CF_LOG_STREAM_001"),
'request_ip', -- name of column in single quotes
50, -- number of top items
60 -- tumbling window size in seconds
)
);
Conclusion
Following the steps in this blog post, you just built an end-to-end serverless architecture to analyze Amazon CloudFront access logs. You analyzed these both in interactive and streaming mode, using Amazon Athena and Amazon Kinesis Analytics respectively.
By creating a partition in Athena for the logs delivered to a centralized bucket, this architecture is optimized for performance and cost when analyzing large volumes of logs for popular websites that receive millions of requests. Here, we have focused on just three common use cases for analysis, sharing the analytic queries as part of the post. However, you can extend this architecture to gain deeper insights and generate usage reports to reduce latency and increase availability. This way, you can provide a better experience on your websites fronted with Amazon CloudFront.
In this blog post, we focused on building serverless architecture to analyze Amazon CloudFront access logs. Our plan is to extend the solution to provide rich visualization as part of our next blog post.
About the Authors
Rajeev Srinivasan is a Senior Solution Architect for AWS. He works very close with our customers to provide big data and NoSQL solution leveraging the AWS platform and enjoys coding . In his spare time he enjoys riding his motorcycle and reading books.
Sai Sriparasa is a consultant with AWS Professional Services. He works with our customers to provide strategic and tactical big data solutions with an emphasis on automation, operations & security on AWS. In his spare time, he follows sports and current affairs.
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.