Tag Archives: Amazon CloudWatch Logs

Analyze Apache Parquet optimized data using Amazon Kinesis Data Firehose, Amazon Athena, and Amazon Redshift

Post Syndicated from Roy Hasson original https://aws.amazon.com/blogs/big-data/analyzing-apache-parquet-optimized-data-using-amazon-kinesis-data-firehose-amazon-athena-and-amazon-redshift/

Amazon Kinesis Data Firehose is the easiest way to capture and stream data into a data lake built on Amazon S3. This data can be anything—from AWS service logs like AWS CloudTrail log files, Amazon VPC Flow Logs, Application Load Balancer logs, and others. It can also be IoT events, game events, and much more. To efficiently query this data, a time-consuming ETL (extract, transform, and load) process is required to massage and convert the data to an optimal file format, which increases the time to insight. This situation is less than ideal, especially for real-time data that loses its value over time.

To solve this common challenge, Kinesis Data Firehose can now save data to Amazon S3 in Apache Parquet or Apache ORC format. These are optimized columnar formats that are highly recommended for best performance and cost-savings when querying data in S3. This feature directly benefits you if you use Amazon Athena, Amazon Redshift, AWS Glue, Amazon EMR, or any other big data tools that are available from the AWS Partner Network and through the open-source community.

Amazon Connect is a simple-to-use, cloud-based contact center service that makes it easy for any business to provide a great customer experience at a lower cost than common alternatives. Its open platform design enables easy integration with other systems. One of those systems is Amazon Kinesis—in particular, Kinesis Data Streams and Kinesis Data Firehose.

What’s really exciting is that you can now save events from Amazon Connect to S3 in Apache Parquet format. You can then perform analytics using Amazon Athena and Amazon Redshift Spectrum in real time, taking advantage of this key performance and cost optimization. Of course, Amazon Connect is only one example. This new capability opens the door for a great deal of opportunity, especially as organizations continue to build their data lakes.

Amazon Connect includes an array of analytics views in the Administrator dashboard. But you might want to run other types of analysis. In this post, I describe how to set up a data stream from Amazon Connect through Kinesis Data Streams and Kinesis Data Firehose and out to S3, and then perform analytics using Athena and Amazon Redshift Spectrum. I focus primarily on the Kinesis Data Firehose support for Parquet and its integration with the AWS Glue Data Catalog, Amazon Athena, and Amazon Redshift.

Solution overview

Here is how the solution is laid out:

 

 

The following sections walk you through each of these steps to set up the pipeline.

1. Define the schema

When Kinesis Data Firehose processes incoming events and converts the data to Parquet, it needs to know which schema to apply. The reason is that many times, incoming events contain all or some of the expected fields based on which values the producers are advertising. A typical process is to normalize the schema during a batch ETL job so that you end up with a consistent schema that can easily be understood and queried. Doing this introduces latency due to the nature of the batch process. To overcome this issue, Kinesis Data Firehose requires the schema to be defined in advance.

To see the available columns and structures, see Amazon Connect Agent Event Streams. For the purpose of simplicity, I opted to make all the columns of type String rather than create the nested structures. But you can definitely do that if you want.

The simplest way to define the schema is to create a table in the Amazon Athena console. Open the Athena console, and paste the following create table statement, substituting your own S3 bucket and prefix for where your event data will be stored. A Data Catalog database is a logical container that holds the different tables that you can create. The default database name shown here should already exist. If it doesn’t, you can create it or use another database that you’ve already created.

CREATE EXTERNAL TABLE default.kfhconnectblog (
  awsaccountid string,
  agentarn string,
  currentagentsnapshot string,
  eventid string,
  eventtimestamp string,
  eventtype string,
  instancearn string,
  previousagentsnapshot string,
  version string
)
STORED AS parquet
LOCATION 's3://your_bucket/kfhconnectblog/'
TBLPROPERTIES ("parquet.compression"="SNAPPY")

That’s all you have to do to prepare the schema for Kinesis Data Firehose.

2. Define the data streams

Next, you need to define the Kinesis data streams that will be used to stream the Amazon Connect events.  Open the Kinesis Data Streams console and create two streams.  You can configure them with only one shard each because you don’t have a lot of data right now.

3. Define the Kinesis Data Firehose delivery stream for Parquet

Let’s configure the Data Firehose delivery stream using the data stream as the source and Amazon S3 as the output. Start by opening the Kinesis Data Firehose console and creating a new data delivery stream. Give it a name, and associate it with the Kinesis data stream that you created in Step 2.

As shown in the following screenshot, enable Record format conversion (1) and choose Apache Parquet (2). As you can see, Apache ORC is also supported. Scroll down and provide the AWS Glue Data Catalog database name (3) and table names (4) that you created in Step 1. Choose Next.

To make things easier, the output S3 bucket and prefix fields are automatically populated using the values that you defined in the LOCATION parameter of the create table statement from Step 1. Pretty cool. Additionally, you have the option to save the raw events into another location as defined in the Source record S3 backup section. Don’t forget to add a trailing forward slash “ / “ so that Data Firehose creates the date partitions inside that prefix.

On the next page, in the S3 buffer conditions section, there is a note about configuring a large buffer size. The Parquet file format is highly efficient in how it stores and compresses data. Increasing the buffer size allows you to pack more rows into each output file, which is preferred and gives you the most benefit from Parquet.

Compression using Snappy is automatically enabled for both Parquet and ORC. You can modify the compression algorithm by using the Kinesis Data Firehose API and update the OutputFormatConfiguration.

Be sure to also enable Amazon CloudWatch Logs so that you can debug any issues that you might run into.

Lastly, finalize the creation of the Firehose delivery stream, and continue on to the next section.

4. Set up the Amazon Connect contact center

After setting up the Kinesis pipeline, you now need to set up a simple contact center in Amazon Connect. The Getting Started page provides clear instructions on how to set up your environment, acquire a phone number, and create an agent to accept calls.

After setting up the contact center, in the Amazon Connect console, choose your Instance Alias, and then choose Data Streaming. Under Agent Event, choose the Kinesis data stream that you created in Step 2, and then choose Save.

At this point, your pipeline is complete.  Agent events from Amazon Connect are generated as agents go about their day. Events are sent via Kinesis Data Streams to Kinesis Data Firehose, which converts the event data from JSON to Parquet and stores it in S3. Athena and Amazon Redshift Spectrum can simply query the data without any additional work.

So let’s generate some data. Go back into the Administrator console for your Amazon Connect contact center, and create an agent to handle incoming calls. In this example, I creatively named mine Agent One. After it is created, Agent One can get to work and log into their console and set their availability to Available so that they are ready to receive calls.

To make the data a bit more interesting, I also created a second agent, Agent Two. I then made some incoming and outgoing calls and caused some failures to occur, so I now have enough data available to analyze.

5. Analyze the data with Athena

Let’s open the Athena console and run some queries. One thing you’ll notice is that when we created the schema for the dataset, we defined some of the fields as Strings even though in the documentation they were complex structures.  The reason for doing that was simply to show some of the flexibility of Athena to be able to parse JSON data. However, you can define nested structures in your table schema so that Kinesis Data Firehose applies the appropriate schema to the Parquet file.

Let’s run the first query to see which agents have logged into the system.

The query might look complex, but it’s fairly straightforward:

WITH dataset AS (
  SELECT 
    from_iso8601_timestamp(eventtimestamp) AS event_ts,
    eventtype,
    -- CURRENT STATE
    json_extract_scalar(
      currentagentsnapshot,
      '$.agentstatus.name') AS current_status,
    from_iso8601_timestamp(
      json_extract_scalar(
        currentagentsnapshot,
        '$.agentstatus.starttimestamp')) AS current_starttimestamp,
    json_extract_scalar(
      currentagentsnapshot, 
      '$.configuration.firstname') AS current_firstname,
    json_extract_scalar(
      currentagentsnapshot,
      '$.configuration.lastname') AS current_lastname,
    json_extract_scalar(
      currentagentsnapshot, 
      '$.configuration.username') AS current_username,
    json_extract_scalar(
      currentagentsnapshot, 
      '$.configuration.routingprofile.defaultoutboundqueue.name') AS               current_outboundqueue,
    json_extract_scalar(
      currentagentsnapshot, 
      '$.configuration.routingprofile.inboundqueues[0].name') as current_inboundqueue,
    -- PREVIOUS STATE
    json_extract_scalar(
      previousagentsnapshot, 
      '$.agentstatus.name') as prev_status,
    from_iso8601_timestamp(
      json_extract_scalar(
        previousagentsnapshot, 
       '$.agentstatus.starttimestamp')) as prev_starttimestamp,
    json_extract_scalar(
      previousagentsnapshot, 
      '$.configuration.firstname') as prev_firstname,
    json_extract_scalar(
      previousagentsnapshot, 
      '$.configuration.lastname') as prev_lastname,
    json_extract_scalar(
      previousagentsnapshot, 
      '$.configuration.username') as prev_username,
    json_extract_scalar(
      previousagentsnapshot, 
      '$.configuration.routingprofile.defaultoutboundqueue.name') as current_outboundqueue,
    json_extract_scalar(
      previousagentsnapshot, 
      '$.configuration.routingprofile.inboundqueues[0].name') as prev_inboundqueue
  from kfhconnectblog
  where eventtype <> 'HEART_BEAT'
)
SELECT
  current_status as status,
  current_username as username,
  event_ts
FROM dataset
WHERE eventtype = 'LOGIN' AND current_username <> ''
ORDER BY event_ts DESC

The query output looks something like this:

Here is another query that shows the sessions each of the agents engaged with. It tells us where they were incoming or outgoing, if they were completed, and where there were missed or failed calls.

WITH src AS (
  SELECT
     eventid,
     json_extract_scalar(currentagentsnapshot, '$.configuration.username') as username,
     cast(json_extract(currentagentsnapshot, '$.contacts') AS ARRAY(JSON)) as c,
     cast(json_extract(previousagentsnapshot, '$.contacts') AS ARRAY(JSON)) as p
  from kfhconnectblog
),
src2 AS (
  SELECT *
  FROM src CROSS JOIN UNNEST (c, p) AS contacts(c_item, p_item)
),
dataset AS (
SELECT 
  eventid,
  username,
  json_extract_scalar(c_item, '$.contactid') as c_contactid,
  json_extract_scalar(c_item, '$.channel') as c_channel,
  json_extract_scalar(c_item, '$.initiationmethod') as c_direction,
  json_extract_scalar(c_item, '$.queue.name') as c_queue,
  json_extract_scalar(c_item, '$.state') as c_state,
  from_iso8601_timestamp(json_extract_scalar(c_item, '$.statestarttimestamp')) as c_ts,
  
  json_extract_scalar(p_item, '$.contactid') as p_contactid,
  json_extract_scalar(p_item, '$.channel') as p_channel,
  json_extract_scalar(p_item, '$.initiationmethod') as p_direction,
  json_extract_scalar(p_item, '$.queue.name') as p_queue,
  json_extract_scalar(p_item, '$.state') as p_state,
  from_iso8601_timestamp(json_extract_scalar(p_item, '$.statestarttimestamp')) as p_ts
FROM src2
)
SELECT 
  username,
  c_channel as channel,
  c_direction as direction,
  p_state as prev_state,
  c_state as current_state,
  c_ts as current_ts,
  c_contactid as id
FROM dataset
WHERE c_contactid = p_contactid
ORDER BY id DESC, current_ts ASC

The query output looks similar to the following:

6. Analyze the data with Amazon Redshift Spectrum

With Amazon Redshift Spectrum, you can query data directly in S3 using your existing Amazon Redshift data warehouse cluster. Because the data is already in Parquet format, Redshift Spectrum gets the same great benefits that Athena does.

Here is a simple query to show querying the same data from Amazon Redshift. Note that to do this, you need to first create an external schema in Amazon Redshift that points to the AWS Glue Data Catalog.

SELECT 
  eventtype,
  json_extract_path_text(currentagentsnapshot,'agentstatus','name') AS current_status,
  json_extract_path_text(currentagentsnapshot, 'configuration','firstname') AS current_firstname,
  json_extract_path_text(currentagentsnapshot, 'configuration','lastname') AS current_lastname,
  json_extract_path_text(
    currentagentsnapshot,
    'configuration','routingprofile','defaultoutboundqueue','name') AS current_outboundqueue,
FROM default_schema.kfhconnectblog

The following shows the query output:

Summary

In this post, I showed you how to use Kinesis Data Firehose to ingest and convert data to columnar file format, enabling real-time analysis using Athena and Amazon Redshift. This great feature enables a level of optimization in both cost and performance that you need when storing and analyzing large amounts of data. This feature is equally important if you are investing in building data lakes on AWS.

 


Additional Reading

If you found this post useful, be sure to check out Analyzing VPC Flow Logs with Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight and Work with partitioned data in AWS Glue.


About the Author

Roy Hasson is a Global Business Development Manager for AWS Analytics. He works with customers around the globe to design solutions to meet their data processing, analytics and business intelligence needs. Roy is big Manchester United fan cheering his team on and hanging out with his family.

 

 

 

Improve the Operational Efficiency of Amazon Elasticsearch Service Domains with Automated Alarms Using Amazon CloudWatch

Post Syndicated from Veronika Megler original https://aws.amazon.com/blogs/big-data/improve-the-operational-efficiency-of-amazon-elasticsearch-service-domains-with-automated-alarms-using-amazon-cloudwatch/

A customer has been successfully creating and running multiple Amazon Elasticsearch Service (Amazon ES) domains to support their business users’ search needs across products, orders, support documentation, and a growing suite of similar needs. The service has become heavily used across the organization.  This led to some domains running at 100% capacity during peak times, while others began to run low on storage space. Because of this increased usage, the technical teams were in danger of missing their service level agreements.  They contacted me for help.

This post shows how you can set up automated alarms to warn when domains need attention.

Solution overview

Amazon ES is a fully managed service that delivers Elasticsearch’s easy-to-use APIs and real-time analytics capabilities along with the availability, scalability, and security that production workloads require.  The service offers built-in integrations with a number of other components and AWS services, enabling customers to go from raw data to actionable insights quickly and securely.

One of these other integrated services is Amazon CloudWatch. CloudWatch is a monitoring service for AWS Cloud resources and the applications that you run on AWS. You can use CloudWatch to collect and track metrics, collect and monitor log files, set alarms, and automatically react to changes in your AWS resources.

CloudWatch collects metrics for Amazon ES. You can use these metrics to monitor the state of your Amazon ES domains, and set alarms to notify you about high utilization of system resources.  For more information, see Amazon Elasticsearch Service Metrics and Dimensions.

While the metrics are automatically collected, the missing piece is how to set alarms on these metrics at appropriate levels for each of your domains. This post includes sample Python code to evaluate the current state of your Amazon ES environment, and to set up alarms according to AWS recommendations and best practices.

There are two components to the sample solution:

  • es-check-cwalarms.py: This Python script checks the CloudWatch alarms that have been set, for all Amazon ES domains in a given account and region.
  • es-create-cwalarms.py: This Python script sets up a set of CloudWatch alarms for a single given domain.

The sample code can also be found in the amazon-es-check-cw-alarms GitHub repo. The scripts are easy to extend or combine, as described in the section “Extensions and Adaptations”.

Assessing the current state

The first script, es-check-cwalarms.py, is used to give an overview of the configurations and alarm settings for all the Amazon ES domains in the given region. The script takes the following parameters:

python es-checkcwalarms.py -h
usage: es-checkcwalarms.py [-h] [-e ESPREFIX] [-n NOTIFY] [-f FREE][-p PROFILE] [-r REGION]
Checks a set of recommended CloudWatch alarms for Amazon Elasticsearch Service domains (optionally, those beginning with a given prefix).
optional arguments:
  -h, --help   		show this help message and exit
  -e ESPREFIX, --esprefix ESPREFIX	Only check Amazon Elasticsearch Service domains that begin with this prefix.
  -n NOTIFY, --notify NOTIFY    List of CloudWatch alarm actions; e.g. ['arn:aws:sns:xxxx']
  -f FREE, --free FREE  Minimum free storage (MB) on which to alarm
  -p PROFILE, --profile PROFILE     IAM profile name to use
  -r REGION, --region REGION       AWS region for the domain. Default: us-east-1

The script first identifies all the domains in the given region (or, optionally, limits them to the subset that begins with a given prefix). It then starts running a set of checks against each one.

The script can be run from the command line or set up as a scheduled Lambda function. For example, for one customer, it was deemed appropriate to regularly run the script to check that alarms were correctly set for all domains. In addition, because configuration changes—cluster size increases to accommodate larger workloads being a common change—might require updates to alarms, this approach allowed the automatic identification of alarms no longer appropriately set as the domain configurations changed.

The output shown below is the output for one domain in my account.

Starting checks for Elasticsearch domain iotfleet , version is 53
Iotfleet Automated snapshot hour (UTC): 0
Iotfleet Instance configuration: 1 instances; type:m3.medium.elasticsearch
Iotfleet Instance storage definition is: 4 GB; free storage calced to: 819.2 MB
iotfleet Desired free storage set to (in MB): 819.2
iotfleet WARNING: Not using VPC Endpoint
iotfleet WARNING: Does not have Zone Awareness enabled
iotfleet WARNING: Instance count is ODD. Best practice is for an even number of data nodes and zone awareness.
iotfleet WARNING: Does not have Dedicated Masters.
iotfleet WARNING: Neither index nor search slow logs are enabled.
iotfleet WARNING: EBS not in use. Using instance storage only.
iotfleet Alarm ok; definition matches. Test-Elasticsearch-iotfleet-ClusterStatus.yellow-Alarm ClusterStatus.yellow
iotfleet Alarm ok; definition matches. Test-Elasticsearch-iotfleet-ClusterStatus.red-Alarm ClusterStatus.red
iotfleet Alarm ok; definition matches. Test-Elasticsearch-iotfleet-CPUUtilization-Alarm CPUUtilization
iotfleet Alarm ok; definition matches. Test-Elasticsearch-iotfleet-JVMMemoryPressure-Alarm JVMMemoryPressure
iotfleet WARNING: Missing alarm!! ('ClusterIndexWritesBlocked', 'Maximum', 60, 5, 'GreaterThanOrEqualToThreshold', 1.0)
iotfleet Alarm ok; definition matches. Test-Elasticsearch-iotfleet-AutomatedSnapshotFailure-Alarm AutomatedSnapshotFailure
iotfleet Alarm: Threshold does not match: Test-Elasticsearch-iotfleet-FreeStorageSpace-Alarm Should be:  819.2 ; is 3000.0

The output messages fall into the following categories:

  • System overview, Informational: The Amazon ES version and configuration, including instance type and number, storage, automated snapshot hour, etc.
  • Free storage: A calculation for the appropriate amount of free storage, based on the recommended 20% of total storage.
  • Warnings: best practices that are not being followed for this domain. (For more about this, read on.)
  • Alarms: An assessment of the CloudWatch alarms currently set for this domain, against a recommended set.

The script contains an array of recommended CloudWatch alarms, based on best practices for these metrics and statistics. Using the array allows alarm parameters (such as free space) to be updated within the code based on current domain statistics and configurations.

For a given domain, the script checks if each alarm has been set. If the alarm is set, it checks whether the values match those in the array esAlarms. In the output above, you can see three different situations being reported:

  • Alarm ok; definition matches. The alarm set for the domain matches the settings in the array.
  • Alarm: Threshold does not match. An alarm exists, but the threshold value at which the alarm is triggered does not match.
  • WARNING: Missing alarm!! The recommended alarm is missing.

All in all, the list above shows that this domain does not have a configuration that adheres to best practices, nor does it have all the recommended alarms.

Setting up alarms

Now that you know that the domains in their current state are missing critical alarms, you can correct the situation.

To demonstrate the script, set up a new domain named “ver”, in us-west-2. Specify 1 node, and a 10-GB EBS disk. Also, create an SNS topic in us-west-2 with a name of “sendnotification”, which sends you an email.

Run the second script, es-create-cwalarms.py, from the command line. This script creates (or updates) the desired CloudWatch alarms for the specified Amazon ES domain, “ver”.

python es-create-cwalarms.py -r us-west-2 -e test -c ver -n "['arn:aws:sns:us-west-2:xxxxxxxxxx:sendnotification']"
EBS enabled: True type: gp2 size (GB): 10 No Iops 10240  total storage (MB)
Desired free storage set to (in MB): 2048.0
Creating  Test-Elasticsearch-ver-ClusterStatus.yellow-Alarm
Creating  Test-Elasticsearch-ver-ClusterStatus.red-Alarm
Creating  Test-Elasticsearch-ver-CPUUtilization-Alarm
Creating  Test-Elasticsearch-ver-JVMMemoryPressure-Alarm
Creating  Test-Elasticsearch-ver-FreeStorageSpace-Alarm
Creating  Test-Elasticsearch-ver-ClusterIndexWritesBlocked-Alarm
Creating  Test-Elasticsearch-ver-AutomatedSnapshotFailure-Alarm
Successfully finished creating alarms!

As with the first script, this script contains an array of recommended CloudWatch alarms, based on best practices for these metrics and statistics. This approach allows you to add or modify alarms based on your use case (more on that below).

After running the script, navigate to Alarms on the CloudWatch console. You can see the set of alarms set up on your domain.

Because the “ver” domain has only a single node, cluster status is yellow, and that alarm is in an “ALARM” state. It’s already sent a notification that the alarm has been triggered.

What to do when an alarm triggers

After alarms are set up, you need to identify the correct action to take for each alarm, which depends on the alarm triggered. For ideas, guidance, and additional pointers to supporting documentation, see Get Started with Amazon Elasticsearch Service: Set CloudWatch Alarms on Key Metrics. For information about common errors and recovery actions to take, see Handling AWS Service Errors.

In most cases, the alarm triggers due to an increased workload. The likely action is to reconfigure the system to handle the increased workload, rather than reducing the incoming workload. Reconfiguring any backend store—a category of systems that includes Elasticsearch—is best performed when the system is quiescent or lightly loaded. Reconfigurations such as setting zone awareness or modifying the disk type cause Amazon ES to enter a “processing” state, potentially disrupting client access.

Other changes, such as increasing the number of data nodes, may cause Elasticsearch to begin moving shards, potentially impacting search performance on these shards while this is happening. These actions should be considered in the context of your production usage. For the same reason I also do not recommend running a script that resets all domains to match best practices.

Avoid the need to reconfigure during heavy workload by setting alarms at a level that allows a considered approach to making the needed changes. For example, if you identify that each weekly peak is increasing, you can reconfigure during a weekly quiet period.

While Elasticsearch can be reconfigured without being quiesced, it is not a best practice to automatically scale it up and down based on usage patterns. Unlike some other AWS services, I recommend against setting a CloudWatch action that automatically reconfigures the system when alarms are triggered.

There are other situations where the planned reconfiguration approach may not work, such as low or zero free disk space causing the domain to reject writes. If the business is dependent on the domain continuing to accept incoming writes and deleting data is not an option, the team may choose to reconfigure immediately.

Extensions and adaptations

You may wish to modify the best practices encoded in the scripts for your own environment or workloads. It’s always better to avoid situations where alerts are generated but routinely ignored. All alerts should trigger a review and one or more actions, either immediately or at a planned date. The following is a list of common situations where you may wish to set different alarms for different domains:

  • Dev/test vs. production
    You may have a different set of configuration rules and alarms for your dev environment configurations than for test. For example, you may require zone awareness and dedicated masters for your production environment, but not for your development domains. Or, you may not have any alarms set in dev. For test environments that mirror your potential peak load, test to ensure that the alarms are appropriately triggered.
  • Differing workloads or SLAs for different domains
    You may have one domain with a requirement for superfast search performance, and another domain with a heavy ingest load that tolerates slower search response. Your reaction to slow response for these two workloads is likely to be different, so perhaps the thresholds for these two domains should be set at a different level. In this case, you might add a “max CPU utilization” alarm at 100% for 1 minute for the fast search domain, while the other domain only triggers an alarm when the average has been higher than 60% for 5 minutes. You might also add a “free space” rule with a higher threshold to reflect the need for more space for the heavy ingest load if there is danger that it could fill the available disk quickly.
  • “Normal” alarms versus “emergency” alarms
    If, for example, free disk space drops to 25% of total capacity, an alarm is triggered that indicates action should be taken as soon as possible, such as cleaning up old indexes or reconfiguring at the next quiet period for this domain. However, if free space drops below a critical level (20% free space), action must be taken immediately in order to prevent Amazon ES from setting the domain to read-only. Similarly, if the “ClusterIndexWritesBlocked” alarm triggers, the domain has already stopped accepting writes, so immediate action is needed. In this case, you may wish to set “laddered” alarms, where one threshold causes an alarm to be triggered to review the current workload for a planned reconfiguration, but a different threshold raises a “DefCon 3” alarm that immediate action is required.

The sample scripts provided here are a starting point, intended for you to adapt to your own environment and needs.

Running the scripts one time can identify how far your current state is from your desired state, and create an initial set of alarms. Regularly re-running these scripts can capture changes in your environment over time and adjusting your alarms for changes in your environment and configurations. One customer has set them up to run nightly, and to automatically create and update alarms to match their preferred settings.

Removing unwanted alarms

Each CloudWatch alarm costs approximately $0.10 per month. You can remove unwanted alarms in the CloudWatch console, under Alarms. If you set up a “ver” domain above, remember to remove it to avoid continuing charges.

Conclusion

Setting CloudWatch alarms appropriately for your Amazon ES domains can help you avoid suboptimal performance and allow you to respond to workload growth or configuration issues well before they become urgent. This post gives you a starting point for doing so. The additional sleep you’ll get knowing you don’t need to be concerned about Elasticsearch domain performance will allow you to focus on building creative solutions for your business and solving problems for your customers.

Enjoy!


Additional Reading

If you found this post useful, be sure to check out Analyzing Amazon Elasticsearch Service Slow Logs Using Amazon CloudWatch Logs Streaming and Kibana and Get Started with Amazon Elasticsearch Service: How Many Shards Do I Need?

 


About the Author

Dr. Veronika Megler is a senior consultant at Amazon Web Services. She works with our customers to implement innovative big data, AI and ML projects, helping them accelerate their time-to-value when using AWS.

 

 

 

Central Logging in Multi-Account Environments

Post Syndicated from matouk original https://aws.amazon.com/blogs/architecture/central-logging-in-multi-account-environments/

Centralized logging is often required in large enterprise environments for a number of reasons, ranging from compliance and security to analytics and application-specific needs.

I’ve seen that in a multi-account environment, whether the accounts belong to the same line of business or multiple business units, collecting logs in a central, dedicated logging account is an established best practice. It helps security teams detect malicious activities both in real-time and during incident response. It provides protection to log data in case it is accidentally or intentionally deleted. It also helps application teams correlate and analyze log data across multiple application tiers.

This blog post provides a solution and building blocks to stream Amazon CloudWatch log data across accounts. In a multi-account environment this repeatable solution could be deployed multiple times to stream all relevant Amazon CloudWatch log data from all accounts to a centralized logging account.

Solution Summary 

The solution uses Amazon Kinesis Data Streams and a log destination to set up an endpoint in the logging account to receive streamed logs and uses Amazon Kinesis Data Firehose to deliver log data to the Amazon Simple Storage Solution (S3) bucket. Application accounts will subscribe to stream all (or part) of their Amazon CloudWatch logs to a defined destination in the logging account via subscription filters.

Below is a diagram illustrating how the various services work together.


In logging an account, a Kinesis Data Stream is created to receive streamed log data and a log destination is created to facilitate remote streaming, configured to use the Kinesis Data Stream as its target.

The Amazon Kinesis Data Firehose stream is created to deliver log data from the data stream to S3. The delivery stream uses a generic AWS Lambda function for data validation and transformation.

In each application account, a subscription filter is created between each Amazon CloudWatch log group and the destination created for this log group in the logging account.

The following steps are involved in setting up the central-logging solution:

  1. Create an Amazon S3 bucket for your central logging in the logging account
  2. Create an AWS Lambda function for log data transformation and decoding in logging account
  3. Create a central logging stack as a logging-account destination ready to receive streamed logs and deliver them to S3
  4. Create a subscription in application accounts to deliver logs from a specific CloudWatch log group to the logging account destination
  5. Create Amazon Athena tables to query and analyze log data in your logging account

Creating a log destination in your logging account

In this section, we will setup the logging account side of the solution, providing detail on the list above. The example I use is for the us-east-1 region, however any region where required services are available could be used.

It’s important to note that your logging-account destination and application-account subscription must be in the same region. You can deploy the solution multiple times to create destinations in all required regions if application accounts use multiple regions.

Step 1: Create an S3 bucket

Use the CloudFormation template below to create S3 bucket in logging account. This template also configures the bucket to archive log data to Glacier after 60 days.


{
  "AWSTemplateFormatVersion":"2010-09-09",
  "Description": "CF Template to create S3 bucket for central logging",
  "Parameters":{

    "BucketName":{
      "Type":"String",
      "Default":"",
      "Description":"Central logging bucket name"
    }
  },
  "Resources":{
                        
   "CentralLoggingBucket" : {
      "Type" : "AWS::S3::Bucket",
      "Properties" : {
        "BucketName" : {"Ref": "BucketName"},
        "LifecycleConfiguration": {
            "Rules": [
                {
                  "Id": "ArchiveToGlacier",
                  "Prefix": "",
                  "Status": "Enabled",
                  "Transitions":[{
                      "TransitionInDays": "60",
                      "StorageClass": "GLACIER"
                  }]
                }
            ]
        }
      }
    }

  },
  "Outputs":{
    "CentralLogBucket":{
    	"Description" : "Central log bucket",
    	"Value" : {"Ref": "BucketName"} ,
    	"Export" : { "Name" : "CentralLogBucketName"}
    }
  }
} 

To create your central-logging bucket do the following:

  1. Save the template file to your local developer machine as “central-log-bucket.json”
  2. From the CloudFormation console, select “create new stack” and import the file “central-log-bucket.json”
  3. Fill in the parameters and complete stack creation steps (as indicated in the screenshot below)
  4. Verify the bucket has been created successfully and take a note of the bucket name

Step 2: Create data processing Lambda function

Use the template below to create a Lambda function in your logging account that will be used by Amazon Firehose for data transformation during the delivery process to S3. This function is based on the AWS Lambda kinesis-firehose-cloudwatch-logs-processor blueprint.

The function could be created manually from the blueprint or using the cloudformation template below. To find the blueprint navigate to Lambda -> Create -> Function -> Blueprints

This function will unzip the event message, parse it and verify that it is a valid CloudWatch log event. Additional processing can be added if needed. As this function is generic, it could be reused by all log-delivery streams.

{
  "AWSTemplateFormatVersion":"2010-09-09",
  "Description": "Create cloudwatch data processing lambda function",
  "Resources":{
      
    "LambdaRole": {
        "Type": "AWS::IAM::Role",
        "Properties": {
            "AssumeRolePolicyDocument": {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Principal": {
                            "Service": "lambda.amazonaws.com"
                        },
                        "Action": "sts:AssumeRole"
                    }
                ]
            },
            "Path": "/",
            "Policies": [
                {
                    "PolicyName": "firehoseCloudWatchDataProcessing",
                    "PolicyDocument": {
                        "Version": "2012-10-17",
                        "Statement": [
                            {
                                "Effect": "Allow",
                                "Action": [
                                    "logs:CreateLogGroup",
                                    "logs:CreateLogStream",
                                    "logs:PutLogEvents"
                                ],
                                "Resource": "arn:aws:logs:*:*:*"
                            }
                        ]
                    }
                }
            ]
        }
    },
      
    "FirehoseDataProcessingFunction": {
        "Type": "AWS::Lambda::Function",
        "Properties": {
            "Handler": "index.handler",
            "Role": {"Fn::GetAtt": ["LambdaRole","Arn"]},
            "Description": "Firehose cloudwatch data processing",
            "Code": {
                "ZipFile" : { "Fn::Join" : ["\n", [
                  "'use strict';",
                  "const zlib = require('zlib');",
                  "function transformLogEvent(logEvent) {",
                  "       return Promise.resolve(`${logEvent.message}\n`);",
                  "}",
                  "exports.handler = (event, context, callback) => {",
                  "    Promise.all(event.records.map(r => {",
                  "        const buffer = new Buffer(r.data, 'base64');",
                  "        const decompressed = zlib.gunzipSync(buffer);",
                  "        const data = JSON.parse(decompressed);",
                  "        if (data.messageType !== 'DATA_MESSAGE') {",
                  "            return Promise.resolve({",
                  "                recordId: r.recordId,",
                  "                result: 'ProcessingFailed',",
                  "            });",
                  "         } else {",
                  "            const promises = data.logEvents.map(transformLogEvent);",
                  "            return Promise.all(promises).then(transformed => {",
                  "                const payload = transformed.reduce((a, v) => a + v, '');",
                  "                const encoded = new Buffer(payload).toString('base64');",
                  "                console.log('---------------payloadv2:'+JSON.stringify(payload, null, 2));",
                  "                return {",
                  "                    recordId: r.recordId,",
                  "                    result: 'Ok',",
                  "                    data: encoded,",
                  "                };",
                  "           });",
                  "        }",
                  "    })).then(recs => callback(null, { records: recs }));",
                    "};"

                ]]}
            },
            "Runtime": "nodejs6.10",
            "Timeout": "60"
        }
    }

  },
  "Outputs":{
   "Function" : {
      "Description": "Function ARN",
      "Value": {"Fn::GetAtt": ["FirehoseDataProcessingFunction","Arn"]},
      "Export" : { "Name" : {"Fn::Sub": "${AWS::StackName}-Function" }}
    }
  }
}

To create the function follow the steps below:

  1. Save the template file as “central-logging-lambda.json”
  2. Login to logging account and, from the CloudFormation console, select “create new stack”
  3. Import the file “central-logging-lambda.json” and click next
  4. Follow the steps to create the stack and verify successful creation
  5. Take a note of Lambda function arn from the output section

Step 3: Create log destination in logging account

Log destination is used as the target of a subscription from application accounts, log destination can be shared between multiple subscriptions however according to the architecture suggested in this solution all logs streamed to the same destination will be stored in the same S3 location, if you would like to store log data in different hierarchy or in a completely different bucket you need to create separate destinations.

As noted previously, your destination and subscription have to be in the same region

Use the template below to create destination stack in logging account.

{
  "AWSTemplateFormatVersion":"2010-09-09",
  "Description": "Create log destination and required resources",
  "Parameters":{

    "LogBucketName":{
      "Type":"String",
      "Default":"central-log-do-not-delete",
      "Description":"Destination logging bucket"
    },
    "LogS3Location":{
      "Type":"String",
      "Default":"<BU>/<ENV>/<SOURCE_ACCOUNT>/<LOG_TYPE>/",
      "Description":"S3 location for the logs streamed to this destination; example marketing/prod/999999999999/flow-logs/"
    },
    "ProcessingLambdaARN":{
      "Type":"String",
      "Default":"",
      "Description":"CloudWatch logs data processing function"
    },
    "SourceAccount":{
      "Type":"String",
      "Default":"",
      "Description":"Source application account number"
    }
  },
    
  "Resources":{
    "MyStream": {
      "Type": "AWS::Kinesis::Stream",
      "Properties": {
        "Name": {"Fn::Join" : [ "", [{ "Ref" : "AWS::StackName" },"-Stream"] ]},
        "RetentionPeriodHours" : 48,
        "ShardCount": 1,
        "Tags": [
          {
            "Key": "Solution",
            "Value": "CentralLogging"
          }
       ]
      }
    },
    "LogRole" : {
      "Type"  : "AWS::IAM::Role",
      "Properties" : {
          "AssumeRolePolicyDocument" : {
              "Statement" : [ {
                  "Effect" : "Allow",
                  "Principal" : {
                      "Service" : [ {"Fn::Join": [ "", [ "logs.", { "Ref": "AWS::Region" }, ".amazonaws.com" ] ]} ]
                  },
                  "Action" : [ "sts:AssumeRole" ]
              } ]
          },         
          "Path" : "/service-role/"
      }
    },
      
    "LogRolePolicy" : {
        "Type" : "AWS::IAM::Policy",
        "Properties" : {
            "PolicyName" : {"Fn::Join" : [ "", [{ "Ref" : "AWS::StackName" },"-LogPolicy"] ]},
            "PolicyDocument" : {
              "Version": "2012-10-17",
              "Statement": [
                {
                  "Effect": "Allow",
                  "Action": ["kinesis:PutRecord"],
                  "Resource": [{ "Fn::GetAtt" : ["MyStream", "Arn"] }]
                },
                {
                  "Effect": "Allow",
                  "Action": ["iam:PassRole"],
                  "Resource": [{ "Fn::GetAtt" : ["LogRole", "Arn"] }]
                }
              ]
            },
            "Roles" : [ { "Ref" : "LogRole" } ]
        }
    },
      
    "LogDestination" : {
      "Type" : "AWS::Logs::Destination",
      "DependsOn" : ["MyStream","LogRole","LogRolePolicy"],
      "Properties" : {
        "DestinationName": {"Fn::Join" : [ "", [{ "Ref" : "AWS::StackName" },"-Destination"] ]},
        "RoleArn": { "Fn::GetAtt" : ["LogRole", "Arn"] },
        "TargetArn": { "Fn::GetAtt" : ["MyStream", "Arn"] },
        "DestinationPolicy": { "Fn::Join" : ["",[
		
				"{\"Version\" : \"2012-10-17\",\"Statement\" : [{\"Effect\" : \"Allow\",",
                " \"Principal\" : {\"AWS\" : \"", {"Ref":"SourceAccount"} ,"\"},",
                "\"Action\" : \"logs:PutSubscriptionFilter\",",
                " \"Resource\" : \"", 
                {"Fn::Join": [ "", [ "arn:aws:logs:", { "Ref": "AWS::Region" }, ":" ,{ "Ref": "AWS::AccountId" }, ":destination:",{ "Ref" : "AWS::StackName" },"-Destination" ] ]}  ,"\"}]}"

			]]}
          
          
      }
    },
      
    "S3deliveryStream": {
      "DependsOn": ["S3deliveryRole", "S3deliveryPolicy"],
      "Type": "AWS::KinesisFirehose::DeliveryStream",
      "Properties": {
        "DeliveryStreamName": {"Fn::Join" : [ "", [{ "Ref" : "AWS::StackName" },"-DeliveryStream"] ]},
        "DeliveryStreamType": "KinesisStreamAsSource",
        "KinesisStreamSourceConfiguration": {
            "KinesisStreamARN": { "Fn::GetAtt" : ["MyStream", "Arn"] },
            "RoleARN": {"Fn::GetAtt" : ["S3deliveryRole", "Arn"] }
        },
        "ExtendedS3DestinationConfiguration": {
          "BucketARN": {"Fn::Join" : [ "", ["arn:aws:s3:::",{"Ref":"LogBucketName"}] ]},
          "BufferingHints": {
            "IntervalInSeconds": "60",
            "SizeInMBs": "50"
          },
          "CompressionFormat": "UNCOMPRESSED",
          "Prefix": {"Ref": "LogS3Location"},
          "RoleARN": {"Fn::GetAtt" : ["S3deliveryRole", "Arn"] },
          "ProcessingConfiguration" : {
              "Enabled": "true",
              "Processors": [
              {
                "Parameters": [ 
                { 
                    "ParameterName": "LambdaArn",
                    "ParameterValue": {"Ref":"ProcessingLambdaARN"}
                }],
                "Type": "Lambda"
              }]
          }
        }

      }
    },
      
    "S3deliveryRole": {
      "Type": "AWS::IAM::Role",
      "Properties": {
        "AssumeRolePolicyDocument": {
          "Version": "2012-10-17",
          "Statement": [
            {
              "Sid": "",
              "Effect": "Allow",
              "Principal": {
                "Service": "firehose.amazonaws.com"
              },
              "Action": "sts:AssumeRole",
              "Condition": {
                "StringEquals": {
                  "sts:ExternalId": {"Ref":"AWS::AccountId"}
                }
              }
            }
          ]
        }
      }
    },
      
    "S3deliveryPolicy": {
      "Type": "AWS::IAM::Policy",
      "Properties": {
        "PolicyName": {"Fn::Join" : [ "", [{ "Ref" : "AWS::StackName" },"-FirehosePolicy"] ]},
        "PolicyDocument": {
          "Version": "2012-10-17",
          "Statement": [
            {
              "Effect": "Allow",
              "Action": [
                "s3:AbortMultipartUpload",
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:ListBucketMultipartUploads",
                "s3:PutObject"
              ],
              "Resource": [
                {"Fn::Join": ["", [ {"Fn::Join" : [ "", ["arn:aws:s3:::",{"Ref":"LogBucketName"}] ]}]]},
                {"Fn::Join": ["", [ {"Fn::Join" : [ "", ["arn:aws:s3:::",{"Ref":"LogBucketName"}] ]}, "*"]]}
              ]
            },
            {
              "Effect": "Allow",
              "Action": [
                "lambda:InvokeFunction",
                "lambda:GetFunctionConfiguration",
                "logs:PutLogEvents",
                "kinesis:DescribeStream",
                "kinesis:GetShardIterator",
                "kinesis:GetRecords",
                "kms:Decrypt"
              ],
              "Resource": "*"
            }
          ]
        },
        "Roles": [{"Ref": "S3deliveryRole"}]
      }
    }

  },
  "Outputs":{
      
   "Destination" : {
      "Description": "Destination",
      "Value": {"Fn::Join": [ "", [ "arn:aws:logs:", { "Ref": "AWS::Region" }, ":" ,{ "Ref": "AWS::AccountId" }, ":destination:",{ "Ref" : "AWS::StackName" },"-Destination" ] ]},
      "Export" : { "Name" : {"Fn::Sub": "${AWS::StackName}-Destination" }}
    }

  }
} 

To create log your destination and all required resources, follow these steps:

  1. Save your template as “central-logging-destination.json”
  2. Login to your logging account and, from the CloudFormation console, select “create new stack”
  3. Import the file “central-logging-destination.json” and click next
  4. Fill in the parameters to configure the log destination and click Next
  5. Follow the default steps to create the stack and verify successful creation
    1. Bucket name is the same as in the “create central logging bucket” step
    2. LogS3Location is the directory hierarchy for saving log data that will be delivered to this destination
    3. ProcessingLambdaARN is as created in “create data processing Lambda function” step
    4. SourceAccount is the application account number where the subscription will be created
  6. Take a note of destination ARN as it appears in outputs section as you did above.

Step 4: Create the log subscription in your application account

In this section, we will create the subscription filter in one of the application accounts to stream logs from the CloudWatch log group to the log destination that was created in your logging account.

Create log subscription filter

The subscription filter is created between the CloudWatch log group and a destination endpoint. Asubscription could be filtered to send part (or all) of the logs in the log group. For example,you can create a subscription filter to stream only flow logs with status REJECT.

Use the CloudFormation template below to create subscription filter. Subscription filter and log destination must be in the same region.

{
  "AWSTemplateFormatVersion":"2010-09-09",
  "Description": "Create log subscription filter for a specific Log Group",
  "Parameters":{

    "DestinationARN":{
      "Type":"String",
      "Default":"",
      "Description":"ARN of logs destination"
    },
    "LogGroupName":{
      "Type":"String",
      "Default":"",
      "Description":"Name of LogGroup to forward logs from"
    },
    "FilterPattern":{
      "Type":"String",
      "Default":"",
      "Description":"Filter pattern to filter events to be sent to log destination; Leave empty to send all logs"
    }
  },
    
  "Resources":{
    "SubscriptionFilter" : {
      "Type" : "AWS::Logs::SubscriptionFilter",
      "Properties" : {
        "LogGroupName" : { "Ref" : "LogGroupName" },
        "FilterPattern" : { "Ref" : "FilterPattern" },
        "DestinationArn" : { "Ref" : "DestinationARN" }
      }
    }
  }
}

To create a subscription filter for one of CloudWatch log groups in your application account, follow the steps below:

  1. Save the template as “central-logging-subscription.json”
  2. Login to your application account and, from the CloudFormation console, select “create new stack”
  3. Select the file “central-logging-subscription.json” and click next
  4. Fill in the parameters as appropriate to your environment as you did above
    a.  DestinationARN is the value of obtained in “create log destination in logging account” step
    b.  FilterPatterns is the filter value for log data to be streamed to your logging account (leave empty to stream all logs in the selected log group)
    c.  LogGroupName is the log group as it appears under CloudWatch Logs
  5. Verify successful creation of the subscription

This completes the deployment process in both the logging- and application-account side. After a few minutes, log data will be streamed to the central-logging destination defined in your logging account.

Step 5: Analyzing log data

Once log data is centralized, it opens the door to run analytics on the consolidated data for business or security reasons. One of the powerful services that AWS offers is Amazon Athena.

Amazon Athena allows you to query data in S3 using standard SQL.

Follow the steps below to create a simple table and run queries on the flow logs data that has been collected from your application accounts

  1. Login to your logging account and from the Amazon Athena console, use the DDL below in your query  editor to create a new table

CREATE EXTERNAL TABLE IF NOT EXISTS prod_vpc_flow_logs (

Version INT,

Account STRING,

InterfaceId STRING,

SourceAddress STRING,

DestinationAddress STRING,

SourcePort INT,

DestinationPort INT,

Protocol INT,

Packets INT,

Bytes INT,

StartTime INT,

EndTime INT,

Action STRING,

LogStatus STRING

)

ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.RegexSerDe’

WITH SERDEPROPERTIES (

“input.regex” = “^([^ ]+)\\s+([0-9]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([0-9]+)\\s+([0-9]+)\\s+([^ ]+)\\s+([^ ]+)$”)

LOCATION ‘s3://central-logging-company-do-not-delete/’;

2. Click ”run query” and verify a successful run/ This creates the table “prod_vpc_flow_logs”

3. You can then run queries against the table data as below:

Conclusion

By following the steps I’ve outlined, you will build a central logging solution to stream CloudWatch logs from one application account to a central logging account. This solution is repeatable and could be deployed multiple times for multiple accounts and logging requirements.

 

About the Author

Mahmoud Matouk is a Senior Cloud Infrastructure Architect. He works with our customers to help accelerate migration and cloud adoption at the enterprise level.

 

Migrating Your Amazon ECS Containers to AWS Fargate

Post Syndicated from Tiffany Jernigan original https://aws.amazon.com/blogs/compute/migrating-your-amazon-ecs-containers-to-aws-fargate/

AWS Fargate is a new technology that works with Amazon Elastic Container Service (ECS) to run containers without having to manage servers or clusters. What does this mean? With Fargate, you no longer need to provision or manage a single virtual machine; you can just create tasks and run them directly!

Fargate uses the same API actions as ECS, so you can use the ECS console, the AWS CLI, or the ECS CLI. I recommend running through the first-run experience for Fargate even if you’re familiar with ECS. It creates all of the one-time setup requirements, such as the necessary IAM roles. If you’re using a CLI, make sure to upgrade to the latest version

In this blog, you will see how to migrate ECS containers from running on Amazon EC2 to Fargate.

Getting started

Note: Anything with code blocks is a change in the task definition file. Screen captures are from the console. Additionally, Fargate is currently available in the us-east-1 (N. Virginia) region.

Launch type

When you create tasks (grouping of containers) and clusters (grouping of tasks), you now have two launch type options: EC2 and Fargate. The default launch type, EC2, is ECS as you knew it before the announcement of Fargate. You need to specify Fargate as the launch type when running a Fargate task.

Even though Fargate abstracts away virtual machines, tasks still must be launched into a cluster. With Fargate, clusters are a logical infrastructure and permissions boundary that allow you to isolate and manage groups of tasks. ECS also supports heterogeneous clusters that are made up of tasks running on both EC2 and Fargate launch types.

The optional, new requiresCompatibilities parameter with FARGATE in the field ensures that your task definition only passes validation if you include Fargate-compatible parameters. Tasks can be flagged as compatible with EC2, Fargate, or both.

"requiresCompatibilities": [
    "FARGATE"
]

Networking

"networkMode": "awsvpc"

In November, we announced the addition of task networking with the network mode awsvpc. By default, ECS uses the bridge network mode. Fargate requires using the awsvpc network mode.

In bridge mode, all of your tasks running on the same instance share the instance’s elastic network interface, which is a virtual network interface, IP address, and security groups.

The awsvpc mode provides this networking support to your tasks natively. You now get the same VPC networking and security controls at the task level that were previously only available with EC2 instances. Each task gets its own elastic networking interface and IP address so that multiple applications or copies of a single application can run on the same port number without any conflicts.

The awsvpc mode also provides a separation of responsibility for tasks. You can get complete control of task placement within your own VPCs, subnets, and the security policies associated with them, even though the underlying infrastructure is managed by Fargate. Also, you can assign different security groups to each task, which gives you more fine-grained security. You can give an application only the permissions it needs.

"portMappings": [
    {
        "containerPort": "3000"
    }
 ]

What else has to change? First, you only specify a containerPort value, not a hostPort value, as there is no host to manage. Your container port is the port that you access on your elastic network interface IP address. Therefore, your container ports in a single task definition file need to be unique.

"environment": [
    {
        "name": "WORDPRESS_DB_HOST",
        "value": "127.0.0.1:3306"
    }
 ]

Additionally, links are not allowed as they are a property of the “bridge” network mode (and are now a legacy feature of Docker). Instead, containers share a network namespace and communicate with each other over the localhost interface. They can be referenced using the following:

localhost/127.0.0.1:<some_port_number>

CPU and memory

"memory": "1024",
 "cpu": "256"

"memory": "1gb",
 "cpu": ".25vcpu"

When launching a task with the EC2 launch type, task performance is influenced by the instance types that you select for your cluster combined with your task definition. If you pick larger instances, your applications make use of the extra resources if there is no contention.

In Fargate, you needed a way to get additional resource information so we created task-level resources. Task-level resources define the maximum amount of memory and cpu that your task can consume.

  • memory can be defined in MB with just the number, or in GB, for example, “1024” or “1gb”.
  • cpu can be defined as the number or in vCPUs, for example, “256” or “.25vcpu”.
    • vCPUs are virtual CPUs. You can look at the memory and vCPUs for instance types to get an idea of what you may have used before.

The memory and CPU options available with Fargate are:

CPU Memory
256 (.25 vCPU) 0.5GB, 1GB, 2GB
512 (.5 vCPU) 1GB, 2GB, 3GB, 4GB
1024 (1 vCPU) 2GB, 3GB, 4GB, 5GB, 6GB, 7GB, 8GB
2048 (2 vCPU) Between 4GB and 16GB in 1GB increments
4096 (4 vCPU) Between 8GB and 30GB in 1GB increments

IAM roles

Because Fargate uses awsvpc mode, you need an Amazon ECS service-linked IAM role named AWSServiceRoleForECS. It provides Fargate with the needed permissions, such as the permission to attach an elastic network interface to your task. After you create your service-linked IAM role, you can delete the remaining roles in your services.

"executionRoleArn": "arn:aws:iam::<your_account_id>:role/ecsTaskExecutionRole"

With the EC2 launch type, an instance role gives the agent the ability to pull, publish, talk to ECS, and so on. With Fargate, the task execution IAM role is only needed if you’re pulling from Amazon ECR or publishing data to Amazon CloudWatch Logs.

The Fargate first-run experience tutorial in the console automatically creates these roles for you.

Volumes

Fargate currently supports non-persistent, empty data volumes for containers. When you define your container, you no longer use the host field and only specify a name.

Load balancers

For awsvpc mode, and therefore for Fargate, use the IP target type instead of the instance target type. You define this in the Amazon EC2 service when creating a load balancer.

If you’re using a Classic Load Balancer, change it to an Application Load Balancer or a Network Load Balancer.

Tip: If you are using an Application Load Balancer, make sure that your tasks are launched in the same VPC and Availability Zones as your load balancer.

Let’s migrate a task definition!

Here is an example NGINX task definition. This type of task definition is what you’re used to if you created one before Fargate was announced. It’s what you would run now with the EC2 launch type.

{
    "containerDefinitions": [
        {
            "name": "nginx",
            "image": "nginx",
            "memory": "512",
            "cpu": "100",
            "essential": true,
            "portMappings": [
                {
                    "hostPort": "80",
                    "containerPort": "80",
                    "protocol": "tcp"
                }
            ],
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "/ecs/",
                    "awslogs-region": "us-east-1",
                    "awslogs-stream-prefix": "ecs"
                }
            }
        }
    ],
    "family": "nginx-ec2"
}

OK, so now what do you need to do to change it to run with the Fargate launch type?

  • Add FARGATE for requiredCompatibilities (not required, but a good safety check for your task definition).
  • Use awsvpc as the network mode.
  • Just specify the containerPort (the hostPortvalue is the same).
  • Add a task executionRoleARN value to allow logging to CloudWatch.
  • Provide cpu and memory limits for the task.
{
    "requiresCompatibilities": [
        "FARGATE"
    ],
    "containerDefinitions": [
        {
            "name": "nginx",
            "image": "nginx",
            "memory": "512",
            "cpu": "100",
            "essential": true,
            "portMappings": [
                {
                    "containerPort": "80",
                    "protocol": "tcp"
                }
            ],
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "/ecs/",
                    "awslogs-region": "us-east-1",
                    "awslogs-stream-prefix": "ecs"
                }
            }
        }
    ],
    "networkMode": "awsvpc",
    "executionRoleArn": "arn:aws:iam::<your_account_id>:role/ecsTaskExecutionRole",
    "family": "nginx-fargate",
    "memory": "512",
    "cpu": "256"
}

Are there more examples?

Yep! Head to the AWS Samples GitHub repo. We have several sample task definitions you can try for both the EC2 and Fargate launch types. Contributions are very welcome too :).

 

tiffany jernigan
@tiffanyfayj

AWS Updated Its ISO Certifications and Now Has 67 Services Under ISO Compliance

Post Syndicated from Chad Woolf original https://aws.amazon.com/blogs/security/aws-updated-its-iso-certifications-and-now-has-67-services-under-iso-compliance/

ISO logo

AWS has updated its certifications against ISO 9001, ISO 27001, ISO 27017, and ISO 27018 standards, bringing the total to 67 services now under ISO compliance. We added the following 29 services this cycle:

Amazon Aurora Amazon S3 Transfer Acceleration AWS [email protected]
Amazon Cloud Directory Amazon SageMaker AWS Managed Services
Amazon CloudWatch Logs Amazon Simple Notification Service AWS OpsWorks Stacks
Amazon Cognito Auto Scaling AWS Shield
Amazon Connect AWS Batch AWS Snowball Edge
Amazon Elastic Container Registry AWS CodeBuild AWS Snowmobile
Amazon Inspector AWS CodeCommit AWS Step Functions
Amazon Kinesis Data Streams AWS CodeDeploy AWS Systems Manager (formerly Amazon EC2 Systems Manager)
Amazon Macie AWS CodePipeline AWS X-Ray
Amazon QuickSight AWS IoT Core

For the complete list of services under ISO compliance, see AWS Services in Scope by Compliance Program.

AWS maintains certifications through extensive audits of its controls to ensure that information security risks that affect the confidentiality, integrity, and availability of company and customer information are appropriately managed.

You can download copies of the AWS ISO certificates that contain AWS’s in-scope services and Regions, and use these certificates to jump-start your own certification efforts:

AWS does not increase service costs in any AWS Region as a result of updating its certifications.

To learn more about compliance in the AWS Cloud, see AWS Cloud Compliance.

– Chad

Now Open AWS EU (Paris) Region

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/now-open-aws-eu-paris-region/

Today we are launching our 18th AWS Region, our fourth in Europe. Located in the Paris area, AWS customers can use this Region to better serve customers in and around France.

The Details
The new EU (Paris) Region provides a broad suite of AWS services including Amazon API Gateway, Amazon Aurora, Amazon CloudFront, Amazon CloudWatch, CloudWatch Events, Amazon CloudWatch Logs, Amazon DynamoDB, Amazon Elastic Compute Cloud (EC2), EC2 Container Registry, Amazon ECS, Amazon Elastic Block Store (EBS), Amazon EMR, Amazon ElastiCache, Amazon Elasticsearch Service, Amazon Glacier, Amazon Kinesis Streams, Polly, Amazon Redshift, Amazon Relational Database Service (RDS), Amazon Route 53, Amazon Simple Notification Service (SNS), Amazon Simple Queue Service (SQS), Amazon Simple Storage Service (S3), Amazon Simple Workflow Service (SWF), Amazon Virtual Private Cloud, Auto Scaling, AWS Certificate Manager (ACM), AWS CloudFormation, AWS CloudTrail, AWS CodeDeploy, AWS Config, AWS Database Migration Service, AWS Direct Connect, AWS Elastic Beanstalk, AWS Identity and Access Management (IAM), AWS Key Management Service (KMS), AWS Lambda, AWS Marketplace, AWS OpsWorks Stacks, AWS Personal Health Dashboard, AWS Server Migration Service, AWS Service Catalog, AWS Shield Standard, AWS Snowball, AWS Snowball Edge, AWS Snowmobile, AWS Storage Gateway, AWS Support (including AWS Trusted Advisor), Elastic Load Balancing, and VM Import.

The Paris Region supports all sizes of C5, M5, R4, T2, D2, I3, and X1 instances.

There are also four edge locations for Amazon Route 53 and Amazon CloudFront: three in Paris and one in Marseille, all with AWS WAF and AWS Shield. Check out the AWS Global Infrastructure page to learn more about current and future AWS Regions.

The Paris Region will benefit from three AWS Direct Connect locations. Telehouse Voltaire is available today. AWS Direct Connect will also become available at Equinix Paris in early 2018, followed by Interxion Paris.

All AWS infrastructure regions around the world are designed, built, and regularly audited to meet the most rigorous compliance standards and to provide high levels of security for all AWS customers. These include ISO 27001, ISO 27017, ISO 27018, SOC 1 (Formerly SAS 70), SOC 2 and SOC 3 Security & Availability, PCI DSS Level 1, and many more. This means customers benefit from all the best practices of AWS policies, architecture, and operational processes built to satisfy the needs of even the most security sensitive customers.

AWS is certified under the EU-US Privacy Shield, and the AWS Data Processing Addendum (DPA) is GDPR-ready and available now to all AWS customers to help them prepare for May 25, 2018 when the GDPR becomes enforceable. The current AWS DPA, as well as the AWS GDPR DPA, allows customers to transfer personal data to countries outside the European Economic Area (EEA) in compliance with European Union (EU) data protection laws. AWS also adheres to the Cloud Infrastructure Service Providers in Europe (CISPE) Code of Conduct. The CISPE Code of Conduct helps customers ensure that AWS is using appropriate data protection standards to protect their data, consistent with the GDPR. In addition, AWS offers a wide range of services and features to help customers meet the requirements of the GDPR, including services for access controls, monitoring, logging, and encryption.

From Our Customers
Many AWS customers are preparing to use this new Region. Here’s a small sample:

Societe Generale, one of the largest banks in France and the world, has accelerated their digital transformation while working with AWS. They developed SG Research, an application that makes reports from Societe Generale’s analysts available to corporate customers in order to improve the decision-making process for investments. The new AWS Region will reduce latency between applications running in the cloud and in their French data centers.

SNCF is the national railway company of France. Their mobile app, powered by AWS, delivers real-time traffic information to 14 million riders. Extreme weather, traffic events, holidays, and engineering works can cause usage to peak at hundreds of thousands of users per second. They are planning to use machine learning and big data to add predictive features to the app.

Radio France, the French public radio broadcaster, offers seven national networks, and uses AWS to accelerate its innovation and stay competitive.

Les Restos du Coeur, a French charity that provides assistance to the needy, delivering food packages and participating in their social and economic integration back into French society. Les Restos du Coeur is using AWS for its CRM system to track the assistance given to each of their beneficiaries and the impact this is having on their lives.

AlloResto by JustEat (a leader in the French FoodTech industry), is using AWS to to scale during traffic peaks and to accelerate their innovation process.

AWS Consulting and Technology Partners
We are already working with a wide variety of consulting, technology, managed service, and Direct Connect partners in France. Here’s a partial list:

AWS Premier Consulting PartnersAccenture, Capgemini, Claranet, CloudReach, DXC, and Edifixio.

AWS Consulting PartnersABC Systemes, Atos International SAS, CoreExpert, Cycloid, Devoteam, LINKBYNET, Oxalide, Ozones, Scaleo Information Systems, and Sopra Steria.

AWS Technology PartnersAxway, Commerce Guys, MicroStrategy, Sage, Software AG, Splunk, Tibco, and Zerolight.

AWS in France
We have been investing in Europe, with a focus on France, for the last 11 years. We have also been developing documentation and training programs to help our customers to improve their skills and to accelerate their journey to the AWS Cloud.

As part of our commitment to AWS customers in France, we plan to train more than 25,000 people in the coming years, helping them develop highly sought after cloud skills. They will have access to AWS training resources in France via AWS Academy, AWSome days, AWS Educate, and webinars, all delivered in French by AWS Technical Trainers and AWS Certified Trainers.

Use it Today
The EU (Paris) Region is open for business now and you can start using it today!

Jeff;

 

Power data ingestion into Splunk using Amazon Kinesis Data Firehose

Post Syndicated from Tarik Makota original https://aws.amazon.com/blogs/big-data/power-data-ingestion-into-splunk-using-amazon-kinesis-data-firehose/

In late September, during the annual Splunk .conf, Splunk and Amazon Web Services (AWS) jointly announced that Amazon Kinesis Data Firehose now supports Splunk Enterprise and Splunk Cloud as a delivery destination. This native integration between Splunk Enterprise, Splunk Cloud, and Amazon Kinesis Data Firehose is designed to make AWS data ingestion setup seamless, while offering a secure and fault-tolerant delivery mechanism. We want to enable customers to monitor and analyze machine data from any source and use it to deliver operational intelligence and optimize IT, security, and business performance.

With Kinesis Data Firehose, customers can use a fully managed, reliable, and scalable data streaming solution to Splunk. In this post, we tell you a bit more about the Kinesis Data Firehose and Splunk integration. We also show you how to ingest large amounts of data into Splunk using Kinesis Data Firehose.

Push vs. Pull data ingestion

Presently, customers use a combination of two ingestion patterns, primarily based on data source and volume, in addition to existing company infrastructure and expertise:

  1. Pull-based approach: Using dedicated pollers running the popular Splunk Add-on for AWS to pull data from various AWS services such as Amazon CloudWatch or Amazon S3.
  2. Push-based approach: Streaming data directly from AWS to Splunk HTTP Event Collector (HEC) by using AWS Lambda. Examples of applicable data sources include CloudWatch Logs and Amazon Kinesis Data Streams.

The pull-based approach offers data delivery guarantees such as retries and checkpointing out of the box. However, it requires more ops to manage and orchestrate the dedicated pollers, which are commonly running on Amazon EC2 instances. With this setup, you pay for the infrastructure even when it’s idle.

On the other hand, the push-based approach offers a low-latency scalable data pipeline made up of serverless resources like AWS Lambda sending directly to Splunk indexers (by using Splunk HEC). This approach translates into lower operational complexity and cost. However, if you need guaranteed data delivery then you have to design your solution to handle issues such as a Splunk connection failure or Lambda execution failure. To do so, you might use, for example, AWS Lambda Dead Letter Queues.

How about getting the best of both worlds?

Let’s go over the new integration’s end-to-end solution and examine how Kinesis Data Firehose and Splunk together expand the push-based approach into a native AWS solution for applicable data sources.

By using a managed service like Kinesis Data Firehose for data ingestion into Splunk, we provide out-of-the-box reliability and scalability. One of the pain points of the old approach was the overhead of managing the data collection nodes (Splunk heavy forwarders). With the new Kinesis Data Firehose to Splunk integration, there are no forwarders to manage or set up. Data producers (1) are configured through the AWS Management Console to drop data into Kinesis Data Firehose.

You can also create your own data producers. For example, you can drop data into a Firehose delivery stream by using Amazon Kinesis Agent, or by using the Firehose API (PutRecord(), PutRecordBatch()), or by writing to a Kinesis Data Stream configured to be the data source of a Firehose delivery stream. For more details, refer to Sending Data to an Amazon Kinesis Data Firehose Delivery Stream.

You might need to transform the data before it goes into Splunk for analysis. For example, you might want to enrich it or filter or anonymize sensitive data. You can do so using AWS Lambda. In this scenario, Kinesis Data Firehose buffers data from the incoming source data, sends it to the specified Lambda function (2), and then rebuffers the transformed data to the Splunk Cluster. Kinesis Data Firehose provides the Lambda blueprints that you can use to create a Lambda function for data transformation.

Systems fail all the time. Let’s see how this integration handles outside failures to guarantee data durability. In cases when Kinesis Data Firehose can’t deliver data to the Splunk Cluster, data is automatically backed up to an S3 bucket. You can configure this feature while creating the Firehose delivery stream (3). You can choose to back up all data or only the data that’s failed during delivery to Splunk.

In addition to using S3 for data backup, this Firehose integration with Splunk supports Splunk Indexer Acknowledgments to guarantee event delivery. This feature is configured on Splunk’s HTTP Event Collector (HEC) (4). It ensures that HEC returns an acknowledgment to Kinesis Data Firehose only after data has been indexed and is available in the Splunk cluster (5).

Now let’s look at a hands-on exercise that shows how to forward VPC flow logs to Splunk.

How-to guide

To process VPC flow logs, we implement the following architecture.

Amazon Virtual Private Cloud (Amazon VPC) delivers flow log files into an Amazon CloudWatch Logs group. Using a CloudWatch Logs subscription filter, we set up real-time delivery of CloudWatch Logs to an Kinesis Data Firehose stream.

Data coming from CloudWatch Logs is compressed with gzip compression. To work with this compression, we need to configure a Lambda-based data transformation in Kinesis Data Firehose to decompress the data and deposit it back into the stream. Firehose then delivers the raw logs to the Splunk Http Event Collector (HEC).

If delivery to the Splunk HEC fails, Firehose deposits the logs into an Amazon S3 bucket. You can then ingest the events from S3 using an alternate mechanism such as a Lambda function.

When data reaches Splunk (Enterprise or Cloud), Splunk parsing configurations (packaged in the Splunk Add-on for Kinesis Data Firehose) extract and parse all fields. They make data ready for querying and visualization using Splunk Enterprise and Splunk Cloud.

Walkthrough

Install the Splunk Add-on for Amazon Kinesis Data Firehose

The Splunk Add-on for Amazon Kinesis Data Firehose enables Splunk (be it Splunk Enterprise, Splunk App for AWS, or Splunk Enterprise Security) to use data ingested from Amazon Kinesis Data Firehose. Install the Add-on on all the indexers with an HTTP Event Collector (HEC). The Add-on is available for download from Splunkbase.

HTTP Event Collector (HEC)

Before you can use Kinesis Data Firehose to deliver data to Splunk, set up the Splunk HEC to receive the data. From Splunk web, go to the Setting menu, choose Data Inputs, and choose HTTP Event Collector. Choose Global Settings, ensure All tokens is enabled, and then choose Save. Then choose New Token to create a new HEC endpoint and token. When you create a new token, make sure that Enable indexer acknowledgment is checked.

When prompted to select a source type, select aws:cloudwatch:vpcflow.

Create an S3 backsplash bucket

To provide for situations in which Kinesis Data Firehose can’t deliver data to the Splunk Cluster, we use an S3 bucket to back up the data. You can configure this feature to back up all data or only the data that’s failed during delivery to Splunk.

Note: Bucket names are unique. Thus, you can’t use tmak-backsplash-bucket.

aws s3 create-bucket --bucket tmak-backsplash-bucket --create-bucket-configuration LocationConstraint=ap-northeast-1

Create an IAM role for the Lambda transform function

Firehose triggers an AWS Lambda function that transforms the data in the delivery stream. Let’s first create a role for the Lambda function called LambdaBasicRole.

Note: You can also set this role up when creating your Lambda function.

$ aws iam create-role --role-name LambdaBasicRole --assume-role-policy-document file://TrustPolicyForLambda.json

Here is TrustPolicyForLambda.json.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

 

After the role is created, attach the managed Lambda basic execution policy to it.

$ aws iam attach-role-policy 
  --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole 
  --role-name LambdaBasicRole

 

Create a Firehose Stream

On the AWS console, open the Amazon Kinesis service, go to the Firehose console, and choose Create Delivery Stream.

In the next section, you can specify whether you want to use an inline Lambda function for transformation. Because incoming CloudWatch Logs are gzip compressed, choose Enabled for Record transformation, and then choose Create new.

From the list of the available blueprint functions, choose Kinesis Data Firehose CloudWatch Logs Processor. This function unzips data and place it back into the Firehose stream in compliance with the record transformation output model.

Enter a name for the Lambda function, choose Choose an existing role, and then choose the role you created earlier. Then choose Create Function.

Go back to the Firehose Stream wizard, choose the Lambda function you just created, and then choose Next.

Select Splunk as the destination, and enter your Splunk Http Event Collector information.

Note: Amazon Kinesis Data Firehose requires the Splunk HTTP Event Collector (HEC) endpoint to be terminated with a valid CA-signed certificate matching the DNS hostname used to connect to your HEC endpoint. You receive delivery errors if you are using a self-signed certificate.

In this example, we only back up logs that fail during delivery.

To monitor your Firehose delivery stream, enable error logging. Doing this means that you can monitor record delivery errors.

Create an IAM role for the Firehose stream by choosing Create new, or Choose. Doing this brings you to a new screen. Choose Create a new IAM role, give the role a name, and then choose Allow.

If you look at the policy document, you can see that the role gives Kinesis Data Firehose permission to publish error logs to CloudWatch, execute your Lambda function, and put records into your S3 backup bucket.

You now get a chance to review and adjust the Firehose stream settings. When you are satisfied, choose Create Stream. You get a confirmation once the stream is created and active.

Create a VPC Flow Log

To send events from Amazon VPC, you need to set up a VPC flow log. If you already have a VPC flow log you want to use, you can skip to the “Publish CloudWatch to Kinesis Data Firehose” section.

On the AWS console, open the Amazon VPC service. Then choose VPC, Your VPC, and choose the VPC you want to send flow logs from. Choose Flow Logs, and then choose Create Flow Log. If you don’t have an IAM role that allows your VPC to publish logs to CloudWatch, choose Set Up Permissions and Create new role. Use the defaults when presented with the screen to create the new IAM role.

Once active, your VPC flow log should look like the following.

Publish CloudWatch to Kinesis Data Firehose

When you generate traffic to or from your VPC, the log group is created in Amazon CloudWatch. The new log group has no subscription filter, so set up a subscription filter. Setting this up establishes a real-time data feed from the log group to your Firehose delivery stream.

At present, you have to use the AWS Command Line Interface (AWS CLI) to create a CloudWatch Logs subscription to a Kinesis Data Firehose stream. However, you can use the AWS console to create subscriptions to Lambda and Amazon Elasticsearch Service.

To allow CloudWatch to publish to your Firehose stream, you need to give it permissions.

$ aws iam create-role --role-name CWLtoKinesisFirehoseRole --assume-role-policy-document file://TrustPolicyForCWLToFireHose.json


Here is the content for TrustPolicyForCWLToFireHose.json.

{
  "Statement": {
    "Effect": "Allow",
    "Principal": { "Service": "logs.us-east-1.amazonaws.com" },
    "Action": "sts:AssumeRole"
  }
}

 

Attach the policy to the newly created role.

$ aws iam put-role-policy 
    --role-name CWLtoKinesisFirehoseRole 
    --policy-name Permissions-Policy-For-CWL 
    --policy-document file://PermissionPolicyForCWLToFireHose.json

Here is the content for PermissionPolicyForCWLToFireHose.json.

{
    "Statement":[
      {
        "Effect":"Allow",
        "Action":["firehose:*"],
        "Resource":["arn:aws:firehose:us-east-1:YOUR-AWS-ACCT-NUM:deliverystream/ FirehoseSplunkDeliveryStream"]
      },
      {
        "Effect":"Allow",
        "Action":["iam:PassRole"],
        "Resource":["arn:aws:iam::YOUR-AWS-ACCT-NUM:role/CWLtoKinesisFirehoseRole"]
      }
    ]
}

Finally, create a subscription filter.

$ aws logs put-subscription-filter 
   --log-group-name " /vpc/flowlog/FirehoseSplunkDemo" 
   --filter-name "Destination" 
   --filter-pattern "" 
   --destination-arn "arn:aws:firehose:us-east-1:YOUR-AWS-ACCT-NUM:deliverystream/FirehoseSplunkDeliveryStream" 
   --role-arn "arn:aws:iam::YOUR-AWS-ACCT-NUM:role/CWLtoKinesisFirehoseRole"

When you run the AWS CLI command preceding, you don’t get any acknowledgment. To validate that your CloudWatch Log Group is subscribed to your Firehose stream, check the CloudWatch console.

As soon as the subscription filter is created, the real-time log data from the log group goes into your Firehose delivery stream. Your stream then delivers it to your Splunk Enterprise or Splunk Cloud environment for querying and visualization. The screenshot following is from Splunk Enterprise.

In addition, you can monitor and view metrics associated with your delivery stream using the AWS console.

Conclusion

Although our walkthrough uses VPC Flow Logs, the pattern can be used in many other scenarios. These include ingesting data from AWS IoT, other CloudWatch logs and events, Kinesis Streams or other data sources using the Kinesis Agent or Kinesis Producer Library. We also used Lambda blueprint Kinesis Data Firehose CloudWatch Logs Processor to transform streaming records from Kinesis Data Firehose. However, you might need to use a different Lambda blueprint or disable record transformation entirely depending on your use case. For an additional use case using Kinesis Data Firehose, check out This is My Architecture Video, which discusses how to securely centralize cross-account data analytics using Kinesis and Splunk.

 


Additional Reading

If you found this post useful, be sure to check out Integrating Splunk with Amazon Kinesis Streams and Using Amazon EMR and Hunk for Rapid Response Log Analysis and Review.


About the Authors

Tarik Makota is a solutions architect with the Amazon Web Services Partner Network. He provides technical guidance, design advice and thought leadership to AWS’ most strategic software partners. His career includes work in an extremely broad software development and architecture roles across ERP, financial printing, benefit delivery and administration and financial services. He holds an M.S. in Software Development and Management from Rochester Institute of Technology.

 

 

 

Roy Arsan is a solutions architect in the Splunk Partner Integrations team. He has a background in product development, cloud architecture, and building consumer and enterprise cloud applications. More recently, he has architected Splunk solutions on major cloud providers, including an AWS Quick Start for Splunk that enables AWS users to easily deploy distributed Splunk Enterprise straight from their AWS console. He’s also the co-author of the AWS Lambda blueprints for Splunk. He holds an M.S. in Computer Science Engineering from the University of Michigan.

 

 

 

Now Open – AWS China (Ningxia) Region

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/now-open-aws-china-ningxia-region/

Today we launched our 17th Region globally, and the second in China. The AWS China (Ningxia) Region, operated by Ningxia Western Cloud Data Technology Co. Ltd. (NWCD), is generally available now and provides customers another option to run applications and store data on AWS in China.

The Details
At launch, the new China (Ningxia) Region, operated by NWCD, supports Auto Scaling, AWS Config, AWS CloudFormation, AWS CloudTrail, Amazon CloudWatch, CloudWatch Events, Amazon CloudWatch Logs, AWS CodeDeploy, AWS Direct Connect, Amazon DynamoDB, Amazon Elastic Compute Cloud (EC2), Amazon Elastic Block Store (EBS), Amazon EC2 Systems Manager, AWS Elastic Beanstalk, Amazon ElastiCache, Amazon Elasticsearch Service, Elastic Load Balancing, Amazon EMR, Amazon Glacier, AWS Identity and Access Management (IAM), Amazon Kinesis Streams, Amazon Redshift, Amazon Relational Database Service (RDS), Amazon Simple Storage Service (S3), Amazon Simple Notification Service (SNS), Amazon Simple Queue Service (SQS), AWS Support API, AWS Trusted Advisor, Amazon Simple Workflow Service (SWF), Amazon Virtual Private Cloud, and VM Import. Visit the AWS China Products page for additional information on these services.

The Region supports all sizes of C4, D2, M4, T2, R4, I3, and X1 instances.

Check out the AWS Global Infrastructure page to learn more about current and future AWS Regions.

Operating Partner
To comply with China’s legal and regulatory requirements, AWS has formed a strategic technology collaboration with NWCD to operate and provide services from the AWS China (Ningxia) Region. Founded in 2015, NWCD is a licensed datacenter and cloud services provider, based in Ningxia, China. NWCD joins Sinnet, the operator of the AWS China China (Beijing) Region, as an AWS operating partner in China. Through these relationships, AWS provides its industry-leading technology, guidance, and expertise to NWCD and Sinnet, while NWCD and Sinnet operate and provide AWS cloud services to local customers. While the cloud services offered in both AWS China Regions are the same as those available in other AWS Regions, the AWS China Regions are different in that they are isolated from all other AWS Regions and operated by AWS’s Chinese partners separately from all other AWS Regions. Customers using the AWS China Regions enter into customer agreements with Sinnet and NWCD, rather than with AWS.

Use it Today
The AWS China (Ningxia) Region, operated by NWCD, is open for business, and you can start using it now! Starting today, Chinese developers, startups, and enterprises, as well as government, education, and non-profit organizations, can leverage AWS to run their applications and store their data in the new AWS China (Ningxia) Region, operated by NWCD. Customers already using the AWS China (Beijing) Region, operated by Sinnet, can select the AWS China (Ningxia) Region directly from the AWS Management Console, while new customers can request an account at www.amazonaws.cn to begin using both AWS China Regions.

Jeff;

 

 

Managing AWS Lambda Function Concurrency

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/managing-aws-lambda-function-concurrency/

One of the key benefits of serverless applications is the ease in which they can scale to meet traffic demands or requests, with little to no need for capacity planning. In AWS Lambda, which is the core of the serverless platform at AWS, the unit of scale is a concurrent execution. This refers to the number of executions of your function code that are happening at any given time.

Thinking about concurrent executions as a unit of scale is a fairly unique concept. In this post, I dive deeper into this and talk about how you can make use of per function concurrency limits in Lambda.

Understanding concurrency in Lambda

Instead of diving right into the guts of how Lambda works, here’s an appetizing analogy: a magical pizza.
Yes, a magical pizza!

This magical pizza has some unique properties:

  • It has a fixed maximum number of slices, such as 8.
  • Slices automatically re-appear after they are consumed.
  • When you take a slice from the pizza, it does not re-appear until it has been completely consumed.
  • One person can take multiple slices at a time.
  • You can easily ask to have the number of slices increased, but they remain fixed at any point in time otherwise.

Now that the magical pizza’s properties are defined, here’s a hypothetical situation of some friends sharing this pizza.

Shawn, Kate, Daniela, Chuck, Ian and Avleen get together every Friday to share a pizza and catch up on their week. As there is just six of them, they can easily all enjoy a slice of pizza at a time. As they finish each slice, it re-appears in the pizza pan and they can take another slice again. Given the magical properties of their pizza, they can continue to eat all they want, but with two very important constraints:

  • If any of them take too many slices at once, the others may not get as much as they want.
  • If they take too many slices, they might also eat too much and get sick.

One particular week, some of the friends are hungrier than the rest, taking two slices at a time instead of just one. If more than two of them try to take two pieces at a time, this can cause contention for pizza slices. Some of them would wait hungry for the slices to re-appear. They could ask for a pizza with more slices, but then run the same risk again later if more hungry friends join than planned for.

What can they do?

If the friends agreed to accept a limit for the maximum number of slices they each eat concurrently, both of these issues are avoided. Some could have a maximum of 2 of the 8 slices, or other concurrency limits that were more or less. Just so long as they kept it at or under eight total slices to be eaten at one time. This would keep any from going hungry or eating too much. The six friends can happily enjoy their magical pizza without worry!

Concurrency in Lambda

Concurrency in Lambda actually works similarly to the magical pizza model. Each AWS Account has an overall AccountLimit value that is fixed at any point in time, but can be easily increased as needed, just like the count of slices in the pizza. As of May 2017, the default limit is 1000 “slices” of concurrency per AWS Region.

Also like the magical pizza, each concurrency “slice” can only be consumed individually one at a time. After consumption, it becomes available to be consumed again. Services invoking Lambda functions can consume multiple slices of concurrency at the same time, just like the group of friends can take multiple slices of the pizza.

Let’s take our example of the six friends and bring it back to AWS services that commonly invoke Lambda:

  • Amazon S3
  • Amazon Kinesis
  • Amazon DynamoDB
  • Amazon Cognito

In a single account with the default concurrency limit of 1000 concurrent executions, any of these four services could invoke enough functions to consume the entire limit or some part of it. Just like with the pizza example, there is the possibility for two issues to pop up:

  • One or more of these services could invoke enough functions to consume a majority of the available concurrency capacity. This could cause others to be starved for it, causing failed invocations.
  • A service could consume too much concurrent capacity and cause a downstream service or database to be overwhelmed, which could cause failed executions.

For Lambda functions that are launched in a VPC, you have the potential to consume the available IP addresses in a subnet or the maximum number of elastic network interfaces to which your account has access. For more information, see Configuring a Lambda Function to Access Resources in an Amazon VPC. For information about elastic network interface limits, see Network Interfaces section in the Amazon VPC Limits topic.

One way to solve both of these problems is applying a concurrency limit to the Lambda functions in an account.

Configuring per function concurrency limits

You can now set a concurrency limit on individual Lambda functions in an account. The concurrency limit that you set reserves a portion of your account level concurrency for a given function. All of your functions’ concurrent executions count against this account-level limit by default.

If you set a concurrency limit for a specific function, then that function’s concurrency limit allocation is deducted from the shared pool and assigned to that specific function. AWS also reserves 100 units of concurrency for all functions that don’t have a specified concurrency limit set. This helps to make sure that future functions have capacity to be consumed.

Going back to the example of the consuming services, you could set throttles for the functions as follows:

Amazon S3 function = 350
Amazon Kinesis function = 200
Amazon DynamoDB function = 200
Amazon Cognito function = 150
Total = 900

With the 100 reserved for all non-concurrency reserved functions, this totals the account limit of 1000.

Here’s how this works. To start, create a basic Lambda function that is invoked via Amazon API Gateway. This Lambda function returns a single “Hello World” statement with an added sleep time between 2 and 5 seconds. The sleep time simulates an API providing some sort of capability that can take a varied amount of time. The goal here is to show how an API that is underloaded can reach its concurrency limit, and what happens when it does.
To create the example function

  1. Open the Lambda console.
  2. Choose Create Function.
  3. For Author from scratch, enter the following values:
    1. For Name, enter a value (such as concurrencyBlog01).
    2. For Runtime, choose Python 3.6.
    3. For Role, choose Create new role from template and enter a name aligned with this function, such as concurrencyBlogRole.
  4. Choose Create function.
  5. The function is created with some basic example code. Replace that code with the following:

import time
from random import randint
seconds = randint(2, 5)

def lambda_handler(event, context):
time.sleep(seconds)
return {"statusCode": 200,
"body": ("Hello world, slept " + str(seconds) + " seconds"),
"headers":
{
"Access-Control-Allow-Headers": "Content-Type,X-Amz-Date,Authorization,X-Api-Key,X-Amz-Security-Token",
"Access-Control-Allow-Methods": "GET,OPTIONS",
}}

  1. Under Basic settings, set Timeout to 10 seconds. While this function should only ever take up to 5-6 seconds (with the 5-second max sleep), this gives you a little bit of room if it takes longer.

  1. Choose Save at the top right.

At this point, your function is configured for this example. Test it and confirm this in the console:

  1. Choose Test.
  2. Enter a name (it doesn’t matter for this example).
  3. Choose Create.
  4. In the console, choose Test again.
  5. You should see output similar to the following:

Now configure API Gateway so that you have an HTTPS endpoint to test against.

  1. In the Lambda console, choose Configuration.
  2. Under Triggers, choose API Gateway.
  3. Open the API Gateway icon now shown as attached to your Lambda function:

  1. Under Configure triggers, leave the default values for API Name and Deployment stage. For Security, choose Open.
  2. Choose Add, Save.

API Gateway is now configured to invoke Lambda at the Invoke URL shown under its configuration. You can take this URL and test it in any browser or command line, using tools such as “curl”:


$ curl https://ofixul557l.execute-api.us-east-1.amazonaws.com/prod/concurrencyBlog01
Hello world, slept 2 seconds

Throwing load at the function

Now start throwing some load against your API Gateway + Lambda function combo. Right now, your function is only limited by the total amount of concurrency available in an account. For this example account, you might have 850 unreserved concurrency out of a full account limit of 1000 due to having configured a few concurrency limits already (also the 100 concurrency saved for all functions without configured limits). You can find all of this information on the main Dashboard page of the Lambda console:

For generating load in this example, use an open source tool called “hey” (https://github.com/rakyll/hey), which works similarly to ApacheBench (ab). You test from an Amazon EC2 instance running the default Amazon Linux AMI from the EC2 console. For more help with configuring an EC2 instance, follow the steps in the Launch Instance Wizard.

After the EC2 instance is running, SSH into the host and run the following:


sudo yum install go
go get -u github.com/rakyll/hey

“hey” is easy to use. For these tests, specify a total number of tests (5,000) and a concurrency of 50 against the API Gateway URL as follows(replace the URL here with your own):


$ ./go/bin/hey -n 5000 -c 50 https://ofixul557l.execute-api.us-east-1.amazonaws.com/prod/concurrencyBlog01

The output from “hey” tells you interesting bits of information:


$ ./go/bin/hey -n 5000 -c 50 https://ofixul557l.execute-api.us-east-1.amazonaws.com/prod/concurrencyBlog01

Summary:
Total: 381.9978 secs
Slowest: 9.4765 secs
Fastest: 0.0438 secs
Average: 3.2153 secs
Requests/sec: 13.0891
Total data: 140024 bytes
Size/request: 28 bytes

Response time histogram:
0.044 [1] |
0.987 [2] |
1.930 [0] |
2.874 [1803] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
3.817 [1518] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
4.760 [719] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
5.703 [917] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
6.647 [13] |
7.590 [14] |
8.533 [9] |
9.477 [4] |

Latency distribution:
10% in 2.0224 secs
25% in 2.0267 secs
50% in 3.0251 secs
75% in 4.0269 secs
90% in 5.0279 secs
95% in 5.0414 secs
99% in 5.1871 secs

Details (average, fastest, slowest):
DNS+dialup: 0.0003 secs, 0.0000 secs, 0.0332 secs
DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0046 secs
req write: 0.0000 secs, 0.0000 secs, 0.0005 secs
resp wait: 3.2149 secs, 0.0438 secs, 9.4472 secs
resp read: 0.0000 secs, 0.0000 secs, 0.0004 secs

Status code distribution:
[200] 4997 responses
[502] 3 responses

You can see a helpful histogram and latency distribution. Remember that this Lambda function has a random sleep period in it and so isn’t entirely representational of a real-life workload. Those three 502s warrant digging deeper, but could be due to Lambda cold-start timing and the “second” variable being the maximum of 5, causing the Lambda functions to time out. AWS X-Ray and the Amazon CloudWatch logs generated by both API Gateway and Lambda could help you troubleshoot this.

Configuring a concurrency reservation

Now that you’ve established that you can generate this load against the function, I show you how to limit it and protect a backend resource from being overloaded by all of these requests.

  1. In the console, choose Configure.
  2. Under Concurrency, for Reserve concurrency, enter 25.

  1. Click on Save in the top right corner.

You could also set this with the AWS CLI using the Lambda put-function-concurrency command or see your current concurrency configuration via Lambda get-function. Here’s an example command:


$ aws lambda get-function --function-name concurrencyBlog01 --output json --query Concurrency
{
"ReservedConcurrentExecutions": 25
}

Either way, you’ve set the Concurrency Reservation to 25 for this function. This acts as both a limit and a reservation in terms of making sure that you can execute 25 concurrent functions at all times. Going above this results in the throttling of the Lambda function. Depending on the invoking service, throttling can result in a number of different outcomes, as shown in the documentation on Throttling Behavior. This change has also reduced your unreserved account concurrency for other functions by 25.

Rerun the same load generation as before and see what happens. Previously, you tested at 50 concurrency, which worked just fine. By limiting the Lambda functions to 25 concurrency, you should see rate limiting kick in. Run the same test again:


$ ./go/bin/hey -n 5000 -c 50 https://ofixul557l.execute-api.us-east-1.amazonaws.com/prod/concurrencyBlog01

While this test runs, refresh the Monitoring tab on your function detail page. You see the following warning message:

This is great! It means that your throttle is working as configured and you are now protecting your downstream resources from too much load from your Lambda function.

Here is the output from a new “hey” command:


$ ./go/bin/hey -n 5000 -c 50 https://ofixul557l.execute-api.us-east-1.amazonaws.com/prod/concurrencyBlog01
Summary:
Total: 379.9922 secs
Slowest: 7.1486 secs
Fastest: 0.0102 secs
Average: 1.1897 secs
Requests/sec: 13.1582
Total data: 164608 bytes
Size/request: 32 bytes

Response time histogram:
0.010 [1] |
0.724 [3075] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.438 [0] |
2.152 [811] |∎∎∎∎∎∎∎∎∎∎∎
2.866 [11] |
3.579 [566] |∎∎∎∎∎∎∎
4.293 [214] |∎∎∎
5.007 [1] |
5.721 [315] |∎∎∎∎
6.435 [4] |
7.149 [2] |

Latency distribution:
10% in 0.0130 secs
25% in 0.0147 secs
50% in 0.0205 secs
75% in 2.0344 secs
90% in 4.0229 secs
95% in 5.0248 secs
99% in 5.0629 secs

Details (average, fastest, slowest):
DNS+dialup: 0.0004 secs, 0.0000 secs, 0.0537 secs
DNS-lookup: 0.0002 secs, 0.0000 secs, 0.0184 secs
req write: 0.0000 secs, 0.0000 secs, 0.0016 secs
resp wait: 1.1892 secs, 0.0101 secs, 7.1038 secs
resp read: 0.0000 secs, 0.0000 secs, 0.0005 secs

Status code distribution:
[502] 3076 responses
[200] 1924 responses

This looks fairly different from the last load test run. A large percentage of these requests failed fast due to the concurrency throttle failing them (those with the 0.724 seconds line). The timing shown here in the histogram represents the entire time it took to get a response between the EC2 instance and API Gateway calling Lambda and being rejected. It’s also important to note that this example was configured with an edge-optimized endpoint in API Gateway. You see under Status code distribution that 3076 of the 5000 requests failed with a 502, showing that the backend service from API Gateway and Lambda failed the request.

Other uses

Managing function concurrency can be useful in a few other ways beyond just limiting the impact on downstream services and providing a reservation of concurrency capacity. Here are two other uses:

  • Emergency kill switch
  • Cost controls

Emergency kill switch

On occasion, due to issues with applications I’ve managed in the past, I’ve had a need to disable a certain function or capability of an application. By setting the concurrency reservation and limit of a Lambda function to zero, you can do just that.

With the reservation set to zero every invocation of a Lambda function results in being throttled. You could then work on the related parts of the infrastructure or application that aren’t working, and then reconfigure the concurrency limit to allow invocations again.

Cost controls

While I mentioned how you might want to use concurrency limits to control the downstream impact to services or databases that your Lambda function might call, another resource that you might be cautious about is money. Setting the concurrency throttle is another way to help control costs during development and testing of your application.

You might want to prevent against a function performing a recursive action too quickly or a development workload generating too high of a concurrency. You might also want to protect development resources connected to this function from generating too much cost, such as APIs that your Lambda function calls.

Conclusion

Concurrent executions as a unit of scale are a fairly unique characteristic about Lambda functions. Placing limits on how many concurrency “slices” that your function can consume can prevent a single function from consuming all of the available concurrency in an account. Limits can also prevent a function from overwhelming a backend resource that isn’t as scalable.

Unlike monolithic applications or even microservices where there are mixed capabilities in a single service, Lambda functions encourage a sort of “nano-service” of small business logic directly related to the integration model connected to the function. I hope you’ve enjoyed this post and configure your concurrency limits today!

AWS Fargate: A Product Overview

Post Syndicated from Deepak Dayama original https://aws.amazon.com/blogs/compute/aws-fargate-a-product-overview/

It was just about three years ago that AWS announced Amazon Elastic Container Service (Amazon ECS), to run and manage containers at scale on AWS. With Amazon ECS, you’ve been able to run your workloads at high scale and availability without having to worry about running your own cluster management and container orchestration software.

Today, AWS announced the availability of AWS Fargate – a technology that enables you to use containers as a fundamental compute primitive without having to manage the underlying instances. With Fargate, you don’t need to provision, configure, or scale virtual machines in your clusters to run containers. Fargate can be used with Amazon ECS today, with plans to support Amazon Elastic Container Service for Kubernetes (Amazon EKS) in the future.

Fargate has flexible configuration options so you can closely match your application needs and granular, per-second billing.

Amazon ECS with Fargate

Amazon ECS enables you to run containers at scale. This service also provides native integration into the AWS platform with VPC networking, load balancing, IAM, Amazon CloudWatch Logs, and CloudWatch metrics. These deep integrations make the Amazon ECS task a first-class object within the AWS platform.

To run tasks, you first need to stand up a cluster of instances, which involves picking the right types of instances and sizes, setting up Auto Scaling, and right-sizing the cluster for performance. With Fargate, you can leave all that behind and focus on defining your application and policies around permissions and scaling.

The same container management capabilities remain available so you can continue to scale your container deployments. With Fargate, the only entity to manage is the task. You don’t need to manage the instances or supporting software like Docker daemon or the Amazon ECS agent.

Fargate capabilities are available natively within Amazon ECS. This means that you don’t need to learn new API actions or primitives to run containers on Fargate.

Using Amazon ECS, Fargate is a launch type option. You continue to define the applications the same way by using task definitions. In contrast, the EC2 launch type gives you more control of your server clusters and provides a broader range of customization options.

For example, a RunTask command example is pasted below with the Fargate launch type:

ecs run-task --launch-type FARGATE --cluster fargate-test --task-definition nginx --network-configuration
"awsvpcConfiguration={subnets=[subnet-b563fcd3]}"

Key features of Fargate

Resource-based pricing and per second billing
You pay by the task size and only for the time for which resources are consumed by the task. The price for CPU and memory is charged on a per-second basis. There is a one-minute minimum charge.

Flexible configurations options
Fargate is available with 50 different combinations of CPU and memory to closely match your application needs. You can use 2 GB per vCPU anywhere up to 8 GB per vCPU for various configurations. Match your workload requirements closely, whether they are general purpose, compute, or memory optimized.

Networking
All Fargate tasks run within your own VPC. Fargate supports the recently launched awsvpc networking mode and the elastic network interface for a task is visible in the subnet where the task is running. This provides the separation of responsibility so you retain full control of networking policies for your applications via VPC features like security groups, routing rules, and NACLs. Fargate also supports public IP addresses.

Load Balancing
ECS Service Load Balancing  for the Application Load Balancer and Network Load Balancer is supported. For the Fargate launch type, you specify the IP addresses of the Fargate tasks to register with the load balancers.

Permission tiers
Even though there are no instances to manage with Fargate, you continue to group tasks into logical clusters. This allows you to manage who can run or view services within the cluster. The task IAM role is still applicable. Additionally, there is a new Task Execution Role that grants Amazon ECS permissions to perform operations such as pushing logs to CloudWatch Logs or pulling image from Amazon Elastic Container Registry (Amazon ECR).

Container Registry Support
Fargate provides seamless authentication to help pull images from Amazon ECR via the Task Execution Role. Similarly, if you are using a public repository like DockerHub, you can continue to do so.

Amazon ECS CLI
The Amazon ECS CLI provides high-level commands to help simplify to create and run Amazon ECS clusters, tasks, and services. The latest version of the CLI now supports running tasks and services with Fargate.

EC2 and Fargate Launch Type Compatibility
All Amazon ECS clusters are heterogeneous – you can run both Fargate and Amazon ECS tasks in the same cluster. This enables teams working on different applications to choose their own cadence of moving to Fargate, or to select a launch type that meets their requirements without breaking the existing model. You can make an existing ECS task definition compatible with the Fargate launch type and run it as a Fargate service, and vice versa. Choosing a launch type is not a one-way door!

Logging and Visibility
With Fargate, you can send the application logs to CloudWatch logs. Service metrics (CPU and Memory utilization) are available as part of CloudWatch metrics. AWS partners for visibility, monitoring and application performance management including Datadog, Aquasec, Splunk, Twistlock, and New Relic also support Fargate tasks.

Conclusion

Fargate enables you to run containers without having to manage the underlying infrastructure. Today, Fargate is availabe for Amazon ECS, and in 2018, Amazon EKS. Visit the Fargate product page to learn more, or get started in the AWS Console.

–Deepak Dayama

Access Resources in a VPC from AWS CodeBuild Builds

Post Syndicated from John Pignata original https://aws.amazon.com/blogs/devops/access-resources-in-a-vpc-from-aws-codebuild-builds/

John Pignata, Startup Solutions Architect, Amazon Web Services

In this blog post we’re going to discuss a new AWS CodeBuild feature that is available starting today. CodeBuild builds can now access resources in a VPC directly without these resources being exposed to the public internet. These resources include Amazon Relational Database Service (Amazon RDS) databases, Amazon ElastiCache clusters, internal services running on Amazon Elastic Compute Cloud (Amazon EC2), and Amazon EC2 Container Service (Amazon ECS), or any service endpoints that are only reachable from within a specific VPC.

CodeBuild is a fully managed build service that compiles source code, runs tests, and produces software packages that are ready to deploy. As part of the build process, developers often require access to resources that should be isolated from the public Internet. Now CodeBuild builds can be optionally configured to have VPC connectivity and access these resources directly.

Accessing Resources in a VPC

You can configure builds to have access to a VPC when you create a CodeBuild project or you can update an existing CodeBuild project with VPC configuration attributes. Here’s how it looks in the console:

 

To configure VPC connectivity: select a VPC, one or more subnets within that VPC, and one or more VPC security groups that CodeBuild should apply when attaching to your VPC. Once configured, commands running as part of your build will be able to access resources in your VPC without transiting across the public Internet.

Use Cases

The availability of VPC connectivity from CodeBuild builds unlocks many potential uses. For example, you can:

  • Run integration tests from your build against data in an Amazon RDS instance that’s isolated on a private subnet.
  • Query data in an ElastiCache cluster directly from tests.
  • Interact with internal web services hosted on Amazon EC2, Amazon ECS, or services that use internal Elastic Load Balancing.
  • Retrieve dependencies from self-hosted, internal artifact repositories such as PyPI for Python, Maven for Java, npm for Node.js, and so on.
  • Access objects in an Amazon S3 bucket configured to allow access only through a VPC endpoint.
  • Query external web services that require fixed IP addresses through the Elastic IP address of the NAT gateway associated with your subnet(s).

… and more! Your builds can now access any resource that’s hosted in your VPC without any compromise on network isolation.

Internet Connectivity

CodeBuild requires access to resources on the public Internet to successfully execute builds. At a minimum, it must be able to reach your source repository system (such as AWS CodeCommit, GitHub, Bitbucket), Amazon Simple Storage Service (Amazon S3) to deliver build artifacts, and Amazon CloudWatch Logs to stream logs from the build process. The interface attached to your VPC will not be assigned a public IP address so to enable Internet access from your builds, you will need to set up a managed NAT Gateway or NAT instance for the subnets you configure. You must also ensure your security groups allow outbound access to these services.

IP Address Space

Each running build will be assigned an IP address from one of the subnets in your VPC that you designate for CodeBuild to use. As CodeBuild scales to meet your build volume, ensure that you select subnets with enough address space to accommodate your expected number of concurrent builds.

Service Role Permissions

CodeBuild requires new permissions in order to manage network interfaces on your VPCs. If you create a service role for your new projects, these permissions will be included in that role’s policy automatically. For existing service roles, you can edit the policy document to include the additional actions. For the full policy document to apply to your service role, see Advanced Setup in the CodeBuild documentation.

For more information, see VPC Support in the CodeBuild documentation. We hope you find the ability to access internal resources on a VPC useful in your build processes! If you have any questions or feedback, feel free to reach out to us through the AWS CodeBuild forum or leave a comment!

AWS Achieves FedRAMP JAB Moderate Provisional Authorization for 20 Services in the AWS US East/West Region

Post Syndicated from Chris Gile original https://aws.amazon.com/blogs/security/aws-achieves-fedramp-jab-moderate-authorization-for-20-services-in-us-eastwest/

The AWS US East/West Region has received a Provisional Authority to Operate (P-ATO) from the Joint Authorization Board (JAB) at the Federal Risk and Authorization Management Program (FedRAMP) Moderate baseline.

Though AWS has maintained an AWS US East/West Region Agency-ATO since early 2013, this announcement represents AWS’s carefully deliberated move to the JAB for the centralized maintenance of our P-ATO for 10 services already authorized. This also includes the addition of 10 new services to our FedRAMP program (see the complete list of services below). This doubles the number of FedRAMP Moderate services available to our customers to enable increased use of the cloud and support modernized IT missions. Our public sector customers now can leverage this FedRAMP P-ATO as a baseline for their own authorizations and look to the JAB for centralized Continuous Monitoring reporting and updates. In a significant enhancement for our partners that build their solutions on the AWS US East/West Region, they can now achieve FedRAMP JAB P-ATOs of their own for their Platform as a Service (PaaS) and Software as a Service (SaaS) offerings.

In line with FedRAMP security requirements, our independent FedRAMP assessment was completed in partnership with a FedRAMP accredited Third Party Assessment Organization (3PAO) on our technical, management, and operational security controls to validate that they meet or exceed FedRAMP’s Moderate baseline requirements. Effective immediately, you can begin leveraging this P-ATO for the following 20 services in the AWS US East/West Region:

  • Amazon Aurora (MySQL)*
  • Amazon CloudWatch Logs*
  • Amazon DynamoDB
  • Amazon Elastic Block Store
  • Amazon Elastic Compute Cloud
  • Amazon EMR*
  • Amazon Glacier*
  • Amazon Kinesis Streams*
  • Amazon RDS (MySQL, Oracle, Postgres*)
  • Amazon Redshift
  • Amazon Simple Notification Service*
  • Amazon Simple Queue Service*
  • Amazon Simple Storage Service
  • Amazon Simple Workflow Service*
  • Amazon Virtual Private Cloud
  • AWS CloudFormation*
  • AWS CloudTrail*
  • AWS Identity and Access Management
  • AWS Key Management Service
  • Elastic Load Balancing

* Services with first-time FedRAMP Moderate authorizations

We continue to work with the FedRAMP Project Management Office (PMO), other regulatory and compliance bodies, and our customers and partners to ensure that we are raising the bar on our customers’ security and compliance needs.

To learn more about how AWS helps customers meet their security and compliance requirements, see the AWS Compliance website. To learn about what other public sector customers are doing on AWS, see our Government, Education, and Nonprofits Case Studies and Customer Success Stories. To review the public posting of our FedRAMP authorizations, see the FedRAMP Marketplace.

– Chris Gile, Senior Manager, AWS Public Sector Risk and Compliance

Just in Case You Missed It: Catching Up on Some Recent AWS Launches

Post Syndicated from Tara Walker original https://aws.amazon.com/blogs/aws/just-in-case-you-missed-it-catching-up-on-some-recent-aws-launches/

So many launches and cloud innovations, that you simply may not believe.  In order to catch up on some service launches and features, this post will be a round-up of some cool releases that happened this summer and through the end of September.

The launches and features I want to share with you today are:

  • AWS IAM for Authenticating Database Users for RDS MySQL and Amazon Aurora
  • Amazon SES Reputation Dashboard
  • Amazon SES Open and Click Tracking Metrics
  • Serverless Image Handler by the Solutions Builder Team
  • AWS Ops Automator by the Solutions Builder Team

Let’s dive in, shall we!

AWS IAM for Authenticating Database Users for RDS MySQL and Amazon Aurora

Wished you could manage access to your Amazon RDS database instances and clusters using AWS IAM? Well, wish no longer. Amazon RDS has launched the ability for you to use IAM to manage database access for Amazon RDS for MySQL and Amazon Aurora DB.

What I like most about this new service feature is, it’s very easy to get started.  To enable database user authentication using IAM, you would select a checkbox Enable IAM DB Authentication when creating, modifying, or restoring your DB instance or cluster. You can enable IAM access using the RDS console, the AWS CLI, and/or the Amazon RDS API.

After configuring the database for IAM authentication, client applications authenticate to the database engine by providing temporary security credentials generated by the IAM Security Token Service. These credentials can be used instead of providing a password to the database engine.

You can learn more about using IAM to provide targeted permissions and authentication to MySQL and Aurora by reviewing the Amazon RDS user guide.

Amazon SES Reputation Dashboard

In order to aid Amazon Simple Email Service customers’ in utilizing best practice guidelines for sending email, I am thrilled to announce we launched the Reputation Dashboard to provide comprehensive reporting on email sending health. To aid in proactively managing emails being sent, customers now have visibility into overall account health, sending metrics, and compliance or enforcement status.

The Reputation Dashboard will provide the following information:

  • Account status: A description of your account health status.
    • Healthy – No issues currently impacting your account.
    • Probation – Account is on probation; Issues causing probation must be resolved to prevent suspension
    • Pending end of probation decision – Your account is on probation. Amazon SES team member must review your account prior to action.
    • Shutdown – Your account has been shut down. No email will be able to be sent using Amazon SES.
    • Pending shutdown – Your account is on probation and issues causing probation are unresolved.
  • Bounce Rate: Percentage of emails sent that have bounced and bounce rate status messages.
  • Complaint Rate: Percentage of emails sent that recipients have reported as spam and complaint rate status messages.
  • Notifications: Messages about other account reputation issues.

Amazon SES Open and Click Tracking Metrics

Another exciting feature recently added to Amazon SES is support for Email Open and Click Tracking Metrics. With Email Open and Click Tracking Metrics feature, SES customers can now track when email they’ve sent has been opened and track when links within the email have been clicked.  Using this SES feature will allow you to better track email campaign engagement and effectiveness.

How does this work?

When using the email open tracking feature, SES will add a transparent, miniature image into the emails that you choose to track. When the email is opened, the mail application client will load the aforementioned tracking which triggers an open track event with Amazon SES. For the email click (link) tracking, links in email and/or email templates are replaced with a custom link.  When the custom link is clicked, a click event is recorded in SES and the custom link will redirect the email user to the link destination of the original email.

You can take advantage of the new open tracking and click tracking features by creating a new configuration set or altering an existing configuration set within SES. After choosing either; Amazon SNS, Amazon CloudWatch, or Amazon Kinesis Firehose as the AWS service to receive the open and click metrics, you would only need to select a new configuration set to successfully enable these new features for any emails you want to send.

AWS Solutions: Serverless Image Handler & AWS Ops Automator

The AWS Solution Builder team has been hard at work helping to make it easier for you all to find answers to common architectural questions to aid in building and running applications on AWS. You can find these solutions on the AWS Answers page. Two new solutions released earlier this fall on AWS Answers are  Serverless Image Handler and the AWS Ops Automator.
Serverless Image Handler was developed to provide a solution to help customers dynamically process, manipulate, and optimize the handling of images on the AWS Cloud. The solution combines Amazon CloudFront for caching, AWS Lambda to dynamically retrieve images and make image modifications, and Amazon S3 bucket to store images. Additionally, the Serverless Image Handler leverages the open source image-processing suite, Thumbor, for additional image manipulation, processing, and optimization.

AWS Ops Automator solution helps you to automate manual tasks using time-based or event-based triggers to automatically such as snapshot scheduling by providing a framework for automated tasks and includes task audit trails, logging, resource selection, scaling, concurrency handling, task completion handing, and API request retries. The solution includes the following AWS services:

  • AWS CloudFormation: a templates to launches the core framework of microservices and solution generated task configurations
  • Amazon DynamoDB: a table which stores task configuration data to defines the event triggers, resources, and saves the results of the action and the errors.
  • Amazon CloudWatch Logs: provides logging to track warning and error messages
  • Amazon SNS: topic to send messages to a subscribed email address to which to send the logging information from the solution

Have fun exploring and coding.

Tara

AWS HIPAA Eligibility Update (October 2017) – Sixteen Additional Services

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-hipaa-eligibility-post-update-october-2017-sixteen-additional-services/

Our Health Customer Stories page lists just a few of the many customers that are building and running healthcare and life sciences applications that run on AWS. Customers like Verge Health, Care Cloud, and Orion Health trust AWS with Protected Health Information (PHI) and Personally Identifying Information (PII) as part of their efforts to comply with HIPAA and HITECH.

Sixteen More Services
In my last HIPAA Eligibility Update I shared the news that we added eight additional services to our list of HIPAA eligible services. Today I am happy to let you know that we have added another sixteen services to the list, bringing the total up to 46. Here are the newest additions, along with some short descriptions and links to some of my blog posts to jog your memory:

Amazon Aurora with PostgreSQL Compatibility – This brand-new addition to Amazon Aurora allows you to encrypt your relational databases using keys that you create and manage through AWS Key Management Service (KMS). When you enable encryption for an Amazon Aurora database, the underlying storage is encrypted, as are automated backups, read replicas, and snapshots. Read New – Encryption at Rest for Amazon Aurora to learn more.

Amazon CloudWatch Logs – You can use the logs to monitor and troubleshoot your systems and applications. You can monitor your existing system, application, and custom log files in near real-time, watching for specific phrases, values, or patterns. Log data can be stored durably and at low cost, for as long as needed. To learn more, read Store and Monitor OS & Application Log Files with Amazon CloudWatch and Improvements to CloudWatch Logs and Dashboards.

Amazon Connect – This self-service, cloud-based contact center makes it easy for you to deliver better customer service at a lower cost. You can use the visual designer to set up your contact flows, manage agents, and track performance, all without specialized skills. Read Amazon Connect – Customer Contact Center in the Cloud and New – Amazon Connect and Amazon Lex Integration to learn more.

Amazon ElastiCache for Redis – This service lets you deploy, operate, and scale an in-memory data store or cache that you can use to improve the performance of your applications. Each ElastiCache for Redis cluster publishes key performance metrics to Amazon CloudWatch. To learn more, read Caching in the Cloud with Amazon ElastiCache and Amazon ElastiCache – Now With a Dash of Redis.

Amazon Kinesis Streams – This service allows you to build applications that process or analyze streaming data such as website clickstreams, financial transactions, social media feeds, and location-tracking events. To learn more, read Amazon Kinesis – Real-Time Processing of Streaming Big Data and New: Server-Side Encryption for Amazon Kinesis Streams.

Amazon RDS for MariaDB – This service lets you set up scalable, managed MariaDB instances in minutes, and offers high performance, high availability, and a simplified security model that makes it easy for you to encrypt data at rest and in transit. Read Amazon RDS Update – MariaDB is Now Available to learn more.

Amazon RDS SQL Server – This service lets you set up scalable, managed Microsoft SQL Server instances in minutes, and also offers high performance, high availability, and a simplified security model. To learn more, read Amazon RDS for SQL Server and .NET support for AWS Elastic Beanstalk and Amazon RDS for Microsoft SQL Server – Transparent Data Encryption (TDE) to learn more.

Amazon Route 53 – This is a highly available Domain Name Server. It translates names like www.example.com into IP addresses. To learn more, read Moving Ahead with Amazon Route 53.

AWS Batch – This service lets you run large-scale batch computing jobs on AWS. You don’t need to install or maintain specialized batch software or build your own server clusters. Read AWS Batch – Run Batch Computing Jobs on AWS to learn more.

AWS CloudHSM – A cloud-based Hardware Security Module (HSM) for key storage and management at cloud scale. Designed for sensitive workloads, CloudHSM lets you manage your own keys using FIPS 140-2 Level 3 validated HSMs. To learn more, read AWS CloudHSM – Secure Key Storage and Cryptographic Operations and AWS CloudHSM Update – Cost Effective Hardware Key Management at Cloud Scale for Sensitive & Regulated Workloads.

AWS Key Management Service – This service makes it easy for you to create and control the encryption keys used to encrypt your data. It uses HSMs to protect your keys, and is integrated with AWS CloudTrail in order to provide you with a log of all key usage. Read New AWS Key Management Service (KMS) to learn more.

AWS Lambda – This service lets you run event-driven application or backend code without thinking about or managing servers. To learn more, read AWS Lambda – Run Code in the Cloud, AWS Lambda – A Look Back at 2016, and AWS Lambda – In Full Production with New Features for Mobile Devs.

[email protected] – You can use this new feature of AWS Lambda to run Node.js functions across the global network of AWS locations without having to provision or manager servers, in order to deliver rich, personalized content to your users with low latency. Read [email protected] – Intelligent Processing of HTTP Requests at the Edge to learn more.

AWS Snowball Edge – This is a data transfer device with 100 terabytes of on-board storage as well as compute capabilities. You can use it to move large amounts of data into or out of AWS, as a temporary storage tier, or to support workloads in remote or offline locations. To learn more, read AWS Snowball Edge – More Storage, Local Endpoints, Lambda Functions.

AWS Snowmobile – This is an exabyte-scale data transfer service. Pulled by a semi-trailer truck, each Snowmobile packs 100 petabytes of storage into a ruggedized 45-foot long shipping container. Read AWS Snowmobile – Move Exabytes of Data to the Cloud in Weeks to learn more (and to see some of my finest LEGO work).

AWS Storage Gateway – This hybrid storage service lets your on-premises applications use AWS cloud storage (Amazon Simple Storage Service (S3), Amazon Glacier, and Amazon Elastic File System) in a simple and seamless way, with storage for volumes, files, and virtual tapes. To learn more, read The AWS Storage Gateway – Integrate Your Existing On-Premises Applications with AWS Cloud Storage and File Interface to AWS Storage Gateway.

And there you go! Check out my earlier post for a list of resources that will help you to build applications that comply with HIPAA and HITECH.

Jeff;

 

Predict Billboard Top 10 Hits Using RStudio, H2O and Amazon Athena

Post Syndicated from Gopal Wunnava original https://aws.amazon.com/blogs/big-data/predict-billboard-top-10-hits-using-rstudio-h2o-and-amazon-athena/

Success in the popular music industry is typically measured in terms of the number of Top 10 hits artists have to their credit. The music industry is a highly competitive multi-billion dollar business, and record labels incur various costs in exchange for a percentage of the profits from sales and concert tickets.

Predicting the success of an artist’s release in the popular music industry can be difficult. One release may be extremely popular, resulting in widespread play on TV, radio and social media, while another single may turn out quite unpopular, and therefore unprofitable. Record labels need to be selective in their decision making, and predictive analytics can help them with decision making around the type of songs and artists they need to promote.

In this walkthrough, you leverage H2O.ai, Amazon Athena, and RStudio to make predictions on whether a song might make it to the Top 10 Billboard charts. You explore the GLM, GBM, and deep learning modeling techniques using H2O’s rapid, distributed and easy-to-use open source parallel processing engine. RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. This is ideal if you don’t want to connect to a server via SSH and use code editors such as vi to do analytics. RStudio is available in a desktop version, or a server version that allows you to access R via a web browser. RStudio’s Notebooks feature is used to demonstrate the execution of code and output. In addition, this post showcases how you can leverage Athena for query and interactive analysis during the modeling phase. A working knowledge of statistics and machine learning would be helpful to interpret the analysis being performed in this post.

Walkthrough

Your goal is to predict whether a song will make it to the Top 10 Billboard charts. For this purpose, you will be using multiple modeling techniques―namely GLM, GBM and deep learning―and choose the model that is the best fit.

This solution involves the following steps:

  • Install and configure RStudio with Athena
  • Log in to RStudio
  • Install R packages
  • Connect to Athena
  • Create a dataset
  • Create models

Install and configure RStudio with Athena

Use the following AWS CloudFormation stack to install, configure, and connect RStudio on an Amazon EC2 instance with Athena.

Launching this stack creates all required resources and prerequisites:

  • Amazon EC2 instance with Amazon Linux (minimum size of t2.large is recommended)
  • Provisioning of the EC2 instance in an existing VPC and public subnet
  • Installation of Java 8
  • Assignment of an IAM role to the EC2 instance with the required permissions for accessing Athena and Amazon S3
  • Security group allowing access to the RStudio and SSH ports from the internet (I recommend restricting access to these ports)
  • S3 staging bucket required for Athena (referenced within RStudio as ATHENABUCKET)
  • RStudio username and password
  • Setup logs in Amazon CloudWatch Logs (if needed for additional troubleshooting)
  • Amazon EC2 Systems Manager agent, which makes it easy to manage and patch

All AWS resources are created in the US-East-1 Region. To avoid cross-region data transfer fees, launch the CloudFormation stack in the same region. To check the availability of Athena in other regions, see Region Table.

Log in to RStudio

The instance security group has been automatically configured to allow incoming connections on the RStudio port 8787 from any source internet address. You can edit the security group to restrict source IP access. If you have trouble connecting, ensure that port 8787 isn’t blocked by subnet network ACLS or by your outgoing proxy/firewall.

  1. In the CloudFormation stack, choose Outputs, Value, and then open the RStudio URL. You might need to wait for a few minutes until the instance has been launched.
  2. Log in to RStudio with the and password you provided during setup.

Install R packages

Next, install the required R packages from the RStudio console. You can download the R notebook file containing just the code.

#install pacman – a handy package manager for managing installs
if("pacman" %in% rownames(installed.packages()) == FALSE)
{install.packages("pacman")}  
library(pacman)
p_load(h2o,rJava,RJDBC,awsjavasdk)
h2o.init(nthreads = -1)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 42 minutes 
##     H2O cluster version:        3.10.4.6 
##     H2O cluster version age:    4 months and 4 days !!! 
##     H2O cluster name:           H2O_started_from_R_rstudio_hjx881 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.30 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.3.3 (2017-03-06)
## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (4 months and 4 days)!
## Please download and install the latest version from http://h2o.ai/download/
#install aws sdk if not present (pre-requisite for using Athena with an IAM role)
if (!aws_sdk_present()) {
  install_aws_sdk()
}

load_sdk()
## NULL

Connect to Athena

Next, establish a connection to Athena from RStudio, using an IAM role associated with your EC2 instance. Use ATHENABUCKET to specify the S3 staging directory.

URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
list.files()
## [1] "AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")

con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
                                   s3_staging_dir=Sys.getenv("ATHENABUCKET"),
                                   aws_credentials_provider_class="com.amazonaws.auth.DefaultAWSCredentialsProviderChain")

Verify the connection. The results returned depend on your specific Athena setup.

con
## <JDBCConnection>
dbListTables(con)
##  [1] "gdelt"               "wikistats"           "elb_logs_raw_native"
##  [4] "twitter"             "twitter2"            "usermovieratings"   
##  [7] "eventcodes"          "events"              "billboard"          
## [10] "billboardtop10"      "elb_logs"            "gdelthist"          
## [13] "gdeltmaster"         "twitter"             "twitter3"

Create a dataset

For this analysis, you use a sample dataset combining information from Billboard and Wikipedia with Echo Nest data in the Million Songs Dataset. Upload this dataset into your own S3 bucket. The table below provides a description of the fields used in this dataset.

Field Description
year Year that song was released
songtitle Title of the song
artistname Name of the song artist
songid Unique identifier for the song
artistid Unique identifier for the song artist
timesignature Variable estimating the time signature of the song
timesignature_confidence Confidence in the estimate for the timesignature
loudness Continuous variable indicating the average amplitude of the audio in decibels
tempo Variable indicating the estimated beats per minute of the song
tempo_confidence Confidence in the estimate for tempo
key Variable with twelve levels indicating the estimated key of the song (C, C#, B)
key_confidence Confidence in the estimate for key
energy Variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitch Continuous variable that indicates the pitch of the song
timbre_0_min thru timbre_11_min Variables that indicate the minimum values over all segments for each of the twelve values in the timbre vector
timbre_0_max thru timbre_11_max Variables that indicate the maximum values over all segments for each of the twelve values in the timbre vector
top10 Indicator for whether or not the song made it to the Top 10 of the Billboard charts (1 if it was in the top 10, and 0 if not)

Create an Athena table based on the dataset

In the Athena console, select the default database, sampled, or create a new database.

Run the following create table statement.

create external table if not exists billboard
(
year int,
songtitle string,
artistname string,
songID string,
artistID string,
timesignature int,
timesignature_confidence double,
loudness double,
tempo double,
tempo_confidence double,
key int,
key_confidence double,
energy double,
pitch double,
timbre_0_min double,
timbre_0_max double,
timbre_1_min double,
timbre_1_max double,
timbre_2_min double,
timbre_2_max double,
timbre_3_min double,
timbre_3_max double,
timbre_4_min double,
timbre_4_max double,
timbre_5_min double,
timbre_5_max double,
timbre_6_min double,
timbre_6_max double,
timbre_7_min double,
timbre_7_max double,
timbre_8_min double,
timbre_8_max double,
timbre_9_min double,
timbre_9_max double,
timbre_10_min double,
timbre_10_max double,
timbre_11_min double,
timbre_11_max double,
Top10 int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://aws-bigdata-blog/artifacts/predict-billboard/data'
;

Inspect the table definition for the ‘billboard’ table that you have created. If you chose a database other than sampledb, replace that value with your choice.

dbGetQuery(con, "show create table sampledb.billboard")
##                                      createtab_stmt
## 1       CREATE EXTERNAL TABLE `sampledb.billboard`(
## 2                                       `year` int,
## 3                               `songtitle` string,
## 4                              `artistname` string,
## 5                                  `songid` string,
## 6                                `artistid` string,
## 7                              `timesignature` int,
## 8                `timesignature_confidence` double,
## 9                                `loudness` double,
## 10                                  `tempo` double,
## 11                       `tempo_confidence` double,
## 12                                       `key` int,
## 13                         `key_confidence` double,
## 14                                 `energy` double,
## 15                                  `pitch` double,
## 16                           `timbre_0_min` double,
## 17                           `timbre_0_max` double,
## 18                           `timbre_1_min` double,
## 19                           `timbre_1_max` double,
## 20                           `timbre_2_min` double,
## 21                           `timbre_2_max` double,
## 22                           `timbre_3_min` double,
## 23                           `timbre_3_max` double,
## 24                           `timbre_4_min` double,
## 25                           `timbre_4_max` double,
## 26                           `timbre_5_min` double,
## 27                           `timbre_5_max` double,
## 28                           `timbre_6_min` double,
## 29                           `timbre_6_max` double,
## 30                           `timbre_7_min` double,
## 31                           `timbre_7_max` double,
## 32                           `timbre_8_min` double,
## 33                           `timbre_8_max` double,
## 34                           `timbre_9_min` double,
## 35                           `timbre_9_max` double,
## 36                          `timbre_10_min` double,
## 37                          `timbre_10_max` double,
## 38                          `timbre_11_min` double,
## 39                          `timbre_11_max` double,
## 40                                     `top10` int)
## 41                             ROW FORMAT DELIMITED 
## 42                         FIELDS TERMINATED BY ',' 
## 43                            STORED AS INPUTFORMAT 
## 44       'org.apache.hadoop.mapred.TextInputFormat' 
## 45                                     OUTPUTFORMAT 
## 46  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
## 47                                        LOCATION
## 48    's3://aws-bigdata-blog/artifacts/predict-billboard/data'
## 49                                  TBLPROPERTIES (
## 50            'transient_lastDdlTime'='1505484133')

Run a sample query

Next, run a sample query to obtain a list of all songs from Janet Jackson that made it to the Billboard Top 10 charts.

dbGetQuery(con, " SELECT songtitle,artistname,top10   FROM sampledb.billboard WHERE lower(artistname) =     'janet jackson' AND top10 = 1")
##                       songtitle    artistname top10
## 1                       Runaway Janet Jackson     1
## 2               Because Of Love Janet Jackson     1
## 3                         Again Janet Jackson     1
## 4                            If Janet Jackson     1
## 5  Love Will Never Do (Without You) Janet Jackson 1
## 6                     Black Cat Janet Jackson     1
## 7               Come Back To Me Janet Jackson     1
## 8                       Alright Janet Jackson     1
## 9                      Escapade Janet Jackson     1
## 10                Rhythm Nation Janet Jackson     1

Determine how many songs in this dataset are specifically from the year 2010.

dbGetQuery(con, " SELECT count(*)   FROM sampledb.billboard WHERE year = 2010")
##   _col0
## 1   373

The sample dataset provides certain song properties of interest that can be analyzed to gauge the impact to the song’s overall popularity. Look at one such property, timesignature, and determine the value that is the most frequent among songs in the database. Timesignature is a measure of the number of beats and the type of note involved.

Running the query directly may result in an error, as shown in the commented lines below. This error is a result of trying to retrieve a large result set over a JDBC connection, which can cause out-of-memory issues at the client level. To address this, reduce the fetch size and run again.

#t<-dbGetQuery(con, " SELECT timesignature FROM sampledb.billboard")
#Note:  Running the preceding query results in the following error: 
#Error in .jcall(rp, "I", "fetch", stride, block): java.sql.SQLException: The requested #fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try #again. Refer to the Athena documentation for valid fetchSize values.
# Use the dbSendQuery function, reduce the fetch size, and run again
r <- dbSendQuery(con, " SELECT timesignature     FROM sampledb.billboard")
dftimesignature<- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
table(dftimesignature)
## dftimesignature
##    0    1    3    4    5    7 
##   10  143  503 6787  112   19
nrow(dftimesignature)
## [1] 7574

From the results, observe that 6787 songs have a timesignature of 4.

Next, determine the song with the highest tempo.

dbGetQuery(con, " SELECT songtitle,artistname,tempo   FROM sampledb.billboard WHERE tempo = (SELECT max(tempo) FROM sampledb.billboard) ")
##                   songtitle      artistname   tempo
## 1 Wanna Be Startin' Somethin' Michael Jackson 244.307

Create the training dataset

Your model needs to be trained such that it can learn and make accurate predictions. Split the data into training and test datasets, and create the training dataset first.  This dataset contains all observations from the year 2009 and earlier. You may face the same JDBC connection issue pointed out earlier, so this query uses a fetch size.

#BillboardTrain <- dbGetQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
#Running the preceding query results in the following error:-
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve #JDBC result set for SELECT * FROM sampledb.billboard WHERE year <= 2009 (Internal error)
#Follow the same approach as before to address this issue.

r <- dbSendQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
BillboardTrain <- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
BillboardTrain[1:2,c(1:3,6:10)]
##   year           songtitle artistname timesignature
## 1 2009 The Awkward Goodbye    Athlete             3
## 2 2009        Rubik's Cube    Athlete             3
##   timesignature_confidence loudness   tempo tempo_confidence
## 1                    0.732   -6.320  89.614   0.652
## 2                    0.906   -9.541 117.742   0.542
nrow(BillboardTrain)
## [1] 7201

Create the test dataset

BillboardTest <- dbGetQuery(con, "SELECT * FROM sampledb.billboard where year = 2010")
BillboardTest[1:2,c(1:3,11:15)]
##   year              songtitle        artistname key
## 1 2010 This Is the House That Doubt Built A Day to Remember  11
## 2 2010        Sticks & Bricks A Day to Remember  10
##   key_confidence    energy pitch timbre_0_min
## 1          0.453 0.9666556 0.024        0.002
## 2          0.469 0.9847095 0.025        0.000
nrow(BillboardTest)
## [1] 373

Convert the training and test datasets into H2O dataframes

train.h2o <- as.h2o(BillboardTrain)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
test.h2o <- as.h2o(BillboardTest)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Inspect the column names in your H2O dataframes.

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"

Create models

You need to designate the independent and dependent variables prior to applying your modeling algorithms. Because you’re trying to predict the ‘top10’ field, this would be your dependent variable and everything else would be independent.

Create your first model using GLM. Because GLM works best with numeric data, you create your model by dropping non-numeric variables. You only use the variables in the dataset that describe the numerical attributes of the song in the logistic regression model. You won’t use these variables:  “year”, “songtitle”, “artistname”, “songid”, or “artistid”.

y.dep <- 39
x.indep <- c(6:38)
x.indep
##  [1]  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38

Create Model 1: All numeric variables

Create Model 1 with the training dataset, using GLM as the modeling algorithm and H2O’s built-in h2o.glm function.

modelh1 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 1, using H2O’s built-in performance function.

h2o.performance(model=modelh1,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09924684
## RMSE:  0.3150347
## LogLoss:  0.3220267
## Mean Per-Class Error:  0.2380168
## AUC:  0.8431394
## Gini:  0.6862787
## R^2:  0.254663
## Null Deviance:  326.0801
## Residual Deviance:  240.2319
## AIC:  308.2319
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error     Rate
## 0      255  59 0.187898  =59/314
## 1       17  42 0.288136   =17/59
## Totals 272 101 0.203753  =76/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.192772 0.525000 100
## 2                       max f2  0.124912 0.650510 155
## 3                 max f0point5  0.416258 0.612903  23
## 4                 max accuracy  0.416258 0.879357  23
## 5                max precision  0.813396 1.000000   0
## 6                   max recall  0.037579 1.000000 282
## 7              max specificity  0.813396 1.000000   0
## 8             max absolute_mcc  0.416258 0.455251  23
## 9   max min_per_class_accuracy  0.161402 0.738854 125
## 10 max mean_per_class_accuracy  0.124912 0.765006 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or ` 
h2o.auc(h2o.performance(modelh1,test.h2o)) 
## [1] 0.8431394

The AUC metric provides insight into how well the classifier is able to separate the two classes. In this case, the value of 0.8431394 indicates that the classification is good. (A value of 0.5 indicates a worthless test, while a value of 1.0 indicates a perfect test.)

Next, inspect the coefficients of the variables in the dataset.

dfmodelh1 <- as.data.frame(h2o.varimp(modelh1))
dfmodelh1
##                       names coefficients sign
## 1              timbre_0_max  1.290938663  NEG
## 2                  loudness  1.262941934  POS
## 3                     pitch  0.616995941  NEG
## 4              timbre_1_min  0.422323735  POS
## 5              timbre_6_min  0.349016024  NEG
## 6                    energy  0.348092062  NEG
## 7             timbre_11_min  0.307331997  NEG
## 8              timbre_3_max  0.302225619  NEG
## 9             timbre_11_max  0.243632060  POS
## 10             timbre_4_min  0.224233951  POS
## 11             timbre_4_max  0.204134342  POS
## 12             timbre_5_min  0.199149324  NEG
## 13             timbre_0_min  0.195147119  POS
## 14 timesignature_confidence  0.179973904  POS
## 15         tempo_confidence  0.144242598  POS
## 16            timbre_10_max  0.137644568  POS
## 17             timbre_7_min  0.126995955  NEG
## 18            timbre_10_min  0.123851179  POS
## 19             timbre_7_max  0.100031481  NEG
## 20             timbre_2_min  0.096127636  NEG
## 21           key_confidence  0.083115820  POS
## 22             timbre_6_max  0.073712419  POS
## 23            timesignature  0.067241917  POS
## 24             timbre_8_min  0.061301881  POS
## 25             timbre_8_max  0.060041698  POS
## 26                      key  0.056158445  POS
## 27             timbre_3_min  0.050825116  POS
## 28             timbre_9_max  0.033733561  POS
## 29             timbre_2_max  0.030939072  POS
## 30             timbre_9_min  0.020708113  POS
## 31             timbre_1_max  0.014228818  NEG
## 32                    tempo  0.008199861  POS
## 33             timbre_5_max  0.004837870  POS
## 34                                    NA <NA>

Typically, songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”). This knowledge is helpful for interpreting the modeling results.

You can make the following observations from the results:

  • The coefficient estimates for the confidence values associated with the time signature, key, and tempo variables are positive. This suggests that higher confidence leads to a higher predicted probability of a Top 10 hit.
  • The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs with heavier instrumentation.
  • The coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those songs with light instrumentation.

These coefficients lead to contradictory conclusions for Model 1. This could be due to multicollinearity issues. Inspect the correlation between the variables “loudness” and “energy” in the training set.

cor(train.h2o$loudness,train.h2o$energy)
## [1] 0.7399067

This number indicates that these two variables are highly correlated, and Model 1 does indeed suffer from multicollinearity. Typically, you associate a value of -1.0 to -0.5 or 1.0 to 0.5 to indicate strong correlation, and a value of 0.1 to 0.1 to indicate weak correlation. To avoid this correlation issue, omit one of these two variables and re-create the models.

You build two variations of the original model:

  • Model 2, in which you keep “energy” and omit “loudness”
  • Model 3, in which you keep “loudness” and omit “energy”

You compare these two models and choose the model with a better fit for this use case.

Create Model 2: Keep energy and omit loudness

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:7,9:38)
x.indep
##  [1]  6  7  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh2 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 2.

h2o.performance(model=modelh2,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09922606
## RMSE:  0.3150017
## LogLoss:  0.3228213
## Mean Per-Class Error:  0.2490554
## AUC:  0.8431933
## Gini:  0.6863867
## R^2:  0.2548191
## Null Deviance:  326.0801
## Residual Deviance:  240.8247
## AIC:  306.8247
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      280 34 0.108280  =34/314
## 1       23 36 0.389831   =23/59
## Totals 303 70 0.152815  =57/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.254391 0.558140  69
## 2                       max f2  0.113031 0.647208 157
## 3                 max f0point5  0.413999 0.596026  22
## 4                 max accuracy  0.446250 0.876676  18
## 5                max precision  0.811739 1.000000   0
## 6                   max recall  0.037682 1.000000 283
## 7              max specificity  0.811739 1.000000   0
## 8             max absolute_mcc  0.254391 0.469060  69
## 9   max min_per_class_accuracy  0.141051 0.716561 131
## 10 max mean_per_class_accuracy  0.113031 0.761821 157
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh2 <- as.data.frame(h2o.varimp(modelh2))
dfmodelh2
##                       names coefficients sign
## 1                     pitch  0.700331511  NEG
## 2              timbre_1_min  0.510270513  POS
## 3              timbre_0_max  0.402059546  NEG
## 4              timbre_6_min  0.333316236  NEG
## 5             timbre_11_min  0.331647383  NEG
## 6              timbre_3_max  0.252425901  NEG
## 7             timbre_11_max  0.227500308  POS
## 8              timbre_4_max  0.210663865  POS
## 9              timbre_0_min  0.208516163  POS
## 10             timbre_5_min  0.202748055  NEG
## 11             timbre_4_min  0.197246582  POS
## 12            timbre_10_max  0.172729619  POS
## 13         tempo_confidence  0.167523934  POS
## 14 timesignature_confidence  0.167398830  POS
## 15             timbre_7_min  0.142450727  NEG
## 16             timbre_8_max  0.093377516  POS
## 17            timbre_10_min  0.090333426  POS
## 18            timesignature  0.085851625  POS
## 19             timbre_7_max  0.083948442  NEG
## 20           key_confidence  0.079657073  POS
## 21             timbre_6_max  0.076426046  POS
## 22             timbre_2_min  0.071957831  NEG
## 23             timbre_9_max  0.071393189  POS
## 24             timbre_8_min  0.070225578  POS
## 25                      key  0.061394702  POS
## 26             timbre_3_min  0.048384697  POS
## 27             timbre_1_max  0.044721121  NEG
## 28                   energy  0.039698433  POS
## 29             timbre_5_max  0.039469064  POS
## 30             timbre_2_max  0.018461133  POS
## 31                    tempo  0.013279926  POS
## 32             timbre_9_min  0.005282143  NEG
## 33                                    NA <NA>

h2o.auc(h2o.performance(modelh2,test.h2o)) 
## [1] 0.8431933

You can make the following observations:

  • The AUC metric is 0.8431933.
  • Inspecting the coefficient of the variable energy, Model 2 suggests that songs with high energy levels tend to be more popular. This is as per expectation.
  • As H2O orders variables by significance, the variable energy is not significant in this model.

You can conclude that Model 2 is not ideal for this use , as energy is not significant.

CreateModel 3: Keep loudness but omit energy

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:12,14:38)
x.indep
##  [1]  6  7  8  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh3 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=================================================================| 100%
perfh3<-h2o.performance(model=modelh3,newdata=test.h2o)
perfh3
## H2OBinomialMetrics: glm
## 
## MSE:  0.0978859
## RMSE:  0.3128672
## LogLoss:  0.3178367
## Mean Per-Class Error:  0.264925
## AUC:  0.8492389
## Gini:  0.6984778
## R^2:  0.2648836
## Null Deviance:  326.0801
## Residual Deviance:  237.1062
## AIC:  303.1062
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      286 28 0.089172  =28/314
## 1       26 33 0.440678   =26/59
## Totals 312 61 0.144772  =54/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.273799 0.550000  60
## 2                       max f2  0.125503 0.663265 155
## 3                 max f0point5  0.435479 0.628931  24
## 4                 max accuracy  0.435479 0.882038  24
## 5                max precision  0.821606 1.000000   0
## 6                   max recall  0.038328 1.000000 280
## 7              max specificity  0.821606 1.000000   0
## 8             max absolute_mcc  0.435479 0.471426  24
## 9   max min_per_class_accuracy  0.173693 0.745763 120
## 10 max mean_per_class_accuracy  0.125503 0.775073 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh3 <- as.data.frame(h2o.varimp(modelh3))
dfmodelh3
##                       names coefficients sign
## 1              timbre_0_max 1.216621e+00  NEG
## 2                  loudness 9.780973e-01  POS
## 3                     pitch 7.249788e-01  NEG
## 4              timbre_1_min 3.891197e-01  POS
## 5              timbre_6_min 3.689193e-01  NEG
## 6             timbre_11_min 3.086673e-01  NEG
## 7              timbre_3_max 3.025593e-01  NEG
## 8             timbre_11_max 2.459081e-01  POS
## 9              timbre_4_min 2.379749e-01  POS
## 10             timbre_4_max 2.157627e-01  POS
## 11             timbre_0_min 1.859531e-01  POS
## 12             timbre_5_min 1.846128e-01  NEG
## 13 timesignature_confidence 1.729658e-01  POS
## 14             timbre_7_min 1.431871e-01  NEG
## 15            timbre_10_max 1.366703e-01  POS
## 16            timbre_10_min 1.215954e-01  POS
## 17         tempo_confidence 1.183698e-01  POS
## 18             timbre_2_min 1.019149e-01  NEG
## 19           key_confidence 9.109701e-02  POS
## 20             timbre_7_max 8.987908e-02  NEG
## 21             timbre_6_max 6.935132e-02  POS
## 22             timbre_8_max 6.878241e-02  POS
## 23            timesignature 6.120105e-02  POS
## 24                      key 5.814805e-02  POS
## 25             timbre_8_min 5.759228e-02  POS
## 26             timbre_1_max 2.930285e-02  NEG
## 27             timbre_9_max 2.843755e-02  POS
## 28             timbre_3_min 2.380245e-02  POS
## 29             timbre_2_max 1.917035e-02  POS
## 30             timbre_5_max 1.715813e-02  POS
## 31                    tempo 1.364418e-02  NEG
## 32             timbre_9_min 8.463143e-05  NEG
## 33                                    NA <NA>
h2o.sensitivity(perfh3,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501855569251422. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.2033898
h2o.auc(perfh3)
## [1] 0.8492389

You can make the following observations:

  • The AUC metric is 0.8492389.
  • From the confusion matrix, the model correctly predicts that 33 songs will be top 10 hits (true positives). However, it has 26 false positives (songs that the model predicted would be Top 10 hits, but ended up not being Top 10 hits).
  • Loudness has a positive coefficient estimate, meaning that this model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion from Model 2.
  • Loudness is significant in this model.

Overall, Model 3 predicts a higher number of top 10 hits with an accuracy rate that is acceptable. To choose the best fit for production runs, record labels should consider the following factors:

  • Desired model accuracy at a given threshold
  • Number of correct predictions for top10 hits
  • Tolerable number of false positives or false negatives

Next, make predictions using Model 3 on the test dataset.

predict.regh <- h2o.predict(modelh3, test.h2o)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
print(predict.regh)
##   predict        p0          p1
## 1       0 0.9654739 0.034526052
## 2       0 0.9654748 0.034525236
## 3       0 0.9635547 0.036445318
## 4       0 0.9343579 0.065642149
## 5       0 0.9978334 0.002166601
## 6       0 0.9779949 0.022005078
## 
## [373 rows x 3 columns]
predict.regh$predict
##   predict
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0
## 
## [373 rows x 1 column]
dpr<-as.data.frame(predict.regh)
#Rename the predicted column 
colnames(dpr)[colnames(dpr) == 'predict'] <- 'predict_top10'
table(dpr$predict_top10)
## 
##   0   1 
## 312  61

The first set of output results specifies the probabilities associated with each predicted observation.  For example, observation 1 is 96.54739% likely to not be a Top 10 hit, and 3.4526052% likely to be a Top 10 hit (predict=1 indicates Top 10 hit and predict=0 indicates not a Top 10 hit).  The second set of results list the actual predictions made.  From the third set of results, this model predicts that 61 songs will be top 10 hits.

Compute the baseline accuracy, by assuming that the baseline predicts the most frequent outcome, which is that most songs are not Top 10 hits.

table(BillboardTest$top10)
## 
##   0   1 
## 314  59

Now observe that the baseline model would get 314 observations correct, and 59 wrong, for an accuracy of 314/(314+59) = 0.8418231.

It seems that Model 3, with an accuracy of 0.8552, provides you with a small improvement over the baseline model. But is this model useful for record labels?

View the two models from an investment perspective:

  • A production company is interested in investing in songs that are more likely to make it to the Top 10. The company’s objective is to minimize the risk of financial losses attributed to investing in songs that end up unpopular.
  • How many songs does Model 3 correctly predict as a Top 10 hit in 2010? Looking at the confusion matrix, you see that it predicts 33 top 10 hits correctly at an optimal threshold, which is more than half the number
  • It will be more useful to the record label if you can provide the production company with a list of songs that are highly likely to end up in the Top 10.
  • The baseline model is not useful, as it simply does not label any song as a hit.

Considering the three models built so far, you can conclude that Model 3 proves to be the best investment choice for the record label.

GBM model

H2O provides you with the ability to explore other learning models, such as GBM and deep learning. Explore building a model using the GBM technique, using the built-in h2o.gbm function.

Before you do this, you need to convert the target variable to a factor for multinomial classification techniques.

train.h2o$top10=as.factor(train.h2o$top10)
gbm.modelh <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 500, max_depth = 4, learn_rate = 0.01, seed = 1122,distribution="multinomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |=====================================                            |  56%
  |                                                                       
  |====================================================             |  79%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |=================================================================| 100%
perf.gbmh<-h2o.performance(gbm.modelh,test.h2o)
perf.gbmh
## H2OBinomialMetrics: gbm
## 
## MSE:  0.09860778
## RMSE:  0.3140188
## LogLoss:  0.3206876
## Mean Per-Class Error:  0.2120263
## AUC:  0.8630573
## Gini:  0.7261146
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      266 48 0.152866  =48/314
## 1       16 43 0.271186   =16/59
## Totals 282 91 0.171582  =64/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.189757 0.573333  90
## 2                     max f2  0.130895 0.693717 145
## 3               max f0point5  0.327346 0.598802  26
## 4               max accuracy  0.442757 0.876676  14
## 5              max precision  0.802184 1.000000   0
## 6                 max recall  0.049990 1.000000 284
## 7            max specificity  0.802184 1.000000   0
## 8           max absolute_mcc  0.169135 0.496486 104
## 9 max min_per_class_accuracy  0.169135 0.796610 104
## 10 max mean_per_class_accuracy  0.169135 0.805948 104
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
h2o.sensitivity(perf.gbmh,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501205344484314. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.1355932
h2o.auc(perf.gbmh)
## [1] 0.8630573

This model correctly predicts 43 top 10 hits, which is 10 more than the number predicted by Model 3. Moreover, the AUC metric is higher than the one obtained from Model 3.

As seen above, H2O’s API provides the ability to obtain key statistical measures required to analyze the models easily, using several built-in functions. The record label can experiment with different parameters to arrive at the model that predicts the maximum number of Top 10 hits at the desired level of accuracy and threshold.

H2O also allows you to experiment with deep learning models. Deep learning models have the ability to learn features implicitly, but can be more expensive computationally.

Now, create a deep learning model with the h2o.deeplearning function, using the same training and test datasets created before. The time taken to run this model depends on the type of EC2 instance chosen for this purpose.  For models that require more computation, consider using accelerated computing instances such as the P2 instance type.

system.time(
  dlearning.modelh <- h2o.deeplearning(y = y.dep,
                                      x = x.indep,
                                      training_frame = train.h2o,
                                      epoch = 250,
                                      hidden = c(250,250),
                                      activation = "Rectifier",
                                      seed = 1122,
                                      distribution="multinomial"
  )
)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |=================================================================| 100%
##    user  system elapsed 
##   1.216   0.020 166.508
perf.dl<-h2o.performance(model=dlearning.modelh,newdata=test.h2o)
perf.dl
## H2OBinomialMetrics: deeplearning
## 
## MSE:  0.1678359
## RMSE:  0.4096778
## LogLoss:  1.86509
## Mean Per-Class Error:  0.3433013
## AUC:  0.7568822
## Gini:  0.5137644
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      290 24 0.076433  =24/314
## 1       36 23 0.610169   =36/59
## Totals 326 47 0.160858  =60/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.826267 0.433962  46
## 2                     max f2  0.000000 0.588235 239
## 3               max f0point5  0.999929 0.511811  16
## 4               max accuracy  0.999999 0.865952  10
## 5              max precision  1.000000 1.000000   0
## 6                 max recall  0.000000 1.000000 326
## 7            max specificity  1.000000 1.000000   0
## 8           max absolute_mcc  0.999929 0.363219  16
## 9 max min_per_class_accuracy  0.000004 0.662420 145
## 10 max mean_per_class_accuracy  0.000000 0.685334 224
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.sensitivity(perf.dl,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.496293348880151. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.3898305
h2o.auc(perf.dl)
## [1] 0.7568822

The AUC metric for this model is 0.7568822, which is less than what you got from the earlier models. I recommend further experimentation using different hyper parameters, such as the learning rate, epoch or the number of hidden layers.

H2O’s built-in functions provide many key statistical measures that can help measure model performance. Here are some of these key terms.

Metric Description
Sensitivity Measures the proportion of positives that have been correctly identified. It is also called the true positive rate, or recall.
Specificity Measures the proportion of negatives that have been correctly identified. It is also called the true negative rate.
Threshold Cutoff point that maximizes specificity and sensitivity. While the model may not provide the highest prediction at this point, it would not be biased towards positives or negatives.
Precision The fraction of the documents retrieved that are relevant to the information needed, for example, how many of the positively classified are relevant
AUC

Provides insight into how well the classifier is able to separate the two classes. The implicit goal is to deal with situations where the sample distribution is highly skewed, with a tendency to overfit to a single class.

0.90 – 1 = excellent (A)

0.8 – 0.9 = good (B)

0.7 – 0.8 = fair (C)

.6 – 0.7 = poor (D)

0.5 – 0.5 = fail (F)

Here’s a summary of the metrics generated from H2O’s built-in functions for the three models that produced useful results.

Metric Model 3 GBM Model Deep Learning Model

Accuracy

(max)

0.882038

(t=0.435479)

0.876676

(t=0.442757)

0.865952

(t=0.999999)

Precision

(max)

1.0

(t=0.821606)

1.0

(t=0802184)

1.0

(t=1.0)

Recall

(max)

1.0 1.0

1.0

(t=0)

Specificity

(max)

1.0 1.0

1.0

(t=1)

Sensitivity

 

0.2033898 0.1355932

0.3898305

(t=0.5)

AUC 0.8492389 0.8630573 0.756882

Note: ‘t’ denotes threshold.

Your options at this point could be narrowed down to Model 3 and the GBM model, based on the AUC and accuracy metrics observed earlier.  If the slightly lower accuracy of the GBM model is deemed acceptable, the record label can choose to go to production with the GBM model, as it can predict a higher number of Top 10 hits.  The AUC metric for the GBM model is also higher than that of Model 3.

Record labels can experiment with different learning techniques and parameters before arriving at a model that proves to be the best fit for their business. Because deep learning models can be computationally expensive, record labels can choose more powerful EC2 instances on AWS to run their experiments faster.

Conclusion

In this post, I showed how the popular music industry can use analytics to predict the type of songs that make the Top 10 Billboard charts. By running H2O’s scalable machine learning platform on AWS, data scientists can easily experiment with multiple modeling techniques and interactively query the data using Amazon Athena, without having to manage the underlying infrastructure. This helps record labels make critical decisions on the type of artists and songs to promote in a timely fashion, thereby increasing sales and revenue.

If you have questions or suggestions, please comment below.


Additional Reading

Learn how to build and explore a simple geospita simple GEOINT application using SparkR.


About the Authors

gopalGopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.

 

 

Bob Strahan, a Senior Consultant with AWS Professional Services, contributed to this post.

 

 

Simplify Your Jenkins Builds with AWS CodeBuild

Post Syndicated from Paul Roberts original https://aws.amazon.com/blogs/devops/simplify-your-jenkins-builds-with-aws-codebuild/

Jeff Bezos famously said, “There’s a lot of undifferentiated heavy lifting that stands between your idea and that success.” He went on to say, “…70% of your time, energy, and dollars go into the undifferentiated heavy lifting and only 30% of your energy, time, and dollars gets to go into the core kernel of your idea.”

If you subscribe to this maxim, you should not be spending valuable time focusing on operational issues related to maintaining the Jenkins build infrastructure. Companies such as Riot Games have over 1.25 million builds per year and have written several lengthy blog posts about their experiences designing a complex, custom Docker-powered Jenkins build farm. Dealing with Jenkins slaves at scale is a job in itself and Riot has engineers focused on managing the build infrastructure.

Typical Jenkins Build Farm

 

As with all technology, the Jenkins build farm architectures have evolved. Today, instead of manually building your own container infrastructure, there are Jenkins Docker plugins available to help reduce the operational burden of maintaining these environments. There is also a community-contributed Amazon EC2 Container Service (Amazon ECS) plugin that helps remove some of the overhead, but you still need to configure and manage the overall Amazon ECS environment.

There are various ways to create and manage your Jenkins build farm, but there has to be a way that significantly reduces your operational overhead.

Introducing AWS CodeBuild

AWS CodeBuild is a fully managed build service that removes the undifferentiated heavy lifting of provisioning, managing, and scaling your own build servers. With CodeBuild, there is no software to install, patch, or update. CodeBuild scales up automatically to meet the needs of your development teams. In addition, CodeBuild is an on-demand service where you pay as you go. You are charged based only on the number of minutes it takes to complete your build.

One AWS customer, Recruiterbox, helps companies hire simply and predictably through their software platform. Two years ago, they began feeling the operational pain of maintaining their own Jenkins build farms. They briefly considered moving to Amazon ECS, but chose an even easier path forward instead. Recuiterbox transitioned to using Jenkins with CodeBuild and are very happy with the results. You can read more about their journey here.

Solution Overview: Jenkins and CodeBuild

To remove the heavy lifting from managing your Jenkins build farm, AWS has developed a Jenkins AWS CodeBuild plugin. After the plugin has been enabled, a developer can configure a Jenkins project to pick up new commits from their chosen source code repository and automatically run the associated builds. After the build is successful, it will create an artifact that is stored inside an S3 bucket that you have configured. If an error is detected somewhere, CodeBuild will capture the output and send it to Amazon CloudWatch logs. In addition to storing the logs on CloudWatch, Jenkins also captures the error so you do not have to go hunting for log files for your build.

 

AWS CodeBuild with Jenkins Plugin

 

The following example uses AWS CodeCommit (Git) as the source control management (SCM) and Amazon S3 for build artifact storage. Logs are stored in CloudWatch. A development pipeline that uses Jenkins with CodeBuild plugin architecture looks something like this:

 

AWS CodeBuild Diagram

Initial Solution Setup

To keep this blog post succinct, I assume that you are using the following components on AWS already and have applied the appropriate IAM policies:

·         AWS CodeCommit repo.

·         Amazon S3 bucket for CodeBuild artifacts.

·         SNS notification for text messaging of the Jenkins admin password.

·         IAM user’s key and secret.

·         A role that has a policy with these permissions. Be sure to edit the ARNs with your region, account, and resource name. Use this role in the AWS CloudFormation template referred to later in this post.

 

Jenkins Installation with CodeBuild Plugin Enabled

To make the integration with Jenkins as frictionless as possible, I have created an AWS CloudFormation template here: https://s3.amazonaws.com/proberts-public/jenkins.yaml. Download the template, sign in the AWS CloudFormation console, and then use the template to create a stack.

 

CloudFormation Inputs

Jenkins Project Configuration

After the stack is complete, log in to the Jenkins EC2 instance using the user name “admin” and the password sent to your mobile device. Now that you have logged in to Jenkins, you need to create your first project. Start with a Freestyle project and configure the parameters based on your CodeBuild and CodeCommit settings.

 

AWS CodeBuild Plugin Configuration in Jenkins

 

Additional Jenkins AWS CodeBuild Plugin Configuration

 

After you have configured the Jenkins project appropriately you should be able to check your build status on the Jenkins polling log under your project settings:

 

Jenkins Polling Log

 

Now that Jenkins is polling CodeCommit, you can check the CodeBuild dashboard under your Jenkins project to confirm your build was successful:

Jenkins AWS CodeBuild Dashboard

Wrapping Up

In a matter of minutes, you have been able to provision Jenkins with the AWS CodeBuild plugin. This will greatly simplify your build infrastructure management. Now kick back and relax while CodeBuild does all the heavy lifting!


About the Author

Paul Roberts is a Strategic Solutions Architect for Amazon Web Services. When he is not working on Serverless, DevOps, or Artificial Intelligence, he is often found in Lake Tahoe exploring the various mountain ranges with his family.

AWS Adds 12 More Services to Its PCI DSS Compliance Program

Post Syndicated from Sara Duffer original https://aws.amazon.com/blogs/security/aws-adds-12-more-services-to-its-pci-dss-compliance-program/

Twelve more AWS services have obtained Payment Card Industry Data Security Standard (PCI DSS) compliance, giving you more options, flexibility, and functionality to process and store sensitive payment card data in the AWS Cloud. The services were audited by Coalfire to ensure that they meet strict PCI DSS standards.

The newly compliant AWS services are:

AWS now offers 42 services that meet PCI DSS standards, putting administrators in better control of their frameworks and making workloads more efficient and cost effective.

For more information about the AWS PCI DSS compliance program, see Compliance Resources, AWS Services in Scope by Compliance Program, and PCI DSS Compliance.

– Sara