Tag Archives: AWS CloudFormation

AWS CloudFormation at AWS re:Invent 2015: Breakout Session Recap, Videos, and Slides

Post Syndicated from George Huang original http://blogs.aws.amazon.com/application-management/post/Tx1ZYD0M87D4NW0/AWS-CloudFormation-at-AWS-re-Invent-2015-Breakout-Session-Recap-Videos-and-Slide

The AWS CloudFormation team and others presented and shared many updates and best practices during several 2015 AWS re:Invent sessions in October. We wanted to take the opportunity to show you where our presentation slides and videos are located as well as highlight a few product updates and best practices that we shared at this year’s re:Invent.

DVO304 – AWS CloudFormation Best Practices: slides and video

ARC307 – Infrastructure as Code: slides and video

DVO303 – Scaling Infrastructure Operations with AWS: slides and video

ARC401 – Cloud First: New Architecture for New Infrastructure: slides and video

DVO310 – Benefit from DevOps When Moving to AWS for Windows: slides and video

DVO401 – Deep Dive into Blue/Green Deployments on AWS: slides and video

SEC312 – Reliable Design and Deployment of Security and Compliance: slides and video

AWS CloudFormation Designer

We introduced CloudFormation Designer in early October. During our re:Invent session DVO304 (AWS CloudFormation Best Practices), we introduced CloudFormation Designer and then did a live demo and walkthrough of its key features and use cases.

AWS CloudFormation Designer is a new visual tool that allows you to visually edit your CloudFormation templates as a diagram. It provides a drag-and-drop interface for adding resources to templates, and CloudFormation Designer automatically modifies the underlying JSON when you add or remove resources. You can also use the integrated text editor to view or specify template details, such as resource property values and input parameters.

To learn more about this feature:

Watch the CloudFormation Designer portion of our re:Invent talk to see a demo

View slides 3-13 to learn more about CloudFormation Designer from our re:Invent talk

Updated resource support in CloudFormation

In the same session, we also talked about the five new resources that CloudFormation can provision which we introduced in October. To stay up to-date on CloudFormation resource support updates, please visit here to see a list of all currently supported AWS resources.

Other topics covered in our “AWS CloudFormation Best Practices” breakout session

Using Cost Explorer to budget and estimate a stack’s cost

Collecting audit logs using the CloudTrail integration with CloudFormation

CloudFormation advanced language features

How to extend CloudFormation to resources that are not yet supported by CloudFormation

Security and user-access best practices

Best practices for writing CloudFormation templates when sharing templates with teams or users that have different environments or are using different AWS regions

Please reach us at the AWS CloudFormation forum if you have more feedback or questions. 

Persist Streaming Data to Amazon S3 using Amazon Kinesis Firehose and AWS Lambda

Post Syndicated from Derek Graeber original https://blogs.aws.amazon.com/bigdata/post/Tx2MUQB5PRWU36K/Persist-Streaming-Data-to-Amazon-S3-using-Amazon-Kinesis-Firehose-and-AWS-Lambda

Derek Graeber is a Senior Consultant in Big Data Analytics for AWS Professional Services

Streaming data analytics is becoming main-stream (pun intended) in large enterprises as the technology stacks have become more user-friendly to implement. For example, Spark-Streaming connected to an Amazon Kinesis stream is a typical model for real-time analytics. 

But one area that cannot and should not be overlooked is the need to persist streaming data (unchanged) in a reliable and durable fashion – and to do it with ease.  This blog post walks you through a simple and effective way to persist data to Amazon S3 from Amazon Kinesis Streams using AWS Lambda and Amazon Kinesis Firehose, a new managed service from AWS.

Here’s a real use case:  Hearst Publishing is a global media company behind well-known brands such as Cosmopolitan, Elle, Esquire, Seventeen, and Car and Driver, as well as television and cable entities such as A&E Networks and Esquire Network.

Hearst has embarked on the big data journey and needs to collect pertinent data from over 200+ digital sites in real time.  This data gives invaluable insight into the usage of their sites and indicates the most relevant trending topics based on content.  Using these data points, both historical and in real-time, Hearst could monitor and become much more agile in managing the content available to site users by giving key analytical data to content owners.

Hearst chose to use a well-respected cast of characters for an ETL process of streaming data: Streams, Spark on Amazon EMR, and S3. They also realized the need to store the unchanged data right from Streams in parallel to EMR-Spark.  In line with the important big data ethic “never throw data away”, all data pulled from Streams was persisted to S3 for historical reasons and so it can be re-processed either by a different consuming team or re-analyzed with a modified processing scheme in Spark. The Amazon Kinesis Client Library (KCL) and Amazon Kinesis Connector codebase provided a consistent and highly configurable way to get data from Streams to S3:

The KCL has built-in check-pointing for Streams (whether it be TRIM-HORIZON or LATEST).

The KCL integrates very easily with the Amazon Kinesis connectors.

The Connectors framework provided a way to transform, buffer, filter, and emit the Amazon Kinesis records to S3 with ease (among other specified AWS services).We can buffer data and write to S3 based on thresholds with number of records, time since last flush, or actual data buffer size limits.

These features make the KCL–Connector (KCL-C) very powerful and useful; it’s a very popular implementation. The KCL-C setup runs on an EC2 instance or fleet of instances and is easily managed with AWS CloudFormation and Auto Scaling.  The KCL has become the proven way to manage getting data off Streams.  The figure below shows a sample architecture with KCL.

Sample architecture with KCL.

Hearst, evaluating their AWS ecosystem, wanted to move as much as possible to AWS-provided services.  With a lean development team and a focus on data science, there was an interest in not having to monitor EC2 instances.  Thus the question was raised “How can we keep the reliability of KCL-C for our data intact but not have to keep tabs on the EC2 instance?  Can’t AWS provide a service to do this so we can focus on data science?” 

In short, a perfect use case for Firehose and Lambda unfolded.  Looking at the needs of the process, reliability was critical along with the ability to buffer (aggregate) data into larger file sizes and persist to S3. The figure below illustrates a sample architecture with Firehose.

Sample architecture with Firehose.

For this post Java is the codebase, but this can also be done in JavaScript.  The code is available in its entirety on the AWS Big Data Blog repository on GitHub.  Assume all services are set up in the same region.  For more information, see the Amazon Kinesis Firehose Getting Started Guide.

Set up the S3 and Streams services

You need to set up a stream (representing the raw data coming in) and an S3 bucket where the data should reside.  For more information, see Step 1: Create a Stream and Create a Bucket

Review the Lambda function with Firehose

This is where the fun happens.  Take a look at the code:  If you pulled the GitHub repository, this is located in the Java class com.amazonaws.proserv.lambda.KinesisToFirehose.

public class KinesisToFirehose {
private String firehoseEndpointURL = "https://firehose.us-east-1.amazonaws.com";
private String deliveryStreamName = "blogfirehose";
private String deliveryStreamRoleARN = "arn:aws:iam::<AWS Acct Id>:role/firehose_blog_role";
private String targetBucketARN = "arn:aws:s3:::dgraeberaws-blogs";
private String targetPrefix = "blogoutput/";
private int intervalInSec = 60;
private int buffSizeInMB = 2;

private AmazonKinesisFirehoseClient firehoseClient = new AmazonKinesisFirehoseClient();
private LambdaLogger logger;

public void kinesisHandler(KinesisEvent event, Context context){
logger = context.getLogger();
setup();
for(KinesisEvent.KinesisEventRecord rec : event.getRecords()) {
logger.log("Got message ");
String msg = new String(rec.getKinesis().getData().array())+"n";
Record deliveryStreamRecord = new Record().withData (ByteBuffer.wrap(msg.getBytes()));

PutRecordRequest putRecordRequest = new PutRecordRequest()
.withDeliveryStreamName(deliveryStreamName)
.withRecord(deliveryStreamRecord);

logger.log("Putting message");
firehoseClient.putRecord(putRecordRequest);
logger.log("Successful Put");
}
}

}

The following private instance variables should be configured with your particular naming conventions:

firehoseEndpointURL – The AWS endpoint where the Firehose delivery stream is hosted.  Typically, you keep the Lambda function and delivery stream in the same region.

deliveryStreamName – The actual name of the Firehose delivery stream that you are using.

deliveryStreamRoleARN – The AWS ARN of the role which you want the Firehose delivery stream to use when writing to S3.  You will create this role via the console later in this post.

targetBucketARN – The AWS ARN of the bucket to which you want Firehose to write.

targetPrefix – When writing to the S3 bucket and segmenting the object key with a prefix, add the segment in this variable. (At the time of this post, if you want a ‘/’ separator, you need to add it in this variable, for example, ‘somesegment/’.)

intervalInSec –  A buffer for time lapse. Firehose pushes to S3 if this threshold has been met after the last write.

bufferSizeInMB – A buffer for aggregated payload size. Firehose pushes to S3 if this threshold has been met after the last write.

This Lambda function is configured to create the Firehose delivery stream if it does not already exist. In this post, you create the delivery stream manually from the console, being careful to have the proper private instance variable (above) set in the Lambda function to reflect the Firehose delivery stream thus created.

Create the Firehose delivery stream

Now, you can create the Firehose delivery stream using the console.  For more information, see Amazon Kinesis Firehose Getting Started Guide.

Create the Firehose delivery system

In this post, I assume that you do not have a role created that gives you Firehose delivery stream access, so you can create one now.  In the list for IAM role*, choose Create new Firehose delivery IAM role.

Create new Firehose delivery IAM role

For reference, the policy associated with the role is similar to the one below:

Permission policy (Firehose role)

{
"Version": "2012-10-17",
"Statement":
[
{
"Sid": "StmtDemo1",
"Effect": "Allow",
"Action":
["s3:AbortMultipartUpload","s3:GetBucketLocation","s3:GetObject",
"s3:ListBucket","s3:ListBucketMultipartUploads","s3:PutObject"
],
"Resource":
["arn:aws:s3:::*"]
},
{
"Sid": "StmtDemo2",
"Effect": "Allow",
"Action": ["kms:Decrypt","kms:Encrypt"],
"Resource": ["*"]
}
]
}

Trust policy (Firehose role)

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "StmtDemo3",
"Effect": "Allow",
"Principal": {"Service": "firehose.amazonaws.com"},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": “YOURACCTID">
}
}
}
]
}

Finish configuring the Firehose delivery stream with the indicated configuration. For this post, set limits of 2 MB and 60 seconds for the buffer size and buffer interval, respectively.

Finishing configuring the Firehose stream

NOTE: For this post, you will not be compressing or encrypting the data when writing to S3 from Firehose.  Your actual implementation may vary.

Confirm and create the Firehose delivery stream

To summarize the configuration, you are:

Defining a name for the Firehose delivery stream

Defining the targeted S3 bucket for output

Adding an S3 prefix to the bucket

Defining the buffer thresholds – in this case, they are 60 seconds and 2 MB (whichever comes first)

Not compressing or encrypting the output data

Create the Lambda JAR distribution

Verify that your instance variables match between your Lambda function and your newly created Firehose delivery stream.  Create the JAR file that Lambda will need.  Because this is a Java project with Maven, execute the mvn clean package task from your project root directory.  Lambda runs Java 8, so you need to compile against the Java 8 JDK.

Create the Lambda function

Now that you have the Lambda code ready to run, create the function itself.  You can do this via CLI or console.  For more information, see Getting Started: Authoring AWS Lambda Code in Java.

When you create the Lambda role that has a Lambda trust relationship, make sure that the policy has access to both Firehose and Streams. Here is an example:

Permissions policy (Lambda role)

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["logs:*"],
"Resource": "arn:aws:logs:*:*:*"
},
{
"Effect": "Allow",
"Action": ["kinesis:*","firehose:*"],
"Resource": ["arn:aws:kinesis:*:*:*","arn:aws:firehose:*:*:*"]
}
]
}

Trust policy (Lambda role)

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "StmtDemo4",
"Effect": "Allow",
"Principal": {"Service": "lambda.amazonaws.com"},
"Action": "sts:AssumeRole"
}
]
}

I won’t cover creating a Lambda function in depth in this post, but here are the highlights:

Use the newly-created .jar file:The HANDLER value should be com.amazonaws.proserv.lambda.KinesisToFirehose::kinesisHandler

Use the role you just created with the policy access to Firehose and Streams and the Lambda trust relationship (directly above).

Use the defaults for Memory and Timeout.

After the upload, the Lambda function is in place; all you need to do is set the listener.

On the Event Sources tab under your new Lambda function, add an event source that is the Amazon Kinesis stream you created earlier.  Select a Streams input, add your stream name, and leave the defaults. You are now connected.

Populate streams and verify results

The only thing left to do is add data to the stream and watch the S3 bucket fill up. In the Java project from Git, a helper class pumps dummy messages to Streams (com.amazonaws.proserv.PopulateKinesisData).  If you are running it from your local repository, add your access key information to the resources/AwsCredentials.properties file.  If you are running it from EC2, make sure the role on the instance has Streams permissions. 

After you start adding messages to the stream and the thresholds are hit (2 MB at 60 seconds), you will see your targeted S3 bucket begin to populate with the prefix that you designated and the files written with an object key designating the year, month, day, and hour in which the output file from Firehose was written (prefix/yyyy/mm/dd/hr/*).

Conclusion

In this post, I have shown you how to create a reliable way to persist data from Streams to Amazon S3 using the new managed service Firehose.  Firehose removes the need to manage compute servers and builds on some of the most-used tenets of streaming data persistence:

Aggregated data based on thresholds.

Persist data to a durable repository (in this case, S3).

The Hearst Publishing use case provided a way to reliably persist data from Streams to S3 with an aggregated output that modeled their current scheme – all with a native AWS service. As the data source was Streams, the Firehose service could run in parallel to the existing real-time data processing scheme with no impact.

If you have questions or suggestions, please  leave a comment below.

—————

Related:

How Expedia Implemented Near Real-time Analysis of Interdependent Datasets

 

Persist Streaming Data to Amazon S3 using Amazon Kinesis Firehose and AWS Lambda

Post Syndicated from Derek Graeber original https://blogs.aws.amazon.com/bigdata/post/Tx2MUQB5PRWU36K/Persist-Streaming-Data-to-Amazon-S3-using-Amazon-Kinesis-Firehose-and-AWS-Lambda

Derek Graeber is a Senior Consultant in Big Data Analytics for AWS Professional Services

Streaming data analytics is becoming main-stream (pun intended) in large enterprises as the technology stacks have become more user-friendly to implement. For example, Spark-Streaming connected to an Amazon Kinesis stream is a typical model for real-time analytics. 

But one area that cannot and should not be overlooked is the need to persist streaming data (unchanged) in a reliable and durable fashion – and to do it with ease.  This blog post walks you through a simple and effective way to persist data to Amazon S3 from Amazon Kinesis Streams using AWS Lambda and Amazon Kinesis Firehose, a new managed service from AWS.

Here’s a real use case:  Hearst Publishing is a global media company behind well-known brands such as Cosmopolitan, Elle, Esquire, Seventeen, and Car and Driver, as well as television and cable entities such as A&E Networks and Esquire Network.

Hearst has embarked on the big data journey and needs to collect pertinent data from over 200+ digital sites in real time.  This data gives invaluable insight into the usage of their sites and indicates the most relevant trending topics based on content.  Using these data points, both historical and in real-time, Hearst could monitor and become much more agile in managing the content available to site users by giving key analytical data to content owners.

Hearst chose to use a well-respected cast of characters for an ETL process of streaming data: Streams, Spark on Amazon EMR, and S3. They also realized the need to store the unchanged data right from Streams in parallel to EMR-Spark.  In line with the important big data ethic “never throw data away”, all data pulled from Streams was persisted to S3 for historical reasons and so it can be re-processed either by a different consuming team or re-analyzed with a modified processing scheme in Spark. The Amazon Kinesis Client Library (KCL) and Amazon Kinesis Connector codebase provided a consistent and highly configurable way to get data from Streams to S3:

The KCL has built-in check-pointing for Streams (whether it be TRIM-HORIZON or LATEST).

The KCL integrates very easily with the Amazon Kinesis connectors.

The Connectors framework provided a way to transform, buffer, filter, and emit the Amazon Kinesis records to S3 with ease (among other specified AWS services).We can buffer data and write to S3 based on thresholds with number of records, time since last flush, or actual data buffer size limits.

These features make the KCL–Connector (KCL-C) very powerful and useful; it’s a very popular implementation. The KCL-C setup runs on an EC2 instance or fleet of instances and is easily managed with AWS CloudFormation and Auto Scaling.  The KCL has become the proven way to manage getting data off Streams.  The figure below shows a sample architecture with KCL.

Sample architecture with KCL.

Hearst, evaluating their AWS ecosystem, wanted to move as much as possible to AWS-provided services.  With a lean development team and a focus on data science, there was an interest in not having to monitor EC2 instances.  Thus the question was raised “How can we keep the reliability of KCL-C for our data intact but not have to keep tabs on the EC2 instance?  Can’t AWS provide a service to do this so we can focus on data science?” 

In short, a perfect use case for Firehose and Lambda unfolded.  Looking at the needs of the process, reliability was critical along with the ability to buffer (aggregate) data into larger file sizes and persist to S3. The figure below illustrates a sample architecture with Firehose.

Sample architecture with Firehose.

For this post Java is the codebase, but this can also be done in JavaScript.  The code is available in its entirety on the AWS Big Data Blog repository on GitHub.  Assume all services are set up in the same region.  For more information, see the Amazon Kinesis Firehose Getting Started Guide.

Set up the S3 and Streams services

You need to set up a stream (representing the raw data coming in) and an S3 bucket where the data should reside.  For more information, see Step 1: Create a Stream and Create a Bucket

Review the Lambda function with Firehose

This is where the fun happens.  Take a look at the code:  If you pulled the GitHub repository, this is located in the Java class com.amazonaws.proserv.lambda.KinesisToFirehose.

public class KinesisToFirehose {
private String firehoseEndpointURL = "https://firehose.us-east-1.amazonaws.com";
private String deliveryStreamName = "blogfirehose";
private String deliveryStreamRoleARN = "arn:aws:iam::<AWS Acct Id>:role/firehose_blog_role";
private String targetBucketARN = "arn:aws:s3:::dgraeberaws-blogs";
private String targetPrefix = "blogoutput/";
private int intervalInSec = 60;
private int buffSizeInMB = 2;

private AmazonKinesisFirehoseClient firehoseClient = new AmazonKinesisFirehoseClient();
private LambdaLogger logger;

public void kinesisHandler(KinesisEvent event, Context context){
logger = context.getLogger();
setup();
for(KinesisEvent.KinesisEventRecord rec : event.getRecords()) {
logger.log("Got message ");
String msg = new String(rec.getKinesis().getData().array())+"n";
Record deliveryStreamRecord = new Record().withData (ByteBuffer.wrap(msg.getBytes()));

PutRecordRequest putRecordRequest = new PutRecordRequest()
.withDeliveryStreamName(deliveryStreamName)
.withRecord(deliveryStreamRecord);

logger.log("Putting message");
firehoseClient.putRecord(putRecordRequest);
logger.log("Successful Put");
}
}

}

The following private instance variables should be configured with your particular naming conventions:

firehoseEndpointURL – The AWS endpoint where the Firehose delivery stream is hosted.  Typically, you keep the Lambda function and delivery stream in the same region.

deliveryStreamName – The actual name of the Firehose delivery stream that you are using.

deliveryStreamRoleARN – The AWS ARN of the role which you want the Firehose delivery stream to use when writing to S3.  You will create this role via the console later in this post.

targetBucketARN – The AWS ARN of the bucket to which you want Firehose to write.

targetPrefix – When writing to the S3 bucket and segmenting the object key with a prefix, add the segment in this variable. (At the time of this post, if you want a ‘/’ separator, you need to add it in this variable, for example, ‘somesegment/’.)

intervalInSec –  A buffer for time lapse. Firehose pushes to S3 if this threshold has been met after the last write.

bufferSizeInMB – A buffer for aggregated payload size. Firehose pushes to S3 if this threshold has been met after the last write.

This Lambda function is configured to create the Firehose delivery stream if it does not already exist. In this post, you create the delivery stream manually from the console, being careful to have the proper private instance variable (above) set in the Lambda function to reflect the Firehose delivery stream thus created.

Create the Firehose delivery stream

Now, you can create the Firehose delivery stream using the console.  For more information, see Amazon Kinesis Firehose Getting Started Guide.

Create the Firehose delivery system

In this post, I assume that you do not have a role created that gives you Firehose delivery stream access, so you can create one now.  In the list for IAM role*, choose Create new Firehose delivery IAM role.

Create new Firehose delivery IAM role

For reference, the policy associated with the role is similar to the one below:

Permission policy (Firehose role)

{
"Version": "2012-10-17",
"Statement":
[
{
"Sid": "StmtDemo1",
"Effect": "Allow",
"Action":
["s3:AbortMultipartUpload","s3:GetBucketLocation","s3:GetObject",
"s3:ListBucket","s3:ListBucketMultipartUploads","s3:PutObject"
],
"Resource":
["arn:aws:s3:::*"]
},
{
"Sid": "StmtDemo2",
"Effect": "Allow",
"Action": ["kms:Decrypt","kms:Encrypt"],
"Resource": ["*"]
}
]
}

Trust policy (Firehose role)

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "StmtDemo3",
"Effect": "Allow",
"Principal": {"Service": "firehose.amazonaws.com"},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": “YOURACCTID">
}
}
}
]
}

Finish configuring the Firehose delivery stream with the indicated configuration. For this post, set limits of 2 MB and 60 seconds for the buffer size and buffer interval, respectively.

Finishing configuring the Firehose stream

NOTE: For this post, you will not be compressing or encrypting the data when writing to S3 from Firehose.  Your actual implementation may vary.

Confirm and create the Firehose delivery stream

To summarize the configuration, you are:

Defining a name for the Firehose delivery stream

Defining the targeted S3 bucket for output

Adding an S3 prefix to the bucket

Defining the buffer thresholds – in this case, they are 60 seconds and 2 MB (whichever comes first)

Not compressing or encrypting the output data

Create the Lambda JAR distribution

Verify that your instance variables match between your Lambda function and your newly created Firehose delivery stream.  Create the JAR file that Lambda will need.  Because this is a Java project with Maven, execute the mvn clean package task from your project root directory.  Lambda runs Java 8, so you need to compile against the Java 8 JDK.

Create the Lambda function

Now that you have the Lambda code ready to run, create the function itself.  You can do this via CLI or console.  For more information, see Getting Started: Authoring AWS Lambda Code in Java.

When you create the Lambda role that has a Lambda trust relationship, make sure that the policy has access to both Firehose and Streams. Here is an example:

Permissions policy (Lambda role)

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["logs:*"],
"Resource": "arn:aws:logs:*:*:*"
},
{
"Effect": "Allow",
"Action": ["kinesis:*","firehose:*"],
"Resource": ["arn:aws:kinesis:*:*:*","arn:aws:firehose:*:*:*"]
}
]
}

Trust policy (Lambda role)

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "StmtDemo4",
"Effect": "Allow",
"Principal": {"Service": "lambda.amazonaws.com"},
"Action": "sts:AssumeRole"
}
]
}

I won’t cover creating a Lambda function in depth in this post, but here are the highlights:

Use the newly-created .jar file:The HANDLER value should be com.amazonaws.proserv.lambda.KinesisToFirehose::kinesisHandler

Use the role you just created with the policy access to Firehose and Streams and the Lambda trust relationship (directly above).

Use the defaults for Memory and Timeout.

After the upload, the Lambda function is in place; all you need to do is set the listener.

On the Event Sources tab under your new Lambda function, add an event source that is the Amazon Kinesis stream you created earlier.  Select a Streams input, add your stream name, and leave the defaults. You are now connected.

Populate streams and verify results

The only thing left to do is add data to the stream and watch the S3 bucket fill up. In the Java project from Git, a helper class pumps dummy messages to Streams (com.amazonaws.proserv.PopulateKinesisData).  If you are running it from your local repository, add your access key information to the resources/AwsCredentials.properties file.  If you are running it from EC2, make sure the role on the instance has Streams permissions. 

After you start adding messages to the stream and the thresholds are hit (2 MB at 60 seconds), you will see your targeted S3 bucket begin to populate with the prefix that you designated and the files written with an object key designating the year, month, day, and hour in which the output file from Firehose was written (prefix/yyyy/mm/dd/hr/*).

Conclusion

In this post, I have shown you how to create a reliable way to persist data from Streams to Amazon S3 using the new managed service Firehose.  Firehose removes the need to manage compute servers and builds on some of the most-used tenets of streaming data persistence:

Aggregated data based on thresholds.

Persist data to a durable repository (in this case, S3).

The Hearst Publishing use case provided a way to reliably persist data from Streams to S3 with an aggregated output that modeled their current scheme – all with a native AWS service. As the data source was Streams, the Firehose service could run in parallel to the existing real-time data processing scheme with no impact.

If you have questions or suggestions, please  leave a comment below.

—————

Related:

How Expedia Implemented Near Real-time Analysis of Interdependent Datasets

 

AWS OpsWorks at re:Invent 2015

Post Syndicated from Daniel Huesch original http://blogs.aws.amazon.com/application-management/post/Tx3B33V56JTM4B2/AWS-OpsWorks-at-re-Invent-2015

re:Invent 2015 is right around the corner. Here’s an overview of the AWS OpsWorks breakout sessions and bootcamp.

DVO301 – AWS OpsWorks Under the Hood

AWS OpsWorks helps you deploy and operate applications of all shapes and sizes. With OpsWorks, you can create your application stack with layers that define the building blocks of your application: load balancers, application servers, databases, etc. But did you know that you can also use OpsWorks to run commands or scripts on your instances? Whether you need to perform a specific task or install a new software package, AWS OpsWorks gives you the tools to install and configure your instances consistently and help them evolve in an automated and predictable fashion. In this session, we explain how lifecycle events work, and how to create custom layers and a runtime system for your operational tooling and how to develop and test locally.

DVO310 – Benefit from DevOps When Moving to AWS for Windows

In this session, we discuss DevOps patterns of success that favor automation and drive consistency from the start of your cloud journey. We explore two key concepts that you need to understand when moving to AWS: pushing and running code. We look at Windows-specific features of services like AWS CodeDeploy, AWS CloudFormation, AWS OpsWorks, and AWS Elastic Beanstalk, and supporting technologies like Chef, PowerShell, and Visual Studio. We also share customer stories about fleets of Microsoft Windows Server that successfully operate at scale in AWS.

Taking AWS Operations to the Next Level Bootcamp

This full-day bootcamp is designed to teach solutions architects, SysOps administrators, and other technical end users how to leverage AWS CloudFormation, AWS OpsWorks, and AWS Service Catalog to automate provisioning and configuring AWS infrastructure resources and applications. In this bootcamp, we build and deploy an end-to-end automation system that provides hands-off failure recovery for key systems.

re:Invent is a great opportunity to talk with AWS teams. As in previous years, you will find OpsWorks team members at the Application Management booth. Drop by and ask for a demo!

Didn’t register before the conference sold out? All sessions will be recorded and posted on YouTube after the conference and all slide decks will be posted on SlideShare.net.

AWS OpsWorks at re:Invent 2015

Post Syndicated from Daniel Huesch original http://blogs.aws.amazon.com/application-management/post/Tx3B33V56JTM4B2/AWS-OpsWorks-at-re-Invent-2015

re:Invent 2015 is right around the corner. Here’s an overview of the AWS OpsWorks breakout sessions and bootcamp.

DVO301 – AWS OpsWorks Under the Hood

AWS OpsWorks helps you deploy and operate applications of all shapes and sizes. With OpsWorks, you can create your application stack with layers that define the building blocks of your application: load balancers, application servers, databases, etc. But did you know that you can also use OpsWorks to run commands or scripts on your instances? Whether you need to perform a specific task or install a new software package, AWS OpsWorks gives you the tools to install and configure your instances consistently and help them evolve in an automated and predictable fashion. In this session, we explain how lifecycle events work, and how to create custom layers and a runtime system for your operational tooling and how to develop and test locally.

DVO310 – Benefit from DevOps When Moving to AWS for Windows

In this session, we discuss DevOps patterns of success that favor automation and drive consistency from the start of your cloud journey. We explore two key concepts that you need to understand when moving to AWS: pushing and running code. We look at Windows-specific features of services like AWS CodeDeploy, AWS CloudFormation, AWS OpsWorks, and AWS Elastic Beanstalk, and supporting technologies like Chef, PowerShell, and Visual Studio. We also share customer stories about fleets of Microsoft Windows Server that successfully operate at scale in AWS.

Taking AWS Operations to the Next Level Bootcamp

This full-day bootcamp is designed to teach solutions architects, SysOps administrators, and other technical end users how to leverage AWS CloudFormation, AWS OpsWorks, and AWS Service Catalog to automate provisioning and configuring AWS infrastructure resources and applications. In this bootcamp, we build and deploy an end-to-end automation system that provides hands-off failure recovery for key systems.

re:Invent is a great opportunity to talk with AWS teams. As in previous years, you will find OpsWorks team members at the Application Management booth. Drop by and ask for a demo!

Didn’t register before the conference sold out? All sessions will be recorded and posted on YouTube after the conference and all slide decks will be posted on SlideShare.net.

Analyze Data with Presto and Airpal on Amazon EMR

Post Syndicated from Songzhi Liu original https://blogs.aws.amazon.com/bigdata/post/Tx1BF2DN6KRFI27/Analyze-Data-with-Presto-and-Airpal-on-Amazon-EMR

Songzhi Liu is a Professional Services Consultant with AWS

You can now launch Presto version 0.119 on Amazon EMR, allowing you to easily spin up a managed EMR cluster with the Presto query engine and run interactive analysis on data stored in Amazon S3. You can integrate with Spot instances, publish logs to an S3 bucket, and use EMR’s configure API to configure Presto. In this post, I’ll show you how to set up a Presto cluster and use Airpal to process data stored in S3.

What is Presto?

Presto is a distributed SQL query engine optimized for ad hoc analysis. It supports the ANSI SQL standard, including complex queries, aggregations, joins, and window functions. Presto can run on multiple data sources, including Amazon S3. Presto’s execution framework is fundamentally different from that of Hive/MapReduce: Presto has a custom query and execution engine where the stages of execution are pipelined, similar to a directed acyclic graph (DAG), and all processing occurs in memory to reduce disk I/O. This pipelined execution model can run multiple stages in parallel and streams data from one stage to another as the data becomes available. This reduces end-to-end latency and makes Presto a great tool for ad hoc data exploration over large datasets.

What is Airpal?

Airpal is a web-based query execution tool open-sourced by Airbnb that leverages Presto to facilitate data analysis. Airpal has many helplful features. For example, you can highlight syntax, export results to  CSV for download, view query history, save queries, use a Table Finder to search for appropriate tables, and use Table Explorer to visualize the schema of a table. We have created an AWS CloudFormation script that makes it easy to set up Airpal on an Amazon  EC2 instance on AWS.

For this blog post we will use the Wikimedia’s page count data, which is publicly available at ‘s3://support.elasticmapreduce/training/dataset/wikistats/’. This data is in textfile format. We will also convert the table to Parquet and ORC.

Spin up an EMR cluster with Hive and Presto installed

First, log in to the AWS console and navigate to the EMR console. Choose EMR-4.1.0 and Presto-Sandbox. Make sure you provide SSH keys so that you can log into the cluster.

Note: Write down the DNS name after creation is complete. You’ll need this for the next step.

 

Use AWS CloudFormation to deploy the Airpal server

Make sure you have a valid Key Pair for the region in which you want to deploy Airpal.

Navigate to AWS CloudFormation, click Create New Stack, name your stack, and choose Specify an Amazon S3 template URL.

Use the template in https://s3-external-1.amazonaws.com/emr.presto.airpal/scripts/deploy_airpal_env.json

Click Next and configure the parameters.

Important parameters you should configure:

PrestoCoordinatorURL Use the DNS name you noted earlier. Follow the format: http://<DNS Name of the cluster>:<Port of Presto>. The default port for Presto installation is 8889.

Example: http://ec2-xx-xx-xx-xx.compute-1.amazonaws.com:8889

AirpalPort  Choose the port on which the Airpal server should run. The default is 8193. Adjust this according to your firewall setting and make sure it’s not blocked.

S3BootstrapBuckets  This is the S3 bucket name that holds the bootstrap scripts. There is no need to change the default value of emr.presto.airpal.

InstallAirpal  This is the path to the installation script of Airpal Server. There is no need to change the default value of scripts/install_airpal.sh.

StartAirpal  This is the path to the starting script of Airpal Server. There is no need to change the default value of scripts/start_airpal.sh.

MyKeyPairName  Select a valid Key Pair you have in this region. You’ll use this to log in to the master node.

Click Next and add a tag to the stack if needed. Select the check box for IAM policy and click Create.

Wait 5 -10 minutes after the stack status changes to create_complete. (The server configuration takes longer than the stack creation.)

Navigate to the EC2 console and select the Airpal Server instance and note its public IP address.

Open a browser. Use: <PublicIP>:<Airpal Port> to go to Airpal. Make sure that the Port 8889 is allowed on the Master Security Group for your EMR cluster.

Log in to the master node and run Hive scripts

Presto ships with several connectors. To query data from Amazon S3, you need to use the Hive connector. Presto only uses Hive to create the meta-data. Presto’s execution engine is different from that of Hive. By default, when you install Presto on your cluster, EMR installs Hive as well. The metadata is stored in a database such a MySQL and is accessed by the Hive metastore service. The Hive metastore service is also installed.

The dataset contains hits data for the Wikipedia pages of around 7 GB. The schema is as follows:

Language of the page

Title of the page

Number of hits

Retrieved page size

Define the schema

To define the schema:

Log in to the master node using the following command in the terminal:

ssh -i YourKeyPair.pem [email protected]

Replace YourKeyPair.pem with the place and name of your pem file. Replace ec2-xx-xx-xx-xx.compute-1.amazonaws.com with the public DNS name of your EMR cluster.

Type “hive” in the command line to enter Hive interactive mode and run the following commands:

CREATE EXTERNAL TABLE wikistats (
language STRING,
page_title STRING,
hits BIGINT,
retrived_size BIGINT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘ ‘
LINES TERMINATED BY ‘n’
LOCATION ‘s3://support.elasticmapreduce/training/datasets/wikistats/’;

Now you have created a “wikistats” table in csv format. You can also store this table using the Parquet format using the following command:

CREATE EXTERNAL TABLE wikistats_parq (
language STRING,
page_title STRING,
hits BIGINT,
retrived_size BIGINT
)
STORED AS PARQUET
LOCATION ‘s3://emr.presto.airpal/wikistats/parquet’;

You can store it in the compressed ORC format using the following command:

CREATE EXTERNAL TABLE wikistats_orc (
language STRING,
page_title STRING,
hits BIGINT,
retrived_size BIGINT
)
STORED AS ORC
LOCATION ‘s3://emr.presto.airpal/wikistats/orc’;

Now we have three tables holding the same data of three different data formats.

Try Presto in Airpal

Open a browser and go to ‘http://<ip address of the ec2 instance>:8193’.

You will use Presto queries to answer the questions below. Paste the following queries into the Airpal query editor.

What is the most frequently viewed page with page_title that contains “Amazon”?

SELECT language,
page_title,
SUM(hits) AS hits
FROM default.wikistats
WHERE language = ‘en’
AND page_title LIKE ‘%Amazon%’
GROUP BY language,
page_title
ORDER BY hits DESC
LIMIT 10;

 

On average, what  page is hit most in English?

SELECT language, page_title, AVG(hits) AS avg_hits
FROM default.wikistats
WHERE language = ‘en’
GROUP BY language, page_title
ORDER BY avg_hits DESC
LIMIT 10;

Try wikistats_orc and wikistats_parq with the same query. Do you see any difference in performance?

Go back to Airpal and view the results. The top records are Main_Page, Special: or 404_error, etc., which we don’t really care about. These words are noise here so you should filter them out in your query:

SELECT language, page_title, AVG(hits) AS avg_hits
FROM default.wikistats
WHERE language = ‘en’
AND page_title NOT IN (‘Main_Page’, ‘404_error/’)
AND page_title NOT LIKE ‘%Special%’
AND page_title NOT LIKE ‘%index%’
AND page_title NOT LIKE ‘%Search%’
AND NOT regexp_like(page_title, ‘%20’)
GROUP BY language, page_title
ORDER BY avg_hits DESC
LIMIT 10;

Using the Presto CLI

You can also use the Presto CLI directly on the EMR cluster to query the data.

Log in to the master node using the following command in the terminal:

ssh -i YourKeyPair.pem [email protected]

Replace YourKeyPair.pem with the location and name of your pem file. Replace ec2-xx-xx-xx-xx.compute-1.amazonaws.com with the public DNS name of your EMR cluster.

Assuming you already defined the schema using Hive, start the Presto-CLI.

Run the following command:

 $ presto-cli –catalog hive –schema default

Check to see if the table is still there.

Try the same query you tried earlier.

SELECT language, page_title, AVG(hits) AS avg_hits
FROM default.wikistats
WHERE language = ‘en’
AND page_title NOT IN (‘Main_Page’, ‘404_error/’)
AND page_title NOT LIKE ‘%Special%’
AND page_title NOT LIKE ‘%index%’
AND page_title NOT LIKE ‘%Search%’
AND NOT regexp_like(page_title, ‘%20’)
GROUP BY language, page_title
ORDER BY avg_hits DESC
LIMIT 10;

As you can see, you can also execute the query from the Presto CLI.

Summary

Presto is a distributed SQL query engine optimized for ad hoc analysis and data-exploration. It supports ANSI SQL standard, including complex queries, aggregations, joins, and window functions. In this post, I’ve shown you how easy it is to set up an EMR cluster with Presto 0.119, create metadata using Hive, and use either the Presto-CLI or Airpal to run interactive queries.

If you have questions or suggestions, please leave a comment below.

——————————————-

Related

Large-Scale Machine Learning with Spark on Amazon EMR

——————————————–

Love to work on open source? Check out EMR’s careers page.

 

Analyze Data with Presto and Airpal on Amazon EMR

Post Syndicated from Songzhi Liu original https://blogs.aws.amazon.com/bigdata/post/Tx1BF2DN6KRFI27/Analyze-Data-with-Presto-and-Airpal-on-Amazon-EMR

Songzhi Liu is a Professional Services Consultant with AWS

You can now launch Presto version 0.119 on Amazon EMR, allowing you to easily spin up a managed EMR cluster with the Presto query engine and run interactive analysis on data stored in Amazon S3. You can integrate with Spot instances, publish logs to an S3 bucket, and use EMR’s configure API to configure Presto. In this post, I’ll show you how to set up a Presto cluster and use Airpal to process data stored in S3.

What is Presto?

Presto is a distributed SQL query engine optimized for ad hoc analysis. It supports the ANSI SQL standard, including complex queries, aggregations, joins, and window functions. Presto can run on multiple data sources, including Amazon S3. Presto’s execution framework is fundamentally different from that of Hive/MapReduce: Presto has a custom query and execution engine where the stages of execution are pipelined, similar to a directed acyclic graph (DAG), and all processing occurs in memory to reduce disk I/O. This pipelined execution model can run multiple stages in parallel and streams data from one stage to another as the data becomes available. This reduces end-to-end latency and makes Presto a great tool for ad hoc data exploration over large datasets.

What is Airpal?

Airpal is a web-based query execution tool open-sourced by Airbnb that leverages Presto to facilitate data analysis. Airpal has many helplful features. For example, you can highlight syntax, export results to  CSV for download, view query history, save queries, use a Table Finder to search for appropriate tables, and use Table Explorer to visualize the schema of a table. We have created an AWS CloudFormation script that makes it easy to set up Airpal on an Amazon  EC2 instance on AWS.

For this blog post we will use the Wikimedia’s page count data, which is publicly available at ‘s3://support.elasticmapreduce/training/dataset/wikistats/’. This data is in textfile format. We will also convert the table to Parquet and ORC.

Spin up an EMR cluster with Hive and Presto installed

First, log in to the AWS console and navigate to the EMR console. Choose EMR-4.1.0 and Presto-Sandbox. Make sure you provide SSH keys so that you can log into the cluster.

Note: Write down the DNS name after creation is complete. You’ll need this for the next step.

 

Use AWS CloudFormation to deploy the Airpal server

Make sure you have a valid Key Pair for the region in which you want to deploy Airpal.

Navigate to AWS CloudFormation, click Create New Stack, name your stack, and choose Specify an Amazon S3 template URL.

Use the template in https://s3-external-1.amazonaws.com/emr.presto.airpal/scripts/deploy_airpal_env.json

Click Next and configure the parameters.

Important parameters you should configure:

PrestoCoordinatorURL Use the DNS name you noted earlier. Follow the format: http://<DNS Name of the cluster>:<Port of Presto>. The default port for Presto installation is 8889.

Example: http://ec2-xx-xx-xx-xx.compute-1.amazonaws.com:8889

AirpalPort  Choose the port on which the Airpal server should run. The default is 8193. Adjust this according to your firewall setting and make sure it’s not blocked.

S3BootstrapBuckets  This is the S3 bucket name that holds the bootstrap scripts. There is no need to change the default value of emr.presto.airpal.

InstallAirpal  This is the path to the installation script of Airpal Server. There is no need to change the default value of scripts/install_airpal.sh.

StartAirpal  This is the path to the starting script of Airpal Server. There is no need to change the default value of scripts/start_airpal.sh.

MyKeyPairName  Select a valid Key Pair you have in this region. You’ll use this to log in to the master node.

Click Next and add a tag to the stack if needed. Select the check box for IAM policy and click Create.

Wait 5 -10 minutes after the stack status changes to create_complete. (The server configuration takes longer than the stack creation.)

Navigate to the EC2 console and select the Airpal Server instance and note its public IP address.

Open a browser. Use: <PublicIP>:<Airpal Port> to go to Airpal. Make sure that the Port 8889 is allowed on the Master Security Group for your EMR cluster.

Log in to the master node and run Hive scripts

Presto ships with several connectors. To query data from Amazon S3, you need to use the Hive connector. Presto only uses Hive to create the meta-data. Presto’s execution engine is different from that of Hive. By default, when you install Presto on your cluster, EMR installs Hive as well. The metadata is stored in a database such a MySQL and is accessed by the Hive metastore service. The Hive metastore service is also installed.

The dataset contains hits data for the Wikipedia pages of around 7 GB. The schema is as follows:

Language of the page

Title of the page

Number of hits

Retrieved page size

Define the schema

To define the schema:

Log in to the master node using the following command in the terminal:

ssh -i YourKeyPair.pem [email protected]

Replace YourKeyPair.pem with the place and name of your pem file. Replace ec2-xx-xx-xx-xx.compute-1.amazonaws.com with the public DNS name of your EMR cluster.

Type “hive” in the command line to enter Hive interactive mode and run the following commands:

CREATE EXTERNAL TABLE wikistats (
language STRING,
page_title STRING,
hits BIGINT,
retrived_size BIGINT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘ ‘
LINES TERMINATED BY ‘n’
LOCATION ‘s3://support.elasticmapreduce/training/datasets/wikistats/’;

Now you have created a “wikistats” table in csv format. You can also store this table using the Parquet format using the following command:

CREATE EXTERNAL TABLE wikistats_parq (
language STRING,
page_title STRING,
hits BIGINT,
retrived_size BIGINT
)
STORED AS PARQUET
LOCATION ‘s3://emr.presto.airpal/wikistats/parquet’;

You can store it in the compressed ORC format using the following command:

CREATE EXTERNAL TABLE wikistats_orc (
language STRING,
page_title STRING,
hits BIGINT,
retrived_size BIGINT
)
STORED AS ORC
LOCATION ‘s3://emr.presto.airpal/wikistats/orc’;

Now we have three tables holding the same data of three different data formats.

Try Presto in Airpal

Open a browser and go to ‘http://<ip address of the ec2 instance>:8193’.

You will use Presto queries to answer the questions below. Paste the following queries into the Airpal query editor.

What is the most frequently viewed page with page_title that contains “Amazon”?

SELECT language,
page_title,
SUM(hits) AS hits
FROM default.wikistats
WHERE language = ‘en’
AND page_title LIKE ‘%Amazon%’
GROUP BY language,
page_title
ORDER BY hits DESC
LIMIT 10;

 

On average, what  page is hit most in English?

SELECT language, page_title, AVG(hits) AS avg_hits
FROM default.wikistats
WHERE language = ‘en’
GROUP BY language, page_title
ORDER BY avg_hits DESC
LIMIT 10;

Try wikistats_orc and wikistats_parq with the same query. Do you see any difference in performance?

Go back to Airpal and view the results. The top records are Main_Page, Special: or 404_error, etc., which we don’t really care about. These words are noise here so you should filter them out in your query:

SELECT language, page_title, AVG(hits) AS avg_hits
FROM default.wikistats
WHERE language = ‘en’
AND page_title NOT IN (‘Main_Page’, ‘404_error/’)
AND page_title NOT LIKE ‘%Special%’
AND page_title NOT LIKE ‘%index%’
AND page_title NOT LIKE ‘%Search%’
AND NOT regexp_like(page_title, ‘%20’)
GROUP BY language, page_title
ORDER BY avg_hits DESC
LIMIT 10;

Using the Presto CLI

You can also use the Presto CLI directly on the EMR cluster to query the data.

Log in to the master node using the following command in the terminal:

ssh -i YourKeyPair.pem [email protected]

Replace YourKeyPair.pem with the location and name of your pem file. Replace ec2-xx-xx-xx-xx.compute-1.amazonaws.com with the public DNS name of your EMR cluster.

Assuming you already defined the schema using Hive, start the Presto-CLI.

Run the following command:

 $ presto-cli –catalog hive –schema default

Check to see if the table is still there.

Try the same query you tried earlier.

SELECT language, page_title, AVG(hits) AS avg_hits
FROM default.wikistats
WHERE language = ‘en’
AND page_title NOT IN (‘Main_Page’, ‘404_error/’)
AND page_title NOT LIKE ‘%Special%’
AND page_title NOT LIKE ‘%index%’
AND page_title NOT LIKE ‘%Search%’
AND NOT regexp_like(page_title, ‘%20’)
GROUP BY language, page_title
ORDER BY avg_hits DESC
LIMIT 10;

As you can see, you can also execute the query from the Presto CLI.

Summary

Presto is a distributed SQL query engine optimized for ad hoc analysis and data-exploration. It supports ANSI SQL standard, including complex queries, aggregations, joins, and window functions. In this post, I’ve shown you how easy it is to set up an EMR cluster with Presto 0.119, create metadata using Hive, and use either the Presto-CLI or Airpal to run interactive queries.

If you have questions or suggestions, please leave a comment below.

——————————————-

Related

Large-Scale Machine Learning with Spark on Amazon EMR

——————————————–

Love to work on open source? Check out EMR’s careers page.

 

Extending Seven Bridges Genomics with Amazon Redshift and R

Post Syndicated from Christopher Crosbie original https://blogs.aws.amazon.com/bigdata/post/TxB9H9MGP4JBBQ/Extending-Seven-Bridges-Genomics-with-Amazon-Redshift-and-R

Christopher Crosbie is a Healthcare and Life Science Solutions Architect with Amazon Web Services

The article was co-authored by Zeynep Onder, Scientist, Seven Bridges Genomics, an AWS Advanced Technology Partner.

“ACTGCTTCGACTCGGGTCCA“

That is probably not a coding language readily understood by many reading this blog post, but it is a programming framework that defines all life on the planet. These letters are known as base pairs in a DNA sequence and represent four chemicals found in all organisms. When put into a specific order, these DNA sequences contain the instructions that kick off processes which eventually render all the characteristics and traits (also known as phenotypes) we see in nature.

Sounds simple enough. Just store the code, perform some complex decoding algorithms, and you are on your way to a great scientific discovery. Right? Well, not quite. Genomics analysis is one of the biggest data problems out there.

Here’s why: You and I have around 20,000 – 25,000 distinct sequences of DNA (genes) that create proteins and thus contain instructions for every process from development to regeneration. This is out of the 3.2 billion individual letters (bases) in each of us. Thousands of other organisms have also been decoded and stored in databases because comparing genes across species can help us understand the way these genes actually function. Algorithms such as BLAST that can search DNA sequences from more than 260,000 organisms containing over 190 billion bases are now commonplace in bioinformatics. It has also been estimated that the total amount of DNA base pairs on Earth is somewhere in the range of 5.0 x 1037 or 50 trillion trillion trillion DNA base pairs. WOW! Forget about clickstream and server logs—nature has given us the ultimate Big Data problem.

Scientists, developers, and technologists from a variety of industries have all chosen AWS as their infrastructure platform to meet the challenges of data volume, data variety, and data velocity. It should be no surprise that the field of genomics has also converged on AWS tooling for meeting their big data needs. From storing and processing raw data from the machines that read it (sequencers) to research centers that want to build collaborative analysis environments, the AWS cloud has been an enabler for scientists making real discoveries in this field.

For blog readers who have not yet encountered this life sciences data, “genomics” is probably a term you have heard a lot in the past year – from President Obama’s precision medicine initiative to many mainstream news articles to potentially as part of your own healthcare. That is why for this Big Data Blog post, we’ll provide non-bioinformatics readers with a starting point for gaining an understanding of what is often seen as a clandestine technology, while at the same time offering those in the bioinformatics field a fresh look at ways the cloud can enhance their research.

We will walk through an end-to-end genomics analysis using the Seven Bridges Genomics platform, an AWS advanced technology partner that offers a solution for data scientists looking to analyze large-scale, next-generation, sequencing (NGS) data securely. We will also demonstrate how easy it is to extend this platform to use the additional analytics capabilities offered by AWS. First, we will identify the genetic “variants” in an individual—what makes them unique—and then conduct a comparison of variation across a group of people.

Seven Bridges Genomics platform

The Seven Bridges platform is a secure, end-to-end solution in itself. It manages all aspects of analysis, from data storage and logistics (via Amazon S3), to user tracking, permissions, logging pipeline executions to support reproducible analysis, and visualizing results.

We show you how use the visual interface of the platform. We like to think of this as an integrated development environment for bioinformatics, but each step in this workflow can also be performed via API. The Seven Bridges platform also offers an easy way to take the output of your analysis to other applications so you can extend its functionality. In this case, we will do additional work using Amazon Redshift and R.

You can follow along with this blog post by signing up for a free Seven Bridges trial account that comes with $100 of free computation and storage credits. This is enough for you to perform variant calling on about 50 whole exomes. In other words, you can modify and re-run the steps in this post 50 different ways before having to pay a dime.

Setting up our Seven Bridges environment

First, we will start with sequencing the 180,000 regions of the genome that make up protein-coding genes, also known as whole exome DNA sequencing. Our starting point will be a file in the FASTQ format, a text-based format for storing the biological sequences of A, C, T, and Gs.

We’ll use publicly available data from the 1000 Genomes project as an example, but you could just as easily upload your own FASTQ files. We also show you how to quickly modify the analysis to modify tool parameters or start with data in other formats. We walk you through the main points here, but the Seven Bridges Quick Start guide or this video also provide a tutorial.

The first step to run an analysis on the Seven Bridges platform is to create a project. Each project corresponds to a distinct scientific investigation, serving as a container for its data, analysis pipelines, results and collaborators. Think of this as our own private island where we can invite friends (collaborators) to join us. We maintain fine-grained control over what our friends can do on our island, from seeing files to running executions.

 Adding genomics data

Next, we need to add the data we need to analyze to the project. There are several ways to add files to the platform; we can add them using the graphical interface, the command line interface, or via FTP/HTTP. Here, we analyze publicly available 1000 Genome Project data that is available in the Public Files hosted on the Seven Bridges platform (which are also available on a public S3 bucket). The Seven Bridges Public Files repository contains reference files, data sets, and other frequently used genomic data that you might find useful. For this analysis, we’ll add two FASTQ files to our project. Because the two files have been sequenced together on the same lane, we set their Lane/Slide metadata same so that the pipeline can process them together.

Sequencing analysis typically works by comparing a unique FASTQ file to a complete reference genome. The Seven Bridges platform has the most common reference files pre­loaded so we won’t need to worry about this now. Of course, additional reference files can be added if we want to further customize our analysis.

Pipelines abound: Customizable conduits for GATK and beyond

Now that the data is ready, we need a pipeline to analyze them. The Seven Bridges platform comes pre­loaded with a broad range of gold standard bioinformatics pipelines that allow us to execute analyses immediately according to community best practices. We can use these as templates to immediately and reproducibly perform complex analyses routines without having to install any tools or manage data flow between them. Using the SDK, we can also put virtually any custom tool on the platform, and create completely custom pipelines.

In this example, we use the Whole Exome Analysis ­ BWA + GATK 2.3.9­Lite (with Metrics) as our starting point. In this pipeline, we first align FASTQ files to a reference using a software package known as the Burrows-Wheeler Aligner (BWA). Next, alignments are refined according to GATK best practices before we start to make the determination between what makes our FASTQ file different from the reference file. This step of looking for differences between our file and the reference file is known as variant calling and typically produces a  Variant Call Format (VCF) file.

While public pipelines are set based on best practices, we can tweak them to suit our needs using the pipeline editor.

Analyzing the source code of a person

Next, we run the analysis. Choosing Run on the pipeline brings us to the task execution page where we can select the files to analyze, and set any modifiable parameters (executions can also be started via the API). We also need to add any reference and annotation files required by our pipeline. Because we are using a Seven Bridges optimized pipeline as the starting point, the recommended reference files can all be selected and added with one click rather than hunting these down from public repositories.

After we choose Run, our optimized execution is performed and we receive an email in an hour or so, letting us know that it has been completed.

After the execution completes, we can return to the Tasks tab to see the inputs, parameters, and output files of our analysis. Everything is kept in one place so that we can return in months or years and always know the precise analysis that was performed. The pipeline we executed returns both aligned bam (alignment) files and VCFs, so if we decide to change the variant calling in the future, we don’t need to re­run the alignment.

Extending from individual analysis to population studies in AWS

The walkthrough thus far ran an analysis on an individual level, but what if we wanted to expand our analysis to an entire population?  AWS and Seven Bridges together offer a wealth of tools that are a perfect fit for this type of study. Let’s start with Amazon Redshift and the open source language R to see just a couple examples of how this might work.

First, we select all the resulting VCF files from the file browser and select get links. This generates signed URLs pointing to our file’s location in Amazon S3. Note that these links expire in 24 hours, so we want to use them right away.

As links will vary based on the 1000 genomes file picked and individual accounts, the rest of this post uses the chromosome 1 VCF file stored on Amazon S3 as an example.

Launching Bioconductor

A very common way to get started analyzing genomics data is by using the R software package, Bioconductor. Bioconductor is built for working with genomics data and has built-in functionality for interpreting many of the common genomics file types. It is also very simple to get started with Bioconductor in the cloud because an AWS CloudFormation template is provided and maintained for quickly provisioning an environment that is ready to go.

To launch a Bioconductor environment, choose AWS CloudFormation from the AWS Management Console, create a new stack, and enter the following value for Specify an Amazon S3 template URL under Template:

https://s3.amazonaws.com/bioc-cloudformation-templates/start_ssh_instance.json

We must be in the us-east-1 region for this template to function correctly; be aware that we must choose an instance type that supports an ‘HVM instance store’ so we avoid the micro instance type.

After the CloudFormation stack has been launched, we should see a status of “CREATE_COMPLETE” and an Outputs tab should be exposed which contains the following important pieces of data:

URL – The URL to click to begin working with Bioconductor and R.

Username – User name to gain entry to the website provided by the URL.

Password –  Password to gain entry to the website provided by the URL.

SSH Command – The command we can copy/paste into a terminal with SSH installed.

Before we jump into R, let’s add a couple items to our server that will come in handy later as we interact with AWS services from R. Connect to the instance using the SSH command provided by the outputs of the CloudFormation template. After we are in the instance, we run the following command:

sudo apt-get -y install libpq-dev

While logged into the instance, we can also use this opportunity to configure the AWS CLI. For the purposes of this post we’ll use the keys of a user who has read and write access to an Amazon S3 bucket.

Load the VCF into the R cloud

We are now ready to get started with R and Bioconductor by clicking the URL provided in the CloudFormation output. After we log in, we should be able to interact with a fully configured and pre-loaded RStudio environment.

First we need to pull the relevant VCF files into our Bioconductor environment. In this sample code, we use the VCF files provided in the 1000 Genomes project hosted on AWS but the links could be switched with those found in the file browser of the Seven Bridges Genomics platform.

#RCurl to retrieve data from S3
library(RCurl)

#The VariantAnnotation package
#from Bioconductor will give us additional functionality for working with #VCF files.
library(VariantAnnotation)

#No need to install these packages as they are part of the pre-configured
# Bioconductor install

#VCF files are composed of two parts. The data and the index stored in the #tbi file.
#The VariantAnnotation package relies on having both files side by side for #performing fast data lookups

#pull the binaries of the data file into R’s memory
bin_vcf <- getBinaryURL
("http://1000genomes.s3.amazonaws.com/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz", ssl.verifypeer=FALSE)

#pull the binaries of the index file into R’s memory
bin_vcf_tabix <- getBinaryURL("http://1000genomes.s3.amazonaws.com/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi",
ssl.verifypeer=FALSE)

#Write both files to the working directory of R.
#use getwd() to see the current directory and use setwd() to change location.
con <- file("imported_vcf.gz", open = "wb")
writeBin(bin_vcf, con)
close(con)

con <- file("imported_vcf.gz.tbi", open = "wb")
writeBin(bin_vcf_tabix, con)
close(con)

#Create strings specifying the file name location that can be used by R
vcf_file <- paste(getwd(), "imported_vcf.gz", sep="/")
vcf.tbi <- paste(getwd(), "imported_vcf.gz.tbi", sep="/")

#use functionality of VariantAnnotation to review the header
#of the VCF file.
info(scanVcfHeader(vcf_file))

Extracting Relevant Variant Data

As a trivial example to show the type of manipulation a bioinformatisis may want to do with Bioconductor before sending the data off for other AWS Analytics services for further analysis, lets filter by a range within the chromosome 1 vcf file and then extract the variations in genes that were actually found.

#filter by a region in the chromosome
loc <- GRanges(1, IRanges(232164611, 233164611))
params <- ScanVcfParam(which=loc)

#read the VCF file into an R object of type VCF.
#the “hg19” is the genome identifier
# and tells bioconductor to use the 19th version of the reference file
#from the human genome project.
vcf <- readVcf(TabixFile(vcf_file), "hg19", params)

#use fixed to pull out the REF, ALT, QUAL and FILTER fields of the VCF and parse
#them into a dataframe.
#Using this as an easy to interpret example.
variant_df <- fixed(vcf)

#to get an idea of what is contained in this dataset, run
(variant_df)

The results of this code should produce a sample similar to the following:

Using Bioconductor and R to send variants to Amazon Redshift

This appears to be a very structured data set that we may want to collect on an ongoing basis from multiple VCFs to perform large population studies later. This is a scenario that aligns perfectly with the AWS service. For more information about how Amazon Redshift relates to R as well as details on how to connect to Amazon Redshift from R, see Connecting R with Amazon Redshift.

To move this dataset effectively into Amazon Redshift, we want to use the COPY command of Amazon Redshift and take advantage of the wide bandwidth that S3 provides. Therefore, the first step is moving our fixed variant data frame into S3.

#move into S3
variant_df <- fixed(vcf)

#export the dataframe to a CSV
write.csv(variant_df, file = "my_variants.csv")

#this step requires that AWS configure was previously run on the
#EC2 instance with user keys that have read/write on YOUR-BUCKET
system("aws s3 mv my_variants.csv s3://YOUR-BUCKET/my_variants.csv", intern = TRUE)

Now that this data frame is available to us in Amazon S3, we can open a connection to Amazon Redshift and load the filtered variant data.

#on first run,
#we need to install the RPostgreSQL package.
#install.packages("RPostgreSQL")
#open a connection to the Amazon Redshift cluster
library(RPostgreSQL)
con <- dbConnect(dbDriver("PostgreSQL"),
host="YOUR-redshift-cluster1.xx.region.amazonaws.com",
user= "USER_NAME",
password="PASSWORD",
dbname="DATABASE_NAME",
port = 5439 )
#Send a query to create
#the table which will hold our variants
dbGetQuery(con,
"CREATE TABLE SAMPLE_VARIANTS
( ROW_ID int, REF varchar(50),
ALT varchar(50), QUAL int,
FILTER varchar(5) )"
)

#Use the Amazon Redshift COPY
#command to load the table from S3
dbGetQuery(con,
“ copy SAMPLE_VARIANTS from
‘s3://YOUR-BUCKET/my_variants.csv’
credentials ‘aws_access_key_id=YOUR_KEY;
aws_secret_access_key=YOUR_SECRET_KEY’ csv
IGNOREHEADER 1 maxerror as 250;" )

Running an analysis against Amazon Redshift from R

Now that the data has been loaded into Amazon Redshift, we can continue to add additional rows from more VCFs that are relevant to the region of the chromosome we wish to study, without having to worry about scaling since Amazon Redshift can easily query petabytes of data. We now also have the ability to write simple SQL against our dataset to find answers to our questions. For example, we can quickly ask what the top gene variations are in this chromosome region with the following statement:

top_variations <- dbGetQuery(con,
"select ref || ‘->’ || alt as variation, count(*) as variations_in_vcf from SAMPLE_VARIANTS group by ref,alt order by variations_in_vcf DESC"
)

The most common variation found in our chromosome range is that a C becomes a T. Most of our top changes are a single letter change, which is also known as a single-nucleotide polymorphism or SNP (pronounced “snip”).

In addition to massive scale-out capabilities, there’s another advantage to collecting relevant results in a single Amazon Redshift table: we are storing the specific variant data for the chromosome region we want to study in an easy-to-interpret and simple-to-query format. This makes it easy to cut our massive datasets with SQL and hone in on the relevant cohorts or regions of data we want to explore further. After we have used SQL to focus our data, we can return to R and run a more sophisticated analysis.

As a spurious example of what this might look like, let’s dig into our SNP example above. Simple counts let us know that the SNPs are certainly the common form of variation in the set. But what if we wanted to see if there were base pairs changes in the SNPs that tended to cluster together? We could examine this by running a correspondence analysis, which is a graphical method of exploring the relationship between variables in a contingency table. In this pseudo-analysis, we treat our ref and alt variables as categorical variables and produce a visualization that displays how our ref and alts move together in chi-squared space.

single_changes = "select ref , alt,
count(*) as variations_in_vcf from SAMPLE_VARIANTS
WHERE len(ref) = 1 and len(alt) = 1
group by ref,alt
order by variations_in_vcf DESC"
count_single_variations <- dbGetQuery(con,single_changes)
library(ca)
count_single_variations
correspondance_table <- with(count_single_variations, table(ref,alt)) # create a 2 way table
prop.table(correspondance_table, 1) # row percentages
prop.table(correspondance_table, 2) # column percentages
fit <- ca(correspondance_table)
print(fit) #basic results
plot(fit, mass = TRUE, contrib = "absolute", map =
"rowgreen", arrows = c(FALSE, TRUE)) # asymmetric map

Conclusion

We are in the midst of a fundamental change in the accessibility and usability of data. Massive datasets once thought to only be archival in nature or impossible to interpret are being opened up for examination due to the processing power of the cloud. The genes that control the makeup of all living creatures are no exception and are finally falling under the same scrutiny.

We hope that this walkthrough has given the non-bioinformatics reader a quick peek under the hood of this fascinating field. We also hope it has provided insight into what it means to begin to understand the data in your DNA. Armed with this knowledge and an improved genomics vocabulary, you can be part of the conversation surrounding the precision medicine revolution.

Additionally, we hope the seasoned bioinformatics reader has been exposed to new tools that could accelerate research and has been empowered to explore the capabilities of Amazon Redshift and the Seven Bridges Genomics platform. This post has just scratched the surface. There are over 30 tested, peer-reviewed pipelines and 400 open source applications available for analyzing multidimensional, next-generation sequencing data integrated into the Seven Bridges Genomics platform.

If you have questions or suggestions, please leave a comment below.

————————-

Related

Connecting R with Amazon Redshift

Extending Seven Bridges Genomics with Amazon Redshift and R

Post Syndicated from Christopher Crosbie original https://blogs.aws.amazon.com/bigdata/post/TxB9H9MGP4JBBQ/Extending-Seven-Bridges-Genomics-with-Amazon-Redshift-and-R

Christopher Crosbie is a Healthcare and Life Science Solutions Architect with Amazon Web Services

The article was co-authored by Zeynep Onder, Scientist, Seven Bridges Genomics, an AWS Advanced Technology Partner.

“ACTGCTTCGACTCGGGTCCA“

That is probably not a coding language readily understood by many reading this blog post, but it is a programming framework that defines all life on the planet. These letters are known as base pairs in a DNA sequence and represent four chemicals found in all organisms. When put into a specific order, these DNA sequences contain the instructions that kick off processes which eventually render all the characteristics and traits (also known as phenotypes) we see in nature.

Sounds simple enough. Just store the code, perform some complex decoding algorithms, and you are on your way to a great scientific discovery. Right? Well, not quite. Genomics analysis is one of the biggest data problems out there.

Here’s why: You and I have around 20,000 – 25,000 distinct sequences of DNA (genes) that create proteins and thus contain instructions for every process from development to regeneration. This is out of the 3.2 billion individual letters (bases) in each of us. Thousands of other organisms have also been decoded and stored in databases because comparing genes across species can help us understand the way these genes actually function. Algorithms such as BLAST that can search DNA sequences from more than 260,000 organisms containing over 190 billion bases are now commonplace in bioinformatics. It has also been estimated that the total amount of DNA base pairs on Earth is somewhere in the range of 5.0 x 1037 or 50 trillion trillion trillion DNA base pairs. WOW! Forget about clickstream and server logs—nature has given us the ultimate Big Data problem.

Scientists, developers, and technologists from a variety of industries have all chosen AWS as their infrastructure platform to meet the challenges of data volume, data variety, and data velocity. It should be no surprise that the field of genomics has also converged on AWS tooling for meeting their big data needs. From storing and processing raw data from the machines that read it (sequencers) to research centers that want to build collaborative analysis environments, the AWS cloud has been an enabler for scientists making real discoveries in this field.

For blog readers who have not yet encountered this life sciences data, “genomics” is probably a term you have heard a lot in the past year – from President Obama’s precision medicine initiative to many mainstream news articles to potentially as part of your own healthcare. That is why for this Big Data Blog post, we’ll provide non-bioinformatics readers with a starting point for gaining an understanding of what is often seen as a clandestine technology, while at the same time offering those in the bioinformatics field a fresh look at ways the cloud can enhance their research.

We will walk through an end-to-end genomics analysis using the Seven Bridges Genomics platform, an AWS advanced technology partner that offers a solution for data scientists looking to analyze large-scale, next-generation, sequencing (NGS) data securely. We will also demonstrate how easy it is to extend this platform to use the additional analytics capabilities offered by AWS. First, we will identify the genetic “variants” in an individual—what makes them unique—and then conduct a comparison of variation across a group of people.

Seven Bridges Genomics platform

The Seven Bridges platform is a secure, end-to-end solution in itself. It manages all aspects of analysis, from data storage and logistics (via Amazon S3), to user tracking, permissions, logging pipeline executions to support reproducible analysis, and visualizing results.

We show you how use the visual interface of the platform. We like to think of this as an integrated development environment for bioinformatics, but each step in this workflow can also be performed via API. The Seven Bridges platform also offers an easy way to take the output of your analysis to other applications so you can extend its functionality. In this case, we will do additional work using Amazon Redshift and R.

You can follow along with this blog post by signing up for a free Seven Bridges trial account that comes with $100 of free computation and storage credits. This is enough for you to perform variant calling on about 50 whole exomes. In other words, you can modify and re-run the steps in this post 50 different ways before having to pay a dime.

Setting up our Seven Bridges environment

First, we will start with sequencing the 180,000 regions of the genome that make up protein-coding genes, also known as whole exome DNA sequencing. Our starting point will be a file in the FASTQ format, a text-based format for storing the biological sequences of A, C, T, and Gs.

We’ll use publicly available data from the 1000 Genomes project as an example, but you could just as easily upload your own FASTQ files. We also show you how to quickly modify the analysis to modify tool parameters or start with data in other formats. We walk you through the main points here, but the Seven Bridges Quick Start guide or this video also provide a tutorial.

The first step to run an analysis on the Seven Bridges platform is to create a project. Each project corresponds to a distinct scientific investigation, serving as a container for its data, analysis pipelines, results and collaborators. Think of this as our own private island where we can invite friends (collaborators) to join us. We maintain fine-grained control over what our friends can do on our island, from seeing files to running executions.

 Adding genomics data

Next, we need to add the data we need to analyze to the project. There are several ways to add files to the platform; we can add them using the graphical interface, the command line interface, or via FTP/HTTP. Here, we analyze publicly available 1000 Genome Project data that is available in the Public Files hosted on the Seven Bridges platform (which are also available on a public S3 bucket). The Seven Bridges Public Files repository contains reference files, data sets, and other frequently used genomic data that you might find useful. For this analysis, we’ll add two FASTQ files to our project. Because the two files have been sequenced together on the same lane, we set their Lane/Slide metadata same so that the pipeline can process them together.

Sequencing analysis typically works by comparing a unique FASTQ file to a complete reference genome. The Seven Bridges platform has the most common reference files pre­loaded so we won’t need to worry about this now. Of course, additional reference files can be added if we want to further customize our analysis.

Pipelines abound: Customizable conduits for GATK and beyond

Now that the data is ready, we need a pipeline to analyze them. The Seven Bridges platform comes pre­loaded with a broad range of gold standard bioinformatics pipelines that allow us to execute analyses immediately according to community best practices. We can use these as templates to immediately and reproducibly perform complex analyses routines without having to install any tools or manage data flow between them. Using the SDK, we can also put virtually any custom tool on the platform, and create completely custom pipelines.

In this example, we use the Whole Exome Analysis ­ BWA + GATK 2.3.9­Lite (with Metrics) as our starting point. In this pipeline, we first align FASTQ files to a reference using a software package known as the Burrows-Wheeler Aligner (BWA). Next, alignments are refined according to GATK best practices before we start to make the determination between what makes our FASTQ file different from the reference file. This step of looking for differences between our file and the reference file is known as variant calling and typically produces a  Variant Call Format (VCF) file.

While public pipelines are set based on best practices, we can tweak them to suit our needs using the pipeline editor.

Analyzing the source code of a person

Next, we run the analysis. Choosing Run on the pipeline brings us to the task execution page where we can select the files to analyze, and set any modifiable parameters (executions can also be started via the API). We also need to add any reference and annotation files required by our pipeline. Because we are using a Seven Bridges optimized pipeline as the starting point, the recommended reference files can all be selected and added with one click rather than hunting these down from public repositories.

After we choose Run, our optimized execution is performed and we receive an email in an hour or so, letting us know that it has been completed.

After the execution completes, we can return to the Tasks tab to see the inputs, parameters, and output files of our analysis. Everything is kept in one place so that we can return in months or years and always know the precise analysis that was performed. The pipeline we executed returns both aligned bam (alignment) files and VCFs, so if we decide to change the variant calling in the future, we don’t need to re­run the alignment.

Extending from individual analysis to population studies in AWS

The walkthrough thus far ran an analysis on an individual level, but what if we wanted to expand our analysis to an entire population?  AWS and Seven Bridges together offer a wealth of tools that are a perfect fit for this type of study. Let’s start with Amazon Redshift and the open source language R to see just a couple examples of how this might work.

First, we select all the resulting VCF files from the file browser and select get links. This generates signed URLs pointing to our file’s location in Amazon S3. Note that these links expire in 24 hours, so we want to use them right away.

As links will vary based on the 1000 genomes file picked and individual accounts, the rest of this post uses the chromosome 1 VCF file stored on Amazon S3 as an example.

Launching Bioconductor

A very common way to get started analyzing genomics data is by using the R software package, Bioconductor. Bioconductor is built for working with genomics data and has built-in functionality for interpreting many of the common genomics file types. It is also very simple to get started with Bioconductor in the cloud because an AWS CloudFormation template is provided and maintained for quickly provisioning an environment that is ready to go.

To launch a Bioconductor environment, choose AWS CloudFormation from the AWS Management Console, create a new stack, and enter the following value for Specify an Amazon S3 template URL under Template:

https://s3.amazonaws.com/bioc-cloudformation-templates/start_ssh_instance.json

We must be in the us-east-1 region for this template to function correctly; be aware that we must choose an instance type that supports an ‘HVM instance store’ so we avoid the micro instance type.

After the CloudFormation stack has been launched, we should see a status of “CREATE_COMPLETE” and an Outputs tab should be exposed which contains the following important pieces of data:

URL – The URL to click to begin working with Bioconductor and R.

Username – User name to gain entry to the website provided by the URL.

Password –  Password to gain entry to the website provided by the URL.

SSH Command – The command we can copy/paste into a terminal with SSH installed.

Before we jump into R, let’s add a couple items to our server that will come in handy later as we interact with AWS services from R. Connect to the instance using the SSH command provided by the outputs of the CloudFormation template. After we are in the instance, we run the following command:

sudo apt-get -y install libpq-dev

While logged into the instance, we can also use this opportunity to configure the AWS CLI. For the purposes of this post we’ll use the keys of a user who has read and write access to an Amazon S3 bucket.

Load the VCF into the R cloud

We are now ready to get started with R and Bioconductor by clicking the URL provided in the CloudFormation output. After we log in, we should be able to interact with a fully configured and pre-loaded RStudio environment.

First we need to pull the relevant VCF files into our Bioconductor environment. In this sample code, we use the VCF files provided in the 1000 Genomes project hosted on AWS but the links could be switched with those found in the file browser of the Seven Bridges Genomics platform.

#RCurl to retrieve data from S3
library(RCurl)

#The VariantAnnotation package
#from Bioconductor will give us additional functionality for working with #VCF files.
library(VariantAnnotation)

#No need to install these packages as they are part of the pre-configured
# Bioconductor install

#VCF files are composed of two parts. The data and the index stored in the #tbi file.
#The VariantAnnotation package relies on having both files side by side for #performing fast data lookups

#pull the binaries of the data file into R’s memory
bin_vcf <- getBinaryURL
("http://1000genomes.s3.amazonaws.com/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz", ssl.verifypeer=FALSE)

#pull the binaries of the index file into R’s memory
bin_vcf_tabix <- getBinaryURL("http://1000genomes.s3.amazonaws.com/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi",
ssl.verifypeer=FALSE)

#Write both files to the working directory of R.
#use getwd() to see the current directory and use setwd() to change location.
con <- file("imported_vcf.gz", open = "wb")
writeBin(bin_vcf, con)
close(con)

con <- file("imported_vcf.gz.tbi", open = "wb")
writeBin(bin_vcf_tabix, con)
close(con)

#Create strings specifying the file name location that can be used by R
vcf_file <- paste(getwd(), "imported_vcf.gz", sep="/")
vcf.tbi <- paste(getwd(), "imported_vcf.gz.tbi", sep="/")

#use functionality of VariantAnnotation to review the header
#of the VCF file.
info(scanVcfHeader(vcf_file))

Extracting Relevant Variant Data

As a trivial example to show the type of manipulation a bioinformatisis may want to do with Bioconductor before sending the data off for other AWS Analytics services for further analysis, lets filter by a range within the chromosome 1 vcf file and then extract the variations in genes that were actually found.

#filter by a region in the chromosome
loc <- GRanges(1, IRanges(232164611, 233164611))
params <- ScanVcfParam(which=loc)

#read the VCF file into an R object of type VCF.
#the “hg19” is the genome identifier
# and tells bioconductor to use the 19th version of the reference file
#from the human genome project.
vcf <- readVcf(TabixFile(vcf_file), "hg19", params)

#use fixed to pull out the REF, ALT, QUAL and FILTER fields of the VCF and parse
#them into a dataframe.
#Using this as an easy to interpret example.
variant_df <- fixed(vcf)

#to get an idea of what is contained in this dataset, run
(variant_df)

The results of this code should produce a sample similar to the following:

Using Bioconductor and R to send variants to Amazon Redshift

This appears to be a very structured data set that we may want to collect on an ongoing basis from multiple VCFs to perform large population studies later. This is a scenario that aligns perfectly with the AWS service. For more information about how Amazon Redshift relates to R as well as details on how to connect to Amazon Redshift from R, see Connecting R with Amazon Redshift.

To move this dataset effectively into Amazon Redshift, we want to use the COPY command of Amazon Redshift and take advantage of the wide bandwidth that S3 provides. Therefore, the first step is moving our fixed variant data frame into S3.

#move into S3
variant_df <- fixed(vcf)

#export the dataframe to a CSV
write.csv(variant_df, file = "my_variants.csv")

#this step requires that AWS configure was previously run on the
#EC2 instance with user keys that have read/write on YOUR-BUCKET
system("aws s3 mv my_variants.csv s3://YOUR-BUCKET/my_variants.csv", intern = TRUE)

Now that this data frame is available to us in Amazon S3, we can open a connection to Amazon Redshift and load the filtered variant data.

#on first run,
#we need to install the RPostgreSQL package.
#install.packages("RPostgreSQL")
#open a connection to the Amazon Redshift cluster
library(RPostgreSQL)
con <- dbConnect(dbDriver("PostgreSQL"),
host="YOUR-redshift-cluster1.xx.region.amazonaws.com",
user= "USER_NAME",
password="PASSWORD",
dbname="DATABASE_NAME",
port = 5439 )
#Send a query to create
#the table which will hold our variants
dbGetQuery(con,
"CREATE TABLE SAMPLE_VARIANTS
( ROW_ID int, REF varchar(50),
ALT varchar(50), QUAL int,
FILTER varchar(5) )"
)

#Use the Amazon Redshift COPY
#command to load the table from S3
dbGetQuery(con,
“ copy SAMPLE_VARIANTS from
‘s3://YOUR-BUCKET/my_variants.csv’
credentials ‘aws_access_key_id=YOUR_KEY;
aws_secret_access_key=YOUR_SECRET_KEY’ csv
IGNOREHEADER 1 maxerror as 250;" )

Running an analysis against Amazon Redshift from R

Now that the data has been loaded into Amazon Redshift, we can continue to add additional rows from more VCFs that are relevant to the region of the chromosome we wish to study, without having to worry about scaling since Amazon Redshift can easily query petabytes of data. We now also have the ability to write simple SQL against our dataset to find answers to our questions. For example, we can quickly ask what the top gene variations are in this chromosome region with the following statement:

top_variations <- dbGetQuery(con,
"select ref || ‘->’ || alt as variation, count(*) as variations_in_vcf from SAMPLE_VARIANTS group by ref,alt order by variations_in_vcf DESC"
)

The most common variation found in our chromosome range is that a C becomes a T. Most of our top changes are a single letter change, which is also known as a single-nucleotide polymorphism or SNP (pronounced “snip”).

In addition to massive scale-out capabilities, there’s another advantage to collecting relevant results in a single Amazon Redshift table: we are storing the specific variant data for the chromosome region we want to study in an easy-to-interpret and simple-to-query format. This makes it easy to cut our massive datasets with SQL and hone in on the relevant cohorts or regions of data we want to explore further. After we have used SQL to focus our data, we can return to R and run a more sophisticated analysis.

As a spurious example of what this might look like, let’s dig into our SNP example above. Simple counts let us know that the SNPs are certainly the common form of variation in the set. But what if we wanted to see if there were base pairs changes in the SNPs that tended to cluster together? We could examine this by running a correspondence analysis, which is a graphical method of exploring the relationship between variables in a contingency table. In this pseudo-analysis, we treat our ref and alt variables as categorical variables and produce a visualization that displays how our ref and alts move together in chi-squared space.

single_changes = "select ref , alt,
count(*) as variations_in_vcf from SAMPLE_VARIANTS
WHERE len(ref) = 1 and len(alt) = 1
group by ref,alt
order by variations_in_vcf DESC"
count_single_variations <- dbGetQuery(con,single_changes)
library(ca)
count_single_variations
correspondance_table <- with(count_single_variations, table(ref,alt)) # create a 2 way table
prop.table(correspondance_table, 1) # row percentages
prop.table(correspondance_table, 2) # column percentages
fit <- ca(correspondance_table)
print(fit) #basic results
plot(fit, mass = TRUE, contrib = "absolute", map =
"rowgreen", arrows = c(FALSE, TRUE)) # asymmetric map

Conclusion

We are in the midst of a fundamental change in the accessibility and usability of data. Massive datasets once thought to only be archival in nature or impossible to interpret are being opened up for examination due to the processing power of the cloud. The genes that control the makeup of all living creatures are no exception and are finally falling under the same scrutiny.

We hope that this walkthrough has given the non-bioinformatics reader a quick peek under the hood of this fascinating field. We also hope it has provided insight into what it means to begin to understand the data in your DNA. Armed with this knowledge and an improved genomics vocabulary, you can be part of the conversation surrounding the precision medicine revolution.

Additionally, we hope the seasoned bioinformatics reader has been exposed to new tools that could accelerate research and has been empowered to explore the capabilities of Amazon Redshift and the Seven Bridges Genomics platform. This post has just scratched the surface. There are over 30 tested, peer-reviewed pipelines and 400 open source applications available for analyzing multidimensional, next-generation sequencing data integrated into the Seven Bridges Genomics platform.

If you have questions or suggestions, please leave a comment below.

————————-

Related

Connecting R with Amazon Redshift

Faster Auto Scaling in AWS CloudFormation Stacks with Lambda-backed Custom Resources

Post Syndicated from Tom Maddox original http://blogs.aws.amazon.com/application-management/post/Tx38Z5CAM5WWRXW/Faster-Auto-Scaling-in-AWS-CloudFormation-Stacks-with-Lambda-backed-Custom-Resou

Many organizations use AWS CloudFormation (CloudFormation) stacks to facilitate blue/green deployments, routinely launching replacement AWS resources with updated packages for code releases, security patching, and change management. To facilitate blue/green deployments with CloudFormation, you typically pass code version identifiers (e.g., a commit hash) to new application stacks as template parameters. Application servers in an Auto Scaling group reference the parameters to fetch and install the correct versions of code.

 

Fetching code every time your application scales can impede bringing new application servers online. Organizations often compensate for reduced scaling agility by setting lower server utilization targets, which has a knock-on effect on cost, or by creating pre-built custom Amazon Machine Images (AMIs) for use in the deployment pipeline. Custom AMIs with pre-installed code can be referenced with new instance launches as part of an Auto Scaling group launch configuration. These application servers are ready faster than if code had to be fetched in the traditional way. However, hosting this type of application deployment pipeline often requires additional servers and adds the overhead of managing the AMIs.

 

In this post, we’ll look at how you can use CloudFormation custom resources with AWS Lambda (Lambda) to create and manage AMIs during stack creation and termination.

 

The following diagram shows how you can use a Lambda function that creates an AMI and returns a success code and the resulting AMI ID.

 

Visualization of AMIManager Custom Resource creation process

 

To orchestrate this process, you bootstrap a reference instance with a user data script, use wait conditions to trigger an AMI capture, and finally create an Auto Scaling group launch configuration that references the newly created AMI. The reference instance that is used to capture the AMI can then be terminated, or it can be repurposed for administrative access or for performing scheduled tasks. Here’s how this looks in a CloudFormation template:

 

"Resources": {
"WaitHandlePendingAMI" : {
"Type" : "AWS::CloudFormation::WaitConditionHandle"
},
"WaitConditionPendingAMI" : {
"Type" : "AWS::CloudFormation::WaitCondition",
"Properties" : {
"Handle" : { "Ref" : "WaitHandlePendingAMI" },
"Timeout" : "7200"
}
},

"WaitHandleAMIComplete" : {
"Type" : "AWS::CloudFormation::WaitConditionHandle"
},
"WaitConditionAMIComplete" : {
"Type" : "AWS::CloudFormation::WaitCondition",
"Properties" : {
"Handle" : { "Ref" : "WaitHandleAMIComplete" },
"Timeout" : "7200"
}
},

"AdminServer" : {
"Type" : "AWS::EC2::Instance",
"Properties" : {

"UserData": { "Fn::Base64": { "Fn::Join": [ "", [
"#!/bin/bashn",
"yum update -yn",
"",
"echo -e "n### Fetching and Installing Code…"n",
"export CODE_VERSION="", {"Ref": "CodeVersionIdentifier"}, ""n",
"# Insert application deployment code here!n",
"",
"echo -e "n### Signal for AMI capture"n",
"history -cn",
"/opt/aws/bin/cfn-signal -e 0 -i waitingforami ‘", { "Ref" : "WaitHandlePendingAMI" }, "’ n",
"",
"echo -e "n### Waiting for AMI to be available"n",
"aws ec2 wait image-available",
" –filters Name=tag:cloudformation:amimanager:stack-name,Values=", { "Ref" : "AWS::StackName" },
" –region ", {"Ref": "AWS::Region"}
"",
"/opt/aws/bin/cfn-signal -e $0 -i waitedforami ‘", { "Ref" : "WaitHandleAMIComplete" }, "’ n"
"",
"# Continue with re-purposing or shutting down instance…n"
] ] } }
}
},

"AMI": {
"Type": "Custom::AMI",
"DependsOn" : "WaitConditionPendingAMI",
"Properties": {
"ServiceToken": "arn:aws:lambda:REGION:ACCOUNTID:function:AMIManager",
"StackName": { "Ref" : "AWS::StackName" },
"Region" : { "Ref" : "AWS::Region" },
"InstanceId" : { "Ref" : "AdminServer" }
}
},

"AutoScalingGroup" : {
"Type" : "AWS::AutoScaling::AutoScalingGroup",
"Properties" : {

"LaunchConfigurationName" : { "Ref" : "LaunchConfiguration" }
}
},

"LaunchConfiguration": {
"Type": "AWS::AutoScaling::LaunchConfiguration",
"DependsOn" : "WaitConditionAMIComplete",
"Properties": {

"ImageId": { "Fn::GetAtt" : [ "AMI", "ImageId" ] }
}
}
}

With this approach, you don’t have to run and maintain additional servers for creating custom AMIs, and the AMIs can be deleted when the stack terminates. The following figure shows that as CloudFormation deletes the stacks, it also deletes the AMIs when the Delete signal is sent to the Lambda-backed custom resource.

Visualization of AMIManager Custom Resource deletion process

Let’s look at the Lambda function that facilitates AMI creation and deletion:

/**
* A Lambda function that takes an AWS CloudFormation stack name and instance id
* and returns the AMI ID.
**/

exports.handler = function (event, context) {

console.log("REQUEST RECEIVED:n", JSON.stringify(event));

var stackName = event.ResourceProperties.StackName;
var instanceId = event.ResourceProperties.InstanceId;
var instanceRegion = event.ResourceProperties.Region;

var responseStatus = "FAILED";
var responseData = {};

var AWS = require("aws-sdk");
var ec2 = new AWS.EC2({region: instanceRegion});

if (event.RequestType == "Delete") {
console.log("REQUEST TYPE:", "delete");
if (stackName && instanceRegion) {
var params = {
Filters: [
{
Name: ‘tag:cloudformation:amimanager:stack-name’,
Values: [ stackName ]
},
{
Name: ‘tag:cloudformation:amimanager:stack-id’,
Values: [ event.StackId ]
},
{
Name: ‘tag:cloudformation:amimanager:logical-id’,
Values: [ event.LogicalResourceId ]
}
]
};
ec2.describeImages(params, function (err, data) {
if (err) {
responseData = {Error: "DescribeImages call failed"};
console.log(responseData.Error + ":n", err);
sendResponse(event, context, responseStatus, responseData);
} else if (data.Images.length === 0) {
sendResponse(event, context, "SUCCESS", {Info: "Nothing to delete"});
} else {
var imageId = data.Images[0].ImageId;
console.log("DELETING:", data.Images[0]);
ec2.deregisterImage({ImageId: imageId}, function (err, data) {
if (err) {
responseData = {Error: "DeregisterImage call failed"};
console.log(responseData.Error + ":n", err);
} else {
responseStatus = "SUCCESS";
responseData.ImageId = imageId;
}
sendResponse(event, context, "SUCCESS");
});
}
});
} else {
responseData = {Error: "StackName or InstanceRegion not specified"};
console.log(responseData.Error);
sendResponse(event, context, responseStatus, responseData);
}
return;
}

console.log("REQUEST TYPE:", "create");
if (stackName && instanceId && instanceRegion) {
ec2.createImage(
{
InstanceId: instanceId,
Name: stackName + ‘-‘ + instanceId,
NoReboot: true
}, function (err, data) {
if (err) {
responseData = {Error: "CreateImage call failed"};
console.log(responseData.Error + ":n", err);
sendResponse(event, context, responseStatus, responseData);
} else {
var imageId = data.ImageId;
console.log(‘SUCCESS: ‘, "ImageId – " + imageId);

var params = {
Resources: [imageId],
Tags: [
{
Key: ‘cloudformation:amimanager:stack-name’,
Value: stackName
},
{
Key: ‘cloudformation:amimanager:stack-id’,
Value: event.StackId
},
{
Key: ‘cloudformation:amimanager:logical-id’,
Value: event.LogicalResourceId
}
]
};
ec2.createTags(params, function (err, data) {
if (err) {
responseData = {Error: "Create tags call failed"};
console.log(responseData.Error + ":n", err);
} else {
responseStatus = "SUCCESS";
responseData.ImageId = imageId;
}
sendResponse(event, context, responseStatus, responseData);
});
}
}
);
} else {
responseData = {Error: "StackName, InstanceId or InstanceRegion not specified"};
console.log(responseData.Error);
sendResponse(event, context, responseStatus, responseData);
}
};

//Sends response to the Amazon S3 pre-signed URL
function sendResponse(event, context, responseStatus, responseData) {
var responseBody = JSON.stringify({
Status: responseStatus,
Reason: "See the details in CloudWatch Log Stream: " + context.logStreamName,
PhysicalResourceId: context.logStreamName,
StackId: event.StackId,
RequestId: event.RequestId,
LogicalResourceId: event.LogicalResourceId,
Data: responseData
});

console.log("RESPONSE BODY:n", responseBody);

var https = require("https");
var url = require("url");

var parsedUrl = url.parse(event.ResponseURL);
var options = {
hostname: parsedUrl.hostname,
port: 443,
path: parsedUrl.path,
method: "PUT",
headers: {
"content-type": "",
"content-length": responseBody.length
}
};

var request = https.request(options, function (response) {
console.log("STATUS: " + response.statusCode);
console.log("HEADERS: " + JSON.stringify(response.headers));
// Tell AWS Lambda that the function execution is done
context.done();
});

request.on("error", function (error) {
console.log("sendResponse Error:n", error);
// Tell AWS Lambda that the function execution is done
context.done();
});

// Write data to request body
request.write(responseBody);
request.end();
}

This Lambda function calls the Amazon EC2 DescribeImages, DeregisterImage, CreateImage, and CreateTags APIs, and logs data to Amazon CloudWatch Logs (CloudWatch Logs) for monitoring and debugging. To support this, we recommended that you create the following AWS Identity and Access Management (IAM) policy for the function’s IAM execution role:

{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"ec2:CreateImage",
"ec2:DeregisterImage",
"ec2:DescribeImages",
"ec2:CreateTags"
],
"Effect": "Allow",
"Resource": "*"
},
{
"Action": [
"logs:*"
],
"Effect": "Allow",
"Resource": "arn:aws:logs:*:*:*"
}
]
}

During testing, the Lambda function didn’t exceed the minimum Lambda memory allocation of 128 MB. Typically, create operations took 4.5 seconds, and delete operations took 25 seconds. At Lambda’s current pricing of $0.00001667 per GB-second, each stack’s launch and terminate cycle incurs custom AMI creation costs of just $0.000988. This is much less expensive than managing an independent code release application. Within the AWS Free Tier, using Lambda as described allows you to perform more than 9,000 custom AMI create and delete operations each month for free!

Faster Auto Scaling in AWS CloudFormation Stacks with Lambda-backed Custom Resources

Post Syndicated from Tom Maddox original http://blogs.aws.amazon.com/application-management/post/Tx38Z5CAM5WWRXW/Faster-Auto-Scaling-in-AWS-CloudFormation-Stacks-with-Lambda-backed-Custom-Resou

Many organizations use AWS CloudFormation (CloudFormation) stacks to facilitate blue/green deployments, routinely launching replacement AWS resources with updated packages for code releases, security patching, and change management. To facilitate blue/green deployments with CloudFormation, you typically pass code version identifiers (e.g., a commit hash) to new application stacks as template parameters. Application servers in an Auto Scaling group reference the parameters to fetch and install the correct versions of code.

 

Fetching code every time your application scales can impede bringing new application servers online. Organizations often compensate for reduced scaling agility by setting lower server utilization targets, which has a knock-on effect on cost, or by creating pre-built custom Amazon Machine Images (AMIs) for use in the deployment pipeline. Custom AMIs with pre-installed code can be referenced with new instance launches as part of an Auto Scaling group launch configuration. These application servers are ready faster than if code had to be fetched in the traditional way. However, hosting this type of application deployment pipeline often requires additional servers and adds the overhead of managing the AMIs.

 

In this post, we’ll look at how you can use CloudFormation custom resources with AWS Lambda (Lambda) to create and manage AMIs during stack creation and termination.

 

The following diagram shows how you can use a Lambda function that creates an AMI and returns a success code and the resulting AMI ID.

 

Visualization of AMIManager Custom Resource creation process

 

To orchestrate this process, you bootstrap a reference instance with a user data script, use wait conditions to trigger an AMI capture, and finally create an Auto Scaling group launch configuration that references the newly created AMI. The reference instance that is used to capture the AMI can then be terminated, or it can be repurposed for administrative access or for performing scheduled tasks. Here’s how this looks in a CloudFormation template:

 

"Resources": {
"WaitHandlePendingAMI" : {
"Type" : "AWS::CloudFormation::WaitConditionHandle"
},
"WaitConditionPendingAMI" : {
"Type" : "AWS::CloudFormation::WaitCondition",
"Properties" : {
"Handle" : { "Ref" : "WaitHandlePendingAMI" },
"Timeout" : "7200"
}
},

"WaitHandleAMIComplete" : {
"Type" : "AWS::CloudFormation::WaitConditionHandle"
},
"WaitConditionAMIComplete" : {
"Type" : "AWS::CloudFormation::WaitCondition",
"Properties" : {
"Handle" : { "Ref" : "WaitHandleAMIComplete" },
"Timeout" : "7200"
}
},

"AdminServer" : {
"Type" : "AWS::EC2::Instance",
"Properties" : {

"UserData": { "Fn::Base64": { "Fn::Join": [ "", [
"#!/bin/bashn",
"yum update -yn",
"",
"echo -e "n### Fetching and Installing Code…"n",
"export CODE_VERSION="", {"Ref": "CodeVersionIdentifier"}, ""n",
"# Insert application deployment code here!n",
"",
"echo -e "n### Signal for AMI capture"n",
"history -cn",
"/opt/aws/bin/cfn-signal -e 0 -i waitingforami ‘", { "Ref" : "WaitHandlePendingAMI" }, "’ n",
"",
"echo -e "n### Waiting for AMI to be available"n",
"aws ec2 wait image-available",
" –filters Name=tag:cloudformation:amimanager:stack-name,Values=", { "Ref" : "AWS::StackName" },
" –region ", {"Ref": "AWS::Region"}
"",
"/opt/aws/bin/cfn-signal -e $0 -i waitedforami ‘", { "Ref" : "WaitHandleAMIComplete" }, "’ n"
"",
"# Continue with re-purposing or shutting down instance…n"
] ] } }
}
},

"AMI": {
"Type": "Custom::AMI",
"DependsOn" : "WaitConditionPendingAMI",
"Properties": {
"ServiceToken": "arn:aws:lambda:REGION:ACCOUNTID:function:AMIManager",
"StackName": { "Ref" : "AWS::StackName" },
"Region" : { "Ref" : "AWS::Region" },
"InstanceId" : { "Ref" : "AdminServer" }
}
},

"AutoScalingGroup" : {
"Type" : "AWS::AutoScaling::AutoScalingGroup",
"Properties" : {

"LaunchConfigurationName" : { "Ref" : "LaunchConfiguration" }
}
},

"LaunchConfiguration": {
"Type": "AWS::AutoScaling::LaunchConfiguration",
"DependsOn" : "WaitConditionAMIComplete",
"Properties": {

"ImageId": { "Fn::GetAtt" : [ "AMI", "ImageId" ] }
}
}
}

With this approach, you don’t have to run and maintain additional servers for creating custom AMIs, and the AMIs can be deleted when the stack terminates. The following figure shows that as CloudFormation deletes the stacks, it also deletes the AMIs when the Delete signal is sent to the Lambda-backed custom resource.

Visualization of AMIManager Custom Resource deletion process

Let’s look at the Lambda function that facilitates AMI creation and deletion:

/**
* A Lambda function that takes an AWS CloudFormation stack name and instance id
* and returns the AMI ID.
**/

exports.handler = function (event, context) {

console.log("REQUEST RECEIVED:n", JSON.stringify(event));

var stackName = event.ResourceProperties.StackName;
var instanceId = event.ResourceProperties.InstanceId;
var instanceRegion = event.ResourceProperties.Region;

var responseStatus = "FAILED";
var responseData = {};

var AWS = require("aws-sdk");
var ec2 = new AWS.EC2({region: instanceRegion});

if (event.RequestType == "Delete") {
console.log("REQUEST TYPE:", "delete");
if (stackName && instanceRegion) {
var params = {
Filters: [
{
Name: ‘tag:cloudformation:amimanager:stack-name’,
Values: [ stackName ]
},
{
Name: ‘tag:cloudformation:amimanager:stack-id’,
Values: [ event.StackId ]
},
{
Name: ‘tag:cloudformation:amimanager:logical-id’,
Values: [ event.LogicalResourceId ]
}
]
};
ec2.describeImages(params, function (err, data) {
if (err) {
responseData = {Error: "DescribeImages call failed"};
console.log(responseData.Error + ":n", err);
sendResponse(event, context, responseStatus, responseData);
} else if (data.Images.length === 0) {
sendResponse(event, context, "SUCCESS", {Info: "Nothing to delete"});
} else {
var imageId = data.Images[0].ImageId;
console.log("DELETING:", data.Images[0]);
ec2.deregisterImage({ImageId: imageId}, function (err, data) {
if (err) {
responseData = {Error: "DeregisterImage call failed"};
console.log(responseData.Error + ":n", err);
} else {
responseStatus = "SUCCESS";
responseData.ImageId = imageId;
}
sendResponse(event, context, "SUCCESS");
});
}
});
} else {
responseData = {Error: "StackName or InstanceRegion not specified"};
console.log(responseData.Error);
sendResponse(event, context, responseStatus, responseData);
}
return;
}

console.log("REQUEST TYPE:", "create");
if (stackName && instanceId && instanceRegion) {
ec2.createImage(
{
InstanceId: instanceId,
Name: stackName + ‘-‘ + instanceId,
NoReboot: true
}, function (err, data) {
if (err) {
responseData = {Error: "CreateImage call failed"};
console.log(responseData.Error + ":n", err);
sendResponse(event, context, responseStatus, responseData);
} else {
var imageId = data.ImageId;
console.log(‘SUCCESS: ‘, "ImageId – " + imageId);

var params = {
Resources: [imageId],
Tags: [
{
Key: ‘cloudformation:amimanager:stack-name’,
Value: stackName
},
{
Key: ‘cloudformation:amimanager:stack-id’,
Value: event.StackId
},
{
Key: ‘cloudformation:amimanager:logical-id’,
Value: event.LogicalResourceId
}
]
};
ec2.createTags(params, function (err, data) {
if (err) {
responseData = {Error: "Create tags call failed"};
console.log(responseData.Error + ":n", err);
} else {
responseStatus = "SUCCESS";
responseData.ImageId = imageId;
}
sendResponse(event, context, responseStatus, responseData);
});
}
}
);
} else {
responseData = {Error: "StackName, InstanceId or InstanceRegion not specified"};
console.log(responseData.Error);
sendResponse(event, context, responseStatus, responseData);
}
};

//Sends response to the Amazon S3 pre-signed URL
function sendResponse(event, context, responseStatus, responseData) {
var responseBody = JSON.stringify({
Status: responseStatus,
Reason: "See the details in CloudWatch Log Stream: " + context.logStreamName,
PhysicalResourceId: context.logStreamName,
StackId: event.StackId,
RequestId: event.RequestId,
LogicalResourceId: event.LogicalResourceId,
Data: responseData
});

console.log("RESPONSE BODY:n", responseBody);

var https = require("https");
var url = require("url");

var parsedUrl = url.parse(event.ResponseURL);
var options = {
hostname: parsedUrl.hostname,
port: 443,
path: parsedUrl.path,
method: "PUT",
headers: {
"content-type": "",
"content-length": responseBody.length
}
};

var request = https.request(options, function (response) {
console.log("STATUS: " + response.statusCode);
console.log("HEADERS: " + JSON.stringify(response.headers));
// Tell AWS Lambda that the function execution is done
context.done();
});

request.on("error", function (error) {
console.log("sendResponse Error:n", error);
// Tell AWS Lambda that the function execution is done
context.done();
});

// Write data to request body
request.write(responseBody);
request.end();
}

This Lambda function calls the Amazon EC2 DescribeImages, DeregisterImage, CreateImage, and CreateTags APIs, and logs data to Amazon CloudWatch Logs (CloudWatch Logs) for monitoring and debugging. To support this, we recommended that you create the following AWS Identity and Access Management (IAM) policy for the function’s IAM execution role:

{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"ec2:CreateImage",
"ec2:DeregisterImage",
"ec2:DescribeImages",
"ec2:CreateTags"
],
"Effect": "Allow",
"Resource": "*"
},
{
"Action": [
"logs:*"
],
"Effect": "Allow",
"Resource": "arn:aws:logs:*:*:*"
}
]
}

During testing, the Lambda function didn’t exceed the minimum Lambda memory allocation of 128 MB. Typically, create operations took 4.5 seconds, and delete operations took 25 seconds. At Lambda’s current pricing of $0.00001667 per GB-second, each stack’s launch and terminate cycle incurs custom AMI creation costs of just $0.000988. This is much less expensive than managing an independent code release application. Within the AWS Free Tier, using Lambda as described allows you to perform more than 9,000 custom AMI create and delete operations each month for free!

Use AWS CodeDeploy to Deploy to Amazon EC2 Instances Behind an Elastic Load Balancer

Post Syndicated from Thomas Schmitt original http://blogs.aws.amazon.com/application-management/post/Tx39X8HM93NXU47/Use-AWS-CodeDeploy-to-Deploy-to-Amazon-EC2-Instances-Behind-an-Elastic-Load-Bala

AWS CodeDeploy is a new service that makes it easy to deploy application updates to Amazon EC2 instances. CodeDeploy is targeted at customers who manage their EC2 instances directly, instead of those who use an application management service like AWS Elastic Beanstalk or AWS OpsWorks that have their own built-in deployment features. CodeDeploy allows developers and administrators to centrally control and track their application deployments across their different development, testing, and production environments.

 

Let’s assume you have an application architecture designed for high availability that includes an Elastic Load Balancer in front of multiple application servers belonging to an Auto Scaling Group. Elastic Load Balancing enables you to distribute incoming traffic over multiple servers and Auto Scaling allows you to scale your EC2 capacity up or down automatically according to your needs. In this blog post, we will show how you can use CodeDeploy to avoid downtime when updating the code running on your application servers in such an environment. We will use the CodeDeploy rolling updates feature so that there is a minimum capacity always available to serve traffic and use a simple script to take EC2 instances out of the load balancer as and when we deploy new code on it.

 

So let’s get started. We are going to:

Set up the environment described above

Create your artifact bundle, which includes the deployment scripts, and upload it to Amazon S3

Create an AWS CodeDeploy application and a deployment group

Start the zero-downtime deployment

Monitor your deployment

 

1. Set up the environment

Let’s get started by setting up some AWS resources.

 

To simplify the setup process, you can use a sample AWS CloudFormation template that sets up the following resources for you:

An Auto Scaling group and its launch configuration. The Auto Scaling group launches by default three Amazon EC2 instances. The AWS CloudFormation template installs Apache on each of these instances to run a sample website. It also installs the AWS CodeDeploy Agent, which performs the deployments on the instance. The template creates a service role that grants AWS CodeDeploy access to add deployment lifecycle event hooks to your Auto Scaling group so that it can kick off a deployment whenever Auto Scaling launches a new Amazon EC2 instance.

The Auto Scaling group spins up Amazon EC2 instances and monitors their health for you. The Auto Scaling Group spans all Availability Zones within the region for fault tolerance. 

An Elastic Load Balancing load balancer, which distributes the traffic across all of the Amazon EC2 instances in the Auto Scaling group.

 

Simply execute the following command using the AWS Command Line Interface (AWS CLI), or you can create an AWS CloudFormation stack with the AWS Management Console by using the value of the –template-url option shown here:

 

aws cloudformation create-stack
–stack-name "CodeDeploySampleELBIntegrationStack"
–template-url "http://s3.amazonaws.com/aws-codedeploy-us-east-1/templates/latest/CodeDeploy_SampleCF_ELB_Integration.json"
–capabilities "CAPABILITY_IAM"
–parameters "ParameterKey=KeyName,ParameterValue=<my-key-pair>"

 

Note: AWS CloudFormation will change your AWS account’s security configuration by adding two roles. These roles will enable AWS CodeDeploy to perform actions on your AWS account’s behalf. These actions include identifying Amazon EC2 instances by their tags or Auto Scaling group names and for deploying applications from Amazon S3 buckets to instances. For more information, see the AWS CodeDeploy service role and IAM instance profile documentation.

 

2. Create your artifact bundle, which includes the deployment scripts, and upload it to Amazon S3

You can use the following sample artifact bundle in Amazon S3, which includes everything you need: the Application Specification (AppSpec) file, deployment scripts, and a sample web page:

 

http://s3.amazonaws.com/aws-codedeploy-us-east-1/samples/latest/SampleApp_ELB_Integration.zip

 

This artifact bundle contains the deployment artifacts and a set of scripts that call the AutoScaling EnterStandby and ExitStandby APIs to do both the registration and deregistration of an Amazon EC2 instance from the load balancer.

 

The installation scripts and deployment artifacts are bundled together with a CodeDeploy AppSpec file. The AppSpec file must be placed in the root of your archive and describes where to copy the application and how to execute installation scripts. 

 

Here is the appspec.yml file from the sample artifact bundle:

 

version: 0.0
os: linux
files:
– source: /html
destination: /var/www/html
hooks:
BeforeInstall:
– location: scripts/deregister_from_elb.sh
timeout: 400
– location: scripts/stop_server.sh
timeout: 120
runas: root
ApplicationStart:
– location: scripts/start_server.sh
timeout: 120
runas: root
– location: scripts/register_with_elb.sh
timeout: 120

 

The defined commands in the AppSepc file will be executed in the following order (see AWS CodeDeploy AppSpec File Reference for more details):

BeforeInstall deployment lifecycle event
First, it deregisters the instance from the load balancer (deregister_from_elb.sh). I have increased the time out for the deregistration script above the 300 seconds that the load balancer waits until all connections are closed, which is the default value if connection draining is enabled.
After that it stops the Apache Web Server (stop_server.sh).

Install deployment lifecycle event
The next step of the host agent is to copy the HTML pages defined in the ‘files’ section from the ‘/html’ folder in the archive to ‘/var/www/html’ on the server.

ApplicationStart deployment lifecycle event
It starts the Apache Web Server (start_server.sh).
It then registers the instance with the load balancer (register_with_elb.sh).

In case you are wondering why I used the BeforeInstall instead of the ApplicationStop deployment lifecycle event, the ApplicationStop event always executes the scripts from the previous deployment bundle. If you do the deployment for the first time with AWS CodeDeploy, the instance would not get deregistered from the load balancer.

 

 

Here’s what the deregister script does, step by step:

The script gets the instance ID (and AWS region) from the Amazon EC2 metadata service.

It checks if the instance is part of an Auto Scaling group.

After that the script deregisters the instance from the load balancer by putting the instance into standby mode in the Auto Scaling group.

The script keeps polling the Auto Scaling API every second until the instance is in standby mode, which means it has been deregistered from the load balancer.

The deregistration might take a while if connection draining is enabled. The server has to finish processing the ongoing requests first before we can continue with the deployment.

 

For example, the following is the section of the deregister_from_elb.sh sample script that removes the Amazon EC2 instance from the load balancer:

 

# Get this instance’s ID
INSTANCE_ID=$(get_instance_id)
if [ $? != 0 -o -z "$INSTANCE_ID" ]; then
error_exit "Unable to get this instance’s ID; cannot continue."
fi

msg "Checking if instance $INSTANCE_ID is part of an AutoScaling group"
asg=$(autoscaling_group_name $INSTANCE_ID)
if [ $? == 0 -a -n "$asg" ]; then
msg "Found AutoScaling group for instance $INSTANCE_ID: $asg"

msg "Attempting to put instance into Standby"
autoscaling_enter_standby $INSTANCE_ID $asg
if [ $? != 0 ]; then
error_exit "Failed to move instance into standby"
else
msg "Instance is in standby"
exit 0
fi
fi

 

The ‘autoscaling_enter_standby’ function is defined in the common_functions.sh sample script as follows:

 

autoscaling_enter_standby() {
local instance_id=$1
local asg_name=$2

msg "Putting instance $instance_id into Standby"
$AWS_CLI autoscaling enter-standby
–instance-ids $instance_id
–auto-scaling-group-name $asg_name
–should-decrement-desired-capacity
if [ $? != 0 ]; then
msg "Failed to put instance $instance_id into standby for ASG $asg_name."
return 1
fi

msg "Waiting for move to standby to finish."
wait_for_state "autoscaling" $instance_id "Standby"
if [ $? != 0 ]; then
local wait_timeout=$(($WAITER_INTERVAL * $WAITER_ATTEMPTS))
msg "$instance_id did not make it to standby after $wait_timeout seconds"
return 1
fi

return 0
}

 

The register_with_elb.sh sample script works in a similar way. It calls the ‘autoscaling_exit_standby’ from the common_functions.sh sample script to put the instance back in service in the load balancer.

 

The register and deregister scripts are executed on each Amazon EC2 instance in your fleet. The instances must have access to the AutoScaling API to put themselves into standby mode and back in service. Your Amazon EC2 instance role needs the following permissions:

 

{
"Statement": [
{
"Effect": "Allow",
"Action": [
"autoscaling:Describe*",
"autoscaling:EnterStandby",
"autoscaling:ExitStandby",
"cloudformation:Describe*",
"cloudformation:GetTemplate",
"s3:Get*"
],
"Resource": "*"
}
]
}

 

If you use the provided CloudFormation template, an IAM instance role with the necessary permissions is automatically created for you.

 

For more details on how to create a deployment archive, see Prepare a Revision for AWS CodeDeploy.

 

3. Create an AWS CodeDeploy application and a deployment group

The next step is to create the AWS CodeDeploy resources and configure the roll-out strategy. The following commands tell AWS CodeDeploy where to deploy your artifact bundle (all instances of the given Auto Scaling group) and how to deploy it (OneAtATime). The deployment configuration ‘OneAtATime’ is the safest way to deploy, because only one instance of the Auto Scaling group will be updated at the same time.

 

# Create a new AWS CodeDeploy application.
aws deploy create-application –application-name "SampleELBWebApp"

# Get the AWS CodeDeploy service role ARN and Auto Scaling group name
# from the AWS CloudFormation template.
output_parameters=$(aws cloudformation describe-stacks
–stack-name "CodeDeploySampleELBIntegrationStack"
–output text
–query ‘Stacks[0].Outputs[*].OutputValue’)
service_role_arn=$(echo $output_parameters | awk ‘{print $2}’)
autoscaling_group_name=$(echo $output_parameters | awk ‘{print $3}’)

# Create an AWS CodeDeploy deployment group that uses
# the Auto Scaling group created by the AWS CloudFormation template.
# Set up the deployment group so that it deploys to
# only one instance at a time.
aws deploy create-deployment-group
–application-name "SampleELBWebApp"
–deployment-group-name "SampleELBDeploymentGroup"
–auto-scaling-groups "$autoscaling_group_name"
–service-role-arn "$service_role_arn"
–deployment-config-name "CodeDeployDefault.OneAtATime"

 

4. Start the zero-downtime deployment

Now you are ready to start your rolling, zero-downtime deployment. 

 

aws deploy create-deployment
–application-name "SampleELBWebApp"
–s3-location "bucket=aws-codedeploy-us-east-1,key=samples/latest/SampleApp_ELB_Integration.zip,bundleType=zip"
–deployment-group-name "SampleELBDeploymentGroup"

 

5. Monitor your deployment

You can see how your instances are taken out of service and back into service with the following command:

 

watch -n1 aws autoscaling describe-scaling-activities
–auto-scaling-group-name "$autoscaling_group_name"
–query ‘Activities[*].Description’

 

Every 1.0s: aws autoscaling describe-scaling-activities […]
[
"Moving EC2 instance out of Standby: i-d308b93c",
"Moving EC2 instance to Standby: i-d308b93c",
"Moving EC2 instance out of Standby: i-a9695458",
"Moving EC2 instance to Standby: i-a9695458",
"Moving EC2 instance out of Standby: i-2478cade",
"Moving EC2 instance to Standby: i-2478cade",
"Launching a new EC2 instance: i-d308b93c",
"Launching a new EC2 instance: i-a9695458",
"Launching a new EC2 instance: i-2478cade"
]

 

The URL output parameter of the AWS CloudFormation stack contains the link to the website so that you are able to watch it change. The following command returns the URL of the load balancer:

 

# Get the URL output parameter of the AWS CloudFormation template.
aws cloudformation describe-stacks
–stack-name "CodeDeploySampleELBIntegrationStack"
–output text
–query ‘Stacks[0].Outputs[?OutputKey==`URL`].OutputValue’

 

There are a few other points to consider in order to achieve zero-downtime deployments:

Graceful shut-down of your application
You do not want to kill a process with running executions. Make sure that the running threads have enough time to finish work before shutting down your application.

Connection draining
The AWS CloudFormation template sets up an Elastic Load Balancing load balancer with connection draining enabled. The load balancer does not send any new requests to the instance when the instance is deregistering, and it waits until any in-flight requests have finished executing. (For more information, see Enable or Disable Connection Draining for Your Load Balancer.)

Sanity test
It is important to check that the instance is healthy and the application is running before the instance is added back to the load balancer after the deployment.

Backward-compatible changes (for example, database changes)
Both application versions must work side by side until the deployment finishes, because only a part of the fleet is updated at the same time.

Warming of the caches and service
This is so that no request suffers a degraded performance after the deployment.

 

This example should help you get started toward improving your deployment process. I hope that this post makes it easier to reach zero-downtime deployments with AWS CodeDeploy and allows shipping your changes continuously in order to provide a great customer experience.

 

Use AWS CodeDeploy to Deploy to Amazon EC2 Instances Behind an Elastic Load Balancer

Post Syndicated from Thomas Schmitt original http://blogs.aws.amazon.com/application-management/post/Tx39X8HM93NXU47/Use-AWS-CodeDeploy-to-Deploy-to-Amazon-EC2-Instances-Behind-an-Elastic-Load-Bala

AWS CodeDeploy is a new service that makes it easy to deploy application updates to Amazon EC2 instances. CodeDeploy is targeted at customers who manage their EC2 instances directly, instead of those who use an application management service like AWS Elastic Beanstalk or AWS OpsWorks that have their own built-in deployment features. CodeDeploy allows developers and administrators to centrally control and track their application deployments across their different development, testing, and production environments.

 

Let’s assume you have an application architecture designed for high availability that includes an Elastic Load Balancer in front of multiple application servers belonging to an Auto Scaling Group. Elastic Load Balancing enables you to distribute incoming traffic over multiple servers and Auto Scaling allows you to scale your EC2 capacity up or down automatically according to your needs. In this blog post, we will show how you can use CodeDeploy to avoid downtime when updating the code running on your application servers in such an environment. We will use the CodeDeploy rolling updates feature so that there is a minimum capacity always available to serve traffic and use a simple script to take EC2 instances out of the load balancer as and when we deploy new code on it.

 

So let’s get started. We are going to:

Set up the environment described above

Create your artifact bundle, which includes the deployment scripts, and upload it to Amazon S3

Create an AWS CodeDeploy application and a deployment group

Start the zero-downtime deployment

Monitor your deployment

 

1. Set up the environment

Let’s get started by setting up some AWS resources.

 

To simplify the setup process, you can use a sample AWS CloudFormation template that sets up the following resources for you:

An Auto Scaling group and its launch configuration. The Auto Scaling group launches by default three Amazon EC2 instances. The AWS CloudFormation template installs Apache on each of these instances to run a sample website. It also installs the AWS CodeDeploy Agent, which performs the deployments on the instance. The template creates a service role that grants AWS CodeDeploy access to add deployment lifecycle event hooks to your Auto Scaling group so that it can kick off a deployment whenever Auto Scaling launches a new Amazon EC2 instance.

The Auto Scaling group spins up Amazon EC2 instances and monitors their health for you. The Auto Scaling Group spans all Availability Zones within the region for fault tolerance. 

An Elastic Load Balancing load balancer, which distributes the traffic across all of the Amazon EC2 instances in the Auto Scaling group.

 

Simply execute the following command using the AWS Command Line Interface (AWS CLI), or you can create an AWS CloudFormation stack with the AWS Management Console by using the value of the –template-url option shown here:

 

aws cloudformation create-stack
–stack-name "CodeDeploySampleELBIntegrationStack"
–template-url "http://s3.amazonaws.com/aws-codedeploy-us-east-1/templates/latest/CodeDeploy_SampleCF_ELB_Integration.json"
–capabilities "CAPABILITY_IAM"
–parameters "ParameterKey=KeyName,ParameterValue=<my-key-pair>"

 

Note: AWS CloudFormation will change your AWS account’s security configuration by adding two roles. These roles will enable AWS CodeDeploy to perform actions on your AWS account’s behalf. These actions include identifying Amazon EC2 instances by their tags or Auto Scaling group names and for deploying applications from Amazon S3 buckets to instances. For more information, see the AWS CodeDeploy service role and IAM instance profile documentation.

 

2. Create your artifact bundle, which includes the deployment scripts, and upload it to Amazon S3

You can use the following sample artifact bundle in Amazon S3, which includes everything you need: the Application Specification (AppSpec) file, deployment scripts, and a sample web page:

 

http://s3.amazonaws.com/aws-codedeploy-us-east-1/samples/latest/SampleApp_ELB_Integration.zip

 

This artifact bundle contains the deployment artifacts and a set of scripts that call the AutoScaling EnterStandby and ExitStandby APIs to do both the registration and deregistration of an Amazon EC2 instance from the load balancer.

 

The installation scripts and deployment artifacts are bundled together with a CodeDeploy AppSpec file. The AppSpec file must be placed in the root of your archive and describes where to copy the application and how to execute installation scripts. 

 

Here is the appspec.yml file from the sample artifact bundle:

 

version: 0.0
os: linux
files:
– source: /html
destination: /var/www/html
hooks:
BeforeInstall:
– location: scripts/deregister_from_elb.sh
timeout: 400
– location: scripts/stop_server.sh
timeout: 120
runas: root
ApplicationStart:
– location: scripts/start_server.sh
timeout: 120
runas: root
– location: scripts/register_with_elb.sh
timeout: 120

 

The defined commands in the AppSepc file will be executed in the following order (see AWS CodeDeploy AppSpec File Reference for more details):

BeforeInstall deployment lifecycle event
First, it deregisters the instance from the load balancer (deregister_from_elb.sh). I have increased the time out for the deregistration script above the 300 seconds that the load balancer waits until all connections are closed, which is the default value if connection draining is enabled.
After that it stops the Apache Web Server (stop_server.sh).

Install deployment lifecycle event
The next step of the host agent is to copy the HTML pages defined in the ‘files’ section from the ‘/html’ folder in the archive to ‘/var/www/html’ on the server.

ApplicationStart deployment lifecycle event
It starts the Apache Web Server (start_server.sh).
It then registers the instance with the load balancer (register_with_elb.sh).

In case you are wondering why I used the BeforeInstall instead of the ApplicationStop deployment lifecycle event, the ApplicationStop event always executes the scripts from the previous deployment bundle. If you do the deployment for the first time with AWS CodeDeploy, the instance would not get deregistered from the load balancer.

 

 

Here’s what the deregister script does, step by step:

The script gets the instance ID (and AWS region) from the Amazon EC2 metadata service.

It checks if the instance is part of an Auto Scaling group.

After that the script deregisters the instance from the load balancer by putting the instance into standby mode in the Auto Scaling group.

The script keeps polling the Auto Scaling API every second until the instance is in standby mode, which means it has been deregistered from the load balancer.

The deregistration might take a while if connection draining is enabled. The server has to finish processing the ongoing requests first before we can continue with the deployment.

 

For example, the following is the section of the deregister_from_elb.sh sample script that removes the Amazon EC2 instance from the load balancer:

 

# Get this instance’s ID
INSTANCE_ID=$(get_instance_id)
if [ $? != 0 -o -z "$INSTANCE_ID" ]; then
error_exit "Unable to get this instance’s ID; cannot continue."
fi

msg "Checking if instance $INSTANCE_ID is part of an AutoScaling group"
asg=$(autoscaling_group_name $INSTANCE_ID)
if [ $? == 0 -a -n "$asg" ]; then
msg "Found AutoScaling group for instance $INSTANCE_ID: $asg"

msg "Attempting to put instance into Standby"
autoscaling_enter_standby $INSTANCE_ID $asg
if [ $? != 0 ]; then
error_exit "Failed to move instance into standby"
else
msg "Instance is in standby"
exit 0
fi
fi

 

The ‘autoscaling_enter_standby’ function is defined in the common_functions.sh sample script as follows:

 

autoscaling_enter_standby() {
local instance_id=$1
local asg_name=$2

msg "Putting instance $instance_id into Standby"
$AWS_CLI autoscaling enter-standby
–instance-ids $instance_id
–auto-scaling-group-name $asg_name
–should-decrement-desired-capacity
if [ $? != 0 ]; then
msg "Failed to put instance $instance_id into standby for ASG $asg_name."
return 1
fi

msg "Waiting for move to standby to finish."
wait_for_state "autoscaling" $instance_id "Standby"
if [ $? != 0 ]; then
local wait_timeout=$(($WAITER_INTERVAL * $WAITER_ATTEMPTS))
msg "$instance_id did not make it to standby after $wait_timeout seconds"
return 1
fi

return 0
}

 

The register_with_elb.sh sample script works in a similar way. It calls the ‘autoscaling_exit_standby’ from the common_functions.sh sample script to put the instance back in service in the load balancer.

 

The register and deregister scripts are executed on each Amazon EC2 instance in your fleet. The instances must have access to the AutoScaling API to put themselves into standby mode and back in service. Your Amazon EC2 instance role needs the following permissions:

 

{
"Statement": [
{
"Effect": "Allow",
"Action": [
"autoscaling:Describe*",
"autoscaling:EnterStandby",
"autoscaling:ExitStandby",
"cloudformation:Describe*",
"cloudformation:GetTemplate",
"s3:Get*"
],
"Resource": "*"
}
]
}

 

If you use the provided CloudFormation template, an IAM instance role with the necessary permissions is automatically created for you.

 

For more details on how to create a deployment archive, see Prepare a Revision for AWS CodeDeploy.

 

3. Create an AWS CodeDeploy application and a deployment group

The next step is to create the AWS CodeDeploy resources and configure the roll-out strategy. The following commands tell AWS CodeDeploy where to deploy your artifact bundle (all instances of the given Auto Scaling group) and how to deploy it (OneAtATime). The deployment configuration ‘OneAtATime’ is the safest way to deploy, because only one instance of the Auto Scaling group will be updated at the same time.

 

# Create a new AWS CodeDeploy application.
aws deploy create-application –application-name "SampleELBWebApp"

# Get the AWS CodeDeploy service role ARN and Auto Scaling group name
# from the AWS CloudFormation template.
output_parameters=$(aws cloudformation describe-stacks
–stack-name "CodeDeploySampleELBIntegrationStack"
–output text
–query ‘Stacks[0].Outputs[*].OutputValue’)
service_role_arn=$(echo $output_parameters | awk ‘{print $2}’)
autoscaling_group_name=$(echo $output_parameters | awk ‘{print $3}’)

# Create an AWS CodeDeploy deployment group that uses
# the Auto Scaling group created by the AWS CloudFormation template.
# Set up the deployment group so that it deploys to
# only one instance at a time.
aws deploy create-deployment-group
–application-name "SampleELBWebApp"
–deployment-group-name "SampleELBDeploymentGroup"
–auto-scaling-groups "$autoscaling_group_name"
–service-role-arn "$service_role_arn"
–deployment-config-name "CodeDeployDefault.OneAtATime"

 

4. Start the zero-downtime deployment

Now you are ready to start your rolling, zero-downtime deployment. 

 

aws deploy create-deployment
–application-name "SampleELBWebApp"
–s3-location "bucket=aws-codedeploy-us-east-1,key=samples/latest/SampleApp_ELB_Integration.zip,bundleType=zip"
–deployment-group-name "SampleELBDeploymentGroup"

 

5. Monitor your deployment

You can see how your instances are taken out of service and back into service with the following command:

 

watch -n1 aws autoscaling describe-scaling-activities
–auto-scaling-group-name "$autoscaling_group_name"
–query ‘Activities[*].Description’

 

Every 1.0s: aws autoscaling describe-scaling-activities […]
[
"Moving EC2 instance out of Standby: i-d308b93c",
"Moving EC2 instance to Standby: i-d308b93c",
"Moving EC2 instance out of Standby: i-a9695458",
"Moving EC2 instance to Standby: i-a9695458",
"Moving EC2 instance out of Standby: i-2478cade",
"Moving EC2 instance to Standby: i-2478cade",
"Launching a new EC2 instance: i-d308b93c",
"Launching a new EC2 instance: i-a9695458",
"Launching a new EC2 instance: i-2478cade"
]

 

The URL output parameter of the AWS CloudFormation stack contains the link to the website so that you are able to watch it change. The following command returns the URL of the load balancer:

 

# Get the URL output parameter of the AWS CloudFormation template.
aws cloudformation describe-stacks
–stack-name "CodeDeploySampleELBIntegrationStack"
–output text
–query ‘Stacks[0].Outputs[?OutputKey==`URL`].OutputValue’

 

There are a few other points to consider in order to achieve zero-downtime deployments:

Graceful shut-down of your application
You do not want to kill a process with running executions. Make sure that the running threads have enough time to finish work before shutting down your application.

Connection draining
The AWS CloudFormation template sets up an Elastic Load Balancing load balancer with connection draining enabled. The load balancer does not send any new requests to the instance when the instance is deregistering, and it waits until any in-flight requests have finished executing. (For more information, see Enable or Disable Connection Draining for Your Load Balancer.)

Sanity test
It is important to check that the instance is healthy and the application is running before the instance is added back to the load balancer after the deployment.

Backward-compatible changes (for example, database changes)
Both application versions must work side by side until the deployment finishes, because only a part of the fleet is updated at the same time.

Warming of the caches and service
This is so that no request suffers a degraded performance after the deployment.

 

This example should help you get started toward improving your deployment process. I hope that this post makes it easier to reach zero-downtime deployments with AWS CodeDeploy and allows shipping your changes continuously in order to provide a great customer experience.

 

Scheduling Automatic Deletion of Application Environments

Post Syndicated from dchetan original http://blogs.aws.amazon.com/application-management/post/Tx2MMLQMSTZYNMJ/Scheduling-Automatic-Deletion-of-Application-Environments

Have you ever set up a temporary application environment and wished you could schedule automatic deletion of the environment rather than remembering to clean it up after you are done? If the answer is yes, then this blog post is for you.

Here is an example of setting up an AWS CloudFormation stack with a configurable TTL (time-to-live). When the TTL is up, deletion of the stack is triggered automatically. You can use this idea regardless of whether you have a single Amazon EC2 instance in the stack or a complex application environment. You can even use this idea in combination with other deployment and management services such as AWS Elastic Beanstalk or AWS OpsWorks, as long as your environment is modeled inside an AWS CloudFormation stack.

In this example, first I setup a sample application on an EC2 instance and then configure a ‘TTL’:

Configuring TTL is simple. Just schedule execution of a one-line shell script, deletestack.sh, using the ‘at’ command. The shell script uses AWS Command Line Interface to call aws cloudformation delete-stack:

Notice that the EC2 instance requires permissions to delete all of the stack resources. The permissions are granted to the EC2 instance via an IAM role. Also, notice that for the stack deletion to succeed, the IAM role needs to be the last in the order of deletion. You can ensure that the role is the last in the order of deletion by making other resources dependent on the role. Finally, as a best practice, you should grant the least possible privilege to the role. You can do this by using a finer grained policy document for the IAM role:

 

You can try the full sample template here: 

 

— Chetan Dandekar, Senior Product Manager, Amazon Web Services.

Scheduling Automatic Deletion of Application Environments

Post Syndicated from dchetan original http://blogs.aws.amazon.com/application-management/post/Tx2MMLQMSTZYNMJ/Scheduling-Automatic-Deletion-of-Application-Environments

Have you ever set up a temporary application environment and wished you could schedule automatic deletion of the environment rather than remembering to clean it up after you are done? If the answer is yes, then this blog post is for you.

Here is an example of setting up an AWS CloudFormation stack with a configurable TTL (time-to-live). When the TTL is up, deletion of the stack is triggered automatically. You can use this idea regardless of whether you have a single Amazon EC2 instance in the stack or a complex application environment. You can even use this idea in combination with other deployment and management services such as AWS Elastic Beanstalk or AWS OpsWorks, as long as your environment is modeled inside an AWS CloudFormation stack.

In this example, first I setup a sample application on an EC2 instance and then configure a ‘TTL’:

Configuring TTL is simple. Just schedule execution of a one-line shell script, deletestack.sh, using the ‘at’ command. The shell script uses AWS Command Line Interface to call aws cloudformation delete-stack:

Notice that the EC2 instance requires permissions to delete all of the stack resources. The permissions are granted to the EC2 instance via an IAM role. Also, notice that for the stack deletion to succeed, the IAM role needs to be the last in the order of deletion. You can ensure that the role is the last in the order of deletion by making other resources dependent on the role. Finally, as a best practice, you should grant the least possible privilege to the role. You can do this by using a finer grained policy document for the IAM role:

 

You can try the full sample template here: 

 

— Chetan Dandekar, Senior Product Manager, Amazon Web Services.

Use a CreationPolicy to Wait for On-Instance Configurations

Post Syndicated from Elliot Yamaguchi original http://blogs.aws.amazon.com/application-management/post/Tx3CISOFP98TS56/Use-a-CreationPolicy-to-Wait-for-On-Instance-Configurations

When you provision an Amazon EC2 instance in an AWS CloudFormation stack, you might specify additional actions to configure the instance, such as install software packages or bootstrap applications. Normally, CloudFormation proceeds with stack creation after the instance has been successfully created. However, you can use a CreationPolicy so that CloudFormation proceeds with stack creation only after your configuration actions are done. That way you’ll know your applications are ready to go after stack creation succeeds.

A CreationPolicy instructs CloudFormation to wait on an instance until CloudFormation receives the specified number of signals. This policy takes effect only when CloudFormation creates the instance. Here’s what a creation policy looks like:

"AutoScalingGroup": {
"Type": "AWS::AutoScaling::AutoScalingGroup",
"Properties": {

},
"CreationPolicy": {
"ResourceSignal": {
"Count": "3",
"Timeout": "PT5M"
}
}
}

A CreationPolicy must be associated with a resource, such as an EC2 instance or an Auto Scaling group. This association is how CloudFormation knows what resource to wait on. In the example policy, the CreationPolicy is associated with an Auto Scaling group. CloudFormation waits on the Auto Scaling group until CloudFormation receives three signals within five minutes. Because the Auto Scaling group’s desired capacity is set to three, the signal count is set to three (one for each instance).

If three signals are not received after five minutes, CloudFormation immediately stops the stack creation labels the Auto Scaling group as failed to create, so make sure you specify a timeout period that gives your instances and applications enough time to be deployed.

Signaling a Resource

You can easily send signals from the instances that you’re provisioning. On those instances, you should be using the cfn-init helper script in the EC2 user data script to deploy applications. After the cfn-init script, just add a command to run the cfn-signal helper script, as in the following example:

"UserData": {
"Fn::Base64": {
"Fn::Join" [ "", [
"/opt/aws/bin/cfn-init ",

"/opt/aws/bin/cfn-signal -e $? ",
" –stack ", { "Ref": "AWS::StackName" },
" –resource AutoScalingGroup " ,
" –region ", { "Ref" : "AWS::Region" }, "n"
] ]
}
}

When you signal CloudFormation, you need let it know what stack and what resource you’re signaling. In the example, the cfn-signal command specifies the stack that is provisioning the instance, the logical ID of the resource (AutoScalingGroup), and the region in which the stack is being created.

With the CreationPolicy attribute and the cfn-signal helper script, you can ensure that your stacks are created successfully only when your applications are successfully deployed. For more information, you can view a complete sample template in the AWS CloudFormation User Guide.

Use a CreationPolicy to Wait for On-Instance Configurations

Post Syndicated from Elliot Yamaguchi original http://blogs.aws.amazon.com/application-management/post/Tx3CISOFP98TS56/Use-a-CreationPolicy-to-Wait-for-On-Instance-Configurations

When you provision an Amazon EC2 instance in an AWS CloudFormation stack, you might specify additional actions to configure the instance, such as install software packages or bootstrap applications. Normally, CloudFormation proceeds with stack creation after the instance has been successfully created. However, you can use a CreationPolicy so that CloudFormation proceeds with stack creation only after your configuration actions are done. That way you’ll know your applications are ready to go after stack creation succeeds.

A CreationPolicy instructs CloudFormation to wait on an instance until CloudFormation receives the specified number of signals. This policy takes effect only when CloudFormation creates the instance. Here’s what a creation policy looks like:

"AutoScalingGroup": {
"Type": "AWS::AutoScaling::AutoScalingGroup",
"Properties": {

},
"CreationPolicy": {
"ResourceSignal": {
"Count": "3",
"Timeout": "PT5M"
}
}
}

A CreationPolicy must be associated with a resource, such as an EC2 instance or an Auto Scaling group. This association is how CloudFormation knows what resource to wait on. In the example policy, the CreationPolicy is associated with an Auto Scaling group. CloudFormation waits on the Auto Scaling group until CloudFormation receives three signals within five minutes. Because the Auto Scaling group’s desired capacity is set to three, the signal count is set to three (one for each instance).

If three signals are not received after five minutes, CloudFormation immediately stops the stack creation labels the Auto Scaling group as failed to create, so make sure you specify a timeout period that gives your instances and applications enough time to be deployed.

Signaling a Resource

You can easily send signals from the instances that you’re provisioning. On those instances, you should be using the cfn-init helper script in the EC2 user data script to deploy applications. After the cfn-init script, just add a command to run the cfn-signal helper script, as in the following example:

"UserData": {
"Fn::Base64": {
"Fn::Join" [ "", [
"/opt/aws/bin/cfn-init ",

"/opt/aws/bin/cfn-signal -e $? ",
" –stack ", { "Ref": "AWS::StackName" },
" –resource AutoScalingGroup " ,
" –region ", { "Ref" : "AWS::Region" }, "n"
] ]
}
}

When you signal CloudFormation, you need let it know what stack and what resource you’re signaling. In the example, the cfn-signal command specifies the stack that is provisioning the instance, the logical ID of the resource (AutoScalingGroup), and the region in which the stack is being created.

With the CreationPolicy attribute and the cfn-signal helper script, you can ensure that your stacks are created successfully only when your applications are successfully deployed. For more information, you can view a complete sample template in the AWS CloudFormation User Guide.

Best Practices for Deploying Applications on AWS CloudFormation Stacks

Post Syndicated from dchetan original http://blogs.aws.amazon.com/application-management/post/Tx1ES7KM3RG4LNO/Best-Practices-for-Deploying-Applications-on-AWS-CloudFormation-Stacks

With AWS CloudFormation, you can provision the full breadth of AWS resources including Amazon EC2 instances. You provision the EC2 instances to run applications that drive your business. Here are some best practices for deploying and updating those applications on EC2 instances provisioned inside CloudFormation stacks:

Use AWS::CloudFormation::Init

Use IAM roles to securely download software and data

Use Amazon CloudWatch logs for debugging

Use cfn-hup for updates

Use custom AMIs to minimize application boot times

 

Best Practice 1: Use AWS::CloudFormation::Init

When you include an EC2 instance in a CloudFormation template, use the AWS::CloudFormation::Init section to specify what application packages you want downloaded on the instance, where to download them from, where to install them, what services to start, and what commands to run after the EC2 instance is up and running. You can do the same when you specify an Auto Scaling launch configuration. Here’s a fill-in-the-blanks example:

"MyInstance": {

"Type": "AWS::EC2::Instance",

"Metadata" : {

"AWS::CloudFormation::Init" : {

"webapp-config": {

"packages" : {}, "sources" : {}, "files" : {},

"groups" : {}, "users" : {},

"commands" : {}, "services" : {}

Why use AWS::CloudFormation::Init? For several reasons.

First of all, it is declarative. You just specify the desired configuration and let CloudFormation figure out the steps to get to the desired configuration. For example, in the "sources" section you just specify the remote location to download an application tarball from and a directory on the instance where you want to install the application source. CloudFormation takes care of the precise steps to download the tarball, retry on any errors, and extract the source files after the tarball is downloaded.

The same declarative specification is supported for packages or files to be downloaded, users or groups to be created, and commands or services to be executed. If you need to invoke a script, you can simply download that script by using the "files" section and execute the script using the "commands" section.

Configurations defined in AWS::CloudFormation::Init can be grouped into units of deployments, which can be reused, ordered, and executed across instance reboots. For details and examples, see Configsets.

Unlike the application specification coded in an EC2 user data script, the application configuration specified in AWS::CloudFormation::Init is updatable. This is handy, for example, when you want to install a new version of a package, without recreating a running instance.  AWS::CloudFormation::Init supports securely downloading application packages and other data.

More on the benefits. First, let’s take a quick look at the sequence of how AWS::CloudFormation::Init works:

You specify application configuration using the AWS::CloudFormation::Init section for an EC2 instance in your CloudFormation template.

You kick-off a CloudFormation stack creation using the template.

The AWS CloudFormation service starts creating a stack, including the EC2 instance.

After the EC2 instance is up and running, a CloudFormation helper script, cfn-init, is executed on the instance to configure the instance in accordance with your AWS::CloudFormation::Init template specification.*

Another CloudFormation helper script, cfn-signal, is executed on the instance to let the remote AWS CloudFormation service know the result (success/failure) of the configuration.* You can optionally have the CloudFormation service hold off on marking the EC2 instance state and the stack state “CREATE_COMPLETE” until the CloudFormation service hears a success signal for the instance. The holding-off period is specified in the template using a CreationPolicy.

*You can download the CloudFormation helper scripts for both Linux and Windows. These come preinstalled on the Linux and Windows AMIs provided by Amazon. You need to specify the commands to trigger cfn-init and cfn-signal in the EC2 user data script. Once an instance is up and running, the EC2 user data script is executed automatically for most Linux distributions and Windows.

Refer to this article for details and try these sample templates to see AWS::CloudFormation::Init in action.

 

Best Practice 2: Use IAM roles to securely download software and data

You might want to store the application packages and data at secure locations and allow only authenticated downloads when you are configuring the EC2 instances to run the applications. Use the AWS::CloudFormation::Authentication section to specify credentials for downloading the application packages and data specified in the AWS::CloudFormation::Init section. Although AWS::CloudFormation::Authentication supports several types of authentication, we recommend using an IAM role. For an end-to-end example refer to an earlier blog post “Authenticated File Downloads with CloudFormation.”

 

Best Practice 3: Use CloudWatch Logs for Debugging

When you are configuring an instance using AWS::CloudFormation::Init, configuration logs are stored on the instance in the cfn-init.log file and other cfn-*.log files. These logs are helpful for debugging configuration errors. In the past, you had to SSH or RDP into EC2 instances to retrieve these log files. However, with the advent of Amazon CloudWatch Logs, you no longer have to log on to the instances. You can simply stream those logs to CloudWatch and view them in the AWS Management Console. Refer to an earlier blog post “View CloudFormation Logs in the Console” to find out how.

 

Best Practice 4: Use cfn-hup for updates

Once your application stack is up and running, chances are that you will update the application, apply an OS patch, or perform some other configuration update in a stack’s lifecycle. You just update the AWS::CloudFormation::Init section in your template (for example, specify a newer version of an application package), and call UpdateStack. When you do, CloudFormation updates the instance metadata in accordance with the updated template. Then the cfn-hup daemon running on the instance detects the updated metadata and reruns cfn-init to update the instance in accordance with the updated configuration. cfn-hup is one of the CloudFormation helper scripts available on both Linux and Windows.

Look for cfn-hup in some of our sample templates to find out how to configure cfn-hup.

 

Best Practice 5: Minimize application boot time with custom AMIs

When you are using CloudFormation, you can use anything from AMIs, user data scripts, AWS::CloudFormation::Init, or third-party configuration tools to configure EC2 instances and bootstrap applications. On one hand, AWS::CloudFormation::Init and the other configuration tools provide a great deal of flexibility and control over instance configuration. On the other hand, AMIs offers the fastest application boot times, since your desired configuration and application can be preinstalled while creating a custom AMI.

Some CloudFormation customers optimize their usage by selecting an instance configuration and application bootstrapping method based on the environment. They employ AWS::CloudFormation::Init and other tools for flexibility and control in development and test environments. When they have a desired configuration developed and tested, they create a custom AMI and use that custom AMI in their CloudFormation stacks in production. The result is faster application boot times.

This optimization requires you to maintain two different configuration methods and requires you to keep track of which AMI corresponds to which version-controlled configuration for future reference and updates. As such, this might be worth looking into only if you have a business need to boot up several homogeneous instances, on-demand, in the absolute shortest time possible.

 

These best practices are based on the real-world experience of our customers. Reach out to us at @AWSCloudFormer to let us know your feedback on these best practices and additional best practices that you may want to share.

 

— Chetan Dandekar, Senior Product Manager, Amazon Web Services.

 

Best Practices for Deploying Applications on AWS CloudFormation Stacks

Post Syndicated from dchetan original http://blogs.aws.amazon.com/application-management/post/Tx1ES7KM3RG4LNO/Best-Practices-for-Deploying-Applications-on-AWS-CloudFormation-Stacks

With AWS CloudFormation, you can provision the full breadth of AWS resources including Amazon EC2 instances. You provision the EC2 instances to run applications that drive your business. Here are some best practices for deploying and updating those applications on EC2 instances provisioned inside CloudFormation stacks:

Use AWS::CloudFormation::Init

Use IAM roles to securely download software and data

Use Amazon CloudWatch logs for debugging

Use cfn-hup for updates

Use custom AMIs to minimize application boot times

 

Best Practice 1: Use AWS::CloudFormation::Init

When you include an EC2 instance in a CloudFormation template, use the AWS::CloudFormation::Init section to specify what application packages you want downloaded on the instance, where to download them from, where to install them, what services to start, and what commands to run after the EC2 instance is up and running. You can do the same when you specify an Auto Scaling launch configuration. Here’s a fill-in-the-blanks example:

"MyInstance": {

"Type": "AWS::EC2::Instance",

"Metadata" : {

"AWS::CloudFormation::Init" : {

"webapp-config": {

"packages" : {}, "sources" : {}, "files" : {},

"groups" : {}, "users" : {},

"commands" : {}, "services" : {}

Why use AWS::CloudFormation::Init? For several reasons.

First of all, it is declarative. You just specify the desired configuration and let CloudFormation figure out the steps to get to the desired configuration. For example, in the "sources" section you just specify the remote location to download an application tarball from and a directory on the instance where you want to install the application source. CloudFormation takes care of the precise steps to download the tarball, retry on any errors, and extract the source files after the tarball is downloaded.

The same declarative specification is supported for packages or files to be downloaded, users or groups to be created, and commands or services to be executed. If you need to invoke a script, you can simply download that script by using the "files" section and execute the script using the "commands" section.

Configurations defined in AWS::CloudFormation::Init can be grouped into units of deployments, which can be reused, ordered, and executed across instance reboots. For details and examples, see Configsets.

Unlike the application specification coded in an EC2 user data script, the application configuration specified in AWS::CloudFormation::Init is updatable. This is handy, for example, when you want to install a new version of a package, without recreating a running instance.  AWS::CloudFormation::Init supports securely downloading application packages and other data.

More on the benefits. First, let’s take a quick look at the sequence of how AWS::CloudFormation::Init works:

You specify application configuration using the AWS::CloudFormation::Init section for an EC2 instance in your CloudFormation template.

You kick-off a CloudFormation stack creation using the template.

The AWS CloudFormation service starts creating a stack, including the EC2 instance.

After the EC2 instance is up and running, a CloudFormation helper script, cfn-init, is executed on the instance to configure the instance in accordance with your AWS::CloudFormation::Init template specification.*

Another CloudFormation helper script, cfn-signal, is executed on the instance to let the remote AWS CloudFormation service know the result (success/failure) of the configuration.* You can optionally have the CloudFormation service hold off on marking the EC2 instance state and the stack state “CREATE_COMPLETE” until the CloudFormation service hears a success signal for the instance. The holding-off period is specified in the template using a CreationPolicy.

*You can download the CloudFormation helper scripts for both Linux and Windows. These come preinstalled on the Linux and Windows AMIs provided by Amazon. You need to specify the commands to trigger cfn-init and cfn-signal in the EC2 user data script. Once an instance is up and running, the EC2 user data script is executed automatically for most Linux distributions and Windows.

Refer to this article for details and try these sample templates to see AWS::CloudFormation::Init in action.

 

Best Practice 2: Use IAM roles to securely download software and data

You might want to store the application packages and data at secure locations and allow only authenticated downloads when you are configuring the EC2 instances to run the applications. Use the AWS::CloudFormation::Authentication section to specify credentials for downloading the application packages and data specified in the AWS::CloudFormation::Init section. Although AWS::CloudFormation::Authentication supports several types of authentication, we recommend using an IAM role. For an end-to-end example refer to an earlier blog post “Authenticated File Downloads with CloudFormation.”

 

Best Practice 3: Use CloudWatch Logs for Debugging

When you are configuring an instance using AWS::CloudFormation::Init, configuration logs are stored on the instance in the cfn-init.log file and other cfn-*.log files. These logs are helpful for debugging configuration errors. In the past, you had to SSH or RDP into EC2 instances to retrieve these log files. However, with the advent of Amazon CloudWatch Logs, you no longer have to log on to the instances. You can simply stream those logs to CloudWatch and view them in the AWS Management Console. Refer to an earlier blog post “View CloudFormation Logs in the Console” to find out how.

 

Best Practice 4: Use cfn-hup for updates

Once your application stack is up and running, chances are that you will update the application, apply an OS patch, or perform some other configuration update in a stack’s lifecycle. You just update the AWS::CloudFormation::Init section in your template (for example, specify a newer version of an application package), and call UpdateStack. When you do, CloudFormation updates the instance metadata in accordance with the updated template. Then the cfn-hup daemon running on the instance detects the updated metadata and reruns cfn-init to update the instance in accordance with the updated configuration. cfn-hup is one of the CloudFormation helper scripts available on both Linux and Windows.

Look for cfn-hup in some of our sample templates to find out how to configure cfn-hup.

 

Best Practice 5: Minimize application boot time with custom AMIs

When you are using CloudFormation, you can use anything from AMIs, user data scripts, AWS::CloudFormation::Init, or third-party configuration tools to configure EC2 instances and bootstrap applications. On one hand, AWS::CloudFormation::Init and the other configuration tools provide a great deal of flexibility and control over instance configuration. On the other hand, AMIs offers the fastest application boot times, since your desired configuration and application can be preinstalled while creating a custom AMI.

Some CloudFormation customers optimize their usage by selecting an instance configuration and application bootstrapping method based on the environment. They employ AWS::CloudFormation::Init and other tools for flexibility and control in development and test environments. When they have a desired configuration developed and tested, they create a custom AMI and use that custom AMI in their CloudFormation stacks in production. The result is faster application boot times.

This optimization requires you to maintain two different configuration methods and requires you to keep track of which AMI corresponds to which version-controlled configuration for future reference and updates. As such, this might be worth looking into only if you have a business need to boot up several homogeneous instances, on-demand, in the absolute shortest time possible.

 

These best practices are based on the real-world experience of our customers. Reach out to us at @AWSCloudFormer to let us know your feedback on these best practices and additional best practices that you may want to share.

 

— Chetan Dandekar, Senior Product Manager, Amazon Web Services.

 

Tracking the Cost of Your AWS CloudFormation Stack

Post Syndicated from Elliot Yamaguchi original http://blogs.aws.amazon.com/application-management/post/TxU37HX5POUOBY/Tracking-the-Cost-of-Your-AWS-CloudFormation-Stack

With cost allocation tagging and the AWS Cost Explorer, you can see the cost of operating each of your AWS CloudFormation stacks.

Here’s how it works.  AWS CloudFormation automatically tags each stack resource. For example, if you have a stack that creates an Amazon EC2 instance, AWS CloudFormation automatically tags the instance with the following key-value pairs:

aws:cloudformation:stack-name

The name of the stack, such as myTestStack.

aws:cloudformation:stack-id

The full stack ID, such as arn:aws:cloudformation:us-east-1:123456789012:stack/myTestStack/2ac98f30-5bdd-11e4-949b-50fa5262a838.

aws:cloudformation:logical-id

The logical ID of a resource that is defined in the stack template, such as myInstance.

To obtain the costs by stack, all you do is set up a billing report to include the AWS CloudFormation tags. Then you can filter your report in the AWS Cost Explorer to see the costs of items tagged with a specific stack name, stack ID, or logical ID. With Cost Explorer, you can see the costs associated with one or more stacks or view how much of a stack’s cost is from a particular service, such as Amazon EC2 or Amazon RDS.

Note: Before you can use the AWS billing tools, you need the permissions that are described in the Billing and Cost Management Permissions Reference.

Configuring billing reports

Go to billing preferences.

Select Receive Billing Reports, and specify an existing Amazon S3 bucket to store your billing reports.
receive billing reports screenshot

Click Verify to ensure that your bucket exists and has the required permissions.
You can use the AWS sample bucket policy to set the appropriate permissions. Copy and paste the sample policy into your bucket’s policy.

Select the Detailed billing report with resources and tags, and then click Save preferences.

With this report, you can view a detailed bill with the report tags that you have included. Later, you will add AWS CloudFormation tags so that you can view costs for each AWS CloudFormation stack.

Note: The current month’s data will be available for viewing in about 24 hours.

 

Configuring cost allocation tags

Under the Report section, click Manage report tags.

Select the AWS CloudFormation tags and then click Save.

Your billing report will now include three additional columns for each AWS CloudFormation tag. For example, if you created a stack named myTestStack, all resources in that stack will have the value myTestStack for the aws:cloudformation:stack-name column.

Analyzing costs in Cost Explorer

From your billing dashboard, click Cost Explorer and then click Launch Cost Explorer.
Note: If you just enabled reporting, data will be available for viewing in about 24 hours.

Select the Tags filter to view billing information about a particular stack or resource.

Select an AWS CloudFormation tag key to refine the filter.

Select the aws:cloudformation:stack-id or stack-name tag to view information about a particular stack.

Select the aws:cloudformation:logical-id tag to view information about a specific resource

Select one or more values for the tag key that you selected, and then click Apply.
Cost Explorer displays billing information for your selected stacks or resources. For example, the following graph shows Amazon EC2 and Amazon RDS costs for a particular stack.

With these few simple steps, you can start analyzing the costs of your stacks and resources. Also, to help you estimate costs before you create a stack, you can use the AWS Simple Monthly Calculator. When you use the AWS CloudFormation console to create a stack, the create stack wizard provides a link to the calculator.