Tag Archives: Guest Post

AWS CloudFormation Security Best Practices

Post Syndicated from George Huang original http://blogs.aws.amazon.com/application-management/post/Tx2UMVHOX7UP4V7/AWS-CloudFormation-Security-Best-Practices

The following is a guest post by Hubert Cheung, Solutions Architect.

AWS CloudFormation makes it easy for developers and systems administrators to create and manage a collection of related AWS resources by provisioning and updating them in an orderly and predictable way. Many of our customers use CloudFormation to control all of the resources in their AWS environments so that they can succinctly capture changes, perform version control, and manage costs in their infrastructure, among other activities.

Customers often ask us how to control permissions for CloudFormation stacks. In this post, we share some of the best security practices for CloudFormation, which include using AWS Identity and Access Management (IAM) policies, CloudFormation-specific IAM conditions, and CloudFormation stack policies. Because most CloudFormation deployments are executed from the AWS command line interface (CLI) and SDK, we focus on using the AWS CLI and SDK to show you how to implement the best practices.

Limiting Access to CloudFormation Stacks with IAM

With IAM, you can securely control access to AWS services and resources by using policies and users or roles. CloudFormation leverages IAM to provide fine-grained access control.

As a best practice, we recommend that you limit service and resource access through IAM policies by applying the principle of least privilege. The simplest way to do this is to limit specific API calls to CloudFormation. For example, you may not want specific IAM users or roles to update or delete CloudFormation stacks. The following sample policy allows all CloudFormation APIs access, but denies UpdateStack and DeleteStack APIs access on your production stack:

{
"Version":"2012-10-17",
"Statement":[{
"Effect":"Allow",
"Action":[
"cloudformation:*"
],
"Resource":"*"
},
{
"Effect":"Deny",
"Action":[
"cloudformation:UpdateStack",
"cloudformation:DeleteStack"
],
"Resource":"arn:aws:cloudformation:us-east-1:123456789012:stack/MyProductionStack/*"
}]
}

We know that IAM policies often need to allow the creation of particular resources, but you may not want them to be created as part of CloudFormation. This is where CloudFormation’s support for IAM conditions comes in.

IAM Conditions for CloudFormation

There are three CloudFormation-specific IAM conditions that you can add to your IAM policies:

cloudformation:TemplateURL

cloudformation:ResourceTypes

cloudformation:StackPolicyURL

With these three conditions, you can ensure that API calls for stack actions, such as create or update, use a specific template or are limited to specific resources, and that your stacks use a stack policy, which prevents stack resources from unintentionally being updated or deleted during stack updates.

Condition: TemplateURL

The first condition, cloudformation:TemplateURL, lets you specify where the CloudFormation template for a stack action, such as create or update, resides and enforce that it be used. In an IAM policy, it would look like this:

{
"Version":"2012-10-17",
"Statement":[{
"Effect": "Deny",
"Action": [
"cloudformation:CreateStack",
“cloudformation:UpdateStack”
],
"Resource": "*",
"Condition": {
"StringNotEquals": {
"cloudformation:TemplateURL": [
"https://s3.amazonaws.com/cloudformation-templates-us-east-1/IAM_Users_Groups_and_Policies.template"
]
}
}
},
{
"Effect": "Deny",
"Action": [
"cloudformation:CreateStack",
"cloudformation:UpdateStack"
],
"Resource": "*",
"Condition": {
"Null": {
"cloudformation:TemplateURL": "true"
}
}
}]
}

The first statement ensures that for all CreateStack or UpdateStack API calls, users must use the specified template. The second ensures that all CreateStack or UpdateStack API calls must include the TemplateURL parameter. From the CLI, your calls need to include the –template-url parameter:

aws cloudformation create-stack –stack-name cloudformation-demo –template-url https://s3.amazonaws.com/cloudformation-templates-us-east-1/IAM_Users_Groups_and_Policies.template

Condition: ResourceTypes

CloudFormation also allows you to control the types of resources that are created or updated in templates with an IAM policy. The CloudFormation API accepts a ResourceTypes parameter. In your API call, you specify which types of resources can be created or updated. However, to use the new ResourceTypes parameter, you need to modify your IAM policies to enforce the use of this particular parameter by adding in conditions like this:

{
"Version":"2012-10-17",
"Statement":[{
"Effect": "Deny",
"Action": [
"cloudformation:CreateStack",
"cloudformation:UpdateStack"
],
"Resource": "*",
"Condition": {
"ForAllValues:StringLike": {
"cloudformation:ResourceTypes": [
"AWS::IAM::*"
]
}
}
},
{
"Effect": "Deny",
"Action": [
"cloudformation:CreateStack",
"cloudformation:UpdateStack"
],
"Resource": "*",
"Condition": {
"Null": {
"cloudformation:ResourceTypes": "true"
}
}
}]
}

From the CLI, your calls need to include a –resource-types parameter. A call to update your stack will look like this:

aws cloudformation create-stack –stack-name cloudformation-demo –template-url https://s3.amazonaws.com/cloudformation-templates-us-east-1/IAM_Users_Groups_and_Policies.template –resource-types=”[AWS::IAM::Group, AWS::IAM::User]”

Depending on the shell, the command might need to be enclosed in quotation marks as follow; otherwise, you’ll get a “No JSON object could be decoded” error:

aws cloudformation create-stack –stack-name cloudformation-demo –template-url https://s3.amazonaws.com/cloudformation-templates-us-east-1/IAM_Users_Groups_and_Policies.template –resource-types=’[“AWS::IAM::Group”, “AWS::IAM::User”]’

The ResourceTypes conditions ensure that CloudFormation creates or updates the right resource types and templates with your CLI or API calls. In the first example, our IAM policy would have blocked the API calls because the example included AWS::IAM resources. If our template included only AWS::EC2::Instance resources, the CLI command would look like this and would succeed:

aws cloudformation create-stack –stack-name cloudformation-demo –template-url https://s3.amazonaws.com/cloudformation-templates-us-east-1/IAM_Users_Groups_and_Policies.template –resource-types=’[“AWS::EC2::Instance”]’

The third condition is the StackPolicyURL condition. Before we explain how that works, we need to provide some additional context about stack policies.

Stack Policies

Often, the worst disruptions are caused by unintentional changes to resources. To help in mitigating this risk, CloudFormation provides stack policies, which prevent stack resources from unintentionally being updated or deleted during stack updates. When used in conjunction with IAM, stack policies provide a second layer of defense against both unintentional and malicious changes to your stack resources.

The CloudFormation stack policy is a JSON document that defines what can be updated as part of a stack update operation. To set or update the policy, your IAM users or roles must first have the ability to call the cloudformation:SetStackPolicy action.

You apply the stack policy directly to the stack. Note that this is not an IAM policy. By default, setting a stack policy protects all stack resources with a Deny to deny any updates unless you specify an explicit Allow. This means that if you want to restrict only a few resources, you must explicitly allow all updates by including an Allow on the resource "*" and a Deny for specific resources. 

For example, stack policies are often used to protect a production database because it contains data that will go live. Depending on the field that’s changing, there are times when the entire database could be replaced during an update. In the following example, the stack policy explicitly denies attempts to update your production database:

{
"Statement" : [
{
"Effect" : "Deny",
"Action" : "Update:*",
"Principal": "*",
"Resource" : "LogicalResourceId/ProductionDB_logical_ID"
},
{
"Effect" : "Allow",
"Action" : "Update:*",
"Principal": "*",
"Resource" : "*"
}
]
}

You can generalize your stack policy to include all RDS DB instances or any given ResourceType. To achieve this, you use conditions. However, note that because we used a wildcard in our example, the condition must use the "StringLike" condition and not "StringEquals":

{
"Statement" : [
{
"Effect" : "Deny",
"Action" : "Update:*",
"Principal": "*",
"Resource" : "*",
"Condition" : {
"StringLike" : {
"ResourceType" : ["AWS::RDS::DBInstance", "AWS::AutoScaling::*"]
}
}
},
{
"Effect" : "Allow",
"Action" : "Update:*",
"Principal": "*",
"Resource" : "*"
}
]
}

For more information about stack policies, see Prevent Updates to Stack Resources.

Finally, let’s ensure that all of your stacks have an appropriate pre-defined stack policy. To address this, we return to  IAM policies.

Condition:StackPolicyURL

From within your IAM policy, you can ensure that every CloudFormation stack has a stack policy associated with it upon creation with the StackPolicyURL condition:

{
"Version":"2012-10-17",
"Statement":[
{
"Effect": "Deny",
"Action": [
"cloudformation:SetStackPolicy"
],
"Resource": "*",
"Condition": {
"ForAnyValue:StringNotEquals": {
"cloudformation:StackPolicyUrl": [
"https://s3.amazonaws.com/samplebucket/sampleallowpolicy.json"
]
}
}
},
{
"Effect": "Deny",
"Action": [
"cloudformation:CreateStack",
"cloudformation:UpdateStack"
],
"Resource": "*",
"Condition": {
"ForAnyValue:StringNotEquals": {
"cloudformation:StackPolicyUrl": [
“https://s3.amazonaws.com/samplebucket/sampledenypolicy.json”
]
}
}
},
{
"Effect": "Deny",
"Action": [
"cloudformation:CreateStack",
"cloudformation:UpdateStack",
“cloudformation:SetStackPolicy”
],
"Resource": "*",
"Condition": {
"Null": {
"cloudformation:StackPolicyUrl": "true"
}
}
}]
}

This policy ensures that there must be a specific stack policy URL any time SetStackPolicy is called. In this case, the URL is https://s3.amazonaws.com/samplebucket/sampleallowpolicy.json. Similarly, for any create and update stack operation, this policy ensures that the StackPolicyURL is set to the sampledenypolicy.json document in S3 and that a StackPolicyURL is always specified. From the CLI, a create-stack command would look like this:

aws cloudformation create-stack –stack-name cloudformation-demo –parameters ParameterKey=Password,ParameterValue=CloudFormationDemo –capabilities CAPABILITY_IAM –template-url https://s3.amazonaws.com/cloudformation-templates-us-east-1/IAM_Users_Groups_and_Policies.template –stack-policy-url https://s3-us-east-1.amazonaws.com/samplebucket/sampledenypolicy.json

Note that if you specify a new stack policy on a stack update, CloudFormation uses the existing stack policy: it uses the new policy only for subsequent updates. For example, if your current policy is set to deny all updates, you must run a SetStackPolicy command to change the stack policy to the one that allows updates. Then you can run an update command against the stack. To update the stack we just created, you can run this:

aws cloudformation set-stack-policy –stack-name cloudformation-demo –stack-policy-url https://s3-us-east-1.amazonaws.com/samplebucket/sampleallowpolicy.json

Then you can run the update:

aws cloudformation update-stack –stack-name cloudformation-demo –parameters ParameterKey=Password,ParameterValue=NewPassword –capabilities CAPABILITY_IAM –template-url https://s3.amazonaws.com/cloudformation-templates-us-east-1/IAM_Users_Groups_and_Policies.template –stack-policy-url https://s3-us-west-2.amazonaws.com/awshubfiles/sampledenypolicy.json

The IAM policy that we used ensures that a specific stack policy is applied to the stack any time a stack is updated or created.

Conclusion

CloudFormation provides a repeatable way to create and manage related AWS resources. By using a combination of IAM policies, users, and roles, CloudFormation-specific IAM conditions, and stack policies, you can ensure that your CloudFormation stacks are used as intended and minimize accidental resource updates or deletions.

You can learn more about this topic and other CloudFormation best practices in the recording of our re:Invent 2015 session, (DVO304) AWS CloudFormation Best Practices, and in our documentation.

AWS CloudFormation Security Best Practices

Post Syndicated from George Huang original http://blogs.aws.amazon.com/application-management/post/Tx2UMVHOX7UP4V7/AWS-CloudFormation-Security-Best-Practices

The following is a guest post by Hubert Cheung, Solutions Architect.

AWS CloudFormation makes it easy for developers and systems administrators to create and manage a collection of related AWS resources by provisioning and updating them in an orderly and predictable way. Many of our customers use CloudFormation to control all of the resources in their AWS environments so that they can succinctly capture changes, perform version control, and manage costs in their infrastructure, among other activities.

Customers often ask us how to control permissions for CloudFormation stacks. In this post, we share some of the best security practices for CloudFormation, which include using AWS Identity and Access Management (IAM) policies, CloudFormation-specific IAM conditions, and CloudFormation stack policies. Because most CloudFormation deployments are executed from the AWS command line interface (CLI) and SDK, we focus on using the AWS CLI and SDK to show you how to implement the best practices.

Limiting Access to CloudFormation Stacks with IAM

With IAM, you can securely control access to AWS services and resources by using policies and users or roles. CloudFormation leverages IAM to provide fine-grained access control.

As a best practice, we recommend that you limit service and resource access through IAM policies by applying the principle of least privilege. The simplest way to do this is to limit specific API calls to CloudFormation. For example, you may not want specific IAM users or roles to update or delete CloudFormation stacks. The following sample policy allows all CloudFormation APIs access, but denies UpdateStack and DeleteStack APIs access on your production stack:

{
"Version":"2012-10-17",
"Statement":[{
"Effect":"Allow",
"Action":[
"cloudformation:*"
],
"Resource":"*"
},
{
"Effect":"Deny",
"Action":[
"cloudformation:UpdateStack",
"cloudformation:DeleteStack"
],
"Resource":"arn:aws:cloudformation:us-east-1:123456789012:stack/MyProductionStack/*"
}]
}

We know that IAM policies often need to allow the creation of particular resources, but you may not want them to be created as part of CloudFormation. This is where CloudFormation’s support for IAM conditions comes in.

IAM Conditions for CloudFormation

There are three CloudFormation-specific IAM conditions that you can add to your IAM policies:

cloudformation:TemplateURL

cloudformation:ResourceTypes

cloudformation:StackPolicyURL

With these three conditions, you can ensure that API calls for stack actions, such as create or update, use a specific template or are limited to specific resources, and that your stacks use a stack policy, which prevents stack resources from unintentionally being updated or deleted during stack updates.

Condition: TemplateURL

The first condition, cloudformation:TemplateURL, lets you specify where the CloudFormation template for a stack action, such as create or update, resides and enforce that it be used. In an IAM policy, it would look like this:

{
"Version":"2012-10-17",
"Statement":[{
"Effect": "Deny",
"Action": [
"cloudformation:CreateStack",
“cloudformation:UpdateStack”
],
"Resource": "*",
"Condition": {
"StringNotEquals": {
"cloudformation:TemplateURL": [
"https://s3.amazonaws.com/cloudformation-templates-us-east-1/IAM_Users_Groups_and_Policies.template"
]
}
}
},
{
"Effect": "Deny",
"Action": [
"cloudformation:CreateStack",
"cloudformation:UpdateStack"
],
"Resource": "*",
"Condition": {
"Null": {
"cloudformation:TemplateURL": "true"
}
}
}]
}

The first statement ensures that for all CreateStack or UpdateStack API calls, users must use the specified template. The second ensures that all CreateStack or UpdateStack API calls must include the TemplateURL parameter. From the CLI, your calls need to include the –template-url parameter:

aws cloudformation create-stack –stack-name cloudformation-demo –template-url https://s3.amazonaws.com/cloudformation-templates-us-east-1/IAM_Users_Groups_and_Policies.template

Condition: ResourceTypes

CloudFormation also allows you to control the types of resources that are created or updated in templates with an IAM policy. The CloudFormation API accepts a ResourceTypes parameter. In your API call, you specify which types of resources can be created or updated. However, to use the new ResourceTypes parameter, you need to modify your IAM policies to enforce the use of this particular parameter by adding in conditions like this:

{
"Version":"2012-10-17",
"Statement":[{
"Effect": "Deny",
"Action": [
"cloudformation:CreateStack",
"cloudformation:UpdateStack"
],
"Resource": "*",
"Condition": {
"ForAllValues:StringLike": {
"cloudformation:ResourceTypes": [
"AWS::IAM::*"
]
}
}
},
{
"Effect": "Deny",
"Action": [
"cloudformation:CreateStack",
"cloudformation:UpdateStack"
],
"Resource": "*",
"Condition": {
"Null": {
"cloudformation:ResourceTypes": "true"
}
}
}]
}

From the CLI, your calls need to include a –resource-types parameter. A call to update your stack will look like this:

aws cloudformation create-stack –stack-name cloudformation-demo –template-url https://s3.amazonaws.com/cloudformation-templates-us-east-1/IAM_Users_Groups_and_Policies.template –resource-types=”[AWS::IAM::Group, AWS::IAM::User]”

Depending on the shell, the command might need to be enclosed in quotation marks as follow; otherwise, you’ll get a “No JSON object could be decoded” error:

aws cloudformation create-stack –stack-name cloudformation-demo –template-url https://s3.amazonaws.com/cloudformation-templates-us-east-1/IAM_Users_Groups_and_Policies.template –resource-types=’[“AWS::IAM::Group”, “AWS::IAM::User”]’

The ResourceTypes conditions ensure that CloudFormation creates or updates the right resource types and templates with your CLI or API calls. In the first example, our IAM policy would have blocked the API calls because the example included AWS::IAM resources. If our template included only AWS::EC2::Instance resources, the CLI command would look like this and would succeed:

aws cloudformation create-stack –stack-name cloudformation-demo –template-url https://s3.amazonaws.com/cloudformation-templates-us-east-1/IAM_Users_Groups_and_Policies.template –resource-types=’[“AWS::EC2::Instance”]’

The third condition is the StackPolicyURL condition. Before we explain how that works, we need to provide some additional context about stack policies.

Stack Policies

Often, the worst disruptions are caused by unintentional changes to resources. To help in mitigating this risk, CloudFormation provides stack policies, which prevent stack resources from unintentionally being updated or deleted during stack updates. When used in conjunction with IAM, stack policies provide a second layer of defense against both unintentional and malicious changes to your stack resources.

The CloudFormation stack policy is a JSON document that defines what can be updated as part of a stack update operation. To set or update the policy, your IAM users or roles must first have the ability to call the cloudformation:SetStackPolicy action.

You apply the stack policy directly to the stack. Note that this is not an IAM policy. By default, setting a stack policy protects all stack resources with a Deny to deny any updates unless you specify an explicit Allow. This means that if you want to restrict only a few resources, you must explicitly allow all updates by including an Allow on the resource "*" and a Deny for specific resources. 

For example, stack policies are often used to protect a production database because it contains data that will go live. Depending on the field that’s changing, there are times when the entire database could be replaced during an update. In the following example, the stack policy explicitly denies attempts to update your production database:

{
"Statement" : [
{
"Effect" : "Deny",
"Action" : "Update:*",
"Principal": "*",
"Resource" : "LogicalResourceId/ProductionDB_logical_ID"
},
{
"Effect" : "Allow",
"Action" : "Update:*",
"Principal": "*",
"Resource" : "*"
}
]
}

You can generalize your stack policy to include all RDS DB instances or any given ResourceType. To achieve this, you use conditions. However, note that because we used a wildcard in our example, the condition must use the "StringLike" condition and not "StringEquals":

{
"Statement" : [
{
"Effect" : "Deny",
"Action" : "Update:*",
"Principal": "*",
"Resource" : "*",
"Condition" : {
"StringLike" : {
"ResourceType" : ["AWS::RDS::DBInstance", "AWS::AutoScaling::*"]
}
}
},
{
"Effect" : "Allow",
"Action" : "Update:*",
"Principal": "*",
"Resource" : "*"
}
]
}

For more information about stack policies, see Prevent Updates to Stack Resources.

Finally, let’s ensure that all of your stacks have an appropriate pre-defined stack policy. To address this, we return to  IAM policies.

Condition:StackPolicyURL

From within your IAM policy, you can ensure that every CloudFormation stack has a stack policy associated with it upon creation with the StackPolicyURL condition:

{
"Version":"2012-10-17",
"Statement":[
{
"Effect": "Deny",
"Action": [
"cloudformation:SetStackPolicy"
],
"Resource": "*",
"Condition": {
"ForAnyValue:StringNotEquals": {
"cloudformation:StackPolicyUrl": [
"https://s3.amazonaws.com/samplebucket/sampleallowpolicy.json"
]
}
}
},
{
"Effect": "Deny",
"Action": [
"cloudformation:CreateStack",
"cloudformation:UpdateStack"
],
"Resource": "*",
"Condition": {
"ForAnyValue:StringNotEquals": {
"cloudformation:StackPolicyUrl": [
“https://s3.amazonaws.com/samplebucket/sampledenypolicy.json”
]
}
}
},
{
"Effect": "Deny",
"Action": [
"cloudformation:CreateStack",
"cloudformation:UpdateStack",
“cloudformation:SetStackPolicy”
],
"Resource": "*",
"Condition": {
"Null": {
"cloudformation:StackPolicyUrl": "true"
}
}
}]
}

This policy ensures that there must be a specific stack policy URL any time SetStackPolicy is called. In this case, the URL is https://s3.amazonaws.com/samplebucket/sampleallowpolicy.json. Similarly, for any create and update stack operation, this policy ensures that the StackPolicyURL is set to the sampledenypolicy.json document in S3 and that a StackPolicyURL is always specified. From the CLI, a create-stack command would look like this:

aws cloudformation create-stack –stack-name cloudformation-demo –parameters ParameterKey=Password,ParameterValue=CloudFormationDemo –capabilities CAPABILITY_IAM –template-url https://s3.amazonaws.com/cloudformation-templates-us-east-1/IAM_Users_Groups_and_Policies.template –stack-policy-url https://s3-us-east-1.amazonaws.com/samplebucket/sampledenypolicy.json

Note that if you specify a new stack policy on a stack update, CloudFormation uses the existing stack policy: it uses the new policy only for subsequent updates. For example, if your current policy is set to deny all updates, you must run a SetStackPolicy command to change the stack policy to the one that allows updates. Then you can run an update command against the stack. To update the stack we just created, you can run this:

aws cloudformation set-stack-policy –stack-name cloudformation-demo –stack-policy-url https://s3-us-east-1.amazonaws.com/samplebucket/sampleallowpolicy.json

Then you can run the update:

aws cloudformation update-stack –stack-name cloudformation-demo –parameters ParameterKey=Password,ParameterValue=NewPassword –capabilities CAPABILITY_IAM –template-url https://s3.amazonaws.com/cloudformation-templates-us-east-1/IAM_Users_Groups_and_Policies.template –stack-policy-url https://s3-us-west-2.amazonaws.com/awshubfiles/sampledenypolicy.json

The IAM policy that we used ensures that a specific stack policy is applied to the stack any time a stack is updated or created.

Conclusion

CloudFormation provides a repeatable way to create and manage related AWS resources. By using a combination of IAM policies, users, and roles, CloudFormation-specific IAM conditions, and stack policies, you can ensure that your CloudFormation stacks are used as intended and minimize accidental resource updates or deletions.

You can learn more about this topic and other CloudFormation best practices in the recording of our re:Invent 2015 session, (DVO304) AWS CloudFormation Best Practices, and in our documentation.

How Coursera Manages Large-Scale ETL using AWS Data Pipeline and Dataduct

Post Syndicated from Coursera original https://blogs.aws.amazon.com/bigdata/post/Tx2Q3JGH427TL8Z/How-Coursera-Manages-Large-Scale-ETL-using-AWS-Data-Pipeline-and-Dataduct

This is a guest post by Sourabh Bajaj, a Software Engineer at Coursera. Coursera in their own words: "Coursera is an online educational startup with over 14 million learners across the globe. We offer more than 1000 courses from over 120 top universities."

 At Coursera, we use Amazon Redshift as our primary data warehouse because it provides a standard SQL interface and has fast and reliable performance. We use AWS Data Pipeline to extract, transform, and load (ETL) data into the warehouse. Data Pipeline provides fault tolerance, scheduling, resource management and an easy-to-extend API for our ETL.

Dataduct is a Python-based framework built on top of Data Pipeline that lets users create custom reusable components and patterns to be shared across multiple pipelines. This boosts developer productivity and simplifies ETL management. At Coursera, we run over 150 pipelines that pull data from 15 data sources such as Amazon RDS, Cassandra, log streams, and third-party APIs. We load over 300 tables every day into Amazon Redshift, processing several terabytes of data. Subsequent pipelines push data back into Cassandra to power our recommendations, search, and other data products.

The image below illustrates the data flow at Coursera.

Data flow at Coursera

In this post, I show you how to create pipelines that you can use to share custom reusable components and patterns across pipelines.

Creating a simple pipeline

You can start by creating a simple pipeline. The first example pulls user metadata from RDS and load it into Amazon Redshift:

name: users
frequency: daily
load_time: 12:00
description: Users table from RDS database

steps:
– step_type: extract-rds
sql: SELECT id, user_name, user_email FROM users
host_name: maestro
database: userDb

– step_type: create-load-redshift
name: load_staging
table_definition: tables/staging.maestro_users.sql

– step_type: upsert
name: upsert_users
source: tables/staging.maestro_users.sql
destination: tables/prod.users.sql

# QA tests
– step_type: primary-key-check
depends_on: upsert_users
table_definition: tables/prod.users.sql

– step_type: count-check
depends_on: upsert_users
source_sql: SELECT id FROM users
source_host: maestro
destination_sql: SELECT user_id FROM users
tolerance: 2.0

The figure below shows the pipeline steps as a DAG (directed acyclic graph).

Pipeline steps as a DAG

As you can see in the above pipeline, you define the pipeline structures in a YAML file. They are then saved in a version control system to track how these pipelines evolved over time. The following steps illustrate how reusable components are created for use in different pipelines.

Step 1: extract-rds. Takes in a SQL query and database name to read data. Credentials are passed through a configuration file that is shared across all pipelines.

Step 2: create-load-redshift. Creates a table if it doesn’t exist and loads data into the table using the COPY command in Amazon Redshift.

Step 3: upsert. Takes the data in the staging table and updates the production table with any values that need to be inserted or updated.

Step 4: primary-key-check. Checks for primary key violations in the production table.

Step 5: count-check. Compares the number of rows between the source database and Amazon Redshift.

The validation steps compare the data between the source system and warehouse for data quality issues that may arise during ETL, by checking for the following cases:

Primary key violations in the warehouse.

Dropped rows, by comparing the number of rows.

Corrupted rows, by comparing a sample set of rows.

Notice that there is a bootstrap and a teardown step for the pipeline. These steps are specified in the configuration. The bootstrap step can be used to install dependencies and upgrade to the latest version of the code so that you don’t have to redeploy each pipeline every time the pipeline code is updated. Similarly, the teardown step can be used for triggering SNS alerts on failure.  

Creating a more complex pipeline

Let’s consider a more complicated pipeline that pulls data from Cassandra to Amazon Redshift. In this example, you use Aegisthus to read and parse Cassandra backups and Scalding to transform and normalize the data into TSV files for loading into Amazon Redshift.

As you can see below, the Aegisthus and Scalding steps use ShellCommandActivity in Data Pipeline to run scripts and commands on the EMR cluster. The Aegisthus step takes in the Cassandra schema to be parsed. It then reads the data from backups in S3 and writes back to an output node. The Scalding step runs a series of MapReduce jobs to create different output nodes for each table to update in Amazon Redshift. This example uses shell command activity to extend the simple case I discussed in the first pipeline.

name: votes
frequency: daily
load_time: 04:00

emr_cluster_config:
num_instances: 20
instance_size: m1.xlarge

steps:
– step_type: aegisthus
cql_schema_definition: cassandra_tables/vote.vote_kvs_timestamp.cql

– step_type: scalding
name: vote_emr
job_name: org.coursera.etl.vote.VotesETLJob
output_node:
– questions

– step_type: create-load-redshift
name: load_discussion_question_votes
input_node: questions
depends_on: vote_emr
table_definition: tables/staging.cassandra_discussion_question_votes.sql

– step_type: reload
name: reload_discussion_question_votes
source: tables/staging.cassandra_discussion_question_votes.sql
destination: tables/prod.discussion_question_votes.sql

Building your own pipelines

To learn more about Dataduct and to get started developing your own pipelines, read the Dataduct documentation. You can install the library using pip and start by running the example pipelines provided with the code on Github.

Conclusion

Dataduct helps us easily maintain the pipelines and allows product developers to create pipelines with minimal involvement from the Data Infrastructure team. AWS Data Pipeline is flexible and provides great support for resource management, fault tolerance and scheduling. Using Data Pipeline to monitor pipeline run times, query times, retries, and failures can reveal inefficiencies in pipelines. Leverage shared code across pipelines as much as you can; your goal should be to redeploy the pipeline only when the steps in the pipeline change.

Dataduct is open source under the Apache 2.0 License, so please contribute with ideas, issues, and pull requests to make the project better. I will be at speaking at re:Invent 2015, so get in touch with me at @sb2nov if you want to chat more about Coursera or Dataduct.

If you have questions or suggestions, please leave a comment below.

Saurabh Bajaj is not an Amazon employee and does not represent Amazon.

———————–

Related

Introducing On-Demand Pipeline Execution in AWS Data Pipeline

 

How Coursera Manages Large-Scale ETL using AWS Data Pipeline and Dataduct

Post Syndicated from Coursera original https://blogs.aws.amazon.com/bigdata/post/Tx2Q3JGH427TL8Z/How-Coursera-Manages-Large-Scale-ETL-using-AWS-Data-Pipeline-and-Dataduct

This is a guest post by Sourabh Bajaj, a Software Engineer at Coursera. Coursera in their own words: "Coursera is an online educational startup with over 14 million learners across the globe. We offer more than 1000 courses from over 120 top universities."

 At Coursera, we use Amazon Redshift as our primary data warehouse because it provides a standard SQL interface and has fast and reliable performance. We use AWS Data Pipeline to extract, transform, and load (ETL) data into the warehouse. Data Pipeline provides fault tolerance, scheduling, resource management and an easy-to-extend API for our ETL.

Dataduct is a Python-based framework built on top of Data Pipeline that lets users create custom reusable components and patterns to be shared across multiple pipelines. This boosts developer productivity and simplifies ETL management. At Coursera, we run over 150 pipelines that pull data from 15 data sources such as Amazon RDS, Cassandra, log streams, and third-party APIs. We load over 300 tables every day into Amazon Redshift, processing several terabytes of data. Subsequent pipelines push data back into Cassandra to power our recommendations, search, and other data products.

The image below illustrates the data flow at Coursera.

Data flow at Coursera

In this post, I show you how to create pipelines that you can use to share custom reusable components and patterns across pipelines.

Creating a simple pipeline

You can start by creating a simple pipeline. The first example pulls user metadata from RDS and load it into Amazon Redshift:

name: users
frequency: daily
load_time: 12:00
description: Users table from RDS database

steps:
– step_type: extract-rds
sql: SELECT id, user_name, user_email FROM users
host_name: maestro
database: userDb

– step_type: create-load-redshift
name: load_staging
table_definition: tables/staging.maestro_users.sql

– step_type: upsert
name: upsert_users
source: tables/staging.maestro_users.sql
destination: tables/prod.users.sql

# QA tests
– step_type: primary-key-check
depends_on: upsert_users
table_definition: tables/prod.users.sql

– step_type: count-check
depends_on: upsert_users
source_sql: SELECT id FROM users
source_host: maestro
destination_sql: SELECT user_id FROM users
tolerance: 2.0

The figure below shows the pipeline steps as a DAG (directed acyclic graph).

Pipeline steps as a DAG

As you can see in the above pipeline, you define the pipeline structures in a YAML file. They are then saved in a version control system to track how these pipelines evolved over time. The following steps illustrate how reusable components are created for use in different pipelines.

Step 1: extract-rds. Takes in a SQL query and database name to read data. Credentials are passed through a configuration file that is shared across all pipelines.

Step 2: create-load-redshift. Creates a table if it doesn’t exist and loads data into the table using the COPY command in Amazon Redshift.

Step 3: upsert. Takes the data in the staging table and updates the production table with any values that need to be inserted or updated.

Step 4: primary-key-check. Checks for primary key violations in the production table.

Step 5: count-check. Compares the number of rows between the source database and Amazon Redshift.

The validation steps compare the data between the source system and warehouse for data quality issues that may arise during ETL, by checking for the following cases:

Primary key violations in the warehouse.

Dropped rows, by comparing the number of rows.

Corrupted rows, by comparing a sample set of rows.

Notice that there is a bootstrap and a teardown step for the pipeline. These steps are specified in the configuration. The bootstrap step can be used to install dependencies and upgrade to the latest version of the code so that you don’t have to redeploy each pipeline every time the pipeline code is updated. Similarly, the teardown step can be used for triggering SNS alerts on failure.  

Creating a more complex pipeline

Let’s consider a more complicated pipeline that pulls data from Cassandra to Amazon Redshift. In this example, you use Aegisthus to read and parse Cassandra backups and Scalding to transform and normalize the data into TSV files for loading into Amazon Redshift.

As you can see below, the Aegisthus and Scalding steps use ShellCommandActivity in Data Pipeline to run scripts and commands on the EMR cluster. The Aegisthus step takes in the Cassandra schema to be parsed. It then reads the data from backups in S3 and writes back to an output node. The Scalding step runs a series of MapReduce jobs to create different output nodes for each table to update in Amazon Redshift. This example uses shell command activity to extend the simple case I discussed in the first pipeline.

name: votes
frequency: daily
load_time: 04:00

emr_cluster_config:
num_instances: 20
instance_size: m1.xlarge

steps:
– step_type: aegisthus
cql_schema_definition: cassandra_tables/vote.vote_kvs_timestamp.cql

– step_type: scalding
name: vote_emr
job_name: org.coursera.etl.vote.VotesETLJob
output_node:
– questions

– step_type: create-load-redshift
name: load_discussion_question_votes
input_node: questions
depends_on: vote_emr
table_definition: tables/staging.cassandra_discussion_question_votes.sql

– step_type: reload
name: reload_discussion_question_votes
source: tables/staging.cassandra_discussion_question_votes.sql
destination: tables/prod.discussion_question_votes.sql

Building your own pipelines

To learn more about Dataduct and to get started developing your own pipelines, read the Dataduct documentation. You can install the library using pip and start by running the example pipelines provided with the code on Github.

Conclusion

Dataduct helps us easily maintain the pipelines and allows product developers to create pipelines with minimal involvement from the Data Infrastructure team. AWS Data Pipeline is flexible and provides great support for resource management, fault tolerance and scheduling. Using Data Pipeline to monitor pipeline run times, query times, retries, and failures can reveal inefficiencies in pipelines. Leverage shared code across pipelines as much as you can; your goal should be to redeploy the pipeline only when the steps in the pipeline change.

Dataduct is open source under the Apache 2.0 License, so please contribute with ideas, issues, and pull requests to make the project better. I will be at speaking at re:Invent 2015, so get in touch with me at @sb2nov if you want to chat more about Coursera or Dataduct.

If you have questions or suggestions, please leave a comment below.

Saurabh Bajaj is not an Amazon employee and does not represent Amazon.

———————–

Related

Introducing On-Demand Pipeline Execution in AWS Data Pipeline

 

Using BlueTalon with Amazon EMR

Post Syndicated from BlueTalon original https://blogs.aws.amazon.com/bigdata/post/Tx12BHE57L19IQI/Using-BlueTalon-with-Amazon-EMR

This is a guest post by Pratik Verma, Founder and Chief Product Officer at BlueTalon. Leonid Fedotov, Senior Solution Architect at BlueTalon, also contributed to this post.

Amazon Elastic MapReduce (Amazon EMR) makes it easy to quickly and cost-effectively process vast amounts of data in the cloud. EMR gets used for log, financial, fraud, and bioinformatics analysis, as well as many other big data use cases. Often, the data used in these analyses, such as customer information, transaction history, and other proprietary data, is sensitive from a business perspective and may even be subject to regulatory compliance.

BlueTalon is a leading provider of data-centric security solutions for Hadoop, SQL and Big Data environments on-premises and in the cloud. BlueTalon keeps enterprises in control of their data by allowing them to give users access to the data they need, not a byte more. BlueTalon solution works across AWS data services like EMR, Redshift and RDS.

In this blog post, we show how organizations can use BlueTalon to mitigate the risks associated with their use of sensitive data while taking full advantage of EMR.

BlueTalon provides capabilities for data-centric security:

Audits of user activity using a context-rich trail of queries users run that hit sensitive fields.

Precise control over data that is specific for each user identity or business role and specific for the data resource at the file, folder, table, column, row, cell, or partial-cell level.

Secure use of business data in policy decisions for real-world requirements, while maintaining complex access scenarios and relationship between users and data.

Using BlueTalon to enforce data security

BlueTalon’s data-centric security solution has three main components: a UI to create rules and visualize real-time audit, a Policy Engine to make fast run-time authorization decisions, and a collection of Enforcement Points that transparently enforce the decisions made by the Policy Engine.

In a typical Hadoop cluster, users specify computations using SQL queries in Hive, scripts in Pig, or MapReduce programs. For applications accessing data via Hive, the BlueTalon Hive enforcement point transparently proxies HiveServer2 at the network level and provides policy-protected data. The BlueTalon Policy Engine makes sophisticated, fine-grained policy decisions based on user and content criteria in-memory at run-time by re-engineering SQL requests for Hive. With the query modification technique, BlueTalon is able to ensure that end users get the same data, whether raw data is coming from local HDFS or Amazon S3, and that only policy-compliant data is pulled from storage by Hive.

For direct HDFS access, end users connect to and receive policy-protected data via the BlueTalon HDFS enforcement point that transparently proxies HDFS NameNode at network level and the BlueTalon Policy Engine makes policy decisions based on user and content criteria in-memory at run-time to provide folder and file level control on HDFS. With the enforcement point for HDFS, BlueTalon ensures that end-users can’t get around its security by going to HDFS to obtain data not accessible via Hive.

Using enforcement points, BlueTalon provides the following access controls for your data:

Field protection: Fields can be denied without breaking the application. As an example, a blank value compatible with the id field is returned instead of revealing the id values as they are stored on disk.

Record protection: The result set can be filtered to return a subset of the data, even when the field used in the filter criteria is not in the result set. In this example, the user is able to see only the 2 records with the East Coast zip codes, compared to 10 records on disk.

Cell protection: A specific field value for a specific record can be protected. In this example, the user is able to see the birthdate value for ‘Joyce McDonald’ but not ‘Kelly Adams’. Here as well, the date field is compatible with the format expected by the application.

Partial cell protection: Even portions of a cell may be protected. In this example, the user sees the last four digits of a Social Security number, rather than the number being hidden entirely.

The BlueTalon Policy Engine integrates with Active Directory for authenticating end-user credentials and mapping identities to business roles. It enforces authorization so that Hive provides only policy-compliant data to end users.

Deploying BlueTalon with Amazon EMR

In the following sections, you’ll learn how to deploy BlueTalon with EMR and configure the policies. A typical deployment looks like the following:

Prerequisites

You need to contact [email protected] to obtain an evaluation copy, an Amazon EC2 Linux instance for installing BlueTalon, and an Amazon EMR cluster in the same VPC. BlueTalon recommends using an m3.large instance with CentOS.

To integrate BlueTalon with a directory, you can use a pre-existing directory in your VPC or launch a new Simple AD using AWS Directory Service. For more information, see Tutorial: Creating a Simple AD Directory.

Install the packages

On the EC2 instance, install the BlueTalon Policy Engine and Audit packages, available as rpm packages, using the yum commands:

> yum search bluetalon

bluetalon-audit.x86_64 : BlueTalon data security for Hadoop.
bluetalon-enforcementpoint.x86_64 : BlueTalon data security for Hadoop.
bluetalon-policy.x86_64 : BlueTalon data security for Hadoop.

> yum install bluetalon-audit –y

> yum install bluetalon-policy –y

Run the setup script

After the BlueTalon packages are installed, run the setup script to configure and turn on the run-time services and UI associated with the two packages.   

> bluetalon-audit-setup

Starting bt-audit-server service: [ OK ]
Starting bt-audit-zookeeper service: [ OK ]
Starting bt-audit-kafka service: [ OK ]
Starting bt-audit-activity-monitor service: [ OK ]

BlueTalon Audit Product is installed….
URL to access BlueTalon Audit UI
ec2-0-0-0-0.us-west-2.compute.amazonaws.com:8112/BlueTalonAudit

Default Username : btadminuser
Default Password : [email protected]

> bluetalon-policy-setup

Starting bt-postgresql service: [ OK ]
Starting bt-policy-engine service: [ OK ]
Starting bt-sql-hooks-vds service: [ OK ]
Starting bt-webserver service: [ OK ]
Starting bt-HeartBeatService service: [ OK ]

BlueTalon Data Security Product for Hadoop is installed….
You can create rules using the BlueTalon Policy UI
URL to access BlueTalon Policy UI
ec2-0-0-0-0.us-west-2.compute.amazonaws.com:8111/BlueTalonConfig

Default Username : btadminuser
Default Password : [email protected]

Connecting to the BlueTalon UI

After the run time services associated with the BlueTalon packages have started, you should be able to connect to the BlueTalon Policy Management and User Audit interfaces as displayed below.

 

Installing enforcement points

Install and configure the BlueTalon enforcement point packages for Hive and HDFS NameNode on the master node of the EMR cluster using the following commands:

> yum install bluetalon-enforcementpoint –y
> bluetalon-enforcementpoint-setup Hive 10011 HiveDS

Starting bt-enforcement-point-demods service: [ OK ]

The arguments to this command include:

Hive: The type of enforcement point to configure. Options include Hive, HDFS, and PostgreSQL.

10011: The port on which the enforcement point listens.

HiveDS: The name of the data domain in the BlueTalon UI to associate with this enforcement point.

This command configures a Hive enforcement point for the local HiveServer2 and creates an iptables entry to re-route HiveServer2 traffic to the BlueTalon enforcement point first.

The following command restarts NameNode with the embedded BlueTalon enforcement point process:

> bluetalon-enforcementpoint-setup HDFS
Stopping NameNode process: [ OK ]
Starting NameNode process: [ OK ]

Adding data domains

Open the BlueTalon Policy Management UI using a browser and add Hive and HDFS as data domains so that BlueTalon can look up the data resources (databases, tables, columns, folders, files, etc.) to create data access rules. This requires connectivity information for HiveServer2 and NameNode.

For HiveServer2:

default: Database name associated with Hive warehouse. Typically, ‘default’.

10.0.0.1: Hostname of the machine where HiveServer2 is running. Typically, the DNS of the master node in Amazon EMR.

10000: Port that HiveServer2 is listening on. Typically, ‘10000’.

10011: Port on which the enforcement point listens. Typically, ‘10011’.

HiveDS: The name of the data domain in the BlueTalon UI to associate with this enforcement point.

No Login: Credentials for connecting to HiveServer2 if required.

For HDFS:

10.0.0.1: Hostname of the machine where NameNode is running.

8020: Port on which NameNode is listening. Typically, ‘8020’.

HDFSDS: The name of the data domain in the BlueTalon UI to associate with this enforcement point.

Adding user domains

Using the BlueTalon Policy Management UI, add the directory as a user domain so that BlueTalon can authenticate user credentials and look up the business roles to which a user belongs. For more information about obtaining connectivity information, see Viewing Directory Information.

10.0.0.1: Hostname of the machine where Active Directory is running.

389:  Port of the machine where Active Directory is running. Typically, ‘389’.

10011: Port that the Enforcement Point listens on. Typically, ‘10011’.

CN=hadoopadmin: Credentials for bind and query to Active Directory.

Creating rules for specifying data access

Using the BlueTalon UI, you can create rules specifying which users can access what data. This can be done using the Add Rule button on the Policy tab to open a tray. Two examples are shown below.

On the left is an example of a row-level rule that restricts access for user ‘admin1’ to records in the ‘people’ table for locations in West Coast zip codes only. On the right is an example of a masking rule on a sensitive field, ‘accounts.ssn’, which masks it completely.

Deploying policies

After the rules are created, deploy the policy to the BlueTalon Policy Engine using the Deploy button from the Deploy tab. After it’s deployed, the policy and rules become effective on the Policy Engine.

The screenshots below show the data protection with BlueTalon by making queries through the ‘beeline’ client.

With BlueTalon

beeline> !connect jdbc:hive2://<hostname of masternode>:10011/default

Without BlueTalon

beeline> !connect jdbc:hive2://<hostname of masternode>:10000/default

With BlueTalon, the row level protection count of records is 249.

Without BlueTalon, the row level protection count of records is 2499.

With BlueTalon protection, field ssn is masked with ‘XXXX’.

Auditing access

All access through the BlueTalon enforcement points is authorized against the BlueTalon Policy Engine and audited. The audit can be visualized in the BlueTalon User Audit UI.

Try BlueTalon with data available from AWS

Generate sample data with table ‘books_1’ using instructions from http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/query-impala-generate-data.html

Create a policy for a user ‘alice’ that allows a read on table ‘books_1’ for books with ‘price’ less than $30.00, masks the field ‘publisher’ and denies the ‘id’ of book completely.

Run the query directly and through BlueTalon to see the effect of the policy rules created in BlueTalon.

Data as stored in Hive:

Result with BlueTalon protection:

Conclusion

BlueTalon enables organizations to protect access to data efficiently in HDFS or Amazon S3, allow users to get needed data, and leverage the full potential of Hadoop in a secure manner.

If you have questions or suggestions, please leave a comment below.

——————————-

Related:

Using IPython Notebook to Analyze Data with EMR

Getting Started with Elasticsearch and Kibana on EMR

Strategies for Reducing your EMR Costs

—————————————————————-

Love to work on open source? Check out EMR’s careers page.

—————————————————————-

 

 

Using BlueTalon with Amazon EMR

Post Syndicated from BlueTalon original https://blogs.aws.amazon.com/bigdata/post/Tx12BHE57L19IQI/Using-BlueTalon-with-Amazon-EMR

This is a guest post by Pratik Verma, Founder and Chief Product Officer at BlueTalon. Leonid Fedotov, Senior Solution Architect at BlueTalon, also contributed to this post.

Amazon Elastic MapReduce (Amazon EMR) makes it easy to quickly and cost-effectively process vast amounts of data in the cloud. EMR gets used for log, financial, fraud, and bioinformatics analysis, as well as many other big data use cases. Often, the data used in these analyses, such as customer information, transaction history, and other proprietary data, is sensitive from a business perspective and may even be subject to regulatory compliance.

BlueTalon is a leading provider of data-centric security solutions for Hadoop, SQL and Big Data environments on-premises and in the cloud. BlueTalon keeps enterprises in control of their data by allowing them to give users access to the data they need, not a byte more. BlueTalon solution works across AWS data services like EMR, Redshift and RDS.

In this blog post, we show how organizations can use BlueTalon to mitigate the risks associated with their use of sensitive data while taking full advantage of EMR.

BlueTalon provides capabilities for data-centric security:

Audits of user activity using a context-rich trail of queries users run that hit sensitive fields.

Precise control over data that is specific for each user identity or business role and specific for the data resource at the file, folder, table, column, row, cell, or partial-cell level.

Secure use of business data in policy decisions for real-world requirements, while maintaining complex access scenarios and relationship between users and data.

Using BlueTalon to enforce data security

BlueTalon’s data-centric security solution has three main components: a UI to create rules and visualize real-time audit, a Policy Engine to make fast run-time authorization decisions, and a collection of Enforcement Points that transparently enforce the decisions made by the Policy Engine.

In a typical Hadoop cluster, users specify computations using SQL queries in Hive, scripts in Pig, or MapReduce programs. For applications accessing data via Hive, the BlueTalon Hive enforcement point transparently proxies HiveServer2 at the network level and provides policy-protected data. The BlueTalon Policy Engine makes sophisticated, fine-grained policy decisions based on user and content criteria in-memory at run-time by re-engineering SQL requests for Hive. With the query modification technique, BlueTalon is able to ensure that end users get the same data, whether raw data is coming from local HDFS or Amazon S3, and that only policy-compliant data is pulled from storage by Hive.

For direct HDFS access, end users connect to and receive policy-protected data via the BlueTalon HDFS enforcement point that transparently proxies HDFS NameNode at network level and the BlueTalon Policy Engine makes policy decisions based on user and content criteria in-memory at run-time to provide folder and file level control on HDFS. With the enforcement point for HDFS, BlueTalon ensures that end-users can’t get around its security by going to HDFS to obtain data not accessible via Hive.

Using enforcement points, BlueTalon provides the following access controls for your data:

Field protection: Fields can be denied without breaking the application. As an example, a blank value compatible with the id field is returned instead of revealing the id values as they are stored on disk.

Record protection: The result set can be filtered to return a subset of the data, even when the field used in the filter criteria is not in the result set. In this example, the user is able to see only the 2 records with the East Coast zip codes, compared to 10 records on disk.

Cell protection: A specific field value for a specific record can be protected. In this example, the user is able to see the birthdate value for ‘Joyce McDonald’ but not ‘Kelly Adams’. Here as well, the date field is compatible with the format expected by the application.

Partial cell protection: Even portions of a cell may be protected. In this example, the user sees the last four digits of a Social Security number, rather than the number being hidden entirely.

The BlueTalon Policy Engine integrates with Active Directory for authenticating end-user credentials and mapping identities to business roles. It enforces authorization so that Hive provides only policy-compliant data to end users.

Deploying BlueTalon with Amazon EMR

In the following sections, you’ll learn how to deploy BlueTalon with EMR and configure the policies. A typical deployment looks like the following:

Prerequisites

You need to contact [email protected] to obtain an evaluation copy, an Amazon EC2 Linux instance for installing BlueTalon, and an Amazon EMR cluster in the same VPC. BlueTalon recommends using an m3.large instance with CentOS.

To integrate BlueTalon with a directory, you can use a pre-existing directory in your VPC or launch a new Simple AD using AWS Directory Service. For more information, see Tutorial: Creating a Simple AD Directory.

Install the packages

On the EC2 instance, install the BlueTalon Policy Engine and Audit packages, available as rpm packages, using the yum commands:

> yum search bluetalon

bluetalon-audit.x86_64 : BlueTalon data security for Hadoop.
bluetalon-enforcementpoint.x86_64 : BlueTalon data security for Hadoop.
bluetalon-policy.x86_64 : BlueTalon data security for Hadoop.

> yum install bluetalon-audit –y

> yum install bluetalon-policy –y

Run the setup script

After the BlueTalon packages are installed, run the setup script to configure and turn on the run-time services and UI associated with the two packages.   

> bluetalon-audit-setup

Starting bt-audit-server service: [ OK ]
Starting bt-audit-zookeeper service: [ OK ]
Starting bt-audit-kafka service: [ OK ]
Starting bt-audit-activity-monitor service: [ OK ]

BlueTalon Audit Product is installed….
URL to access BlueTalon Audit UI
ec2-0-0-0-0.us-west-2.compute.amazonaws.com:8112/BlueTalonAudit

Default Username : btadminuser
Default Password : [email protected]

> bluetalon-policy-setup

Starting bt-postgresql service: [ OK ]
Starting bt-policy-engine service: [ OK ]
Starting bt-sql-hooks-vds service: [ OK ]
Starting bt-webserver service: [ OK ]
Starting bt-HeartBeatService service: [ OK ]

BlueTalon Data Security Product for Hadoop is installed….
You can create rules using the BlueTalon Policy UI
URL to access BlueTalon Policy UI
ec2-0-0-0-0.us-west-2.compute.amazonaws.com:8111/BlueTalonConfig

Default Username : btadminuser
Default Password : [email protected]

Connecting to the BlueTalon UI

After the run time services associated with the BlueTalon packages have started, you should be able to connect to the BlueTalon Policy Management and User Audit interfaces as displayed below.

 

Installing enforcement points

Install and configure the BlueTalon enforcement point packages for Hive and HDFS NameNode on the master node of the EMR cluster using the following commands:

> yum install bluetalon-enforcementpoint –y
> bluetalon-enforcementpoint-setup Hive 10011 HiveDS

Starting bt-enforcement-point-demods service: [ OK ]

The arguments to this command include:

Hive: The type of enforcement point to configure. Options include Hive, HDFS, and PostgreSQL.

10011: The port on which the enforcement point listens.

HiveDS: The name of the data domain in the BlueTalon UI to associate with this enforcement point.

This command configures a Hive enforcement point for the local HiveServer2 and creates an iptables entry to re-route HiveServer2 traffic to the BlueTalon enforcement point first.

The following command restarts NameNode with the embedded BlueTalon enforcement point process:

> bluetalon-enforcementpoint-setup HDFS
Stopping NameNode process: [ OK ]
Starting NameNode process: [ OK ]

Adding data domains

Open the BlueTalon Policy Management UI using a browser and add Hive and HDFS as data domains so that BlueTalon can look up the data resources (databases, tables, columns, folders, files, etc.) to create data access rules. This requires connectivity information for HiveServer2 and NameNode.

For HiveServer2:

default: Database name associated with Hive warehouse. Typically, ‘default’.

10.0.0.1: Hostname of the machine where HiveServer2 is running. Typically, the DNS of the master node in Amazon EMR.

10000: Port that HiveServer2 is listening on. Typically, ‘10000’.

10011: Port on which the enforcement point listens. Typically, ‘10011’.

HiveDS: The name of the data domain in the BlueTalon UI to associate with this enforcement point.

No Login: Credentials for connecting to HiveServer2 if required.

For HDFS:

10.0.0.1: Hostname of the machine where NameNode is running.

8020: Port on which NameNode is listening. Typically, ‘8020’.

HDFSDS: The name of the data domain in the BlueTalon UI to associate with this enforcement point.

Adding user domains

Using the BlueTalon Policy Management UI, add the directory as a user domain so that BlueTalon can authenticate user credentials and look up the business roles to which a user belongs. For more information about obtaining connectivity information, see Viewing Directory Information.

10.0.0.1: Hostname of the machine where Active Directory is running.

389:  Port of the machine where Active Directory is running. Typically, ‘389’.

10011: Port that the Enforcement Point listens on. Typically, ‘10011’.

CN=hadoopadmin: Credentials for bind and query to Active Directory.

Creating rules for specifying data access

Using the BlueTalon UI, you can create rules specifying which users can access what data. This can be done using the Add Rule button on the Policy tab to open a tray. Two examples are shown below.

On the left is an example of a row-level rule that restricts access for user ‘admin1’ to records in the ‘people’ table for locations in West Coast zip codes only. On the right is an example of a masking rule on a sensitive field, ‘accounts.ssn’, which masks it completely.

Deploying policies

After the rules are created, deploy the policy to the BlueTalon Policy Engine using the Deploy button from the Deploy tab. After it’s deployed, the policy and rules become effective on the Policy Engine.

The screenshots below show the data protection with BlueTalon by making queries through the ‘beeline’ client.

With BlueTalon

beeline> !connect jdbc:hive2://<hostname of masternode>:10011/default

Without BlueTalon

beeline> !connect jdbc:hive2://<hostname of masternode>:10000/default

With BlueTalon, the row level protection count of records is 249.

Without BlueTalon, the row level protection count of records is 2499.

With BlueTalon protection, field ssn is masked with ‘XXXX’.

Auditing access

All access through the BlueTalon enforcement points is authorized against the BlueTalon Policy Engine and audited. The audit can be visualized in the BlueTalon User Audit UI.

Try BlueTalon with data available from AWS

Generate sample data with table ‘books_1’ using instructions from http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/query-impala-generate-data.html

Create a policy for a user ‘alice’ that allows a read on table ‘books_1’ for books with ‘price’ less than $30.00, masks the field ‘publisher’ and denies the ‘id’ of book completely.

Run the query directly and through BlueTalon to see the effect of the policy rules created in BlueTalon.

Data as stored in Hive:

Result with BlueTalon protection:

Conclusion

BlueTalon enables organizations to protect access to data efficiently in HDFS or Amazon S3, allow users to get needed data, and leverage the full potential of Hadoop in a secure manner.

If you have questions or suggestions, please leave a comment below.

——————————-

Related:

Using IPython Notebook to Analyze Data with EMR

Getting Started with Elasticsearch and Kibana on EMR

Strategies for Reducing your EMR Costs

—————————————————————-

Love to work on open source? Check out EMR’s careers page.

—————————————————————-

 

 

Integrating Amazon Kinesis, Amazon S3 and Amazon Redshift with Cascading on Amazon EMR

Post Syndicated from Concurrent original https://blogs.aws.amazon.com/bigdata/post/Tx3FWOWOHSITOFC/Integrating-Amazon-Kinesis-Amazon-S3-and-Amazon-Redshift-with-Cascading-on-Amazo

This is a guest post by Ryan Desmond, Solutions Architect at Concurrent. Concurrent is an AWS Advanced Technology Partner.

With Amazon Kinesis developers can quickly store, collate and access large, distributed data streams such as access logs, click streams and IoT data in real-time. The question then becomes, how can we access and leverage this data for use in data applications running in Amazon EMR? There are many ways to solve this problem, but the solution should be simple, fault tolerant and highly scalable. In this post I demonstrate a micro-batch system that delivers this solution by processing a simulated real-time data stream using Amazon Kinesis with Cascading on Amazon EMR. We will also join our Amazon Kinesis stream with data residing in a file in Amazon S3 and write our results to Amazon Redshift using cascading-jdbc-redshift which leverages Amazon Redshift’s COPY command.

Cascading is a proven, highly extensible application development framework for building massively parallelized data applications on EMR. A few key features of Cascading include the ability to implement comprehensive TDD practices, application portability across platforms such as MapReduce and Apache Tez, and the ability to integrate with a variety of external systems using out-of-the-box integration adapters.

The result of combining Amazon Kinesis, Amazon EMR, Amazon Redshift and Cascading is an architecture that enables end-to-end data processing of streaming data sources.

Thankfully, Amazon has made this a relatively simple and straightforward process for us. Here are the steps we will take:

Review a sample Cascading application that joins data from Amazon Kinesis and S3, processes the data with a few operations on EMR and writes the results to Amazon Redshift.

Create an Amazon Kinesis stream.

Configure an Amazon DynamoDB table to manage data range values, or “windowing.”

Download the Amazon Kinesis publisher sample application.

Use the publisher sample application to populate our Amazon Kinesis stream with sample Apache web log data.

Create an Amazon Redshift cluster.

Use the AWS CLI to create an EMR cluster and run our Cascading application by adding it as a step.

First, let’s look at the directed acyclic graph (DAG), for the Cascading application we will run. You can explore the full application with Driven. If you are prompted to log in, use the username “guest” and the password “welcome.”

As you can see, we source data from Amazon Kinesis as well as S3. We perform several operations on the incoming data before joining the two streams and writing the output to Amazon Redshift.

I review the key components of this application below. You can also view the full source. To run this application on your local machine, clone the Cascading Tutorials repository and move it into the cascading-aws/part4 directory.

$ git clone https://github.com/Cascading/tutorials.git
$ cd tutorials/cascading-aws/part4/

Review and compile Cascading application code

Before we can build and run this application, we must pull the latest EmrKinesisCascading connector down from an EMR instance. At the time of publishing this connector is only available on the EMR instances themselves at “/usr/share/aws/emr/kinesis/lib/”. As this is a build dependency for our Cascading application, we need to pull this library down and install it in a local Maven repository.

Step 1: SCP library from EMR to your local machine     

If you do not have access to an EMR instance, follow these instructions to set up a simple EMR cluster. Be sure to enable SSH access to this cluster using your private pem file.

On your local machine, cd to the directory of your choice.

$ scp -i ~/.ssh/<your-private>.pem [email protected]<your-emr-ip>.compute-1.amazonaws.com:/usr/share/aws/emr/kinesis/lib/EmrKinesisCascading*

Step 2:  Install this library into a local Maven repository

Remain in the same directory as EmrKinesisCascading connector.

$ mvn install:install-file -Dfile=<EmrKinesisCascading-<version>.jar> -DgroupId=aws.kinesis -DartifactId=cascading-connector -Dversion=<EmrKinesisCascadingVersion> -Dpackaging=jar

Now that we have installed the EmrKinesisCascading connector in our local Maven repository, we can review the sample application.

First, let’s look at how we instantiate our KinesisHadoopScheme and KinesisHadoopTap. With Cascading, a “Tap” is used wherever you read or write data. There are roughly 30 supported Taps available for integration with the most widely used data stores/sources. We will also instantiate several Schemes which are used with Taps to specify the format (and types where necessary) of the incoming/outgoing data.

// instantiate incoming fields, in this case "data" to be used in the KinesisHadoopScheme
Fields columns = new Fields("data");
// instantiate KinesisHadoopScheme to be used with KinesisHadoopTap
KinesisHadoopScheme scheme = new KinesisHadoopScheme(columns);
// set noDateTimeout to true
scheme.setNoDataTimeout(1);
// apply our AWS access and secret keys – please see Disclaimer below regarding
// the use of AccessKeys and SecretKeys in production systems
scheme.setCredentials([ACCESS_KEY],[SECRET_KEY]);
// instantiate Kinesis Tap to read “AccessLogStream”
Tap kinesisSourceTap = new KinesisHadoopTap("AccessLogStream", scheme);

Now we create our Tap to read a file from S3 that will be joined to the Amazon KinesisStream, and we create the RedshiftTap that we will use to write our final output. For the S3 Tap we will use an Hfs tap which is fully compatible with S3. All you need to do is provide an S3 path instead of an Hfs path.

// instantiate S3 source Tap using Hfs. This Tap will source a comma delimited file of IP address
// found at the location of s3InStr
Tap s3SourceTap = new Hfs( new TextDelimited( new Fields("ip"), "," ), s3InStr );
// instantiate S3 sink tap – comma delimited with fields “ip”, “count”. Using Redshift’s COPY
// command we can load this data from S3 very quickly
Tap sinkTap = new Hfs( new TextDelimited( new Fields ("ip","count"), "," ), s3OutStr, SinkMode.REPLACE );
// instantiate S3 trap tap to catch any bad data – this data will be written to S3 and you will
// be able to see if, and how many tuples are being trapped using Driven
Tap trapTap = new Hfs(new TextDelimited(true, "t"), s3TrapStr, SinkMode.REPLACE);

Now that we have our necessary Taps, we can process the data. The Cascading processing model is based on a metaphor of flows based on patterns. Pipes control the flow of data applying operations to each Tuple or groups of Tuples. Within these pipes, we will perform several operations using Each and Every pipe. The operations we will use include RegexParser, Retain, Rename, HashJoin, Discard, GroupBy and Count.

I highlight a few of these operations below. As mentioned earlier, you can view the full source code for this sample application. 

RegexParser parser = new RegexParser(apacheFields, apacheRegex, allGroups);
// apply regex parser to each tuple in the Kinesis stream
processPipe = new Each(processPipe, columns, parser);
// retain only the field "ip" from Kinesis stream
processPipe = new Retain(processPipe, new Fields("ip"));
// in anticipation of the upcoming join rename S3 file field to avoid naming collision
joinPipe = new Rename( joinPipe, new Fields( "ip" ), new Fields( "userip" ) );
// rightJoin processPipe and joinPipe (IPs in S3 file) on ip (and renamed "userip")
Pipe outputPipe = new HashJoin(processPipe, new Fields("ip"), joinPipe, new Fields("userip"), new RightJoin());
// discard unnecessary "userip"
outputPipe = new Discard(outputPipe, new Fields("userip"));
// group all by "ip"
outputPipe = new GroupBy(outputPipe, new Fields("ip"));
// calculate count of every group of IP’s
outputPipe = new Every(outputPipe, new Count(new Fields("count")));

Now all we have to do is compose the flow by connecting our Taps to Pipes. Then we will connect and complete the flow.

// define the flow definition
FlowDef flowDef = FlowDef.flowDef()
.addSource( processPipe, kinesisSourceTap ) // connect processPipe to KinesisTap
.addSource( joinPipe, s3SourceTap ) // connect joinPipe to s3Tap
.addTailSink( outputPipe, sinkTap ) // connect outputPipe to S3 sink Tap
.setName( "Cascading-Kinesis-Flow" ) // name the flow
.addTrap( processPipe, trapTap ); // add the trap to catch any bad data in processPipe

// instantiate HadoopFlowConnector – other flowConnectors include:
// — Hadoop2Mr1FlowConnector
// — LocalFlowConnector
// — Hadoop2TezFlowConnector
// — Spark and Flink FlowConnectors under development
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// attach the flow definition to the flow connector
Flow kinesisFlow = flowConnector.connect(flowDef);
// run the flow
kinesisFlow.complete();

Setting up Amazon Kinesis, DynamoDB, Amazon Redshift and CLI

Before we run the Cascading application, let’s take a moment to ensure that the necessary infrastructure is in place and the Amazon Kinesis stream is populated. For simplicity, we use this setup script which handles the following tasks:

Create an Amazon Kinesis stream

Create two DynamoDB tables required by EMR to process Amazon Kinesis streams

Create an Amazon Redshift cluster

Download the kinesis-log4j-appender-1.0.0.jar to publish sample data to Amazon Kinesis

Download and unpack the sample data

Start the Amazon Kinesis Publisher for One-Time Publishing

As you can see, this script interacts with several AWS services via the command line. If you have not done so already, you must install the AWS Command Line Interface (CLI). The AWS CLI is a unified tool for managing your AWS services.

Now that we’ve installed the AWS CLI, we’ll add our AWS credentials to src/main/resource/ AwsCredentials.properties. This is required for kinesis-log4j-appender to publish to our Amazon Kinesis stream. Once that is complete, let’s call the script from the root directly of our sample project.

# add preferred Redshift credentials to setup-script.sh
$ vi //tutorials/cascading-aws/part4/src/main/scripts/setup-script.sh
# add AWS keys for Kinesis publisher –required for kinesis-log4j-appender-1.0.0.jar
$ vi //tutorials/cascading-aws/part4/AwsCredentials.properties
# cd to root of sample project
$ cd //tutorials/cascading-aws/part4/
# call setup script
$ src/main/scripts/setup-script.sh

Tying everything together

Now that we have all the right pieces in all the right places, we just need to update our configuration.properties file located in the Cascading sample application source code at “src/main/resources” with our respective values.

Disclaimer: In the interest of simplicity for this tutorial, we set our AccessKey and SecretKey values manually. However, it is highly recommended that all systems take advantage of the identity and access management afforded by AWS IAM. In doing so, your AccessKey and SecretKey will be available in your instance profile and will not be required anywhere in your code. The result is a more robust and secure application. This will require that your IAM users and roles are properly configured and grant appropriate permissions in place for all AWS services that you need to access. AWS documentation provides more information on creating an IAM policy and using IAM to control access to Amazon Kinesis resources.

USER_AWS_ACCESS_KEY=
USER_AWS_SECRET_KEY=
REDSHIFT_JDBC_URL=
REDSHIFT_USER_NAME=
REDSHIFT_PASSWORD=

NOTE: At the time of publication cascading-jdbc-redshift uses the compatible PostgreSQL driver. With that in mind, please replace “jdbc:redshift://” with “jdbc:postgresql://” in the Amazon Redshift JDBC URL you add to configuration.properties.

Now we’re ready to run our application! In order to do so, let’s go back into the sample source code and take a look at the shell script that we will use to simplify the final execution steps. This script has two primary functions. The first is to compile the sample application and push it to your S3 bucket, along with a data file that we will join against the Amazon Kinesis stream.

# clean and compile the application
gradle clean fatjar
# create the bucket or delete the contents if it exists
aws s3 mb s3://$BUCKET || aws s3 rm s3://$BUCKET –recursive
# push built jar file to S3
aws s3 cp $BUILD/$NAME s3://$BUCKET/$NAME
# push data file to S3 – we will join this file against the Kinesis stream
aws s3 cp $DATA_DIR s3://$BUCKET/$DATA_DIR

The second function of the script launches an EMR cluster and submits our Cascading application (now located in your S3 bucket) as a step to be run on the cluster.

Disclaimer: For scheduled, operational, production systems it is recommended that the following action be wrapped in an AWS Data Pipeline definition. In this tutorial, for simplicity, we will be using the CLI to create a small cluster for demonstrative purposes. After the sample data has been processed, or if the Amazon Kinesis stream is empty, this cluster will terminate automatically.

aws emr create-cluster
–ami-version 3.8.0
–instance-type m1.xlarge
–instance-count 1
–name "cascading-kinesis-example"
–visible-to-all-users
–enable-debugging
–auto-terminate
–no-termination-protected
–log-uri s3n://$BUCKET/logs/
–service-role EMR_DefaultRole –ec2-attributes InstanceProfile=EMR_EC2_DefaultRole
–steps Type=CUSTOM_JAR,Name=KinesisTest1,ActionOnFailure=TERMINATE_CLUSTER,Jar=s3n://$BUCKET/$NAME,Args=$BUCKET

To run the sample Cascading application simply call this script and provide the following arguments:

S3_BUCKET                          // S3 bucket to hold application and data

For the auto-compile to work you will need to call this script from the following directory “/[PATH_TO}/tutorials/cascading-aws/part4/”. For example:

$ cd /[PATH_TO}/tutorials/cascading-aws/part4/
$ src/main/scripts/cascading-kinesis.sh

After calling cascading-kinesis.sh, you will see the Cascading application compile, then the application jar and the data file that we will join against the Amazon Kinesis stream will be transferred to S3. After that, an EMR cluster is created and our Cascading application jar is added as a step. You can verify that this cluster is booting up by visiting the EMR console.  When the application has completed you will find the final output in an Amazon Redshift table named CascadingResults.

If you are already using Driven, you will also see the application appear on the landing page after logging in. With Driven, you can visualize the status, progress and behavior of your applications in real-time, as well as over time.

Driven lets developers build higher quality data applications and gives operators the ability to efficiently optimize and monitor these applications. If you’d like to explore Driven, you can take a tour or visit Driven’s website

There you have it. With Cascading we now have a micro-batch system that sources streaming, real-time data from Amazon Kinesis, joins it with data in S3, processes this joined stream on EMR, loads the results into Amazon Redshift, and monitors the entire application lifecycle with Driven.

If you have questions or suggestions, please leave a comment below.

—————————–

Related:

Indexing Common Crawl Metadata on Amazon EMR Using Cascading and Elasticsearch

 

Integrating Amazon Kinesis, Amazon S3 and Amazon Redshift with Cascading on Amazon EMR

Post Syndicated from Concurrent original https://blogs.aws.amazon.com/bigdata/post/Tx3FWOWOHSITOFC/Integrating-Amazon-Kinesis-Amazon-S3-and-Amazon-Redshift-with-Cascading-on-Amazo

This is a guest post by Ryan Desmond, Solutions Architect at Concurrent. Concurrent is an AWS Advanced Technology Partner.

With Amazon Kinesis developers can quickly store, collate and access large, distributed data streams such as access logs, click streams and IoT data in real-time. The question then becomes, how can we access and leverage this data for use in data applications running in Amazon EMR? There are many ways to solve this problem, but the solution should be simple, fault tolerant and highly scalable. In this post I demonstrate a micro-batch system that delivers this solution by processing a simulated real-time data stream using Amazon Kinesis with Cascading on Amazon EMR. We will also join our Amazon Kinesis stream with data residing in a file in Amazon S3 and write our results to Amazon Redshift using cascading-jdbc-redshift which leverages Amazon Redshift’s COPY command.

Cascading is a proven, highly extensible application development framework for building massively parallelized data applications on EMR. A few key features of Cascading include the ability to implement comprehensive TDD practices, application portability across platforms such as MapReduce and Apache Tez, and the ability to integrate with a variety of external systems using out-of-the-box integration adapters.

The result of combining Amazon Kinesis, Amazon EMR, Amazon Redshift and Cascading is an architecture that enables end-to-end data processing of streaming data sources.

Thankfully, Amazon has made this a relatively simple and straightforward process for us. Here are the steps we will take:

Review a sample Cascading application that joins data from Amazon Kinesis and S3, processes the data with a few operations on EMR and writes the results to Amazon Redshift.

Create an Amazon Kinesis stream.

Configure an Amazon DynamoDB table to manage data range values, or “windowing.”

Download the Amazon Kinesis publisher sample application.

Use the publisher sample application to populate our Amazon Kinesis stream with sample Apache web log data.

Create an Amazon Redshift cluster.

Use the AWS CLI to create an EMR cluster and run our Cascading application by adding it as a step.

First, let’s look at the directed acyclic graph (DAG), for the Cascading application we will run. You can explore the full application with Driven. If you are prompted to log in, use the username “guest” and the password “welcome.”

As you can see, we source data from Amazon Kinesis as well as S3. We perform several operations on the incoming data before joining the two streams and writing the output to Amazon Redshift.

I review the key components of this application below. You can also view the full source. To run this application on your local machine, clone the Cascading Tutorials repository and move it into the cascading-aws/part4 directory.

$ git clone https://github.com/Cascading/tutorials.git
$ cd tutorials/cascading-aws/part4/

Review and compile Cascading application code

Before we can build and run this application, we must pull the latest EmrKinesisCascading connector down from an EMR instance. At the time of publishing this connector is only available on the EMR instances themselves at “/usr/share/aws/emr/kinesis/lib/”. As this is a build dependency for our Cascading application, we need to pull this library down and install it in a local Maven repository.

Step 1: SCP library from EMR to your local machine     

If you do not have access to an EMR instance, follow these instructions to set up a simple EMR cluster. Be sure to enable SSH access to this cluster using your private pem file.

On your local machine, cd to the directory of your choice.

$ scp -i ~/.ssh/<your-private>.pem [email protected]<your-emr-ip>.compute-1.amazonaws.com:/usr/share/aws/emr/kinesis/lib/EmrKinesisCascading*

Step 2:  Install this library into a local Maven repository

Remain in the same directory as EmrKinesisCascading connector.

$ mvn install:install-file -Dfile=<EmrKinesisCascading-<version>.jar> -DgroupId=aws.kinesis -DartifactId=cascading-connector -Dversion=<EmrKinesisCascadingVersion> -Dpackaging=jar

Now that we have installed the EmrKinesisCascading connector in our local Maven repository, we can review the sample application.

First, let’s look at how we instantiate our KinesisHadoopScheme and KinesisHadoopTap. With Cascading, a “Tap” is used wherever you read or write data. There are roughly 30 supported Taps available for integration with the most widely used data stores/sources. We will also instantiate several Schemes which are used with Taps to specify the format (and types where necessary) of the incoming/outgoing data.

// instantiate incoming fields, in this case "data" to be used in the KinesisHadoopScheme
Fields columns = new Fields("data");
// instantiate KinesisHadoopScheme to be used with KinesisHadoopTap
KinesisHadoopScheme scheme = new KinesisHadoopScheme(columns);
// set noDateTimeout to true
scheme.setNoDataTimeout(1);
// apply our AWS access and secret keys – please see Disclaimer below regarding
// the use of AccessKeys and SecretKeys in production systems
scheme.setCredentials([ACCESS_KEY],[SECRET_KEY]);
// instantiate Kinesis Tap to read “AccessLogStream”
Tap kinesisSourceTap = new KinesisHadoopTap("AccessLogStream", scheme);

Now we create our Tap to read a file from S3 that will be joined to the Amazon KinesisStream, and we create the RedshiftTap that we will use to write our final output. For the S3 Tap we will use an Hfs tap which is fully compatible with S3. All you need to do is provide an S3 path instead of an Hfs path.

// instantiate S3 source Tap using Hfs. This Tap will source a comma delimited file of IP address
// found at the location of s3InStr
Tap s3SourceTap = new Hfs( new TextDelimited( new Fields("ip"), "," ), s3InStr );
// instantiate S3 sink tap – comma delimited with fields “ip”, “count”. Using Redshift’s COPY
// command we can load this data from S3 very quickly
Tap sinkTap = new Hfs( new TextDelimited( new Fields ("ip","count"), "," ), s3OutStr, SinkMode.REPLACE );
// instantiate S3 trap tap to catch any bad data – this data will be written to S3 and you will
// be able to see if, and how many tuples are being trapped using Driven
Tap trapTap = new Hfs(new TextDelimited(true, "t"), s3TrapStr, SinkMode.REPLACE);

Now that we have our necessary Taps, we can process the data. The Cascading processing model is based on a metaphor of flows based on patterns. Pipes control the flow of data applying operations to each Tuple or groups of Tuples. Within these pipes, we will perform several operations using Each and Every pipe. The operations we will use include RegexParser, Retain, Rename, HashJoin, Discard, GroupBy and Count.

I highlight a few of these operations below. As mentioned earlier, you can view the full source code for this sample application. 

RegexParser parser = new RegexParser(apacheFields, apacheRegex, allGroups);
// apply regex parser to each tuple in the Kinesis stream
processPipe = new Each(processPipe, columns, parser);
// retain only the field "ip" from Kinesis stream
processPipe = new Retain(processPipe, new Fields("ip"));
// in anticipation of the upcoming join rename S3 file field to avoid naming collision
joinPipe = new Rename( joinPipe, new Fields( "ip" ), new Fields( "userip" ) );
// rightJoin processPipe and joinPipe (IPs in S3 file) on ip (and renamed "userip")
Pipe outputPipe = new HashJoin(processPipe, new Fields("ip"), joinPipe, new Fields("userip"), new RightJoin());
// discard unnecessary "userip"
outputPipe = new Discard(outputPipe, new Fields("userip"));
// group all by "ip"
outputPipe = new GroupBy(outputPipe, new Fields("ip"));
// calculate count of every group of IP’s
outputPipe = new Every(outputPipe, new Count(new Fields("count")));

Now all we have to do is compose the flow by connecting our Taps to Pipes. Then we will connect and complete the flow.

// define the flow definition
FlowDef flowDef = FlowDef.flowDef()
.addSource( processPipe, kinesisSourceTap ) // connect processPipe to KinesisTap
.addSource( joinPipe, s3SourceTap ) // connect joinPipe to s3Tap
.addTailSink( outputPipe, sinkTap ) // connect outputPipe to S3 sink Tap
.setName( "Cascading-Kinesis-Flow" ) // name the flow
.addTrap( processPipe, trapTap ); // add the trap to catch any bad data in processPipe

// instantiate HadoopFlowConnector – other flowConnectors include:
// — Hadoop2Mr1FlowConnector
// — LocalFlowConnector
// — Hadoop2TezFlowConnector
// — Spark and Flink FlowConnectors under development
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// attach the flow definition to the flow connector
Flow kinesisFlow = flowConnector.connect(flowDef);
// run the flow
kinesisFlow.complete();

Setting up Amazon Kinesis, DynamoDB, Amazon Redshift and CLI

Before we run the Cascading application, let’s take a moment to ensure that the necessary infrastructure is in place and the Amazon Kinesis stream is populated. For simplicity, we use this setup script which handles the following tasks:

Create an Amazon Kinesis stream

Create two DynamoDB tables required by EMR to process Amazon Kinesis streams

Create an Amazon Redshift cluster

Download the kinesis-log4j-appender-1.0.0.jar to publish sample data to Amazon Kinesis

Download and unpack the sample data

Start the Amazon Kinesis Publisher for One-Time Publishing

As you can see, this script interacts with several AWS services via the command line. If you have not done so already, you must install the AWS Command Line Interface (CLI). The AWS CLI is a unified tool for managing your AWS services.

Now that we’ve installed the AWS CLI, we’ll add our AWS credentials to src/main/resource/ AwsCredentials.properties. This is required for kinesis-log4j-appender to publish to our Amazon Kinesis stream. Once that is complete, let’s call the script from the root directly of our sample project.

# add preferred Redshift credentials to setup-script.sh
$ vi //tutorials/cascading-aws/part4/src/main/scripts/setup-script.sh
# add AWS keys for Kinesis publisher –required for kinesis-log4j-appender-1.0.0.jar
$ vi //tutorials/cascading-aws/part4/AwsCredentials.properties
# cd to root of sample project
$ cd //tutorials/cascading-aws/part4/
# call setup script
$ src/main/scripts/setup-script.sh

Tying everything together

Now that we have all the right pieces in all the right places, we just need to update our configuration.properties file located in the Cascading sample application source code at “src/main/resources” with our respective values.

Disclaimer: In the interest of simplicity for this tutorial, we set our AccessKey and SecretKey values manually. However, it is highly recommended that all systems take advantage of the identity and access management afforded by AWS IAM. In doing so, your AccessKey and SecretKey will be available in your instance profile and will not be required anywhere in your code. The result is a more robust and secure application. This will require that your IAM users and roles are properly configured and grant appropriate permissions in place for all AWS services that you need to access. AWS documentation provides more information on creating an IAM policy and using IAM to control access to Amazon Kinesis resources.

USER_AWS_ACCESS_KEY=
USER_AWS_SECRET_KEY=
REDSHIFT_JDBC_URL=
REDSHIFT_USER_NAME=
REDSHIFT_PASSWORD=

NOTE: At the time of publication cascading-jdbc-redshift uses the compatible PostgreSQL driver. With that in mind, please replace “jdbc:redshift://” with “jdbc:postgresql://” in the Amazon Redshift JDBC URL you add to configuration.properties.

Now we’re ready to run our application! In order to do so, let’s go back into the sample source code and take a look at the shell script that we will use to simplify the final execution steps. This script has two primary functions. The first is to compile the sample application and push it to your S3 bucket, along with a data file that we will join against the Amazon Kinesis stream.

# clean and compile the application
gradle clean fatjar
# create the bucket or delete the contents if it exists
aws s3 mb s3://$BUCKET || aws s3 rm s3://$BUCKET –recursive
# push built jar file to S3
aws s3 cp $BUILD/$NAME s3://$BUCKET/$NAME
# push data file to S3 – we will join this file against the Kinesis stream
aws s3 cp $DATA_DIR s3://$BUCKET/$DATA_DIR

The second function of the script launches an EMR cluster and submits our Cascading application (now located in your S3 bucket) as a step to be run on the cluster.

Disclaimer: For scheduled, operational, production systems it is recommended that the following action be wrapped in an AWS Data Pipeline definition. In this tutorial, for simplicity, we will be using the CLI to create a small cluster for demonstrative purposes. After the sample data has been processed, or if the Amazon Kinesis stream is empty, this cluster will terminate automatically.

aws emr create-cluster
–ami-version 3.8.0
–instance-type m1.xlarge
–instance-count 1
–name "cascading-kinesis-example"
–visible-to-all-users
–enable-debugging
–auto-terminate
–no-termination-protected
–log-uri s3n://$BUCKET/logs/
–service-role EMR_DefaultRole –ec2-attributes InstanceProfile=EMR_EC2_DefaultRole
–steps Type=CUSTOM_JAR,Name=KinesisTest1,ActionOnFailure=TERMINATE_CLUSTER,Jar=s3n://$BUCKET/$NAME,Args=$BUCKET

To run the sample Cascading application simply call this script and provide the following arguments:

S3_BUCKET                          // S3 bucket to hold application and data

For the auto-compile to work you will need to call this script from the following directory “/[PATH_TO}/tutorials/cascading-aws/part4/”. For example:

$ cd /[PATH_TO}/tutorials/cascading-aws/part4/
$ src/main/scripts/cascading-kinesis.sh

After calling cascading-kinesis.sh, you will see the Cascading application compile, then the application jar and the data file that we will join against the Amazon Kinesis stream will be transferred to S3. After that, an EMR cluster is created and our Cascading application jar is added as a step. You can verify that this cluster is booting up by visiting the EMR console.  When the application has completed you will find the final output in an Amazon Redshift table named CascadingResults.

If you are already using Driven, you will also see the application appear on the landing page after logging in. With Driven, you can visualize the status, progress and behavior of your applications in real-time, as well as over time.

Driven lets developers build higher quality data applications and gives operators the ability to efficiently optimize and monitor these applications. If you’d like to explore Driven, you can take a tour or visit Driven’s website

There you have it. With Cascading we now have a micro-batch system that sources streaming, real-time data from Amazon Kinesis, joins it with data in S3, processes this joined stream on EMR, loads the results into Amazon Redshift, and monitors the entire application lifecycle with Driven.

If you have questions or suggestions, please leave a comment below.

—————————–

Related:

Indexing Common Crawl Metadata on Amazon EMR Using Cascading and Elasticsearch

 

Integrating AWS CodeCommit with Review Board

Post Syndicated from Wade Matveyenko original http://blogs.aws.amazon.com/application-management/post/Tx35O95VQF5I0AT/Integrating-AWS-CodeCommit-with-Review-Board

Today we have a guest post from Jeff Nunn, a Solutions Architect at AWS, specializing in DevOps and Big Data solutions.

By now you’ve probably heard of AWS CodeCommit–a secure, highly scalable, managed source control service that hosts private Git repositories. AWS CodeCommit supports the standard functionality of Git, allowing it to work seamlessly with your existing Git-based tools. In addition, CodeCommit works with Git-based code review tools, allowing you and your team to better collaborate on projects. By the end of this post, you will have launched an EC2 instance, configured the AWS CLI, and integrated CodeCommit with a code review tool.

What is Code Review?

Code review (sometimes called "peer review") is the process of making source code available for other collaborators to review. The intention of code review is to catch bugs and errors and improve the quality of code before it becomes part of the product. Most code review systems provide contributors the ability to capture notes and comments about the changes to enable discussion of the change, which is useful when working with distributed teams.

We’ll show you how to integrate AWS CodeCommit into development workflow using the Review Board code review system.

Getting Started

If you’re reading this, you most likely are familiar with Git and have it installed. To work with files or code in AWS CodeCommit repositories, you must install Git on your local machine, if you haven’t installed it already. AWS CodeCommit supports Git versions 1.7 and later.

In addition, you’ll need to have an AWS Identity and Access Management (IAM) user with an appropriate AWS CodeCommit permissions policy attached. Follow the instructions at Set Up Your IAM User Credentials for Git and AWS CodeCommit to give your user(s) access.

While this post covers integration with Review Board, you can take what you learn here and integrate with your favorite code review tools. We’ll soon publish integration methods for other tools, like Gerrit, Phabricator, and Crucible. When you have completed the above prerequisites, you are ready to continue.

Review Board

Review Board is a web-based collaborative code review tool for reviewing source code, documentation, and other text-based files. Let’s integrate Review Board with a CodeCommit repo. You can integrate CodeCommit with an existing Review Board server, or setup a new one. If you already have Review Board setup, you can skip down to Step 2: Setting up the Review Board Server.

Step 1: Creating a Review Board Server

To setup a Review Board server, we turn to the AWS Marketplace. The AWS Marketplace has a rich ecosystem of Independent Software Vendors (ISVs) and partners that AWS works with, and there you will find many pre-built Amazon Machine Images (AMIs) to help save you time and effort when setting up software or application stacks.

We launch an EC2 instance based off a public Review Board AMI from Bitnami. From the EC2 console, click the Launch Instance button. From Choose an Amazon Machine Image (AMI), click the "AWS Marketplace" link, and then search for "review board hvm".

In the search results returned, select "Review Board powered by Bitnami (HVM)". While some products in the AWS Marketplace do have an additional cost to use them, you’ll notice that there is no additional cost to run Review Board from Bitnami. Click the Select button to choose this image, and you are taken to the "Choose Instance Type" step. By default, the Review Board AMI selects an m3.medium instance to launch into, but you can choose any instance type that fits your needs. Click the Review and Launch button to review the settings for your instance. Scroll to the bottom of the screen, click the Edit Tags link, and create a "Name" tag with a descriptive value:

Click the Review and Launch button again, and then click the Launch button. Verify that you have a key pair that will connect to your instance, and then click the Launch Instance button.

After a short time, your instance should successfully launch, and be in a "running" state:

Because we used Bitnami’s prebuilt-AMI to do our install, the majority of the configuration is done for us, including the creation of an administrative user and password. To retrieve the password, select the instance, click the Actions button, and then click "Get System Log." You can find more information on this process at Bitnami’s website for retrieving AWS Marketplace credentials.

Scroll until you’re near the bottom of the log, and find the "Bitnami application password." You’ll need this to login to your Review Board server in Step 2.

Step 2: Setting Up The Review Board Server

SSH into your EC2 instance. If you’ve installed Review Board with the Bitnami AMI, you’ll need to login as the "bitnami" user rather than the "ubuntu" user. Download and install the AWS CLI, if you haven’t done so already. This is a prerequisite to enabling you to interact with AWS CodeCommit from the command line. For more information, see Getting Set Up with the AWS Command Line Interface.

Note: Although the Review Board AMI comes with Python and pip ("pip" is a package manager for Python), you’ll need to re-install pip before installing the AWS CLI. From the command line, type:

> curl -O https://bootstrap.pypa.io/get-pip.py

and then:

> sudo python get-pip.py

Follow the instructions from the "Install the AWS CLI using pip" section, and then configure the command line to work with your AWS account. Be sure to specify "us-east-1" as your default region, as CodeCommit currently only works from this region.

Configure the AWS CodeCommit Credential Helper

The approach that you take to set up your IAM user credentials for Git and AWS CodeCommit on your local machine depends on the connection protocol (HTTPS) and operating system (Windows, or Linux, OS X, or Unix) that you intend to use.

For HTTPS, you allow Git to use a cryptographically-signed version of your IAM user credentials whenever Git needs to authenticate with AWS in order to interact with repositories in AWS CodeCommit. To do this, you install and configure on your local machine what we call a credential helper for Git. (Without this credential helper, you would need to manually sign and resubmit a cryptographic version of your IAM user credentials frequently whenever you need Git to authenticate with AWS. The credential helper automatically manages this process for you.)

Follow the steps in Set up the AWS CodeCommit credential helper for Git depending on your desired connection protocol and operating system.

Create or Clone a CodeCommit Repository

Now that you have your Review Board server setup, you’ll need to add an AWS CodeCommit repository to connect to. If you have not yet created an AWS CodeCommit repository, follow the instructions here to create a new repository, and note the new AWS CodeCommit repository’s name.

If you have an existing AWS CodeCommit repository but you do not know its name, you get the name by following the instructions in View Repository Details.

Once you have your AWS CodeCommit repository name, you will create a local repo on the Review Board server. Change to a directory in which you will store the repository, and clone the repository. Cloning to your home directory is shown in the following example:

> cd ~
> git clone https://git-codecommit.us-east-1.amazonaws.com/v1/repos/MyDemoRepo my-demo-repo

Setting up a Repository in Review Board

Now that you have cloned your repo to your Review Board server, we need to configure Review Board to watch your repo. Visit the Public DNS address of your EC2 instance, and login to the Review Board administration panel, which will look like http://ec2-dns-address-of-your-instance.amazonaws.com/admin/.  The username is “user,” and the password is the password you saved from the log file in Step 1.

After logging in, you are taken the admin dashboard. Create a new repository from the Manage section of the admin menu. Fill in the name, keep the “Hosting service” at “None”, and select “Git” as the “Repository type.” For the path, enter the path to your cloned repository, including the “.git” hidden folder, as seen in the example below:

 

Finally, click the Save button. Back on the Manage section of the admin menu click the Repositories link to be taken to a dashboard of the repositories on your Review Board system. Next to your repository, click the RBTools Setup link. Follow the instructions to create a .reviewboardrc file in your repository:

Commit the file to your repository and push it to AWS CodeCommit with a "git push" command. RBTools will then be able to find the correct server and repository when any developer posts changes for review.

Step 3: Setting up Your Client(s)

Now that you have a Review Board server setup, you’ll need to install AWS CodeCommit and RBTools. For each client machine, configure the AWS CLI and AWS CodeCommit similar to the way you did on the Review Board server. Then install RBTools, which will allow you to post your changes to the Review Board server, and let other collaborators comment on those changes. You may also want to create user accounts in the Review Board dashboard for each developer that will submit reviews. For the purposes of this demo, create at least one additional user to serve as a reviewer for your code review requests.

Step 4: Using Review Board in your AWS CodeCommit Workflow

You now have a Review Board server integrated with AWS CodeCommit, a client from which to send review requests, and an additional Review Board user to assign review requests to. Let’s take a look at a typical Git workflow with Review Board.

In many projects, feature work or bug fixes are first done in a Git branch, then moved into the main branch (sometimes called master or mainline) after a testing or review process.

Let’s go through a simple example to demonstrate branching, merging, and reviewing in an AWS CodeCommit and Review Board workflow. We’ll take a fictitious cookbook project and add recipes to it, and have a reviewer accept your changes before you merge them into your AWS CodeCommit project’s master branch.

Creating a Review Request

Create a branch in your project from which to add a new file:

> git checkout -b pies
Switched to a new branch pies

Now, add a new file to this branch (you could also modify an existing file, but for the sake of this demo, we will create a new one).

> echo "6 cups of sliced, peeled apples." > applepie.txt

You’ve now added the beginning of a new pie recipe to your cookbook. Ideally, you would now run unit tests to verify the validity of your work, or similarly validate that your code was functional and did not break other parts of your code base.

Add this recipe to your repo, and give it a meaningful commit message:

> git add .
> git commit -m "beginning apple pie recipe"
[pies 5d2a678] beginning apple pie recipe
1 file changed, 1 insertion(+)
create mode 100644 applepie.txt

You have added a new file to a branch in your project–let’s share it with your reviewers. We use rbt post, along with our branch name, to post it to the Review Board server for review. On your first post to Review Board, you will be asked a username and password, and upon successful post to the Review Board server, you are given a review request URL.

> rbt post pies
Review request #1 posted.

http://ec2-dns-address-of-your-instance.amazonaws.com:80/r/1/
http://ec2-dns-address-of-your-instance.amazonaws.com:80/r/1/diff/

We specified our branch name, "pies" in the "rbt post" command, which automatically chooses the latest commit in our commit history to send to the Review Board server. You are able to post any commit in your history however, by specifying the commit ID–which you can retrieve by issuing a "git log" command. For example, if you add additional pie recipes across several commits, you could choose a specific commit to send to the Review Board server.

> git log

commit 1d1bfc579bac494ae656eae9ce6ee23cae3f146b
Author: username <[email protected]>
Date: Mon May 11 10:37:12 2015 -0500

Blueberry pie

commit 468f20fc4272691a409ef21dc0d6eaab27c1ab35
Author: username <[email protected]>
Date: Mon May 11 10:35:22 2015 -0500

Cherry and chocolate pie recipes

> rbt post 468f20
Review request #2 posted.

http://ec2-dns-address-of-your-instance.amazonaws.com:80/r/2/
http://ec2-dns-address-of-your-instance.amazonaws.com:80/r/2/diff/

Now that we have sent our apple pie recipe (and any additional recipes you may have created) to the Review Board server for review, let’s log in to walk through the process of assigning them to be reviewed.

Log in to your Review Board account and visit your dashboard. Under the Outgoing menu on the left-hand side, you’ll see a count of "All" and "Open" requests. Click "All," and then click the "[Draft] beginning apple pie recipe" request.

Edit your description or add comments to any testing you have done, and then assign a reviewer by clicking the pencil icon next to "People" under the Reviewers section:

Finally, click the Publish button to publish your review request and assign it to one or more reviewers.

Reviewing a Change Request

Now that we have at least one review request assigned to a user, log in as that user on the Review Board server. On your dashboard under the Incoming section, click the "To Me" link. Find the "beginning apple pie recipe" request and click it to be taken to the review request details page. Click the View Diff button in the summary toolbar to view the changes made to this file. Since this was a new file, you will only see one change. Click the Review button in the summary toolbar to add your review comments to this request. When you are finished with your comments, click the Publish Review button.

As a reviewer, we are satisfied with the modifications to the file. We could check the "Ship It" box, or click the Ship It button in the summary toolbar after we publish the review. We have now indicated that the code is ready to be merged into the master branch.

Log in again as the user who submitted the request, and notice two new icons next to your request. The speech bubble indicates you have comments available to view, and the green check oval indicates that your code is ready to be merged into your master AWS CodeCommit branch.

View the comments from your reviewer, and notice that your code is ready to be shipped.

Merging Your Commit

There are several viable ways to merge a branch into a master branch, like cherry-picking a single commit from a branch, or bringing in all commits. We keep things simple here, and simply merge the commit(s) from the pies branch into the master. Now, push the updated code up to the AWS CodeCommit Git repo.

> git checkout master
Switched to branch ‘master’
Your branch is up-to-date with ‘origin/master’.

> git merge pies
Updating 304d704..9ab13cf
Fast-forward
applepie.txt | 1 +
1 files changed, 11 insertions(+)
create mode 100644 applepie.txt

> git push

Counting objects: 2, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (2/2), 226 bytes | 0 bytes/s, done.
Total 2 (delta 1), reused 0 (delta 0)

remote:
To https://git-codecommit.us-east-1.amazonaws.com/v1/repos/MyDemoRepo

9ab13cf..a0e3119 master -> master

Conclusion

Congratulations! You integrated Review Board with AWS CodeCommit, and created a feature development branch, created code, submitted it for review, and merged it into your master branch after acceptance. You seamlessly combined a secure, highly scalable and managed service for your Git repositories with a code review tool, and now you can ship your code faster, cleaner, and with more confidence. In future posts, we’ll show you how to integrate AWS CodeCommit with other common Git tools.

Integrating AWS CodeCommit with Review Board

Post Syndicated from Wade Matveyenko original http://blogs.aws.amazon.com/application-management/post/Tx35O95VQF5I0AT/Integrating-AWS-CodeCommit-with-Review-Board

Today we have a guest post from Jeff Nunn, a Solutions Architect at AWS, specializing in DevOps and Big Data solutions.

By now you’ve probably heard of AWS CodeCommit–a secure, highly scalable, managed source control service that hosts private Git repositories. AWS CodeCommit supports the standard functionality of Git, allowing it to work seamlessly with your existing Git-based tools. In addition, CodeCommit works with Git-based code review tools, allowing you and your team to better collaborate on projects. By the end of this post, you will have launched an EC2 instance, configured the AWS CLI, and integrated CodeCommit with a code review tool.

What is Code Review?

Code review (sometimes called "peer review") is the process of making source code available for other collaborators to review. The intention of code review is to catch bugs and errors and improve the quality of code before it becomes part of the product. Most code review systems provide contributors the ability to capture notes and comments about the changes to enable discussion of the change, which is useful when working with distributed teams.

We’ll show you how to integrate AWS CodeCommit into development workflow using the Review Board code review system.

Getting Started

If you’re reading this, you most likely are familiar with Git and have it installed. To work with files or code in AWS CodeCommit repositories, you must install Git on your local machine, if you haven’t installed it already. AWS CodeCommit supports Git versions 1.7 and later.

In addition, you’ll need to have an AWS Identity and Access Management (IAM) user with an appropriate AWS CodeCommit permissions policy attached. Follow the instructions at Set Up Your IAM User Credentials for Git and AWS CodeCommit to give your user(s) access.

While this post covers integration with Review Board, you can take what you learn here and integrate with your favorite code review tools. We’ll soon publish integration methods for other tools, like Gerrit, Phabricator, and Crucible. When you have completed the above prerequisites, you are ready to continue.

Review Board

Review Board is a web-based collaborative code review tool for reviewing source code, documentation, and other text-based files. Let’s integrate Review Board with a CodeCommit repo. You can integrate CodeCommit with an existing Review Board server, or setup a new one. If you already have Review Board setup, you can skip down to Step 2: Setting up the Review Board Server.

Step 1: Creating a Review Board Server

To setup a Review Board server, we turn to the AWS Marketplace. The AWS Marketplace has a rich ecosystem of Independent Software Vendors (ISVs) and partners that AWS works with, and there you will find many pre-built Amazon Machine Images (AMIs) to help save you time and effort when setting up software or application stacks.

We launch an EC2 instance based off a public Review Board AMI from Bitnami. From the EC2 console, click the Launch Instance button. From Choose an Amazon Machine Image (AMI), click the "AWS Marketplace" link, and then search for "review board hvm".

In the search results returned, select "Review Board powered by Bitnami (HVM)". While some products in the AWS Marketplace do have an additional cost to use them, you’ll notice that there is no additional cost to run Review Board from Bitnami. Click the Select button to choose this image, and you are taken to the "Choose Instance Type" step. By default, the Review Board AMI selects an m3.medium instance to launch into, but you can choose any instance type that fits your needs. Click the Review and Launch button to review the settings for your instance. Scroll to the bottom of the screen, click the Edit Tags link, and create a "Name" tag with a descriptive value:

Click the Review and Launch button again, and then click the Launch button. Verify that you have a key pair that will connect to your instance, and then click the Launch Instance button.

After a short time, your instance should successfully launch, and be in a "running" state:

Because we used Bitnami’s prebuilt-AMI to do our install, the majority of the configuration is done for us, including the creation of an administrative user and password. To retrieve the password, select the instance, click the Actions button, and then click "Get System Log." You can find more information on this process at Bitnami’s website for retrieving AWS Marketplace credentials.

Scroll until you’re near the bottom of the log, and find the "Bitnami application password." You’ll need this to login to your Review Board server in Step 2.

Step 2: Setting Up The Review Board Server

SSH into your EC2 instance. If you’ve installed Review Board with the Bitnami AMI, you’ll need to login as the "bitnami" user rather than the "ubuntu" user. Download and install the AWS CLI, if you haven’t done so already. This is a prerequisite to enabling you to interact with AWS CodeCommit from the command line. For more information, see Getting Set Up with the AWS Command Line Interface.

Note: Although the Review Board AMI comes with Python and pip ("pip" is a package manager for Python), you’ll need to re-install pip before installing the AWS CLI. From the command line, type:

> curl -O https://bootstrap.pypa.io/get-pip.py

and then:

> sudo python get-pip.py

Follow the instructions from the "Install the AWS CLI using pip" section, and then configure the command line to work with your AWS account. Be sure to specify "us-east-1" as your default region, as CodeCommit currently only works from this region.

Configure the AWS CodeCommit Credential Helper

The approach that you take to set up your IAM user credentials for Git and AWS CodeCommit on your local machine depends on the connection protocol (HTTPS) and operating system (Windows, or Linux, OS X, or Unix) that you intend to use.

For HTTPS, you allow Git to use a cryptographically-signed version of your IAM user credentials whenever Git needs to authenticate with AWS in order to interact with repositories in AWS CodeCommit. To do this, you install and configure on your local machine what we call a credential helper for Git. (Without this credential helper, you would need to manually sign and resubmit a cryptographic version of your IAM user credentials frequently whenever you need Git to authenticate with AWS. The credential helper automatically manages this process for you.)

Follow the steps in Set up the AWS CodeCommit credential helper for Git depending on your desired connection protocol and operating system.

Create or Clone a CodeCommit Repository

Now that you have your Review Board server setup, you’ll need to add an AWS CodeCommit repository to connect to. If you have not yet created an AWS CodeCommit repository, follow the instructions here to create a new repository, and note the new AWS CodeCommit repository’s name.

If you have an existing AWS CodeCommit repository but you do not know its name, you get the name by following the instructions in View Repository Details.

Once you have your AWS CodeCommit repository name, you will create a local repo on the Review Board server. Change to a directory in which you will store the repository, and clone the repository. Cloning to your home directory is shown in the following example:

> cd ~
> git clone https://git-codecommit.us-east-1.amazonaws.com/v1/repos/MyDemoRepo my-demo-repo

Setting up a Repository in Review Board

Now that you have cloned your repo to your Review Board server, we need to configure Review Board to watch your repo. Visit the Public DNS address of your EC2 instance, and login to the Review Board administration panel, which will look like http://ec2-dns-address-of-your-instance.amazonaws.com/admin/.  The username is “user,” and the password is the password you saved from the log file in Step 1.

After logging in, you are taken the admin dashboard. Create a new repository from the Manage section of the admin menu. Fill in the name, keep the “Hosting service” at “None”, and select “Git” as the “Repository type.” For the path, enter the path to your cloned repository, including the “.git” hidden folder, as seen in the example below:

 

Finally, click the Save button. Back on the Manage section of the admin menu click the Repositories link to be taken to a dashboard of the repositories on your Review Board system. Next to your repository, click the RBTools Setup link. Follow the instructions to create a .reviewboardrc file in your repository:

Commit the file to your repository and push it to AWS CodeCommit with a "git push" command. RBTools will then be able to find the correct server and repository when any developer posts changes for review.

Step 3: Setting up Your Client(s)

Now that you have a Review Board server setup, you’ll need to install AWS CodeCommit and RBTools. For each client machine, configure the AWS CLI and AWS CodeCommit similar to the way you did on the Review Board server. Then install RBTools, which will allow you to post your changes to the Review Board server, and let other collaborators comment on those changes. You may also want to create user accounts in the Review Board dashboard for each developer that will submit reviews. For the purposes of this demo, create at least one additional user to serve as a reviewer for your code review requests.

Step 4: Using Review Board in your AWS CodeCommit Workflow

You now have a Review Board server integrated with AWS CodeCommit, a client from which to send review requests, and an additional Review Board user to assign review requests to. Let’s take a look at a typical Git workflow with Review Board.

In many projects, feature work or bug fixes are first done in a Git branch, then moved into the main branch (sometimes called master or mainline) after a testing or review process.

Let’s go through a simple example to demonstrate branching, merging, and reviewing in an AWS CodeCommit and Review Board workflow. We’ll take a fictitious cookbook project and add recipes to it, and have a reviewer accept your changes before you merge them into your AWS CodeCommit project’s master branch.

Creating a Review Request

Create a branch in your project from which to add a new file:

> git checkout -b pies
Switched to a new branch pies

Now, add a new file to this branch (you could also modify an existing file, but for the sake of this demo, we will create a new one).

> echo "6 cups of sliced, peeled apples." > applepie.txt

You’ve now added the beginning of a new pie recipe to your cookbook. Ideally, you would now run unit tests to verify the validity of your work, or similarly validate that your code was functional and did not break other parts of your code base.

Add this recipe to your repo, and give it a meaningful commit message:

> git add .
> git commit -m "beginning apple pie recipe"
[pies 5d2a678] beginning apple pie recipe
1 file changed, 1 insertion(+)
create mode 100644 applepie.txt

You have added a new file to a branch in your project–let’s share it with your reviewers. We use rbt post, along with our branch name, to post it to the Review Board server for review. On your first post to Review Board, you will be asked a username and password, and upon successful post to the Review Board server, you are given a review request URL.

> rbt post pies
Review request #1 posted.

http://ec2-dns-address-of-your-instance.amazonaws.com:80/r/1/
http://ec2-dns-address-of-your-instance.amazonaws.com:80/r/1/diff/

We specified our branch name, "pies" in the "rbt post" command, which automatically chooses the latest commit in our commit history to send to the Review Board server. You are able to post any commit in your history however, by specifying the commit ID–which you can retrieve by issuing a "git log" command. For example, if you add additional pie recipes across several commits, you could choose a specific commit to send to the Review Board server.

> git log

commit 1d1bfc579bac494ae656eae9ce6ee23cae3f146b
Author: username <[email protected]>
Date: Mon May 11 10:37:12 2015 -0500

Blueberry pie

commit 468f20fc4272691a409ef21dc0d6eaab27c1ab35
Author: username <[email protected]>
Date: Mon May 11 10:35:22 2015 -0500

Cherry and chocolate pie recipes

> rbt post 468f20
Review request #2 posted.

http://ec2-dns-address-of-your-instance.amazonaws.com:80/r/2/
http://ec2-dns-address-of-your-instance.amazonaws.com:80/r/2/diff/

Now that we have sent our apple pie recipe (and any additional recipes you may have created) to the Review Board server for review, let’s log in to walk through the process of assigning them to be reviewed.

Log in to your Review Board account and visit your dashboard. Under the Outgoing menu on the left-hand side, you’ll see a count of "All" and "Open" requests. Click "All," and then click the "[Draft] beginning apple pie recipe" request.

Edit your description or add comments to any testing you have done, and then assign a reviewer by clicking the pencil icon next to "People" under the Reviewers section:

Finally, click the Publish button to publish your review request and assign it to one or more reviewers.

Reviewing a Change Request

Now that we have at least one review request assigned to a user, log in as that user on the Review Board server. On your dashboard under the Incoming section, click the "To Me" link. Find the "beginning apple pie recipe" request and click it to be taken to the review request details page. Click the View Diff button in the summary toolbar to view the changes made to this file. Since this was a new file, you will only see one change. Click the Review button in the summary toolbar to add your review comments to this request. When you are finished with your comments, click the Publish Review button.

As a reviewer, we are satisfied with the modifications to the file. We could check the "Ship It" box, or click the Ship It button in the summary toolbar after we publish the review. We have now indicated that the code is ready to be merged into the master branch.

Log in again as the user who submitted the request, and notice two new icons next to your request. The speech bubble indicates you have comments available to view, and the green check oval indicates that your code is ready to be merged into your master AWS CodeCommit branch.

View the comments from your reviewer, and notice that your code is ready to be shipped.

Merging Your Commit

There are several viable ways to merge a branch into a master branch, like cherry-picking a single commit from a branch, or bringing in all commits. We keep things simple here, and simply merge the commit(s) from the pies branch into the master. Now, push the updated code up to the AWS CodeCommit Git repo.

> git checkout master
Switched to branch ‘master’
Your branch is up-to-date with ‘origin/master’.

> git merge pies
Updating 304d704..9ab13cf
Fast-forward
applepie.txt | 1 +
1 files changed, 11 insertions(+)
create mode 100644 applepie.txt

> git push

Counting objects: 2, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (2/2), 226 bytes | 0 bytes/s, done.
Total 2 (delta 1), reused 0 (delta 0)

remote:
To https://git-codecommit.us-east-1.amazonaws.com/v1/repos/MyDemoRepo

9ab13cf..a0e3119 master -> master

Conclusion

Congratulations! You integrated Review Board with AWS CodeCommit, and created a feature development branch, created code, submitted it for review, and merged it into your master branch after acceptance. You seamlessly combined a secure, highly scalable and managed service for your Git repositories with a code review tool, and now you can ship your code faster, cleaner, and with more confidence. In future posts, we’ll show you how to integrate AWS CodeCommit with other common Git tools.

Integrating AWS CodeCommit with Jenkins

Post Syndicated from Rob Brigham original http://blogs.aws.amazon.com/application-management/post/Tx1C8B98XN0AF2E/Integrating-AWS-CodeCommit-with-Jenkins

Today we have a guest post written by Emeka Igbokwe, a Solutions Architect at AWS.

This post walks you through the steps to set up Jenkins and AWS CodeCommit to support 2 simple continuous integration (CI) scenarios. 

In the 1st scenario, you will make a change in your local Git repository, push the change to your AWS CodeCommit hosted repository and have the change trigger a build in Jenkins.

For the 2nd scenario, you will make a change on a development branch in your local Git repository, push the change to your AWS CodeCommit hosted repository and have the change trigger a merge from the development branch to the master branch, perform a build on the merged master branch, then push the change on the merged master branch to the AWS CodeCommit hosted repository on a successful build.

For the walkthrough, we will run the Jenkins server on an Amazon Linux Instance and configure your workstation to access the Git repository hosted by AWS CodeCommit.

Set Up IAM Permissions

AWS CodeCommit uses IAM permissions to control access to the Git repositories. 

For this walkthrough, you will create an IAM user, an IAM role, and a managed policy. You will attach the managed policy to the IAM user and the IAM role, granting both the user and role the permissions to push and pull changes to and from the Git repository hosted by AWS CodeCommit.  

You will associate the IAM role with the Amazon EC2 instance you launch to run Jenkins. (Jenkins uses the permissions granted by the IAM role to access the Git repositories.)  

Create an IAM user. Save the access key ID and the secret access key for the new user. 

Attach the managed policy named AWSCodeCommitPowerUser to the IAM user you created.

Create an Amazon EC2 service role named CodeCommitRole and attach the managed policy (AWSCodeCommitPowerUser) to it.

Set Up Your Development Environment

Install Git and the AWS CLI on your workstation.

Windows:

Install Git on Windows

Install the AWS CLI using the MSI Installer.

Linux or Mac:

Install Git on Linux or Mac.

Install the AWS CLI using the Bundled Installer.

After you install the AWS CLI, you must configure it using your IAM user credentials.

aws configure

Enter the AWS access key and AWS secret access key for the IAM user you created; enter us-east-1 for the region name; and enter json for the output format. 

AWS Access Key ID [None]: Type your target AWS access key ID here, and then press Enter
AWS Secret Access Key [None]: Type your target AWS secret access key here, and then press Enter
Default region name [None]: Type us-east-1 here, and then press Enter
Default output format [None]: Type json here, and then press Enter

Configure Git to use your IAM credentials and an HTTP path to access the repositories hosted by AWS CodeCommit.

git config –global credential.helper ‘!aws codecommit credential-helper [email protected]
git config –global credential.useHttpPath true

Create your central Git repository in AWS CodeCommit. 

aws codecommit create-repository –repository-name DemoRepo –repository-description "demonstration repository"

Set your user name and email address.

git config –global user.name "Your Name"
git config –global user.email "Your Email Address"

Create a local copy of the repository.

git clone https://git-codecommit.us-east-1.amazonaws.com/v1/repos/DemoRepo

Change directory to the local repository.

cd DemoRepo

In the editor of your choice, copy and paste the following into a file and save it as HelloWorld.java.

class HelloWorld {
public static void main(String[] args) {
System.out.println("Hello World!");
}
}

In the same directory where you created HelloWorld.java, run the following git commands to commit and push your change.

git add HelloWorld.java
git commit -m "Added HelloWord.java"
git push origin

Set Up the Jenkins Server

Create an instance using the Amazon Linux AMI. Make sure you associate the instance with the CodeCommitRole role and configure the security group associated with the instance to allow incoming traffic on ports 22 (SSH) and 8080 (Jenkins). You may further secure your server by restricting access to only the IP addresses of the developer machines connecting to Jenkins.

Use SSH to connect to the instance. Update the AWS CLI and install Jenkins, Git, and the Java JDK. 

sudo yum install -y git java-1.8.0-openjdk-devel
sudo yum update -y aws-cli

Add the Jenkins repository and install Jenkins.

sudo wget -O /etc/yum.repos.d/jenkins.repo http://pkg.jenkins-ci.org/redhat/jenkins.repo
sudo rpm –import https://jenkins-ci.org/redhat/jenkins-ci.org.key
sudo yum install -y jenkins

Configure the AWS CLI.

cd ~jenkins
sudo -u jenkins aws configure

Accept the defaults for the AWS access key and AWS secret access key; enter us-east-1 for the region name; and enter json for the output format. 

AWS Access Key ID [None]: Press Enter
AWS Secret Access Key [None]: Press Enter
Default region name [None]: Type us-east-1 here, and then press Enter
Default output format [None]: Type json here, and then press Enter

Configure Git to use IAM credentials and an HTTP path to access the repositories hosted by AWS CodeCommit.

sudo -u jenkins git config –global credential.helper ‘!aws codecommit credential-helper [email protected]
sudo -u jenkins git config –global credential.useHttpPath true
sudo -u jenkins git config –global user.email "[email protected]"
sudo -u jenkins git config –global user.name "MyJenkinsServer"

Start Jenkins. 

sudo service jenkins start
sudo chkconfig jenkins on

Configure global security.

Open the Jenkins home page (http://<public DNS name of EC2 instance>:8080) in your browser.

Select Manage Jenkins and Configure Global Security. 

Select the Enable Security check box.

Under Security Realm, select the Jenkins’ own user database radio button.

Clear the Allow users to sign up check box.

Under Authorization, select the Logged-in users can do anything radio button.

Configure the Git plugin.

Select Manage Jenkins and Manage Plugins. 

On the Available tab, use the Filter box to find Git Plugin.  

Select the Install check box next to Git Plugin.

Choose Download now and install after restart.

After Jenkins has restarted, add a project that will execute a build each time a change is pushed to the AWS CodeCommit hosted repository. 

Scenario 1: Set Up Project 

From the Jenkins home page, select New Item. 

Select Build a free-style software project.   

For the project name, enter "Demo".

For Source Code Management, choose Git.

For the repository URL, enter "https://git-codecommit.us-east-1.amazonaws.com/v1/repos/DemoRepo". 

For the Build Trigger, select Poll SCM with a schedule of */05 * * *.

For the Build under Add Build Step select Execute Shell and in the Command text box, type javac HelloWorld.java.

Click Save.

Scenario 1: Update the Local Git Repository

Now that your development environment is configured and the Jenkins server is set up, modify the source in your local repository and push the change to the central repository hosted on AWS CodeCommit.

On your workstation, change directory to the local repository and create a branch where you will make your changes.

cd DemoRepo

Use the editor of your choice to modify Helloword.java with the content below, and then save the file in the DemoRepo directory.

class HelloWorld {
public static void main(String[] args) {
System.out.println("Scenario 1: Build Hello World using Jenkins");
}
}

Run the following git commands to commit and push your change.

git add HelloWorld.java
git commit -m "Modified HelloWord.java for scenario 1"
git push origin

Scenario 1: Monitor Build

After five minutes, go to the Jenkins home page. You should see a build.

In the Last Success column, click the build (shown here as #1). This will take you to the build output. Click Console Output to see the build details.

Scenario 2: Modify Project To Support "Pre-Build Branch Merging"

From the Jenkins home page, click on Demo in the Name column. 

Select "Configure" to modify project

Make sure “Branch Specifier” for Branches to build is blank.

For Additional Behaviors, add Merge before Build.

Set the name of the repository to origin.

Set the branch to merge to master.

Add the Post Build Action Git Publisher.

Select Push Only If Build Succeeds.

Select Merge Results.

Select Add Tag.

Set the tag to push to $GIT_COMMIT.

Select Create new tag.

Set the target remote name to origin.

Click Save.

Scenario 2: Update the Local Git Repository

Now that your development environment is configured and the Jenkins server is set up, modify the source in your local repository and push the change to the central repository hosted on AWS CodeCommit.

On your workstation, change directory to the local repository and create a branch where you will make your changes.

cd DemoRepo
git checkout -b MyDevBranch

Use the editor of your choice to modify Helloword.java with the content below, and then save the file in the DemoRepo directory.

class HelloWorld {
public static void main(String[] args) {
System.out.println("Build Hello World using Jenkins!");
}
}

Run the following git commands to commit and push your change.

git add HelloWorld.java
git commit -m "Modified HelloWord.java for sceanrio 2"
git push origin MyDevBranch

Scenario 2: Monitor Build

After five minutes, go to the Jenkins home page. You should see a build.

In the Last Success column, click the build (shown here as #2). This will take you to the build output. Click Console Output to see the build details.

Scenario 2:  Verify The Master Branch Is Updated

Create another local repository named DemoRepo2. Verify the master branch includes your changes.

cd ..
git clone https://git-codecommit.us-east-1.amazonaws.com/v1/repos/DemoRepo DemoRepo2
cd DemoRepo2

Use the editor of your choice to open HelloWorld.java. It should include the change you made in your local DemoRepo repository. 

We hope this helps to get you started using Jenkins with your AWS CodeCommit repositories.  Let us know if you have questions, or if there are other product integrations you are interested in.

 

Integrating AWS CodeCommit with Jenkins

Post Syndicated from Rob Brigham original http://blogs.aws.amazon.com/application-management/post/Tx1C8B98XN0AF2E/Integrating-AWS-CodeCommit-with-Jenkins

Today we have a guest post written by Emeka Igbokwe, a Solutions Architect at AWS.

This post walks you through the steps to set up Jenkins and AWS CodeCommit to support 2 simple continuous integration (CI) scenarios. 

In the 1st scenario, you will make a change in your local Git repository, push the change to your AWS CodeCommit hosted repository and have the change trigger a build in Jenkins.

For the 2nd scenario, you will make a change on a development branch in your local Git repository, push the change to your AWS CodeCommit hosted repository and have the change trigger a merge from the development branch to the master branch, perform a build on the merged master branch, then push the change on the merged master branch to the AWS CodeCommit hosted repository on a successful build.

For the walkthrough, we will run the Jenkins server on an Amazon Linux Instance and configure your workstation to access the Git repository hosted by AWS CodeCommit.

Set Up IAM Permissions

AWS CodeCommit uses IAM permissions to control access to the Git repositories. 

For this walkthrough, you will create an IAM user, an IAM role, and a managed policy. You will attach the managed policy to the IAM user and the IAM role, granting both the user and role the permissions to push and pull changes to and from the Git repository hosted by AWS CodeCommit.  

You will associate the IAM role with the Amazon EC2 instance you launch to run Jenkins. (Jenkins uses the permissions granted by the IAM role to access the Git repositories.)  

Create an IAM user. Save the access key ID and the secret access key for the new user. 

Attach the managed policy named AWSCodeCommitPowerUser to the IAM user you created.

Create an Amazon EC2 service role named CodeCommitRole and attach the managed policy (AWSCodeCommitPowerUser) to it.

Set Up Your Development Environment

Install Git and the AWS CLI on your workstation.

Windows:

Install Git on Windows

Install the AWS CLI using the MSI Installer.

Linux or Mac:

Install Git on Linux or Mac.

Install the AWS CLI using the Bundled Installer.

After you install the AWS CLI, you must configure it using your IAM user credentials.

aws configure

Enter the AWS access key and AWS secret access key for the IAM user you created; enter us-east-1 for the region name; and enter json for the output format. 

AWS Access Key ID [None]: Type your target AWS access key ID here, and then press Enter
AWS Secret Access Key [None]: Type your target AWS secret access key here, and then press Enter
Default region name [None]: Type us-east-1 here, and then press Enter
Default output format [None]: Type json here, and then press Enter

Configure Git to use your IAM credentials and an HTTP path to access the repositories hosted by AWS CodeCommit.

git config –global credential.helper ‘!aws codecommit credential-helper [email protected]
git config –global credential.useHttpPath true

Create your central Git repository in AWS CodeCommit. 

aws codecommit create-repository –repository-name DemoRepo –repository-description "demonstration repository"

Set your user name and email address.

git config –global user.name "Your Name"
git config –global user.email "Your Email Address"

Create a local copy of the repository.

git clone https://git-codecommit.us-east-1.amazonaws.com/v1/repos/DemoRepo

Change directory to the local repository.

cd DemoRepo

In the editor of your choice, copy and paste the following into a file and save it as HelloWorld.java.

class HelloWorld {
public static void main(String[] args) {
System.out.println("Hello World!");
}
}

In the same directory where you created HelloWorld.java, run the following git commands to commit and push your change.

git add HelloWorld.java
git commit -m "Added HelloWord.java"
git push origin

Set Up the Jenkins Server

Create an instance using the Amazon Linux AMI. Make sure you associate the instance with the CodeCommitRole role and configure the security group associated with the instance to allow incoming traffic on ports 22 (SSH) and 8080 (Jenkins). You may further secure your server by restricting access to only the IP addresses of the developer machines connecting to Jenkins.

Use SSH to connect to the instance. Update the AWS CLI and install Jenkins, Git, and the Java JDK. 

sudo yum install -y git java-1.8.0-openjdk-devel
sudo yum update -y aws-cli

Add the Jenkins repository and install Jenkins.

sudo wget -O /etc/yum.repos.d/jenkins.repo http://pkg.jenkins-ci.org/redhat/jenkins.repo
sudo rpm –import https://jenkins-ci.org/redhat/jenkins-ci.org.key
sudo yum install -y jenkins

Configure the AWS CLI.

cd ~jenkins
sudo -u jenkins aws configure

Accept the defaults for the AWS access key and AWS secret access key; enter us-east-1 for the region name; and enter json for the output format. 

AWS Access Key ID [None]: Press Enter
AWS Secret Access Key [None]: Press Enter
Default region name [None]: Type us-east-1 here, and then press Enter
Default output format [None]: Type json here, and then press Enter

Configure Git to use IAM credentials and an HTTP path to access the repositories hosted by AWS CodeCommit.

sudo -u jenkins git config –global credential.helper ‘!aws codecommit credential-helper [email protected]
sudo -u jenkins git config –global credential.useHttpPath true
sudo -u jenkins git config –global user.email "[email protected]"
sudo -u jenkins git config –global user.name "MyJenkinsServer"

Start Jenkins. 

sudo service jenkins start
sudo chkconfig jenkins on

Configure global security.

Open the Jenkins home page (http://<public DNS name of EC2 instance>:8080) in your browser.

Select Manage Jenkins and Configure Global Security. 

Select the Enable Security check box.

Under Security Realm, select the Jenkins’ own user database radio button.

Clear the Allow users to sign up check box.

Under Authorization, select the Logged-in users can do anything radio button.

Configure the Git plugin.

Select Manage Jenkins and Manage Plugins. 

On the Available tab, use the Filter box to find Git Plugin.  

Select the Install check box next to Git Plugin.

Choose Download now and install after restart.

After Jenkins has restarted, add a project that will execute a build each time a change is pushed to the AWS CodeCommit hosted repository. 

Scenario 1: Set Up Project 

From the Jenkins home page, select New Item. 

Select Build a free-style software project.   

For the project name, enter "Demo".

For Source Code Management, choose Git.

For the repository URL, enter "https://git-codecommit.us-east-1.amazonaws.com/v1/repos/DemoRepo". 

For the Build Trigger, select Poll SCM with a schedule of */05 * * *.

For the Build under Add Build Step select Execute Shell and in the Command text box, type javac HelloWorld.java.

Click Save.

Scenario 1: Update the Local Git Repository

Now that your development environment is configured and the Jenkins server is set up, modify the source in your local repository and push the change to the central repository hosted on AWS CodeCommit.

On your workstation, change directory to the local repository and create a branch where you will make your changes.

cd DemoRepo

Use the editor of your choice to modify Helloword.java with the content below, and then save the file in the DemoRepo directory.

class HelloWorld {
public static void main(String[] args) {
System.out.println("Scenario 1: Build Hello World using Jenkins");
}
}

Run the following git commands to commit and push your change.

git add HelloWorld.java
git commit -m "Modified HelloWord.java for scenario 1"
git push origin

Scenario 1: Monitor Build

After five minutes, go to the Jenkins home page. You should see a build.

In the Last Success column, click the build (shown here as #1). This will take you to the build output. Click Console Output to see the build details.

Scenario 2: Modify Project To Support "Pre-Build Branch Merging"

From the Jenkins home page, click on Demo in the Name column. 

Select "Configure" to modify project

Make sure “Branch Specifier” for Branches to build is blank.

For Additional Behaviors, add Merge before Build.

Set the name of the repository to origin.

Set the branch to merge to master.

Add the Post Build Action Git Publisher.

Select Push Only If Build Succeeds.

Select Merge Results.

Select Add Tag.

Set the tag to push to $GIT_COMMIT.

Select Create new tag.

Set the target remote name to origin.

Click Save.

Scenario 2: Update the Local Git Repository

Now that your development environment is configured and the Jenkins server is set up, modify the source in your local repository and push the change to the central repository hosted on AWS CodeCommit.

On your workstation, change directory to the local repository and create a branch where you will make your changes.

cd DemoRepo
git checkout -b MyDevBranch

Use the editor of your choice to modify Helloword.java with the content below, and then save the file in the DemoRepo directory.

class HelloWorld {
public static void main(String[] args) {
System.out.println("Build Hello World using Jenkins!");
}
}

Run the following git commands to commit and push your change.

git add HelloWorld.java
git commit -m "Modified HelloWord.java for sceanrio 2"
git push origin MyDevBranch

Scenario 2: Monitor Build

After five minutes, go to the Jenkins home page. You should see a build.

In the Last Success column, click the build (shown here as #2). This will take you to the build output. Click Console Output to see the build details.

Scenario 2:  Verify The Master Branch Is Updated

Create another local repository named DemoRepo2. Verify the master branch includes your changes.

cd ..
git clone https://git-codecommit.us-east-1.amazonaws.com/v1/repos/DemoRepo DemoRepo2
cd DemoRepo2

Use the editor of your choice to open HelloWorld.java. It should include the change you made in your local DemoRepo repository. 

We hope this helps to get you started using Jenkins with your AWS CodeCommit repositories.  Let us know if you have questions, or if there are other product integrations you are interested in.

 

Send ECS Container Logs to CloudWatch Logs for Centralized Monitoring

Post Syndicated from Chris Barclay original http://blogs.aws.amazon.com/application-management/post/TxFRDMTMILAA8X/Send-ECS-Container-Logs-to-CloudWatch-Logs-for-Centralized-Monitoring

My colleagues Brandon Chavis, Pierre Steckmeyer and Chad Schmutzer sent a nice guest post that demonstrates how to send your container logs to a central source for easy troubleshooting and alarming.

 

—–

Amazon EC2 Container Service (Amazon ECS) is a highly scalable, high performance container management service that supports Docker containers and allows you to easily run applications on a managed cluster of Amazon EC2 instances.

In this multipart blog post, we have chosen to take a universal struggle amongst IT professionals—log collection—and approach it from different angles to highlight possible architectural patterns that facilitate communication and data sharing between containers.

When building applications on ECS, it is a good practice to follow a micro services approach, which encourages the design of a single application component in a single container. This design improves flexibility and elasticity, while leading to a loosely coupled architecture for resilience and ease of maintenance. However, this architectural style makes it important to consider how your containers will communicate and share data with each other.

Why is it useful?

Application logs are useful for many reasons. They are the primary source of troubleshooting information. In the field of security, they are essential to forensics. Web server logs are often leveraged for analysis (at scale) in order to gain insight into usage, audience, and trends.

Centrally collecting container logs is a common problem that can be solved in a number of ways. The Docker community has offered solutions such as having working containers map a shared volume; having a log-collecting container; and getting logs from a container that logs to stdout/stderr and retrieving them with docker logs.

In this post, we present a solution using Amazon CloudWatch Logs. CloudWatch is a monitoring service for AWS cloud resources and the applications you run on AWS. CloudWatch Logs can be used to collect and monitor your logs for specific phrases, values, or patterns. For example, you could set an alarm on the number of errors that occur in your system logs or view graphs of web request latencies from your application logs. The additional advantages here are that you can look at a single pane of glass for all of your monitoring needs because such metrics as CPU, disk I/O, and network for your container instances are already available on CloudWatch.

Here is how we are going to do it

Our approach involves setting up a container whose sole purpose is logging. It runs rsyslog and the CloudWatch Logs agent, and we use Docker Links to communicate to other containers. With this strategy, it becomes easy to link existing application containers such as Apache and have discrete logs per task. This logging container is defined in each ECS task definition, which is a collection of containers running together on the same container instance. With our container log collection strategy, you do not have to modify your Docker image. Any log mechanism tweak is specified in the task definition.

 

Note: This blog provisions a new ECS cluster in order to test the following instructions. Also, please note that we are using the US East (N. Virginia) region throughout this exercise. If you would like to use a different AWS region, please make sure to update your configuration accordingly.

Linking to a CloudWatch logging container

We will create a container that can be deployed as a syslog host. It will accept standard syslog connections on 514/TCP to rsyslog through container links, and will also forward those logs to CloudWatch Logs via the CloudWatch Logs agent. The idea is that this container can be deployed as the logging component in your architecture (not limited to ECS; it could be used for any centralized logging).

As a proof of concept, we show you how to deploy a container running httpd, clone some static web content (for this example, we clone the ECS documentation), and have the httpd access and error logs sent to the rsyslog service running on the syslog container via container linking. We also send the Docker and ecs-agent logs from the EC2 instance the task is running on. The logs in turn are sent to CloudWatch Logs via the CloudWatch Logs agent.

Note: Be sure to replace your information througout the document as necessary (for example: replace "my_docker_hub_repo" with the name of your own Docker Hub repository).

We also assume that all following requirements are in place in your AWS account:

A VPC exists for the account

There is an IAM user with permissions to launch EC2 instances and create IAM policies/roles

SSH keys have been generated

Git and Docker are installed on the image building host

The user owns a Docker Hub account and a repository ("my_docker_hub_repo" in this document)

Let’s get started.

Create the Docker image

The first step is to create the Docker image to use as a logging container. For this, all you need is a machine that has Git and Docker installed. You could use your own local machine or an EC2 instance.

Install Git and Docker. The following steps pertain to the Amazon Linux AMI but you should follow the Git and Docker installation instructions respective to your machine.

$ sudo yum update -y && sudo yum -y install git docker

Make sure that the Docker service is running:

$ sudo service docker start

Clone the GitHub repository containing the files you need:

$ git clone https://github.com/awslabs/ecs-cloudwatch-logs.git
$ cd ecs-cloudwatch-logs
You should now have a directory containing two .conf files and a Dockerfile. Feel free to read the content of these files and identify the mechanisms used.
 

Log in to Docker Hub:

$ sudo docker login

Build the container image (replace the my_docker_hub_repo with your repository name):

$ sudo docker build -t my_docker_hub_repo/cloudwatchlogs .

Push the image to your repo:

$ sudo docker push my_docker_hub_repo/cloudwatchlogs

Use the build-and-push time to dive deeper into what will live in this container. You can follow along by reading the Dockerfile. Here are a few things worth noting:

The first RUN updates the distribution and installs rsyslog, pip, and curl.

The second RUN downloads the AWS CloudWatch Logs agent.

The third RUN enables remote conncetions for rsyslog.

The fourth RUN removes the local6 and local7 facilities to prevent duplicate entries. If you don’t do this, you would see every single apache log entry in /var/log/syslog.

The last RUN specifies which output files will receive the log entries on local6 and local7 (e.g., "if the facility is local6 and it is tagged with httpd, put those into this httpd-access.log file").

We use Supervisor to run more than one process in this container: rsyslog and the CloudWatch Logs agent.

We expose port 514 for rsyslog to collect log entries via the Docker link.

Create an ECS cluster

Now, create an ECS cluster. One way to do so could be to use the Amazon ECS console first run wizard. For now, though, all you need is an ECS cluster.

7. Navigate to the ECS console and choose Create cluster. Give it a unique name that you have not used before (such as "ECSCloudWatchLogs"), and choose Create.

Create an IAM role

The next five steps set up a CloudWatch-enabled IAM role with EC2 permissions and spin up a new container instance with this role. All of this can be done manually via the console or you can run a CloudFormation template. To use the CloudFormation template, navigate to CloudFormation console, create a new stack by using this template and go straight to step 14 (just specify the ECS cluster name used above, choose your prefered instance type and select the appropriate EC2 SSH key, and leave the rest unchanged). Otherwise, continue on to step 8.

8. Create an IAM policy for CloudWatch Logs and ECS: point your browser to the IAM console, choose Policies and then Create Policy. Choose Select next to Create Your Own Policy. Give your policy a name (e.g., ECSCloudWatchLogs) and paste the text below as the Policy Document value.

{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"logs:Create*",
"logs:PutLogEvents"
],
"Effect": "Allow",
"Resource": "arn:aws:logs:*:*:*"
},
{
"Action": [
"ecs:CreateCluster",
"ecs:DeregisterContainerInstance",
"ecs:DiscoverPollEndpoint",
"ecs:RegisterContainerInstance",
"ecs:Submit*",
"ecs:Poll"
],
"Effect": "Allow",
"Resource": "*"
}
]
}

9. Create a new IAM EC2 service role and attach the above policy to it. In IAM, choose Roles, Create New Role. Pick a name for the role (e.g., ECSCloudWatchLogs). Choose Role Type, Amazon EC2. Find and pick the policy you just created, click Next Step, and then Create Role.

Launch an EC2 instance and ECS cluster

10. Launch an instance with the Amazon ECS AMI and the above role in the US East (N. Virginia) region. On the EC2 console page, choose Launch Instance. Choose Community AMIs. In the search box, type "amazon-ecs-optimized" and choose Select for the latest version (2015.03.b). Select the appropriate instance type and choose Next.

11. Choose the appropriate Network value for your ECS cluster. Make sure that Auto-assign Public IP is enabled. Choose the IAM role that you just created (e.g., ECSCloudWatchLogs). Expand Advanced Details and in the User data field, add the following while substituting your_cluster_name for the appropriate name:

#!/bin/bash
echo ECS_CLUSTER=your_cluster_name >> /etc/ecs/ecs.config
EOF

12. Choose Next: Add Storage, then Next: Tag Instance. You can give your container instance a name on this page. Choose Next: Configure Security Group. On this page, you should make sure that both SSH and HTTP are open to at least your own IP address.

13. Choose Review and Launch, then Launch and Associate with the appropriate SSH key. Note the instance ID.

14. Ensure that your newly spun-up EC2 instance is part of your container instances (note that it may take up to a minute for the container instance to register with ECS). In the ECS console, select the appropriate cluster. Select the ECS Instances tab. You should see a container instance with the instance ID that you just noted after a minute.

15. On the left pane of the ECS console, choose Task Definitions, then Create new Task Definition. On the JSON tab, paste the code below, overwriting the default text. Make sure to replace "my_docker_hub_repo" with your own Docker Hub repo name and choose Create.

{
"volumes": [
{
"name": "ecs_instance_logs",
"host": {
"sourcePath": "/var/log"
}
}
],
"containerDefinitions": [
{
"environment": [],
"name": "cloudwatchlogs",
"image": "my_docker_hub_repo/cloudwatchlogs",
"cpu": 50,
"portMappings": [],
"memory": 64,
"essential": true,
"mountPoints": [
{
"sourceVolume": "ecs_instance_logs",
"containerPath": "/mnt/ecs_instance_logs",
"readOnly": true
}
]
},
{
"environment": [],
"name": "httpd",
"links": [
"cloudwatchlogs"
],
"image": "httpd",
"cpu": 50,
"portMappings": [
{
"containerPort": 80,
"hostPort": 80
}
],
"memory": 128,
"entryPoint": ["/bin/bash", "-c"],
"command": [
"apt-get update && apt-get -y install wget && echo ‘CustomLog "| /usr/bin/logger -t httpd -p local6.info -n cloudwatchlogs -P 514" "%v %h %l %u %t %r %>s %b %{Referer}i %{User-agent}i"’ >> /usr/local/apache2/conf/httpd.conf && echo ‘ErrorLogFormat "%v [%t] [%l] [pid %P] %F: %E: [client %a] %M"’ >> /usr/local/apache2/conf/httpd.conf && echo ‘ErrorLog "| /usr/bin/logger -t httpd -p local7.info -n cloudwatchlogs -P 514"’ >> /usr/local/apache2/conf/httpd.conf && echo ServerName `hostname` >> /usr/local/apache2/conf/httpd.conf && rm -rf /usr/local/apache2/htdocs/* && cd /usr/local/apache2/htdocs && wget -mkEpnp -nH –cut-dirs=4 http://docs.aws.amazon.com/AmazonECS/latest/developerguide/Welcome.html && /usr/local/bin/httpd-foreground"
],
"essential": true
}
],
"family": "cloudwatchlogs"
}

What are some highlights of this task definition?

The sourcePath value allows the CloudWatch Logs agent running in the log collection container to access the host-based Docker and ECS agent log files. You can change the retention period in CloudWatch Logs.

The cloudwatchlogs container is marked essential, which means that if log collection goes down, so should the application it is collecting from. Similarly, the web server is marked essential as well. You can easily change this behavior.

The command section is a bit lengthy. Let us break it down:

We first install wget so that we can later clone the ECS documentation for display on our web server.

We then write four lines to httpd.conf. These are the echo commands. They describe how httpd will generate log files and their format. Notice how we tag (-t httpd) these files with httpd and assign them a specific facility (-p localX.info). We also specify that logger is to send these entries to host -n cloudwatchlogs on port -p 514. This will be handled by linking. Hence, port 514 is left untouched on the machine and we could have as many of these logging containers running as we want.

%h %l %u %t %r %>s %b %{Referer}i %{User-agent}i should look fairly familiar to anyone who has looked into tweaking Apache logs. The initial %v is the server name and it will be replaced by the container ID. This is how we are able to discern what container the logs come from in CloudWatch Logs.

We remove the default httpd landing page with rm -rf.

We instead use wget to download a clone of the ECS documentation.

And, finally, we start httpd. Note that we redirect httpd log files in our task definition at the command level for the httpd image. Applying the same concept to another image would simply require you to know where your application maintains its log files.

Note that we redirect httpd log files in our task definition at the command level for the httpd image. Applying the same concept to another image would simply require you to know where your application maintains its log files.

Create a service

16. On the services tab in the ECS console, choose Create. Choose the task definition created in step 15, name the service and set the number of tasks to 1. Select Create service.

17. The task will start running shortly. You can press the refresh icon on your service’s Tasks tab. After the status says "Running", choose the task and expand the httpd container. The container instance IP will be a hyperlink under the Network bindings section’s External link. When you select the link you should see a clone of the Amazon ECS documentation. You are viewing this thanks to the httpd container running on your ECS cluster.

18. Open the CloudWatch Logs console to view new ecs entries.

Conclusion

If you have followed all of these steps, you should now have a two container task running in your ECS cluster. One container serves web pages while the other one collects the log activity from the web container and sends it to CloudWatch Logs. Such a setup can be replicated with any other application. All you need is to specify a different container image and describe the expected log files in the command section.

Send ECS Container Logs to CloudWatch Logs for Centralized Monitoring

Post Syndicated from Chris Barclay original http://blogs.aws.amazon.com/application-management/post/TxFRDMTMILAA8X/Send-ECS-Container-Logs-to-CloudWatch-Logs-for-Centralized-Monitoring

My colleagues Brandon Chavis, Pierre Steckmeyer and Chad Schmutzer sent a nice guest post that demonstrates how to send your container logs to a central source for easy troubleshooting and alarming.

 

—–

Amazon EC2 Container Service (Amazon ECS) is a highly scalable, high performance container management service that supports Docker containers and allows you to easily run applications on a managed cluster of Amazon EC2 instances.

In this multipart blog post, we have chosen to take a universal struggle amongst IT professionals—log collection—and approach it from different angles to highlight possible architectural patterns that facilitate communication and data sharing between containers.

When building applications on ECS, it is a good practice to follow a micro services approach, which encourages the design of a single application component in a single container. This design improves flexibility and elasticity, while leading to a loosely coupled architecture for resilience and ease of maintenance. However, this architectural style makes it important to consider how your containers will communicate and share data with each other.

Why is it useful?

Application logs are useful for many reasons. They are the primary source of troubleshooting information. In the field of security, they are essential to forensics. Web server logs are often leveraged for analysis (at scale) in order to gain insight into usage, audience, and trends.

Centrally collecting container logs is a common problem that can be solved in a number of ways. The Docker community has offered solutions such as having working containers map a shared volume; having a log-collecting container; and getting logs from a container that logs to stdout/stderr and retrieving them with docker logs.

In this post, we present a solution using Amazon CloudWatch Logs. CloudWatch is a monitoring service for AWS cloud resources and the applications you run on AWS. CloudWatch Logs can be used to collect and monitor your logs for specific phrases, values, or patterns. For example, you could set an alarm on the number of errors that occur in your system logs or view graphs of web request latencies from your application logs. The additional advantages here are that you can look at a single pane of glass for all of your monitoring needs because such metrics as CPU, disk I/O, and network for your container instances are already available on CloudWatch.

Here is how we are going to do it

Our approach involves setting up a container whose sole purpose is logging. It runs rsyslog and the CloudWatch Logs agent, and we use Docker Links to communicate to other containers. With this strategy, it becomes easy to link existing application containers such as Apache and have discrete logs per task. This logging container is defined in each ECS task definition, which is a collection of containers running together on the same container instance. With our container log collection strategy, you do not have to modify your Docker image. Any log mechanism tweak is specified in the task definition.

 

Note: This blog provisions a new ECS cluster in order to test the following instructions. Also, please note that we are using the US East (N. Virginia) region throughout this exercise. If you would like to use a different AWS region, please make sure to update your configuration accordingly.

Linking to a CloudWatch logging container

We will create a container that can be deployed as a syslog host. It will accept standard syslog connections on 514/TCP to rsyslog through container links, and will also forward those logs to CloudWatch Logs via the CloudWatch Logs agent. The idea is that this container can be deployed as the logging component in your architecture (not limited to ECS; it could be used for any centralized logging).

As a proof of concept, we show you how to deploy a container running httpd, clone some static web content (for this example, we clone the ECS documentation), and have the httpd access and error logs sent to the rsyslog service running on the syslog container via container linking. We also send the Docker and ecs-agent logs from the EC2 instance the task is running on. The logs in turn are sent to CloudWatch Logs via the CloudWatch Logs agent.

Note: Be sure to replace your information througout the document as necessary (for example: replace "my_docker_hub_repo" with the name of your own Docker Hub repository).

We also assume that all following requirements are in place in your AWS account:

A VPC exists for the account

There is an IAM user with permissions to launch EC2 instances and create IAM policies/roles

SSH keys have been generated

Git and Docker are installed on the image building host

The user owns a Docker Hub account and a repository ("my_docker_hub_repo" in this document)

Let’s get started.

Create the Docker image

The first step is to create the Docker image to use as a logging container. For this, all you need is a machine that has Git and Docker installed. You could use your own local machine or an EC2 instance.

Install Git and Docker. The following steps pertain to the Amazon Linux AMI but you should follow the Git and Docker installation instructions respective to your machine.

$ sudo yum update -y && sudo yum -y install git docker

Make sure that the Docker service is running:

$ sudo service docker start

Clone the GitHub repository containing the files you need:

$ git clone https://github.com/awslabs/ecs-cloudwatch-logs.git
$ cd ecs-cloudwatch-logs
You should now have a directory containing two .conf files and a Dockerfile. Feel free to read the content of these files and identify the mechanisms used.
 

Log in to Docker Hub:

$ sudo docker login

Build the container image (replace the my_docker_hub_repo with your repository name):

$ sudo docker build -t my_docker_hub_repo/cloudwatchlogs .

Push the image to your repo:

$ sudo docker push my_docker_hub_repo/cloudwatchlogs

Use the build-and-push time to dive deeper into what will live in this container. You can follow along by reading the Dockerfile. Here are a few things worth noting:

The first RUN updates the distribution and installs rsyslog, pip, and curl.

The second RUN downloads the AWS CloudWatch Logs agent.

The third RUN enables remote conncetions for rsyslog.

The fourth RUN removes the local6 and local7 facilities to prevent duplicate entries. If you don’t do this, you would see every single apache log entry in /var/log/syslog.

The last RUN specifies which output files will receive the log entries on local6 and local7 (e.g., "if the facility is local6 and it is tagged with httpd, put those into this httpd-access.log file").

We use Supervisor to run more than one process in this container: rsyslog and the CloudWatch Logs agent.

We expose port 514 for rsyslog to collect log entries via the Docker link.

Create an ECS cluster

Now, create an ECS cluster. One way to do so could be to use the Amazon ECS console first run wizard. For now, though, all you need is an ECS cluster.

7. Navigate to the ECS console and choose Create cluster. Give it a unique name that you have not used before (such as "ECSCloudWatchLogs"), and choose Create.

Create an IAM role

The next five steps set up a CloudWatch-enabled IAM role with EC2 permissions and spin up a new container instance with this role. All of this can be done manually via the console or you can run a CloudFormation template. To use the CloudFormation template, navigate to CloudFormation console, create a new stack by using this template and go straight to step 14 (just specify the ECS cluster name used above, choose your prefered instance type and select the appropriate EC2 SSH key, and leave the rest unchanged). Otherwise, continue on to step 8.

8. Create an IAM policy for CloudWatch Logs and ECS: point your browser to the IAM console, choose Policies and then Create Policy. Choose Select next to Create Your Own Policy. Give your policy a name (e.g., ECSCloudWatchLogs) and paste the text below as the Policy Document value.

{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"logs:Create*",
"logs:PutLogEvents"
],
"Effect": "Allow",
"Resource": "arn:aws:logs:*:*:*"
},
{
"Action": [
"ecs:CreateCluster",
"ecs:DeregisterContainerInstance",
"ecs:DiscoverPollEndpoint",
"ecs:RegisterContainerInstance",
"ecs:Submit*",
"ecs:Poll"
],
"Effect": "Allow",
"Resource": "*"
}
]
}

9. Create a new IAM EC2 service role and attach the above policy to it. In IAM, choose Roles, Create New Role. Pick a name for the role (e.g., ECSCloudWatchLogs). Choose Role Type, Amazon EC2. Find and pick the policy you just created, click Next Step, and then Create Role.

Launch an EC2 instance and ECS cluster

10. Launch an instance with the Amazon ECS AMI and the above role in the US East (N. Virginia) region. On the EC2 console page, choose Launch Instance. Choose Community AMIs. In the search box, type "amazon-ecs-optimized" and choose Select for the latest version (2015.03.b). Select the appropriate instance type and choose Next.

11. Choose the appropriate Network value for your ECS cluster. Make sure that Auto-assign Public IP is enabled. Choose the IAM role that you just created (e.g., ECSCloudWatchLogs). Expand Advanced Details and in the User data field, add the following while substituting your_cluster_name for the appropriate name:

#!/bin/bash
echo ECS_CLUSTER=your_cluster_name >> /etc/ecs/ecs.config
EOF

12. Choose Next: Add Storage, then Next: Tag Instance. You can give your container instance a name on this page. Choose Next: Configure Security Group. On this page, you should make sure that both SSH and HTTP are open to at least your own IP address.

13. Choose Review and Launch, then Launch and Associate with the appropriate SSH key. Note the instance ID.

14. Ensure that your newly spun-up EC2 instance is part of your container instances (note that it may take up to a minute for the container instance to register with ECS). In the ECS console, select the appropriate cluster. Select the ECS Instances tab. You should see a container instance with the instance ID that you just noted after a minute.

15. On the left pane of the ECS console, choose Task Definitions, then Create new Task Definition. On the JSON tab, paste the code below, overwriting the default text. Make sure to replace "my_docker_hub_repo" with your own Docker Hub repo name and choose Create.

{
"volumes": [
{
"name": "ecs_instance_logs",
"host": {
"sourcePath": "/var/log"
}
}
],
"containerDefinitions": [
{
"environment": [],
"name": "cloudwatchlogs",
"image": "my_docker_hub_repo/cloudwatchlogs",
"cpu": 50,
"portMappings": [],
"memory": 64,
"essential": true,
"mountPoints": [
{
"sourceVolume": "ecs_instance_logs",
"containerPath": "/mnt/ecs_instance_logs",
"readOnly": true
}
]
},
{
"environment": [],
"name": "httpd",
"links": [
"cloudwatchlogs"
],
"image": "httpd",
"cpu": 50,
"portMappings": [
{
"containerPort": 80,
"hostPort": 80
}
],
"memory": 128,
"entryPoint": ["/bin/bash", "-c"],
"command": [
"apt-get update && apt-get -y install wget && echo ‘CustomLog "| /usr/bin/logger -t httpd -p local6.info -n cloudwatchlogs -P 514" "%v %h %l %u %t %r %>s %b %{Referer}i %{User-agent}i"’ >> /usr/local/apache2/conf/httpd.conf && echo ‘ErrorLogFormat "%v [%t] [%l] [pid %P] %F: %E: [client %a] %M"’ >> /usr/local/apache2/conf/httpd.conf && echo ‘ErrorLog "| /usr/bin/logger -t httpd -p local7.info -n cloudwatchlogs -P 514"’ >> /usr/local/apache2/conf/httpd.conf && echo ServerName `hostname` >> /usr/local/apache2/conf/httpd.conf && rm -rf /usr/local/apache2/htdocs/* && cd /usr/local/apache2/htdocs && wget -mkEpnp -nH –cut-dirs=4 http://docs.aws.amazon.com/AmazonECS/latest/developerguide/Welcome.html && /usr/local/bin/httpd-foreground"
],
"essential": true
}
],
"family": "cloudwatchlogs"
}

What are some highlights of this task definition?

The sourcePath value allows the CloudWatch Logs agent running in the log collection container to access the host-based Docker and ECS agent log files. You can change the retention period in CloudWatch Logs.

The cloudwatchlogs container is marked essential, which means that if log collection goes down, so should the application it is collecting from. Similarly, the web server is marked essential as well. You can easily change this behavior.

The command section is a bit lengthy. Let us break it down:

We first install wget so that we can later clone the ECS documentation for display on our web server.

We then write four lines to httpd.conf. These are the echo commands. They describe how httpd will generate log files and their format. Notice how we tag (-t httpd) these files with httpd and assign them a specific facility (-p localX.info). We also specify that logger is to send these entries to host -n cloudwatchlogs on port -p 514. This will be handled by linking. Hence, port 514 is left untouched on the machine and we could have as many of these logging containers running as we want.

%h %l %u %t %r %>s %b %{Referer}i %{User-agent}i should look fairly familiar to anyone who has looked into tweaking Apache logs. The initial %v is the server name and it will be replaced by the container ID. This is how we are able to discern what container the logs come from in CloudWatch Logs.

We remove the default httpd landing page with rm -rf.

We instead use wget to download a clone of the ECS documentation.

And, finally, we start httpd. Note that we redirect httpd log files in our task definition at the command level for the httpd image. Applying the same concept to another image would simply require you to know where your application maintains its log files.

Note that we redirect httpd log files in our task definition at the command level for the httpd image. Applying the same concept to another image would simply require you to know where your application maintains its log files.

Create a service

16. On the services tab in the ECS console, choose Create. Choose the task definition created in step 15, name the service and set the number of tasks to 1. Select Create service.

17. The task will start running shortly. You can press the refresh icon on your service’s Tasks tab. After the status says "Running", choose the task and expand the httpd container. The container instance IP will be a hyperlink under the Network bindings section’s External link. When you select the link you should see a clone of the Amazon ECS documentation. You are viewing this thanks to the httpd container running on your ECS cluster.

18. Open the CloudWatch Logs console to view new ecs entries.

Conclusion

If you have followed all of these steps, you should now have a two container task running in your ECS cluster. One container serves web pages while the other one collects the log activity from the web container and sends it to CloudWatch Logs. Such a setup can be replicated with any other application. All you need is to specify a different container image and describe the expected log files in the command section.

Using the Elastic Beanstalk (EB) CLI to create, manage, and share environment configuration

Post Syndicated from Abhishek Singh original http://blogs.aws.amazon.com/application-management/post/Tx1YHAJ5EELY54J/Using-the-Elastic-Beanstalk-EB-CLI-to-create-manage-and-share-environment-config

My colleague Nick Humrich wrote up the guest post below to share a powerful way to use the EB CLI to manage environment configurations — Abhishek

The AWS Elastic Beanstalk command line interface (EB CLI) makes it easier for developers to get started with Elastic Beanstalk by using command line tools. Last November, we released a revamped version of the EB CLI that added a number of new commands and made it even simpler to get started. As part of a recent update (2/17), we’ve added new commands to create, manage, and share environment configurations.

In this post, we will discuss how to create configurations, save them in templates, make a template the default for future deployments, and share templates by checking them in to version control. The remainder of this post will assume that you have EB CLI 3.x installed and have an application that you will be deploying to Elastic Beanstalk. If you haven’t installed the EB CLI, see Install the EB CLI using pip (Windows, Linux, OS X or Unix).

To begin, we will create an Elastic Beanstalk environment and deploy our application to it:

$ git clone https://github.com/awslabs/eb-python-flask.git
$ cd eb-python-flask
$ eb init -p python2.7 # Configures the current folder for EB

$ eb create dev-env # Creates the EB environment and pushes
# the contents of the app folder to the
# newly created environment

$ eb open # Opens the default browser to the
# current application’s URL

Creating and applying an environment configuration

Before we start editing our environment configuration, let’s take a snapshot of the current settings. We can then roll back to them should any problems arise.

You can save a configuration of a running environment by using the following command:

$ eb config save dev-env –cfg initial-configuration

The command saves a configuration locally and in Elastic Beanstalk. The output from EB CLI tells you the location of the local copy, the .elasticbeanstalk/saved_configs/ folder.

Now that we have a snapshot, let’s make some changes to our environment so that we can see how to apply a configuration to an environment. Type the following command to set the environment variable “ENABLE_COOL_NEW_FEATURE” to “true”.

$ eb setenv ENABLE_COOL_NEW_FEATURE=true

If you want to revert to your previous configuration, you can do so by applying a saved configuration during an environment update.

$ eb config –cfg initial-configuration

This will work even if you don’t have a local copy because the saved configuration is also stored in Elastic Beanstalk.

Creating an environment template

Now that our cool new feature is working, we are ready to create a production environment. Let’s begin by saving a configuration with the following:

$ eb config save dev-env –cfg prod

Now, open this file in a text editor to modify/remove sections as necessary for your production environment. For this example, we just need the environment variables section. We will also add the line “FLASK_DEBUG: false” to turn debugging text off.

Note: AWSConfigurationTemplateVersion is a required field. Do not remove it from the configuration file.

The following is an example of what the file should now look like:

EnvironmentConfigurationMetadata:
Description: Configuration created from the EB CLI using "eb config save".
DateModified: ‘1427752586000’
DateCreated: ‘1427752586000’
AWSConfigurationTemplateVersion: 1.1.0.0
OptionSettings:
aws:elasticbeanstalk:application:environment:
ENABLE_COOL_NEW_FEATURE: true
FLASK_DEBUG: false

When you are done revising the configuration, you can upload it to Elastic Beanstalk by running the following command:

$ eb config put prod

This command also validates the saved configuration to make sure it doesn’t contain any errors.

Using an environment template to create an environment

Now that we have an environment template, we can create our new prod environment from the template.

$ eb create prod-env –cfg prod

Updating an existing environment using an environment template

You can also update the currently running environment to use the saved configuration by running the following command:

$ eb config dev-env –cfg prod

Alternatively, you can pipe the configuration into the config command:

$ cat prod.cfg.yml | eb config dev-env

Using default option settings with EB CLI

Specifying a template every time you create an environment can be annoying. When you save a template with the name “default,” the CLI will use it automatically for all new environments.

$ eb config save –cfg default

This will save a configuration called default locally and in Elastic Beanstalk. Open the saved configuration with a text editor and remove all sections that you do not want included as default settings for environments that apply the saved configuration. EB CLI will now use these settings automatically every time you run the eb create command.

Checking configurations into version control

If you want to check in your saved configurations so that anyone with access to your code can use the same settings in their own environments or if you want to track different versions of the saved configurations, move the file to the .elasticbeanstalk/folder directory. Saved configurations are located in the .elasticbeanstalk/saved_configs/ folder. By moving the configuration file up one level into the .elasticbeanstalk/ folder, the file can be checked in and will still work with the EB CLI. After you move the file, you must add and commit it.

Using the Elastic Beanstalk (EB) CLI to create, manage, and share environment configuration

Post Syndicated from Abhishek Singh original http://blogs.aws.amazon.com/application-management/post/Tx1YHAJ5EELY54J/Using-the-Elastic-Beanstalk-EB-CLI-to-create-manage-and-share-environment-config

My colleague Nick Humrich wrote up the guest post below to share a powerful way to use the EB CLI to manage environment configurations — Abhishek

The AWS Elastic Beanstalk command line interface (EB CLI) makes it easier for developers to get started with Elastic Beanstalk by using command line tools. Last November, we released a revamped version of the EB CLI that added a number of new commands and made it even simpler to get started. As part of a recent update (2/17), we’ve added new commands to create, manage, and share environment configurations.

In this post, we will discuss how to create configurations, save them in templates, make a template the default for future deployments, and share templates by checking them in to version control. The remainder of this post will assume that you have EB CLI 3.x installed and have an application that you will be deploying to Elastic Beanstalk. If you haven’t installed the EB CLI, see Install the EB CLI using pip (Windows, Linux, OS X or Unix).

To begin, we will create an Elastic Beanstalk environment and deploy our application to it:

$ git clone https://github.com/awslabs/eb-python-flask.git
$ cd eb-python-flask
$ eb init -p python2.7 # Configures the current folder for EB

$ eb create dev-env # Creates the EB environment and pushes
# the contents of the app folder to the
# newly created environment

$ eb open # Opens the default browser to the
# current application’s URL

Creating and applying an environment configuration

Before we start editing our environment configuration, let’s take a snapshot of the current settings. We can then roll back to them should any problems arise.

You can save a configuration of a running environment by using the following command:

$ eb config save dev-env –cfg initial-configuration

The command saves a configuration locally and in Elastic Beanstalk. The output from EB CLI tells you the location of the local copy, the .elasticbeanstalk/saved_configs/ folder.

Now that we have a snapshot, let’s make some changes to our environment so that we can see how to apply a configuration to an environment. Type the following command to set the environment variable “ENABLE_COOL_NEW_FEATURE” to “true”.

$ eb setenv ENABLE_COOL_NEW_FEATURE=true

If you want to revert to your previous configuration, you can do so by applying a saved configuration during an environment update.

$ eb config –cfg initial-configuration

This will work even if you don’t have a local copy because the saved configuration is also stored in Elastic Beanstalk.

Creating an environment template

Now that our cool new feature is working, we are ready to create a production environment. Let’s begin by saving a configuration with the following:

$ eb config save dev-env –cfg prod

Now, open this file in a text editor to modify/remove sections as necessary for your production environment. For this example, we just need the environment variables section. We will also add the line “FLASK_DEBUG: false” to turn debugging text off.

Note: AWSConfigurationTemplateVersion is a required field. Do not remove it from the configuration file.

The following is an example of what the file should now look like:

EnvironmentConfigurationMetadata:
Description: Configuration created from the EB CLI using "eb config save".
DateModified: ‘1427752586000’
DateCreated: ‘1427752586000’
AWSConfigurationTemplateVersion: 1.1.0.0
OptionSettings:
aws:elasticbeanstalk:application:environment:
ENABLE_COOL_NEW_FEATURE: true
FLASK_DEBUG: false

When you are done revising the configuration, you can upload it to Elastic Beanstalk by running the following command:

$ eb config put prod

This command also validates the saved configuration to make sure it doesn’t contain any errors.

Using an environment template to create an environment

Now that we have an environment template, we can create our new prod environment from the template.

$ eb create prod-env –cfg prod

Updating an existing environment using an environment template

You can also update the currently running environment to use the saved configuration by running the following command:

$ eb config dev-env –cfg prod

Alternatively, you can pipe the configuration into the config command:

$ cat prod.cfg.yml | eb config dev-env

Using default option settings with EB CLI

Specifying a template every time you create an environment can be annoying. When you save a template with the name “default,” the CLI will use it automatically for all new environments.

$ eb config save –cfg default

This will save a configuration called default locally and in Elastic Beanstalk. Open the saved configuration with a text editor and remove all sections that you do not want included as default settings for environments that apply the saved configuration. EB CLI will now use these settings automatically every time you run the eb create command.

Checking configurations into version control

If you want to check in your saved configurations so that anyone with access to your code can use the same settings in their own environments or if you want to track different versions of the saved configurations, move the file to the .elasticbeanstalk/folder directory. Saved configurations are located in the .elasticbeanstalk/saved_configs/ folder. By moving the configuration file up one level into the .elasticbeanstalk/ folder, the file can be checked in and will still work with the EB CLI. After you move the file, you must add and commit it.

Set up a build pipeline with Jenkins and Amazon ECS

Post Syndicated from Chris Barclay original http://blogs.aws.amazon.com/application-management/post/Tx32RHFZHXY6ME1/Set-up-a-build-pipeline-with-Jenkins-and-Amazon-ECS

My colleague Daniele Stroppa sent a nice guest post that demonstrates how to use Jenkins to build Docker images for Amazon EC2 Container Service.

 

—–

 

In this walkthrough, we’ll show you how to set up and configure a build pipeline using Jenkins and the Amazon EC2 Container Service (ECS).

 

We’ll be using a sample Python application, available on GitHub. The repository contains a simple Dockerfile that uses a python base image and runs our application:

FROM python:2-onbuild
CMD [ "python", "./application.py" ]

This Dockerfile is used by the build pipeline to create a new Docker image upon pushing code to the repository. The built image will then be used to start a new service on an ECS cluster.

 

For the purpose of this walkthrough, fork the py-flask-signup-docker repository to your account.

 

Setup the build environment

For our build environment we’ll launch an Amazon EC2 instance using the Amazon Linux AMI and install and configure the required packages. Make sure that the security group you select for your instance allows traffic on ports TCP/22 and TCP/80.

 

Install and configure Jenkins, Docker and Nginx

Connect to your instance using your private key and switch to the root user. First, let’s update the repositories and install Docker, Nginx and Git.

# yum update -y
# yum install -y docker nginx git

To install Jenkins on Amazon Linux, we need to add the Jenkins repository and install Jenkins from there.

# wget -O /etc/yum.repos.d/jenkins.repo http://pkg.jenkins-ci.org/redhat/jenkins.repo
# rpm –import http://pkg.jenkins-ci.org/redhat/jenkins-ci.org.key
# yum install jenkins

As Jenkins typically uses port TCP/8080, we’ll configure Nginx as a proxy. Edit the Nginx config file (/etc/nginx/nginx.conf) and change the server configuration to look like this:

server {
listen 80;
server_name _;

location / {
proxy_pass http://127.0.0.1:8080;
}
}

We’ll be using Jenkins to build our Docker images, so we need to add the jenkins user to the docker group. A reboot may be required for the changes to take effect.

# usermod -a -G docker jenkins

Start the Docker, Jenkins and Nginx services and make sure they will be running after a reboot:

# service docker start
# service jenkins start
# service nginx start
# chkconfig docker on
# chkconfig jenkins on
# chkconfig nginx on

You can launch the Jenkins instance complete with all the required plugins with this CloudFormation template.

 

Point your browser to the public DNS name of your EC2 instance (e.g. ec2-54-163-4-211.compute-1.amazonaws.com) and you should be able to see the Jenkins home page:

 

 

The Jenkins installation is currently accessible through the Internet without any form of authentication. Before proceeding to the next step, let’s secure Jenkins. Select Manage Jenkins on the Jenkins home page, click Configure Global Security and then enable Jenkins security by selecting the Enable Security checkbox.

 

For the purpose of this walkthrough, select Jenkins’s Own User Database under Security realm and make sure to select the Allow users to sign up checkbox. Under Authorization, select Matrix-based security. Add a user (e.g. admin) and provide necessary privileges to this user.

 

 

After that’s complete, save your changes. Now you will be asked to provide a username and password for the user to login. Click on Create an account, provide your username – i.e. admin – and fill in the user details. Now you will be able to log in securely to Jenkins.

 

Install and configure Jenkins plugins

The last step in setting up our build environment is to install and configure the Jenkins plugins required to build a Docker image and publish it to a Docker registry (DockerHub in our case). We’ll also need a plugin to interact with the code repository of our choice, GitHub in our case.

 

From the Jenkins dashboard select Manage Jenkins and click Manage Plugins. On the Available tab, search for and select the following plugins:

Docker Build and Publish plugin

dockerhub plugin

Github plugin

Then click the Install button. After the plugin installation is completed, select Manage Jenkins from the Jenkins dashboard and click Configure System. Look for the Docker Image Builder section and fill in your Docker registry (DockerHub) credentials:

 

 

Install and configure the Amazon ECS CLI

Now we are ready to setup and configure the ECS Command Line Interface (CLI). The sample application creates and uses an Amazon DynamoDB table to store signup information, so make sure that the IAM Role that you create for the EC2 instances allows the dynamodb:* action.

 

Follow the Setting Up with Amazon ECS guide to get ready to use ECS. If you haven’t done so yet, make sure to start at least one container instance in your account and create the Amazon ECS service role in the AWS IAM console.

 

Make sure that Jenkins is able to use the ECS CLI. Switch to the jenkins user and configure the AWS CLI, providing your credentials:

# sudo -su jenkins
> aws configure

Login to Docker Hub
The Jenkins user needs to login to Docker Hub before doing the first build:

# docker login

Create a task definition template

Create a task definition template for our application (note, you will replace the image name with your own repository):

{
"family": "flask-signup",
"containerDefinitions": [
{
"image": "your-repository/flask-signup:v_%BUILD_NUMBER%",
"name": "flask-signup",
"cpu": 10,
"memory": 256,
"essential": true,
"portMappings": [
{
"containerPort": 5000,
"hostPort": 80
}
]
}
]
}

Save your task definition template as flask-signup.json. Since the image specified in the task definition template will be built in the Jenkins job, at this point we will create a dummy task definition. Substitute the %BUILD_NUMBER% parameter in your task definition template with a non-existent value (0) and register it with ECS:

# sed -e "s;%BUILD_NUMBER%;0;g" flask-signup.json > flask-signup-v_0.json
# aws ecs register-task-definition –cli-input-json file://flask-signup-v_0.json
{
"taskDefinition": {
"volumes": [],
"taskDefinitionArn": "arn:aws:ecs:us-east-1:123456789012:task-definition/flask-signup:1",
"containerDefinitions": [
{
"name": "flask-signup",
"image": "your-repository/flask-signup:v_0",
"cpu": 10,
"portMappings": [
{
"containerPort": 5000,
"hostPort": 80
}
],
"memory": 256,
"essential": true
}
],
"family": "flask-signup",
"revision": 1
}
}

Make note of the family value (flask-signup), as it will be needed when configuring the Execute shell step in the Jenkins job.

 

Create the ECS IAM Role, an ELB and your service definition

Create a new IAM role (e.g. ecs-service-role), select the Amazon EC2 Container Service Role type and attach the AmazonEC2ContainerServiceRole policy. This will allows ECS to create and manage AWS resources, such as an ELB, on your behalf. Create an Amazon Elastic Load Balancing (ELB) load balancer to be used in your service definition and note the ELB name (e.g. elb-flask-signup-1985465812). Create the flask-signup-service service, specifying the task definition (e.g. flask-signup) and the ELB name (e.g. elb-flask-signup-1985465812):

# aws ecs create-service –cluster default –service-name flask-signup-service –task-definition flask-signup –load-balancers loadBalancerName=elb-flask-signup-1985465812,containerName=flask-signup,containerPort=5000 –role ecs-service-role –desired-count 0
{
"service": {
"status": "ACTIVE",
"taskDefinition": "arn:aws:ecs:us-east-1:123456789012:task-definition/flask-signup:1",
"desiredCount": 0,
"serviceName": "flask-signup-service",
"clusterArn": "arn:aws:ecs:us-east-1:123456789012:cluster/default",
"serviceArn": "arn:aws:ecs:us-east-1:123456789012:service/flask-signup-service",
"runningCount": 0
}
}

Since we have not yet build a Docker image for our task, make sure to set the –desired-count flag to 0.

 

Configure the Jenkins build

On the Jenkins dashboard, click on New Item, select the Freestyle project job, add a name for the job, and click OK. Configure the Jenkins job:

Under GitHub Project, add the path of your GitHub repository – e.g. https://github.com/awslabs/py-flask-signup-docker. In addition to the application source code, the repository contains the Dockerfile used to build the image, as explained at the beginning of this walkthrough. 

Under Source Code Management provide the Repository URL for Git, e.g. https://github.com/awslabs/py-flask-signup-docker.

In the Build Triggers section, select Build when a change is pushed to GitHub.

In the Build section, add a Docker build and publish step to the job and configure it to publish to your Docker registry repository (e.g. DockerHub) and add a tag to identify the image (e.g. v_$BUILD_NUMBER). 

 

The Repository Name specifies the name of the Docker repository where the image will be published; this is composed of a user name (dstroppa) and an image name (flask-signup). In our case, the Dockerfile sits in the root path of our repository, so we won’t specify any path in the Directory Dockerfile is in field. Note, the repository name needs to be the same as what is used in the task definition template in flask-signup.json.

Add a Execute Shell step and add the ECS CLI commands to start a new task on your ECS cluster. 

The script for the Execute shell step will look like this:

#!/bin/bash
SERVICE_NAME="flask-signup-service"
IMAGE_VERSION="v_"${BUILD_NUMBER}
TASK_FAMILY="flask-signup"

# Create a new task definition for this build
sed -e "s;%BUILD_NUMBER%;${BUILD_NUMBER};g" flask-signup.json > flask-signup-v_${BUILD_NUMBER}.json
aws ecs register-task-definition –family flask-signup –cli-input-json file://flask-signup-v_${BUILD_NUMBER}.json

# Update the service with the new task definition and desired count
TASK_REVISION=`aws ecs describe-task-definition –task-definition flask-signup | egrep "revision" | tr "/" " " | awk ‘{print $2}’ | sed ‘s/"$//’`
DESIRED_COUNT=`aws ecs describe-services –services ${SERVICE_NAME} | egrep "desiredCount" | tr "/" " " | awk ‘{print $2}’ | sed ‘s/,$//’`
if [ ${DESIRED_COUNT} = "0" ]; then
DESIRED_COUNT="1"
fi

aws ecs update-service –cluster default –service ${SERVICE_NAME} –task-definition ${TASK_FAMILY}:${TASK_REVISION} –desired-count ${DESIRED_COUNT}

To trigger the build process on Jenkins upon pushing to the GitHub repository we need to configure a service hook on GitHub. Go to the GitHub repository settings page, select Webhooks and Services and add a service hook for Jenkins (GitHub plugin). Add the Jenkins hook url: http://<username>:<password>@<EC2-DNS-Name>/github-webhook/.

 

 

Now we have configured a Jenkins job in such a way that whenever a change is committed to GitHub repository it will trigger the build process on Jenkins.

 

Happy building

From your local repository, push the application code to GitHub:

# git add *
# git commit -m "Kicking off Jenkins build"
# git push origin master

This will trigger the Jenkins job. After the job is completed, point your browser to the public DNS name for your EC2 container instance and verify that the application is correctly running:

 

Conclusion

In this walkthrough we demonstrated how to use Jenkins to automate the deployment of an ECS service. See the documentation for further information on Amazon ECS.

 

 

Set up a build pipeline with Jenkins and Amazon ECS

Post Syndicated from Chris Barclay original http://blogs.aws.amazon.com/application-management/post/Tx32RHFZHXY6ME1/Set-up-a-build-pipeline-with-Jenkins-and-Amazon-ECS

My colleague Daniele Stroppa sent a nice guest post that demonstrates how to use Jenkins to build Docker images for Amazon EC2 Container Service.

 

—–

 

In this walkthrough, we’ll show you how to set up and configure a build pipeline using Jenkins and the Amazon EC2 Container Service (ECS).

 

We’ll be using a sample Python application, available on GitHub. The repository contains a simple Dockerfile that uses a python base image and runs our application:

FROM python:2-onbuild
CMD [ "python", "./application.py" ]

This Dockerfile is used by the build pipeline to create a new Docker image upon pushing code to the repository. The built image will then be used to start a new service on an ECS cluster.

 

For the purpose of this walkthrough, fork the py-flask-signup-docker repository to your account.

 

Setup the build environment

For our build environment we’ll launch an Amazon EC2 instance using the Amazon Linux AMI and install and configure the required packages. Make sure that the security group you select for your instance allows traffic on ports TCP/22 and TCP/80.

 

Install and configure Jenkins, Docker and Nginx

Connect to your instance using your private key and switch to the root user. First, let’s update the repositories and install Docker, Nginx and Git.

# yum update -y
# yum install -y docker nginx git

To install Jenkins on Amazon Linux, we need to add the Jenkins repository and install Jenkins from there.

# wget -O /etc/yum.repos.d/jenkins.repo http://pkg.jenkins-ci.org/redhat/jenkins.repo
# rpm –import http://pkg.jenkins-ci.org/redhat/jenkins-ci.org.key
# yum install jenkins

As Jenkins typically uses port TCP/8080, we’ll configure Nginx as a proxy. Edit the Nginx config file (/etc/nginx/nginx.conf) and change the server configuration to look like this:

server {
listen 80;
server_name _;

location / {
proxy_pass http://127.0.0.1:8080;
}
}

We’ll be using Jenkins to build our Docker images, so we need to add the jenkins user to the docker group. A reboot may be required for the changes to take effect.

# usermod -a -G docker jenkins

Start the Docker, Jenkins and Nginx services and make sure they will be running after a reboot:

# service docker start
# service jenkins start
# service nginx start
# chkconfig docker on
# chkconfig jenkins on
# chkconfig nginx on

You can launch the Jenkins instance complete with all the required plugins with this CloudFormation template.

 

Point your browser to the public DNS name of your EC2 instance (e.g. ec2-54-163-4-211.compute-1.amazonaws.com) and you should be able to see the Jenkins home page:

 

 

The Jenkins installation is currently accessible through the Internet without any form of authentication. Before proceeding to the next step, let’s secure Jenkins. Select Manage Jenkins on the Jenkins home page, click Configure Global Security and then enable Jenkins security by selecting the Enable Security checkbox.

 

For the purpose of this walkthrough, select Jenkins’s Own User Database under Security realm and make sure to select the Allow users to sign up checkbox. Under Authorization, select Matrix-based security. Add a user (e.g. admin) and provide necessary privileges to this user.

 

 

After that’s complete, save your changes. Now you will be asked to provide a username and password for the user to login. Click on Create an account, provide your username – i.e. admin – and fill in the user details. Now you will be able to log in securely to Jenkins.

 

Install and configure Jenkins plugins

The last step in setting up our build environment is to install and configure the Jenkins plugins required to build a Docker image and publish it to a Docker registry (DockerHub in our case). We’ll also need a plugin to interact with the code repository of our choice, GitHub in our case.

 

From the Jenkins dashboard select Manage Jenkins and click Manage Plugins. On the Available tab, search for and select the following plugins:

Docker Build and Publish plugin

dockerhub plugin

Github plugin

Then click the Install button. After the plugin installation is completed, select Manage Jenkins from the Jenkins dashboard and click Configure System. Look for the Docker Image Builder section and fill in your Docker registry (DockerHub) credentials:

 

 

Install and configure the Amazon ECS CLI

Now we are ready to setup and configure the ECS Command Line Interface (CLI). The sample application creates and uses an Amazon DynamoDB table to store signup information, so make sure that the IAM Role that you create for the EC2 instances allows the dynamodb:* action.

 

Follow the Setting Up with Amazon ECS guide to get ready to use ECS. If you haven’t done so yet, make sure to start at least one container instance in your account and create the Amazon ECS service role in the AWS IAM console.

 

Make sure that Jenkins is able to use the ECS CLI. Switch to the jenkins user and configure the AWS CLI, providing your credentials:

# sudo -su jenkins
> aws configure

Login to Docker Hub
The Jenkins user needs to login to Docker Hub before doing the first build:

# docker login

Create a task definition template

Create a task definition template for our application (note, you will replace the image name with your own repository):

{
"family": "flask-signup",
"containerDefinitions": [
{
"image": "your-repository/flask-signup:v_%BUILD_NUMBER%",
"name": "flask-signup",
"cpu": 10,
"memory": 256,
"essential": true,
"portMappings": [
{
"containerPort": 5000,
"hostPort": 80
}
]
}
]
}

Save your task definition template as flask-signup.json. Since the image specified in the task definition template will be built in the Jenkins job, at this point we will create a dummy task definition. Substitute the %BUILD_NUMBER% parameter in your task definition template with a non-existent value (0) and register it with ECS:

# sed -e "s;%BUILD_NUMBER%;0;g" flask-signup.json > flask-signup-v_0.json
# aws ecs register-task-definition –cli-input-json file://flask-signup-v_0.json
{
"taskDefinition": {
"volumes": [],
"taskDefinitionArn": "arn:aws:ecs:us-east-1:123456789012:task-definition/flask-signup:1",
"containerDefinitions": [
{
"name": "flask-signup",
"image": "your-repository/flask-signup:v_0",
"cpu": 10,
"portMappings": [
{
"containerPort": 5000,
"hostPort": 80
}
],
"memory": 256,
"essential": true
}
],
"family": "flask-signup",
"revision": 1
}
}

Make note of the family value (flask-signup), as it will be needed when configuring the Execute shell step in the Jenkins job.

 

Create the ECS IAM Role, an ELB and your service definition

Create a new IAM role (e.g. ecs-service-role), select the Amazon EC2 Container Service Role type and attach the AmazonEC2ContainerServiceRole policy. This will allows ECS to create and manage AWS resources, such as an ELB, on your behalf. Create an Amazon Elastic Load Balancing (ELB) load balancer to be used in your service definition and note the ELB name (e.g. elb-flask-signup-1985465812). Create the flask-signup-service service, specifying the task definition (e.g. flask-signup) and the ELB name (e.g. elb-flask-signup-1985465812):

# aws ecs create-service –cluster default –service-name flask-signup-service –task-definition flask-signup –load-balancers loadBalancerName=elb-flask-signup-1985465812,containerName=flask-signup,containerPort=5000 –role ecs-service-role –desired-count 0
{
"service": {
"status": "ACTIVE",
"taskDefinition": "arn:aws:ecs:us-east-1:123456789012:task-definition/flask-signup:1",
"desiredCount": 0,
"serviceName": "flask-signup-service",
"clusterArn": "arn:aws:ecs:us-east-1:123456789012:cluster/default",
"serviceArn": "arn:aws:ecs:us-east-1:123456789012:service/flask-signup-service",
"runningCount": 0
}
}

Since we have not yet build a Docker image for our task, make sure to set the –desired-count flag to 0.

 

Configure the Jenkins build

On the Jenkins dashboard, click on New Item, select the Freestyle project job, add a name for the job, and click OK. Configure the Jenkins job:

Under GitHub Project, add the path of your GitHub repository – e.g. https://github.com/awslabs/py-flask-signup-docker. In addition to the application source code, the repository contains the Dockerfile used to build the image, as explained at the beginning of this walkthrough. 

Under Source Code Management provide the Repository URL for Git, e.g. https://github.com/awslabs/py-flask-signup-docker.

In the Build Triggers section, select Build when a change is pushed to GitHub.

In the Build section, add a Docker build and publish step to the job and configure it to publish to your Docker registry repository (e.g. DockerHub) and add a tag to identify the image (e.g. v_$BUILD_NUMBER). 

 

The Repository Name specifies the name of the Docker repository where the image will be published; this is composed of a user name (dstroppa) and an image name (flask-signup). In our case, the Dockerfile sits in the root path of our repository, so we won’t specify any path in the Directory Dockerfile is in field. Note, the repository name needs to be the same as what is used in the task definition template in flask-signup.json.

Add a Execute Shell step and add the ECS CLI commands to start a new task on your ECS cluster. 

The script for the Execute shell step will look like this:

#!/bin/bash
SERVICE_NAME="flask-signup-service"
IMAGE_VERSION="v_"${BUILD_NUMBER}
TASK_FAMILY="flask-signup"

# Create a new task definition for this build
sed -e "s;%BUILD_NUMBER%;${BUILD_NUMBER};g" flask-signup.json > flask-signup-v_${BUILD_NUMBER}.json
aws ecs register-task-definition –family flask-signup –cli-input-json file://flask-signup-v_${BUILD_NUMBER}.json

# Update the service with the new task definition and desired count
TASK_REVISION=`aws ecs describe-task-definition –task-definition flask-signup | egrep "revision" | tr "/" " " | awk ‘{print $2}’ | sed ‘s/"$//’`
DESIRED_COUNT=`aws ecs describe-services –services ${SERVICE_NAME} | egrep "desiredCount" | tr "/" " " | awk ‘{print $2}’ | sed ‘s/,$//’`
if [ ${DESIRED_COUNT} = "0" ]; then
DESIRED_COUNT="1"
fi

aws ecs update-service –cluster default –service ${SERVICE_NAME} –task-definition ${TASK_FAMILY}:${TASK_REVISION} –desired-count ${DESIRED_COUNT}

To trigger the build process on Jenkins upon pushing to the GitHub repository we need to configure a service hook on GitHub. Go to the GitHub repository settings page, select Webhooks and Services and add a service hook for Jenkins (GitHub plugin). Add the Jenkins hook url: http://<username>:<password>@<EC2-DNS-Name>/github-webhook/.

 

 

Now we have configured a Jenkins job in such a way that whenever a change is committed to GitHub repository it will trigger the build process on Jenkins.

 

Happy building

From your local repository, push the application code to GitHub:

# git add *
# git commit -m "Kicking off Jenkins build"
# git push origin master

This will trigger the Jenkins job. After the job is completed, point your browser to the public DNS name for your EC2 container instance and verify that the application is correctly running:

 

Conclusion

In this walkthrough we demonstrated how to use Jenkins to automate the deployment of an ECS service. See the documentation for further information on Amazon ECS.