Thanks to Greg Eppel, Sr. Solutions Architect, Microsoft Platform for this great blog that describes how to create a custom CodeBuild build environment for the .NET Framework. — AWS CodeBuild is a fully managed build service that compiles source code, runs tests, and produces software packages that are ready to deploy. CodeBuild provides curated build environments for programming languages and runtimes such as Android, Go, Java, Node.js, PHP, Python, Ruby, and Docker. CodeBuild now supports builds for the Microsoft Windows Server platform, including a prepackaged build environment for .NET Core on Windows. If your application uses the .NET Framework, you will need to use a custom Docker image to create a custom build environment that includes the Microsoft proprietary Framework Class Libraries. For information about why this step is required, see our FAQs. In this post, I’ll show you how to create a custom build environment for .NET Framework applications and walk you through the steps to configure CodeBuild to use this environment.
Build environments are Docker images that include a complete file system with everything required to build and test your project. To use a custom build environment in a CodeBuild project, you build a container image for your platform that contains your build tools, push it to a Docker container registry such as Amazon Elastic Container Registry (Amazon ECR), and reference it in the project configuration. When it builds your application, CodeBuild retrieves the Docker image from the container registry specified in the project configuration and uses the environment to compile your source code, run your tests, and package your application.
Step 1: Launch EC2 Windows Server 2016 with Containers
In the Amazon EC2 console, in your region, launch an Amazon EC2 instance from a Microsoft Windows Server 2016 Base with Containers AMI.
Increase disk space on the boot volume to at least 50 GB to account for the larger size of containers required to install and run Visual Studio Build Tools.
Run the following command in that directory. This process can take a while. It depends on the size of EC2 instance you launched. In my tests, a t2.2xlarge takes less than 30 minutes to build the image and produces an approximately 15 GB image.
docker build -t buildtools2017:latest -m 2GB .
Run the following command to test the container and start a command shell with all the developer environment variables:
docker run -it buildtools2017
Create a repository in the Amazon ECS console. For the repository name, type buildtools2017. Choose Next step and then complete the remaining steps.
Execute the following command to generate authentication details for our registry to the local Docker engine. Make sure you have permissions to the Amazon ECR registry before you execute the command.
aws ecr get-login
In the same command prompt window, copy and paste the following commands:
In the CodeCommit console, create a repository named DotNetFrameworkSampleApp. On the Configure email notifications page, choose Skip.
Clone a .NET Framework Docker sample application from GitHub. The repository includes a sample ASP.NET Framework that we’ll use to demonstrate our custom build environment.On the EC2 instance, open a command prompt and execute the following commands:
Navigate to the CodeCommit repository and confirm that the files you just pushed are there.
Step 4: Configure build spec
To build your .NET Framework application with CodeBuild you use a build spec, which is a collection of build commands and related settings, in YAML format, that AWS CodeBuild can use to run a build. You can include a build spec as part of the source code or you can define a build spec when you create a build project. In this example, I include a build spec as part of the source code.
In the root directory of your source directory, create a YAML file named buildspec.yml.
At this point, we have a Docker image with Visual Studio Build Tools installed and stored in the Amazon ECR registry. We also have a sample ASP.NET Framework application in a CodeCommit repository. Now we are going to set up CodeBuild to build the ASP.NET Framework application.
In the Amazon ECR console, choose the repository that was pushed earlier with the docker push command. On the Permissions tab, choose Add.
For Source Provider, choose AWS CodeCommit and then choose the called DotNetFrameworkSampleApp repository.
For Environment Image, choose Specify a Docker image.
For Environment type, choose Windows.
For Custom image type, choose Amazon ECR.
For Amazon ECR repository, choose the Docker image with the Visual Studio Build Tools installed, buildtools2017. Your configuration should look like the image below:
Choose Continue and then Save and Build to create your CodeBuild project and start your first build. You can monitor the status of the build in the console. You can also configure notifications that will notify subscribers whenever builds succeed, fail, go from one phase to another, or any combination of these events.
Summary
CodeBuild supports a number of platforms and languages out of the box. By using custom build environments, it can be extended to other runtimes. In this post, I showed you how to build a .NET Framework environment on a Windows container and demonstrated how to use it to build .NET Framework applications in CodeBuild.
We’re excited to see how customers extend and use CodeBuild to enable continuous integration and continuous delivery for their Windows applications. Feel free to share what you’ve learned extending CodeBuild for your own projects. Just leave questions or suggestions in the comments.
Well, we actually won’t show you how we create the magic in our big OATH consumer mail factory. But nevertheless we wanted to share how interested developers could leverage some of our unique features we offer for our Yahoo and AOL Mail customers.
To drive experiences like our travel and shopping smart views or message threading, we tag qualified mails with something we call DECOS and THREADID. While we will not indulge in explaining how exactly we use them internally, we wanted to share how they can be used and accessed through IMAP.
So let’s just look at a sample IMAP command chain. We’ll just assume that you are familiar with the IMAP protocol at this point and you know how to properly talk to an IMAP server.
So here’s how you would retrieve DECO and THREADIDs for specific messages:
As a serverless computing platform that supports Java 8 runtime, AWS Lambda makes it easy to run any type of Java function simply by uploading a JAR file. To help define not only a Lambda serverless application but also Amazon API Gateway, Amazon DynamoDB, and other related services, the AWS Serverless Application Model (SAM) allows developers to use a simple AWS CloudFormation template.
AWS provides the AWS Toolkit for Eclipse that supports both Lambda and SAM. AWS also gives customers an easy way to create Lambda functions and SAM applications in Java using the AWS Command Line Interface (AWS CLI). After you build a JAR file, all you have to do is type the following commands:
To consolidate these steps, customers can use Archetype by Apache Maven. Archetype uses a predefined package template that makes getting started to develop a function exceptionally simple.
In this post, I introduce a Maven archetype that allows you to create a skeleton of AWS SAM for a Java function. Using this archetype, you can generate a sample Java code example and an accompanying SAM template to deploy it on AWS Lambda by a single Maven action.
Prerequisites
Make sure that the following software is installed on your workstation:
Java
Maven
AWS CLI
(Optional) AWS SAM CLI
Install Archetype
After you’ve set up those packages, install Archetype with the following commands:
git clone https://github.com/awslabs/aws-serverless-java-archetype
cd aws-serverless-java-archetype
mvn install
These are one-time operations, so you don’t run them for every new package. If you’d like, you can add Archetype to your company’s Maven repository so that other developers can use it later.
With those packages installed, you’re ready to develop your new Lambda Function.
Start a project
Now that you have the archetype, customize it and run the code:
cd /path/to/project_home
mvn archetype:generate \
-DarchetypeGroupId=com.amazonaws.serverless.archetypes \
-DarchetypeArtifactId=aws-serverless-java-archetype \
-DarchetypeVersion=1.0.0 \
-DarchetypeRepository=local \ # Forcing to use local maven repository
-DinteractiveMode=false \ # For batch mode
# You can also specify properties below interactively if you omit the line for batch mode
-DgroupId=YOUR_GROUP_ID \
-DartifactId=YOUR_ARTIFACT_ID \
-Dversion=YOUR_VERSION \
-DclassName=YOUR_CLASSNAME
You should have a directory called YOUR_ARTIFACT_ID that contains the files and folders shown below:
The sample code is a working example. If you install SAM CLI, you can invoke it just by the command below:
cd YOUR_ARTIFACT_ID
mvn -P invoke verify
[INFO] Scanning for projects...
[INFO]
[INFO] ---------------------------< com.riywo:foo >----------------------------
[INFO] Building foo 1.0
[INFO] --------------------------------[ jar ]---------------------------------
...
[INFO] --- maven-jar-plugin:3.0.2:jar (default-jar) @ foo ---
[INFO] Building jar: /private/tmp/foo/target/foo-1.0.jar
[INFO]
[INFO] --- maven-shade-plugin:3.1.0:shade (shade) @ foo ---
[INFO] Including com.amazonaws:aws-lambda-java-core:jar:1.2.0 in the shaded jar.
[INFO] Replacing /private/tmp/foo/target/lambda.jar with /private/tmp/foo/target/foo-1.0-shaded.jar
[INFO]
[INFO] --- exec-maven-plugin:1.6.0:exec (sam-local-invoke) @ foo ---
2018/04/06 16:34:35 Successfully parsed template.yaml
2018/04/06 16:34:35 Connected to Docker 1.37
2018/04/06 16:34:35 Fetching lambci/lambda:java8 image for java8 runtime...
java8: Pulling from lambci/lambda
Digest: sha256:14df0a5914d000e15753d739612a506ddb8fa89eaa28dcceff5497d9df2cf7aa
Status: Image is up to date for lambci/lambda:java8
2018/04/06 16:34:37 Invoking Package.Example::handleRequest (java8)
2018/04/06 16:34:37 Decompressing /tmp/foo/target/lambda.jar
2018/04/06 16:34:37 Mounting /private/var/folders/x5/ldp7c38545v9x5dg_zmkr5kxmpdprx/T/aws-sam-local-1523000077594231063 as /var/task:ro inside runtime container
START RequestId: a6ae19fe-b1b0-41e2-80bc-68a40d094d74 Version: $LATEST
Log output: Greeting is 'Hello Tim Wagner.'
END RequestId: a6ae19fe-b1b0-41e2-80bc-68a40d094d74
REPORT RequestId: a6ae19fe-b1b0-41e2-80bc-68a40d094d74 Duration: 96.60 ms Billed Duration: 100 ms Memory Size: 128 MB Max Memory Used: 7 MB
{"greetings":"Hello Tim Wagner."}
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 10.452 s
[INFO] Finished at: 2018-04-06T16:34:40+09:00
[INFO] ------------------------------------------------------------------------
This maven goal invokes sam local invoke -e event.json, so you can see the sample output to greet Tim Wagner.
To deploy this application to AWS, you need an Amazon S3 bucket to upload your package. You can use the following command to create a bucket if you want:
aws s3 mb s3://YOUR_BUCKET --region YOUR_REGION
Now, you can deploy your application by just one command!
mvn deploy \
-DawsRegion=YOUR_REGION \
-Ds3Bucket=YOUR_BUCKET \
-DstackName=YOUR_STACK
[INFO] Scanning for projects...
[INFO]
[INFO] ---------------------------< com.riywo:foo >----------------------------
[INFO] Building foo 1.0
[INFO] --------------------------------[ jar ]---------------------------------
...
[INFO] --- exec-maven-plugin:1.6.0:exec (sam-package) @ foo ---
Uploading to aws-serverless-java/com.riywo:foo:1.0/924732f1f8e4705c87e26ef77b080b47 11657 / 11657.0 (100.00%)
Successfully packaged artifacts and wrote output template to file target/sam.yaml.
Execute the following command to deploy the packaged template
aws cloudformation deploy --template-file /private/tmp/foo/target/sam.yaml --stack-name <YOUR STACK NAME>
[INFO]
[INFO] --- maven-deploy-plugin:2.8.2:deploy (default-deploy) @ foo ---
[INFO] Skipping artifact deployment
[INFO]
[INFO] --- exec-maven-plugin:1.6.0:exec (sam-deploy) @ foo ---
Waiting for changeset to be created..
Waiting for stack create/update to complete
Successfully created/updated stack - archetype
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 37.176 s
[INFO] Finished at: 2018-04-06T16:41:02+09:00
[INFO] ------------------------------------------------------------------------
Maven automatically creates a shaded JAR file, uploads it to your S3 bucket, replaces template.yaml, and creates and updates the CloudFormation stack.
To customize the process, modify the pom.xml file. For example, to avoid typing values for awsRegion, s3Bucket or stackName, write them inside pom.xml and check in your VCS. Afterward, you and the rest of your team can deploy the function by typing just the following command:
mvn deploy
Options
Lambda Java 8 runtime has some types of handlers: POJO, Simple type and Stream. The default option of this archetype is POJO style, which requires to create request and response classes, but they are baked by the archetype by default. If you want to use other type of handlers, you can use handlerType property like below:
## POJO type (default)
mvn archetype:generate \
...
-DhandlerType=pojo
## Simple type - String
mvn archetype:generate \
...
-DhandlerType=simple
### Stream type
mvn archetype:generate \
...
-DhandlerType=stream
Also, Lambda Java 8 runtime supports two types of Logging class: Log4j 2 and LambdaLogger. This archetype creates LambdaLogger implementation by default, but you can use Log4j 2 if you want:
If you use LambdaLogger, you can delete ./src/main/resources/log4j2.xml. See documentation for more details.
Conclusion
So, what’s next? Develop your Lambda function locally and type the following command: mvn deploy !
With this Archetype code example, available on GitHub repo, you should be able to deploy Lambda functions for Java 8 in a snap. If you have any questions or comments, please submit them below or leave them on GitHub.
The Internet of Things (IoT) has precipitated to an influx of connected devices and data that can be mined to gain useful business insights. If you own an IoT device, you might want the data to be uploaded seamlessly from your connected devices to the cloud so that you can make use of cloud storage and the processing power to perform sophisticated analysis of data. To upload the data to the AWS Cloud, devices must pass authentication and authorization checks performed by the respective AWS services. The standard way of authenticating AWS requests is the Signature Version 4 algorithm that requires the caller to have an access key ID and secret access key. Consequently, you need to hardcode the access key ID and the secret access key on your devices. Alternatively, you can use the built-in X.509 certificate as the unique device identity to authenticate AWS requests.
AWS IoT has introduced the credentials provider feature that allows a caller to authenticate AWS requests by having an X.509 certificate. The credentials provider authenticates a caller using an X.509 certificate, and vends a temporary, limited-privilege security token. The token can be used to sign and authenticate any AWS request. Thus, the credentials provider relieves you from having to manage and periodically refresh the access key ID and secret access key remotely on your devices.
In the process of retrieving a security token, you use AWS IoT to create a thing (a representation of a specific device or logical entity), register a certificate, and create AWS IoT policies. You also configure an AWS Identity and Access Management (IAM) role and attach appropriate IAM policies to the role so that the credentials provider can assume the role on your behalf. You also make an HTTP-over-Transport Layer Security (TLS) mutual authentication request to the credentials provider that uses your preconfigured thing, certificate, policies, and IAM role to authenticate and authorize the request, and obtain a security token on your behalf. You can then use the token to sign any AWS request using Signature Version 4.
In this blog post, I explain the AWS IoT credentials provider design and then demonstrate the end-to-end process of retrieving a security token from AWS IoT and using the token to write a temperature and humidity record to a specific Amazon DynamoDB table.
Note: This post assumes you are familiar with AWS IoT and IAM to perform steps using the AWS CLI and OpenSSL. Make sure you are running the latest version of the AWS CLI.
Overview of the credentials provider workflow
The following numbered diagram illustrates the credentials provider workflow. The diagram is followed by explanations of the steps.
To explain the steps of the workflow as illustrated in the preceding diagram:
The AWS IoT device uses the AWS SDK or custom client to make an HTTPS request to the credentials provider for a security token. The request includes the device X.509 certificate for authentication.
The credentials provider forwards the request to the AWS IoT authentication and authorization module to verify the certificate and the permission to request the security token.
If the certificate is valid and has permission to request a security token, the AWS IoT authentication and authorization module returns success. Otherwise, it returns failure, which goes back to the device with the appropriate exception.
If assuming the role succeeds, AWS STS returns a temporary, limited-privilege security token to the credentials provider.
The credentials provider returns the security token to the device.
The AWS SDK on the device uses the security token to sign an AWS request with AWS Signature Version 4.
The requested service invokes IAM to validate the signature and authorize the request against access policies attached to the preconfigured IAM role.
If IAM validates the signature successfully and authorizes the request, the request goes through.
In another solution, you could configure an AWS Lambda rule that ingests your device data and sends it to another AWS service. However, in applications that require the uploading of large files such as videos or aggregated telemetry to the AWS Cloud, you may want your devices to be able to authenticate and send data directly to the AWS service of your choice. The credentials provider enables you to do that.
Outline of the steps to retrieve and use security token
Perform the following steps as part of this solution:
Create an AWS IoT thing: Start by creating a thing that corresponds to your home thermostat in the AWS IoT thing registry database. This allows you to authenticate the request as a thing and use thing attributes as policy variables in AWS IoT and IAM policies.
Register a certificate: Create and register a certificate with AWS IoT, and attach it to the thing for successful device authentication.
Create and configure an IAM role: Create an IAM role to be assumed by the service on behalf of your device. I illustrate how to configure a trust policy and an access policy so that AWS IoT has permission to assume the role, and the token has necessary permission to make requests to DynamoDB.
Create a role alias: Create a role alias in AWS IoT. A role alias is an alternate data model pointing to an IAM role. The credentials provider request must include a role alias name to indicate which IAM role to assume for obtaining a security token from AWS STS. You may update the role alias on the server to point to a different IAM role and thus make your device obtain a security token with different permissions.
Attach a policy: Create an authorization policy with AWS IoT and attach it to the certificate to control which device can assume which role aliases.
Request a security token: Make an HTTPS request to the credentials provider and retrieve a security token and use it to sign a DynamoDB request with Signature Version 4.
Use the security token to sign a request: Use the retrieved token to sign a request to DynamoDB and successfully write a temperature and humidity record from your home thermostat in a specific table. Thus, starting with an X.509 certificate on your home thermostat, you can successfully upload your thermostat record to DynamoDB and use it for further analysis. Before the availability of the credentials provider, you could not do this.
Deploy the solution
1. Create an AWS IoT thing
Register your home thermostat in the AWS IoT thing registry database by creating a thing type and a thing. You can use the AWS CLI with the following command to create a thing type. The thing type allows you to store description and configuration information that is common to a set of things.
Now, you need to have a Certificate Authority (CA) certificate, sign a device certificate using the CA certificate, and register both certificates with AWS IoT before your device can authenticate to AWS IoT. If you do not already have a CA certificate, you can use OpenSSL to create a CA certificate, as described in Use Your Own Certificate. To register your CA certificate with AWS IoT, follow the steps on Registering Your CA Certificate.
You then have to create a device certificate signed by the CA certificate and register it with AWS IoT, which you can do by following the steps on Creating a Device Certificate Using Your CA Certificate. Save the certificate and the corresponding key pair; you will use them when you request a security token later. Also, remember the password you provide when you create the certificate.
Run the following command in the AWS CLI to attach the device certificate to your thing so that you can use thing attributes in policy variables.
If the attach-thing-principal command succeeds, the output is empty.
3. Configure an IAM role
Next, configure an IAM role in your AWS account that will be assumed by the credentials provider on behalf of your device. You are required to associate two policies with the role: a trust policy that controls who can assume the role, and an access policy that controls which actions can be performed on which resources by assuming the role.
The following trust policy grants the credentials provider permission to assume the role. Put it in a text document and save the document with the name, trustpolicyforiot.json.
The following access policy allows DynamoDB operations on the table that has the same name as the thing name that you created in Step 1, MyHomeThermostat, by using credentials-iot:ThingName as a policy variable. I explain after Step 5 about using thing attributes as policy variables. Put the following policy in a text document and save the document with the name, accesspolicyfordynamodb.json.
Finally, run the following command in the AWS CLI to attach the access policy to your role.
aws iam attach-role-policy --role-name dynamodb-access-role --policy-arn arn:aws:iam::<your_aws_account_id>:policy/accesspolicyfordynamodb
If the attach-role-policy command succeeds, the output is empty.
Configure the PassRole permissions
The IAM role that you have created must be passed to AWS IoT to create a role alias, as described in Step 4. The user who performs the operation requires iam:PassRole permission to authorize this action. You also should add permission for the iam:GetRole action to allow the user to retrieve information about the specified role. Create the following policy to grant iam:PassRole and iam:GetRole permissions. Name this policy, passrolepermission.json.
Now, run the following command to attach the policy to the user.
aws iam attach-user-policy --policy-arn arn:aws:iam::<your_aws_account_id>:policy/passrolepermission --user-name <user_name>
If the attach-user-policy command succeeds, the output is empty.
4. Create a role alias
Now that you have configured the IAM role, you will create a role alias with AWS IoT. You must provide the following pieces of information when creating a role alias:
RoleAlias: This is the primary key of the role alias data model and hence a mandatory attribute. It is a string; the minimum length is 1 character, and the maximum length is 128 characters.
RoleArn: This is the Amazon Resource Name (ARN) of the IAM role you have created. This is also a mandatory attribute.
CredentialDurationSeconds: This is an optional attribute specifying the validity (in seconds) of the security token. The minimum value is 900 seconds (15 minutes), and the maximum value is 3,600 seconds (60 minutes); the default value is 3,600 seconds, if not specified.
Run the following command in the AWS CLI to create a role alias. Use the credentials of the user to whom you have given the iam:PassRole permission.
You created and registered a certificate with AWS IoT earlier for successful authentication of your device. Now, you need to create and attach a policy to the certificate to authorize the request for the security token.
Let’s say you want to allow a thing to get credentials for the role alias, Thermostat-dynamodb-access-role-alias, with thing owner Alice, thing type thermostat, and the thing attached to a principal. The following policy, with thing attributes as policy variables, achieves these requirements. After this step, I explain more about using thing attributes as policy variables. Put the policy in a text document, and save it with the name, alicethermostatpolicy.json.
If the attach-policy command succeeds, the output is empty.
You have completed all the necessary steps to request an AWS security token from the credentials provider!
Using thing attributes as policy variables
Before I show how to request a security token, I want to explain more about how to use thing attributes as policy variables and the advantage of using them. As a prerequisite, a device must provide a thing name in the credentials provider request.
Thing substitution variables in AWS IoT policies
AWS IoT Simplified Permission Management allows you to associate a connection with a specific thing, and allow the thing name, thing type, and other thing attributes to be available as substitution variables in AWS IoT policies. You can write a generic AWS IoT policy as in alicethermostatpolicy.json in Step 5, attach it to multiple certificates, and authorize the connection as a thing. For example, you could attach alicethermostatpolicy.json to certificates corresponding to each of the thermostats you have that you want to assume the role alias, Thermostat-dynamodb-access-role-alias, and allow operations only on the table with the name that matches the thing name. For more information, see the full list of thing policy variables.
Thing substitution variables in IAM policies
You also can use the following three substitution variables in the IAM role’s access policy (I used credentials-iot:ThingName in accesspolicyfordynamodb.json in Step 3):
credentials-iot:ThingName
credentials-iot:ThingTypeName
credentials-iot:AwsCertificateId
When the device provides the thing name in the request, the credentials provider fetches these three variables from the database and adds them as context variables to the security token. When the device uses the token to access DynamoDB, the variables in the role’s access policy are replaced with the corresponding values in the security token. Note that you also can use credentials-iot:AwsCertificateId as a policy variable; AWS IoT returns certificateId during registration.
6. Request a security token
Make an HTTPS request to the credentials provider to fetch a security token. You have to supply the following information:
Certificate and key pair: Because this is an HTTP request over TLS mutual authentication, you have to provide the certificate and the corresponding key pair to your client while making the request. Use the same certificate and key pair that you used during certificate registration with AWS IoT.
RoleAlias: Provide the role alias (in this example, Thermostat-dynamodb-access-role-alias) to be assumed in the request.
ThingName: Provide the thing name that you created earlier in the AWS IoT thing registry database. This is passed as a header with the name, x-amzn-iot-thingname. Note that the thing name is mandatory only if you have thing attributes as policy variables in AWS IoT or IAM policies.
Run the following command in the AWS CLI to obtain your AWS account-specific endpoint for the credentials provider. See the DescribeEndpoint API documentation for further details.
Note that if you are on Mac OS X, you need to export your certificate to a .pfx or .p12 file before you can pass it in the https request. Use OpenSSL with the following command to convert the device certificate from .pem to .pfx format. Remember the password because you will need it subsequently in a curl command.
Now, make an HTTPS request to the credentials provider to fetch a security token. You may use your preferred HTTP client for the request. I use curl in the following examples.
This command returns a security token object that has an accessKeyId, a secretAccessKey, a sessionToken, and an expiration. The following is sample output of the curl command.
Create a DynamoDB table called MyHomeThermostat in your AWS account. You will have to choose the hash (partition key) and the range (sort key) while creating the table to uniquely identify a record. Make the hash the serial_number of the thermostat and the range the timestamp of the record. Create a text file with the following JSON to put a temperature and humidity record in the table. Name the file, item.json.
You can use the accessKeyId, secretAccessKey, and sessionToken retrieved from the output of the curl command to sign a request that writes the temperature and humidity record to the DynamoDB table. Use the following commands to accomplish this.
In this blog post, I demonstrated how to retrieve a security token by using an X.509 certificate and then writing an item to a DynamoDB table by using the security token. Similarly, you could run applications on surveillance cameras or sensor devices that exchange the X.509 certificate for an AWS security token and use the token to upload video streams to Amazon Kinesis or telemetry data to Amazon CloudWatch.
If you have comments about this blog post, submit them in the “Comments” section below. If you have questions about or issues implementing this solution, start a new thread on the AWS IoT forum.
Enterprises adopt containers because they recognize the benefits: speed, agility, portability, and high compute density. They understand how accelerating application delivery and deployment pipelines makes it possible to rapidly slipstream new features to customers. Although the benefits are indisputable, this acceleration raises concerns about security and corporate compliance with software governance. In this blog post, I provide a solution that shows how Layered Insight, the pioneer and global leader in container-native application protection, can be used with seamless application build and delivery pipelines like those available in AWS CodeBuild to address these concerns.
Layered Insight solutions
Layered Insight enables organizations to unify DevOps and SecOps by providing complete visibility and control of containerized applications. Using the industry’s first embedded security approach, Layered Insight solves the challenges of container performance and protection by providing accurate insight into container images, adaptive analysis of running containers, and automated enforcement of container behavior.
AWS CodeBuild
AWS CodeBuild is a fully managed build service that compiles source code, runs tests, and produces software packages that are ready to deploy. With CodeBuild, you don’t need to provision, manage, and scale your own build servers. CodeBuild scales continuously and processes multiple builds concurrently, so your builds are not left waiting in a queue. You can get started quickly by using prepackaged build environments, or you can create custom build environments that use your own build tools.
Problem Definition
Security and compliance concerns span the lifecycle of application containers. Common concerns include:
Visibility into the container images. You need to verify the software composition information of the container image to determine whether known vulnerabilities associated with any of the software packages and libraries are included in the container image.
Governance of container images is critical because only certain open source packages/libraries, of specific versions, should be included in the container images. You need support for mechanisms for blacklisting all container images that include a certain version of a software package/library, or only allowing open source software that come with a specific type of license (such as Apache, MIT, GPL, and so on). You need to be able to address challenges such as:
· Defining the process for image compliance policies at the enterprise, department, and group levels.
· Preventing the images that fail the compliance checks from being deployed in critical environments, such as staging, pre-prod, and production.
Visibility into running container instances is critical, including:
· CPU and memory utilization.
· Security of the build environment.
· All activities (system, network, storage, and application layer) of the application code running in each container instance.
Protection of running container instances that is:
· Zero-touch to the developers (not an SDK-based approach).
· Zero touch to the DevOps team and doesn’t limit the portability of the containerized application.
· This protection must retain the option to switch to a different container stack or orchestration layer, or even to a different Container as a Service (CaaS ).
· And it must be a fully automated solution to SecOps, so that the SecOps team doesn’t have to manually analyze and define detailed blacklist and whitelist policies.
Solution Details
In AWS CodeCommit, we have three projects: ● “Democode” is a simple Java application, with one buildspec to build the app into a Docker container (run by build-demo-image CodeBuild project), and another to instrument said container (instrument-image CodeBuild project). The resulting container is stored in ECR repo javatestasjavatest:20180415-layered. This instrumented container is running in AWS Fargate cluster demo-java-appand can be seen in the Layered Insight runtime console as the javatestapplication in us-east-1. ● aws-codebuild-docker-imagesis a clone of the official aws-codebuild-docker-images repo on GitHub . This CodeCommit project is used by the build-python-builder CodeBuild project to build the python 3.3.6 codebuild image and is stored at the codebuild-python ECR repo. We then manually instructed the Layered Insight console to instrument the image. ● scan-java-imagecontains just a buildspec.yml file. This file is used by the scan-java-image CodeBuild project to instruct Layered Assessment to perform a vulnerability scan of the javatest container image built previously, and then run the scan results through a compliance policy that states there should be no medium vulnerabilities. This build fails — but in this case that is a success: the scan completes successfully, but compliance fails as there are medium-level issues found in the scan.
This build is performed using the instrumented version of the Python 3.3.6 CodeBuild image, so the activity of the processes running within the build are recorded each time within the LI console.
Build container image
Create or use a CodeCommit project with your application. To build this image and store it in Amazon Elastic Container Registry (Amazon ECR), add a buildspec file to the project and build a container image and create a CodeBuild project.
Scan container image
Once the image is built, create a new buildspec in the same project or a new one that looks similar to below (update ECR URL as necessary):
version: 0.2
phases:
pre_build:
commands:
- echo Pulling down LI Scan API client scripts
- git clone https://github.com/LayeredInsight/scan-api-example-python.git
- echo Setting up LI Scan API client
- cd scan-api-example-python
- pip install layint_scan_api
- pip install -r requirements.txt
build:
commands:
- echo Scanning container started on `date`
- IMAGEID=$(./li_add_image --name <aws-region>.amazonaws.com/javatest:20180415)
- ./li_wait_for_scan -v --imageid $IMAGEID
- ./li_run_image_compliance -v --imageid $IMAGEID --policyid PB15260f1acb6b2aa5b597e9d22feffb538256a01fbb4e5a95
Add the buildspec file to the git repo, push it, and then build a CodeBuild project using with the instrumented Python 3.3.6 CodeBuild image at <aws-region>.amazonaws.com/codebuild-python:3.3.6-layered. Set the following environment variables in the CodeBuild project: ● LI_APPLICATIONNAME – name of the build to display ● LI_LOCATION – location of the build project to display ● LI_API_KEY – ApiKey:<key-name>:<api-key> ● LI_API_HOST – location of the Layered Insight API service
Instrument container image
Next, to instrument the new container image:
In the Layered Insight runtime console, ensure that the ECR registry and credentials are defined (click the Setup icon and the ‘+’ sign on the top right of the screen to add a new container registry). Note the name given to the registry in the console, as this needs to be referenced in the li_add_imagecommand in the script, below.
Next, add a new buildspec (with a new name) to the CodeCommit project, such as the one shown below. This code will download the Layered Insight runtime client, and use it to instruct the Layered Insight service to instrument the image that was just built:
version: 0.2
phases:
pre_build:
commands:
echo Pulling down LI API Runtime client scripts
git clone https://github.com/LayeredInsight/runtime-api-example-python
echo Setting up LI API client
cd runtime-api-example-python
pip install layint-runtime-api
pip install -r requirements.txt
build:
commands:
echo Instrumentation started on `date`
./li_add_image --registry "Javatest ECR" --name IMAGE_NAME:TAG --description "IMAGE DESCRIPTION" --policy "Default Policy" --instrument --wait --verbose
Commit and push the new buildspec file.
Going back to CodeBuild, create a new project, with the same CodeCommit repo, but this time select the new buildspec file. Use a Python 3.3.6 builder – either the AWS or LI Instrumented version.
Click Continue
Click Save
Run the build, again on the master branch.
If everything runs successfully, a new image should appear in the ECR registry with a -layered suffix. This is the instrumented image.
Run instrumented container image
When the instrumented container is now run — in ECS, Fargate, or elsewhere — it will log data back to the Layered Insight runtime console. It’s appearance in the console can be modified by setting the LI_APPLICATIONNAME and LI_LOCATION environment variables when running the container.
Conclusion
In the above blog we have provided you steps needed to embed governance and runtime security in your build pipelines running on AWS CodeBuild using Layered Insight.
Thanks to Raja Mani, AWS Solutions Architect, for this great blog.
—
In this blog post, I’ll walk you through the steps for setting up continuous replication of an AWS CodeCommit repository from one AWS region to another AWS region using a serverless architecture. CodeCommit is a fully-managed, highly scalable source control service that stores anything from source code to binaries. It works seamlessly with your existing Git tools and eliminates the need to operate your own source control system. Replicating an AWS CodeCommit repository from one AWS region to another AWS region enables you to achieve lower latency pulls for global developers. This same approach can also be used to automatically back up repositories currently hosted on other services (for example, GitHub or BitBucket) to AWS CodeCommit.
This solution uses AWS Lambda and AWS Fargate for continuous replication. Benefits of this approach include:
The replication process can be easily setup to trigger based on events, such as commits made to the repository.
Setting up a serverless architecture means you don’t need to provision, maintain, or administer servers.
Note: AWS Fargate has a limitation of 10 GB for storage and is available in US East (N. Virginia) region. A similar solution that uses Amazon EC2 instances to replicate the repositories on a schedule was published in a previous blog and can be used if your repository does not meet these conditions.
Replication using Fargate
As you follow this blog post, you’ll set up an architecture that looks like this:
Any change in the AWS CodeCommit repository will trigger a Lambda function. The Lambda function will call the Fargate task that replicates the repository using a Git command line tool.
Let us assume a user wants to replicate a repository (Source) from US East (N. Virginia/us-east-1) region to a repository (Destination) in US West (Oregon/us-west-2) region. I’ll walk you through the steps for it:
Prerequisites
Create an AWS Service IAM role for Amazon EC2 that has permission for both source and destination repositories, IAM CreateRole, AttachRolePolicy and Amazon ECR privileges. Here is the EC2 role policy I used:
You need a Docker environment to build this solution. You can launch an EC2 instance and install Docker (or) you can use AWS Cloud9 that comes with Docker and Git preinstalled. I used an EC2 instance and installed Docker in it. Use the IAM role created in the previous step when creating the EC2 instance. I am going to refer this environment as “Docker Environment” in the following steps.
You need to install the AWS CLI on the Docker environment. For AWS CLI installation, refer this page.
You need to install Git, including a Git command line on the Docker environment.
Step 1: Create the Docker image
To create the Docker image, first it needs a Dockerfile. A Dockerfile is a manifest that describes the base image to use for your Docker image and what you want installed and running on it. For more information about Dockerfiles, go to the Dockerfile Reference.
1. Choose a directory in the Docker environment and perform the following steps in that directory. I used /home/ec2-user directory to perform the following steps.
2. Clone the AWS CodeCommit repository in the Docker environment. Open the terminal to the Docker environment and run the following commands to clone your source AWS CodeCommit repository (I ran the commands from /home/ec2-user directory):
Note: Change the URL marked in red to your source and destination repository URL.
3. Create a file called Dockerfile (case sensitive) with the following content (I created it in /home/ec2-user directory):
# Pull the Amazon Linux latest base image
FROM amazonlinux:latest
#Install aws-cli and git command line tools
RUN yum -y install unzip aws-cli
RUN yum -y install git
WORKDIR /home/ec2-user
RUN mkdir LocalRepository
WORKDIR /home/ec2-user/LocalRepository
#Copy Cloned CodeCommit repository to Docker container
COPY ./LocalRepository /home/ec2-user/LocalRepository
#Copy shell script that does the replication
COPY ./repl_repository.bash /home/ec2-user/LocalRepository
RUN chmod ugo+rwx /home/ec2-user/LocalRepository/repl_repository.bash
WORKDIR /home/ec2-user/LocalRepository
#Call this script when Docker starts the container
ENTRYPOINT ["/home/ec2-user/LocalRepository/repl_repository.bash"]
4. Copy the following shell script into a file called repl_repository.bash to the DockerFile directory location in the Docker environment (I created it in /home/ec2-user directory)
6. Verify whether the replication is working by running the repl_repository.bash script from the LocalRepository directory. Go to LocalRepository directory and run this command: . ../repl_repository.bash If it is successful, you will get the “Everything up-to-date” at the last line of the result like this:
$ . ../repl_repository.bash
Everything up-to-date
Step 2: Build the Docker Image
1. Build the Docker image by running this command from the directory where you created the DockerFile in the Docker environment in the previous step (I ran it from /home/ec2-user directory):
$ docker build . –t ccrepl
Output: It installs various packages and set environment variables as part of steps 1 to 3 from the Dockerfile. The steps 4 to 11 from the Dockerfile should produce an output similar to the following:
2. Run the following command to verify that the image was created successfully. It will display “Everything up-to-date” at the end if it is successful.
[[email protected] LocalRepository]$ docker run ccrepl
Everything up-to-date
Step 3: Push the Docker Image to Amazon Elastic Container Registry (ECR)
Perform the following steps in the Docker Environment.
1. Run the AWS CLI configure command and set default region as your source repository region (I used us-east-1).
$ aws configure set default.region <Source Repository Region>
2. Create an Amazon ECR repository using this command to store your ccrepl image (Note the repositoryUri in the output):
2. Create a role called AccessRoleForCCfromFG using the following command in the DockerEnvironment:
$ aws iam create-role --role-name AccessRoleForCCfromFG --assume-role-policy-document file://trustpolicyforecs.json
3. Assign CodeCommit service full access to the above role using the following command in the DockerEnvironment:
$ aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AWSCodeCommitFullAccess --role-name AccessRoleForCCfromFG
4. In the Amazon ECS Console, choose Repositories and select the ccrepl repository that was created in the previous step. Copy the Repository URI.
5. In the Amazon ECS Console, choose Task Definitions and click Create New Task Definition.
6. Select launch type compatibility as FARGATE and click Next Step.
7. In the create task definition screen, do the following:
In Task Definition Name, type ccrepl
In Task Role, choose AccessRoleForCCfromFG
In Task Memory, choose 2GB
In Task CPU, choose 1 vCPU
Click Add Container under Container Definitions in the same screen. In the Add Container screen, do the following:
Enter Container name as ccreplcont
Enter Image URL copied from step 4
Enter Memory Limits as 128 and click Add.
Note: Select TaskExecutionRole as “ecsTaskExecutionRole” if it already exists. If not, select create new role and it will create “ecsTaskExecutionRole” for you.
8. Click the Create button in the task definition screen to create the task. It will successfully create the task, execution role and AWS CloudWatch Log groups.
9. In the Amazon ECS Console, click Clusters and create cluster. Select template as “Networking only, Powered by AWS Fargate” and click next step.
10. Enter cluster name as ccreplcluster and click create.
Step 5: Create the Lambda Function
In this section, I used Amazon Elastic Container Service (ECS) run task API from Lambda to invoke the Fargate task.
1. In the IAM Console, create a new role called ECSLambdaRole with the permissions to AWS CodeCommit, Amazon ECS as well as pass roles privileges needed to run the ECS task. Your statement should look similar to the following (replace <your account id>):
2. In AWS management console, select VPC service and click subnets in the left navigation screen. Note down the Subnet IDs that you want to run the Fargate task in.
3. Create a new Lambda Node.js function called FargateTaskExecutionFunc and assign the role ECSLambdaRole with the following content:
Note: Replace subnets values (marked in red color) with the subnet IDs you identified as the subnets you wanted to run the Fargate task on in Step 2 of this section.
1. In the Lambda Console, click FargateTaskExecutionFunc under functions.
2. Under Add triggers in the Designer, select CodeCommit
3. In the Configure triggers screen, do the following:
Enter Repository name as Source (your source repository name)
Enter trigger name as LambdaTrigger
Leave the Events as “All repository events”
Leave the Branch names as “All branches”
Click Add button
Click Save button to save the changes
Step 6: Verification
To test the application, make a commit and push the changes to the source repository in AWS CodeCommit. That should automatically trigger the Lambda function and replicate the changes in the destination repository. You can verify this by checking CloudWatch Logs for Lambda and ECS, or simply going to the destination repository and verifying the change appears.
Conclusion
Congratulations! You have successfully configured repository replication of an AWS CodeCommit repository using AWS Lambda and AWS Fargate. You can use this technique in a deployment pipeline. You can also tweak the trigger configuration in AWS CodeCommit to call the Lambda function in response to any supported trigger event in AWS CodeCommit.
<p>Let’s Encrypt recently <a href="https://community.letsencrypt.org/t/signed-certificate-timestamps-embedded-in-certificates/57187">launched SCT embedding in certificates</a>. This feature allows browsers to check that a certificate was submitted to a <a href="https://en.wikipedia.org/wiki/Certificate_Transparency">Certificate Transparency</a> log. As part of the launch, we did a thorough review that the encoding of Signed Certificate Timestamps (SCTs) in our certificates matches the relevant specifications. In this post, I’ll dive into the details. You’ll learn more about X.509, ASN.1, DER, and TLS encoding, with references to the relevant RFCs.</p>
<p>Certificate Transparency offers three ways to deliver SCTs to a browser: In a TLS extension, in stapled OCSP, or embedded in a certificate. We chose to implement the embedding method because it would just work for Let’s Encrypt subscribers without additional work. In the SCT embedding method, we submit a “precertificate” with a <a href="#poison">poison extension</a> to a set of CT logs, and get back SCTs. We then issue a real certificate based on the precertificate, with two changes: The poison extension is removed, and the SCTs obtained earlier are added in another extension.</p>
<p>Given a certificate, let’s first look for the SCT list extension. According to CT (<a href="https://tools.ietf.org/html/rfc6962#section-3.3">RFC 6962 section 3.3</a>), the extension OID for a list of SCTs is <code>1.3.6.1.4.1.11129.2.4.2</code>. An <a href="http://www.hl7.org/Oid/information.cfm">OID (object ID)</a> is a series of integers, hierarchically assigned and globally unique. They are used extensively in X.509, for instance to uniquely identify extensions.</p>
<p>We can <a href="https://acme-v01.api.letsencrypt.org/acme/cert/031f2484307c9bc511b3123cb236a480d451">download an example certificate</a>, and view it using OpenSSL (if your OpenSSL is old, it may not display the detailed information):</p>
<pre><code>$ openssl x509 -noout -text -inform der -in Downloads/031f2484307c9bc511b3123cb236a480d451 … CT Precertificate SCTs: Signed Certificate Timestamp: Version : v1(0) Log ID : DB:74:AF:EE:CB:29:EC:B1:FE:CA:3E:71:6D:2C:E5:B9: AA:BB:36:F7:84:71:83:C7:5D:9D:4F:37:B6:1F:BF:64 Timestamp : Mar 29 18:45:07.993 2018 GMT Extensions: none Signature : ecdsa-with-SHA256 30:44:02:20:7E:1F:CD:1E:9A:2B:D2:A5:0A:0C:81:E7: 13:03:3A:07:62:34:0D:A8:F9:1E:F2:7A:48:B3:81:76: 40:15:9C:D3:02:20:65:9F:E9:F1:D8:80:E2:E8:F6:B3: 25:BE:9F:18:95:6D:17:C6:CA:8A:6F:2B:12:CB:0F:55: FB:70:F7:59:A4:19 Signed Certificate Timestamp: Version : v1(0) Log ID : 29:3C:51:96:54:C8:39:65:BA:AA:50:FC:58:07:D4:B7: 6F:BF:58:7A:29:72:DC:A4:C3:0C:F4:E5:45:47:F4:78 Timestamp : Mar 29 18:45:08.010 2018 GMT Extensions: none Signature : ecdsa-with-SHA256 30:46:02:21:00:AB:72:F1:E4:D6:22:3E:F8:7F:C6:84: 91:C2:08:D2:9D:4D:57:EB:F4:75:88:BB:75:44:D3:2F: 95:37:E2:CE:C1:02:21:00:8A:FF:C4:0C:C6:C4:E3:B2: 45:78:DA:DE:4F:81:5E:CB:CE:2D:57:A5:79:34:21:19: A1:E6:5B:C7:E5:E6:9C:E2 </code></pre>
<p>Now let’s go a little deeper. How is that extension represented in the certificate? Certificates are expressed in <a href="https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One">ASN.1</a>, which generally refers to both a language for expressing data structures and a set of formats for encoding them. The most common format, <a href="https://en.wikipedia.org/wiki/X.690#DER_encoding">DER</a>, is a tag-length-value format. That is, to encode an object, first you write down a tag representing its type (usually one byte), then you write down a number expressing how long the object is, then you write down the object contents. This is recursive: An object can contain multiple objects within it, each of which has its own tag, length, and value.</p>
<p>One of the cool things about DER and other tag-length-value formats is that you can decode them to some degree without knowing what they mean. For instance, I can tell you that 0x30 means the data type “SEQUENCE” (a struct, in ASN.1 terms), and 0x02 means “INTEGER”, then give you this hex byte sequence to decode:</p>
<pre><code>30 06 02 01 03 02 01 0A </code></pre>
<p>You could tell me right away that decodes to:</p>
<p>Try it yourself with this great <a href="https://lapo.it/asn1js/#300602010302010A">JavaScript ASN.1 decoder</a>. However, you wouldn’t know what those integers represent without the corresponding ASN.1 schema (or “module”). For instance, if you knew that this was a piece of DogData, and the schema was:</p>
<pre><code>DogData ::= SEQUENCE { legs INTEGER, cutenessLevel INTEGER } </code></pre>
<p>You’d know this referred to a three-legged dog with a cuteness level of 10.</p>
<p>We can take some of this knowledge and apply it to our certificates. As a first step, convert the above certificate to hex with <code>xxd -ps < Downloads/031f2484307c9bc511b3123cb236a480d451</code>. You can then copy and paste the result into <a href="https://lapo.it/asn1js">lapo.it/asn1js</a> (or use <a href="https://lapo.it/asn1js/#3082062F30820517A0030201020212031F2484307C9BC511B3123CB236A480D451300D06092A864886F70D01010B0500304A310B300906035504061302555331163014060355040A130D4C6574277320456E6372797074312330210603550403131A4C6574277320456E637279707420417574686F72697479205833301E170D3138303332393137343530375A170D3138303632373137343530375A302D312B3029060355040313223563396137662E6C652D746573742E686F66666D616E2D616E64726577732E636F6D30820122300D06092A864886F70D01010105000382010F003082010A0282010100BCEAE8F504D9D91FCFC69DB943254A7FED7C6A3C04E2D5C7DDD010CBBC555887274489CA4F432DCE6D7AB83D0D7BDB49C466FBCA93102DC63E0EB1FB2A0C50654FD90B81A6CB357F58E26E50F752BF7BFE9B56190126A47409814F59583BDD337DFB89283BE22E81E6DCE13B4E21FA6009FC8A7F903A17AB05C8BED85A715356837E849E571960A8999701EAE9CE0544EAAB936B790C3C35C375DB18E9AA627D5FA3579A0FB5F8079E4A5C9BE31C2B91A7F3A63AFDFEDB9BD4EA6668902417D286BE4BBE5E43CD9FE1B8954C06F21F5C5594FD3AB7D7A9CBD6ABF19774D652FD35C5718C25A3BA1967846CED70CDBA95831CF1E09FF7B8014E63030CE7A776750203010001A382032A30820326300E0603551D0F0101FF0404030205A0301D0603551D250416301406082B0601050507030106082B06010505070302300C0603551D130101FF04023000301D0603551D0E041604148B3A21ABADF50C4B30DCCD822724D2C4B9BA29E3301F0603551D23041830168014A84A6A63047DDDBAE6D139B7A64565EFF3A8ECA1306F06082B0601050507010104633061302E06082B060105050730018622687474703A2F2F6F6373702E696E742D78332E6C657473656E63727970742E6F7267302F06082B060105050730028623687474703A2F2F636572742E696E742D78332E6C657473656E63727970742E6F72672F302D0603551D110426302482223563396137662E6C652D746573742E686F66666D616E2D616E64726577732E636F6D3081FE0603551D200481F63081F33008060667810C0102013081E6060B2B0601040182DF130101013081D6302606082B06010505070201161A687474703A2F2F6370732E6C657473656E63727970742E6F72673081AB06082B0601050507020230819E0C819B54686973204365727469666963617465206D6179206F6E6C792062652072656C6965642075706F6E2062792052656C79696E67205061727469657320616E64206F6E6C7920696E206163636F7264616E636520776974682074686520436572746966696361746520506F6C69637920666F756E642061742068747470733A2F2F6C657473656E63727970742E6F72672F7265706F7369746F72792F30820104060A2B06010401D6790204020481F50481F200F0007500DB74AFEECB29ECB1FECA3E716D2CE5B9AABB36F7847183C75D9D4F37B61FBF64000001627313EB19000004030046304402207E1FCD1E9A2BD2A50A0C81E713033A0762340DA8F91EF27A48B3817640159CD30220659FE9F1D880E2E8F6B325BE9F18956D17C6CA8A6F2B12CB0F55FB70F759A419007700293C519654C83965BAAA50FC5807D4B76FBF587A2972DCA4C30CF4E54547F478000001627313EB2A0000040300483046022100AB72F1E4D6223EF87FC68491C208D29D4D57EBF47588BB7544D32F9537E2CEC10221008AFFC40CC6C4E3B24578DADE4F815ECBCE2D57A579342119A1E65BC7E5E69CE2300D06092A864886F70D01010B0500038201010095F87B663176776502F792DDD232C216943C7803876FCBEB46393A36354958134482E0AFEED39011618327C2F0203351758FEB420B73CE6C797B98F88076F409F3903F343D1F5D9540F41EF47EB39BD61B62873A44F00B7C8B593C6A416458CF4B5318F35235BC88EABBAA34F3E3F81BD3B047E982EE1363885E84F76F2F079F2B6EEB4ECB58EFE74C8DE7D54DE5C89C4FB5BB0694B837BD6F02BAFD5A6C007D1B93D25007BDA9B2BDBF82201FE1B76B628CE34E2D974E8E623EC57A5CB53B435DD4B9993ADF6BA3972F2B29D259594A94E17BBE06F34AAE5CF0F50297548C4DFFC5566136F78A3D3B324EAE931A14EB6BE6DA1D538E48CF077583C67B52E7E8">this handy link</a>). You can also run <code>openssl asn1parse -i -inform der -in Downloads/031f2484307c9bc511b3123cb236a480d451</code> to use OpenSSL’s parser, which is less easy to use in some ways, but easier to copy and paste.</p>
<p>In the decoded data, we can find the OID <code>1.3.6.1.4.1.11129.2.4.2</code>, indicating the SCT list extension. Per <a href="https://tools.ietf.org/html/rfc5280#page-17">RFC 5280, section 4.1</a>, an extension is defined:</p>
<pre><code>Extension ::= SEQUENCE { extnID OBJECT IDENTIFIER, critical BOOLEAN DEFAULT FALSE, extnValue OCTET STRING — contains the DER encoding of an ASN.1 value — corresponding to the extension type identified — by extnID } </code></pre>
<p>We’ve found the <code>extnID</code>. The “critical” field is omitted because it has the default value (false). Next up is the <code>extnValue</code>. This has the type <code>OCTET STRING</code>, which has the tag “0x04”. <code>OCTET STRING</code> means “here’s a bunch of bytes!” In this case, as described by the spec, those bytes happen to contain more DER. This is a fairly common pattern in X.509 to deal with parameterized data. For instance, this allows defining a structure for extensions without knowing ahead of time all the structures that a future extension might want to carry in its value. If you’re a C programmer, think of it as a <code>void*</code> for data structures. If you prefer Go, think of it as an <code>interface{}</code>.</p>
<p>That’s tag “0x04”, meaning <code>OCTET STRING</code>, followed by “0x81 0xF5”, meaning “this string is 245 bytes long” (the 0x81 prefix is part of <a href="#variable-length">variable length number encoding</a>).</p>
<p>According to <a href="https://tools.ietf.org/html/rfc6962#section-3.3">RFC 6962, section 3.3</a>, “obtained SCTs can be directly embedded in the final certificate, by encoding the SignedCertificateTimestampList structure as an ASN.1 <code>OCTET STRING</code> and inserting the resulting data in the TBSCertificate as an X.509v3 certificate extension”</p>
<p>So, we have an <code>OCTET STRING</code>, all’s good, right? Except if you remove the tag and length from extnValue to get its value, you’re left with:</p>
<p>There’s that “0x04” tag again, but with a shorter length. Why do we nest one <code>OCTET STRING</code> inside another? It’s because the contents of extnValue are required by RFC 5280 to be valid DER, but a SignedCertificateTimestampList is not encoded using DER (more on that in a minute). So, by RFC 6962, a SignedCertificateTimestampList is wrapped in an <code>OCTET STRING</code>, which is wrapped in another <code>OCTET STRING</code> (the extnValue).</p>
<p>Once we decode that second <code>OCTET STRING</code>, we’re left with the contents:</p>
<pre><code>00F0007500DB74AFEEC… </code></pre>
<p>“0x00” isn’t a valid tag in DER. What is this? It’s TLS encoding. This is defined in <a href="https://tools.ietf.org/html/rfc5246#section-4">RFC 5246, section 4</a> (the TLS 1.2 RFC). TLS encoding, like ASN.1, has both a way to define data structures and a way to encode those structures. TLS encoding differs from DER in that there are no tags, and lengths are only encoded when necessary for variable-length arrays. Within an encoded structure, the type of a field is determined by its position, rather than by a tag. This means that TLS-encoded structures are more compact than DER structures, but also that they can’t be processed without knowing the corresponding schema. For instance, here’s the top-level schema from <a href="https://tools.ietf.org/html/rfc6962#section-3.3">RFC 6962, section 3.3</a>:</p>
<pre><code> The contents of the ASN.1 OCTET STRING embedded in an OCSP extension or X509v3 certificate extension are as follows:
Here, "SerializedSCT" is an opaque byte string that contains the serialized TLS structure. </code></pre>
<p>Right away, we’ve found one of those variable-length arrays. The length of such an array (in bytes) is always represented by a length field just big enough to hold the max array size. The max size of an <code>sct_list</code> is 65535 bytes, so the length field is two bytes wide. Sure enough, those first two bytes are “0x00 0xF0”, or 240 in decimal. In other words, this <code>sct_list</code> will have 240 bytes. We don’t yet know how many SCTs will be in it. That will become clear only by continuing to parse the encoded data and seeing where each struct ends (spoiler alert: there are two SCTs!).</p>
<p>Now we know the first SerializedSCT starts with <code>0075…</code>. SerializedSCT is itself a variable-length field, this time containing <code>opaque</code> bytes (much like <code>OCTET STRING</code> back in the ASN.1 world). Like SignedCertificateTimestampList, it has a max size of 65535 bytes, so we pull off the first two bytes and discover that the first SerializedSCT is 0x0075 (117 decimal) bytes long. Here’s the whole thing, in hex:</p>
<p>This can be decoded using the TLS encoding struct defined in <a href="https://tools.ietf.org/html/rfc6962#page-13">RFC 6962, section 3.2</a>:</p>
<pre><code>enum { v1(0), (255) } Version;
struct { opaque key_id[32]; } LogID;
opaque CtExtensions<0..2^16-1>; …
struct { Version sct_version; LogID id; uint64 timestamp; CtExtensions extensions; digitally-signed struct { Version sct_version; SignatureType signature_type = certificate_timestamp; uint64 timestamp; LogEntryType entry_type; select(entry_type) { case x509_entry: ASN.1Cert; case precert_entry: PreCert; } signed_entry; CtExtensions extensions; }; } SignedCertificateTimestamp; </code></pre>
<p>Breaking that down:</p>
<pre><code># Version sct_version v1(0) 00 # LogID id (aka opaque key_id[32]) DB74AFEECB29ECB1FECA3E716D2CE5B9AABB36F7847183C75D9D4F37B61FBF64 # uint64 timestamp (milliseconds since the epoch) 000001627313EB19 # CtExtensions extensions (zero-length array) 0000 # digitally-signed struct 04030046304402207E1FCD1E9A2BD2A50A0C81E713033A0762340DA8F91EF27A48B3817640159CD30220659FE9F1D880E2E8F6B325BE9F18956D17C6CA8A6F2B12CB0F55FB70F759A419 </code></pre>
<p>To understand the “digitally-signed struct,” we need to turn back to <a href="https://tools.ietf.org/html/rfc5246#section-4.7">RFC 5246, section 4.7</a>. It says:</p>
<pre><code>A digitally-signed element is encoded as a struct DigitallySigned:
<p>We have “0x0403”, which corresponds to sha256(4) and ecdsa(3). The next two bytes, “0x0046”, tell us the length of the “opaque signature” field, 70 bytes in decimal. To decode the signature, we reference <a href="https://tools.ietf.org/html/rfc4492#page-20">RFC 4492 section 5.4</a>, which says:</p>
<pre><code>The digitally-signed element is encoded as an opaque vector <0..2^16-1>, the contents of which are the DER encoding corresponding to the following ASN.1 notation.
Ecdsa-Sig-Value ::= SEQUENCE { r INTEGER, s INTEGER } </code></pre>
<p>Having dived through two layers of TLS encoding, we are now back in ASN.1 land! We <a href="https://lapo.it/asn1js/#304402207E1FCD1E9A2BD2A50A0C81E713033A0762340DA8F91EF27A48B3817640159CD30220659FE9F1D880E2E8F6B325BE9F18956D17C6CA8A6F2B12CB0F55FB70F759A419">decode</a> the remaining bytes into a SEQUENCE containing two INTEGERS. And we’re done! Here’s the whole extension decoded:</p>
<p>One surprising thing you might notice: In the first SCT, <code>r</code> and <code>s</code> are twenty bytes long. In the second SCT, they are both twenty-one bytes long, and have a leading zero. Integers in DER are two’s complement, so if the leftmost bit is set, they are interpreted as negative. Since <code>r</code> and <code>s</code> are positive, if the leftmost bit would be a 1, an extra byte has to be added so that the leftmost bit can be 0.</p>
<p>This is a little taste of what goes into encoding a certificate. I hope it was informative! If you’d like to learn more, I recommend “<a href="http://luca.ntop.org/Teaching/Appunti/asn1.html">A Layman’s Guide to a Subset of ASN.1, BER, and DER</a>.”</p>
<p><a name="poison"></a>Footnote 1: A “poison extension” is defined by <a href="https://tools.ietf.org/html/rfc6962#section-3.1">RFC 6962 section 3.1</a>:</p>
<pre><code>The Precertificate is constructed from the certificate to be issued by adding a special critical poison extension (OID `1.3.6.1.4.1.11129.2.4.3`, whose extnValue OCTET STRING contains ASN.1 NULL data (0x05 0x00)) </code></pre>
<p>In other words, it’s an empty extension whose only purpose is to ensure that certificate processors will not accept precertificates as valid certificates. The specification ensures this by setting the “critical” bit on the extension, which ensures that code that doesn’t recognize the extension will reject the whole certificate. Code that does recognize the extension specifically as poison will also reject the certificate.</p>
<p><a name="variable-length"></a>Footnote 2: Lengths from 0-127 are represented by a single byte (short form). To express longer lengths, more bytes are used (long form). The high bit (0x80) on the first byte is set to distinguish long form from short form. The remaining bits are used to express how many more bytes to read for the length. For instance, 0x81F5 means “this is long form because the length is greater than 127, but there’s still only one byte of length (0xF5) to decode.”</p>
I thought I’d write up some notes on the memcached DDoS. Specifically, I describe how many I found scanning the Internet with masscan, and how to use masscan as a killswitch to neuter the worst of the attacks.
Test your servers
I added code to my port scanner for this, then scanned the Internet:
This example scans the entire Internet (/0). Replaced 0.0.0.0/0 with your address range (or ranges).
This produces output that looks like this:
Banner on port 11211/udp on 172.246.132.226: [memcached] uptime=230130 time=1520485357 version=1.4.13
Banner on port 11211/udp on 89.110.149.218: [memcached] uptime=3935192 time=1520485363 version=1.4.17
Banner on port 11211/udp on 172.246.132.226: [memcached] uptime=230130 time=1520485357 version=1.4.13
Banner on port 11211/udp on 84.200.45.2: [memcached] uptime=399858 time=1520485362 version=1.4.20
Banner on port 11211/udp on 5.1.66.2: [memcached] uptime=29429482 time=1520485363 version=1.4.20
Banner on port 11211/udp on 103.248.253.112: [memcached] uptime=2879363 time=1520485366 version=1.2.6
Banner on port 11211/udp on 193.240.236.171: [memcached] uptime=42083736 time=1520485365 version=1.4.13
The “banners” check filters out those with valid memcached responses, so you don’t get other stuff that isn’t memcached. To filter this output further, use the ‘cut’ to grab just column 6:
… | cut -d ‘ ‘ -f 6 | cut -d: -f1
You often get multiple responses to just one query, so you’ll want to sort/uniq the list:
… | sort | uniq
My results from an Internet wide scan
I got 15181 results (or roughly 15,000).
People are using Shodan to find a list of memcached servers. They might be getting a lot results back that response to TCP instead of UDP. Only UDP can be used for the attack.
Other researchers scanned the Internet a few days ago and found ~31k. I don’t know if this means people have been removing these from the Internet.
Masscan as exploit script
BTW, you can not only use masscan to find amplifiers, you can also use it to carry out the DDoS. Simply import the list of amplifier IP addresses, then spoof the source address as that of the target. All the responses will go back to the source address.
I point this out to show how there’s no magic in exploiting this. Numerous exploit scripts have been released, because it’s so easy.
Why memcached servers are vulnerable
Like many servers, memcached listens to local IP address 127.0.0.1 for local administration. By listening only on the local IP address, remote people cannot talk to the server.
However, this process is often buggy, and you end up listening on either 0.0.0.0 (all interfaces) or on one of the external interfaces. There’s a common Linux network stack issue where this keeps happening, like trying to get VMs connected to the network. I forget the exact details, but the point is that lots of servers that intend to listen only on 127.0.0.1 end up listening on external interfaces instead. It’s not a good security barrier.
Thus, there are lots of memcached servers listening on their control port (11211) on external interfaces.
How the protocol works
The protocol is documented here. It’s pretty straightforward.
The easiest amplification attacks is to send the “stats” command. This is 15 byte UDP packet that causes the server to send back either a large response full of useful statistics about the server. You often see around 10 kilobytes of response across several packets.
A harder, but more effect attack uses a two step process. You first use the “add” or “set” commands to put chunks of data into the server, then send a “get” command to retrieve it. You can easily put 100-megabytes of data into the server this way, and causes a retrieval with a single “get” command.
That’s why this has been the largest amplification ever, because a single 100-byte packet can in theory cause a 100-megabytes response.
Doing the math, the 1.3 terabit/second DDoS divided across the 15,000 servers I found vulnerable on the Internet leads to an average of 100-megabits/second per server. This is fairly minor, and is indeed something even small servers (like Raspberry Pis) can generate.
Neutering the attack (“kill switch”)
If they are using the more powerful attack against you, you can neuter it: you can send a “flush_all” command back at the servers who are flooding you, causing them to drop all those large chunks of data from the cache.
I’m going to describe how I would do this.
First, get a list of attackers, meaning, the amplifiers that are flooding you. The way to do this is grab a packet sniffer and capture all packets with a source port of 11211. Here is an example using tcpdump.
tcpdump -i -w attackers.pcap src port 11221
Let that run for a while, then hit [ctrl-c] to stop, then extract the list of IP addresses in the capture file. The way I do this is with tshark (comes with Wireshark):
Now, craft a flush_all payload. There are many ways of doing this. For example, if you are using nmap or masscan, you can add the bytes to the nmap-payloads.txt file. Also, masscan can read this directly from a packet capture file. To do this, first craft a packet, such as with the following command line foo:
Capture this packet using tcpdump or something, and save into a file “flush_all.pcap”. If you want to skip this step, I’ve already done this for you, go grab the file from GitHub:
Reportedly, “shutdown” may also work to completely shutdown the amplifiers. I’ll leave that as an exercise for the reader, since of course you’ll be adversely affecting the servers.
The AWS team is always eager to add features that make it easier for you to protect your sensitive data and to help you to achieve your compliance objectives. For example, in 2017 we launched encryption at rest for SQS and EFS, additional encryption options for S3, and server-side encryption of Kinesis Data Streams.
Today we are giving you another data protection option with the introduction of encryption at rest for Amazon DynamoDB. You simply enable encryption when you create a new table and DynamoDB takes care of the rest. Your data (tables, local secondary indexes, and global secondary indexes) will be encrypted using AES-256 and a service-default AWS Key Management Service (KMS) key. The encryption adds no storage overhead and is completely transparent; you can insert, query, scan, and delete items as before. The team did not observe any changes in latency after enabling encryption and running several different workloads on an encrypted DynamoDB table.
Creating an Encrypted Table You can create an encrypted table from the AWS Management Console, API (CreateTable), or CLI (create-table). I’ll use the console! I enter the name and set up the primary key as usual:
Before proceeding, I uncheck Use default settings, scroll down to the Encrypytion section, and check Enable encryption. Then I click Create and my table is created in encrypted form:
I can see the encryption setting for the table at a glance:
When my compliance team asks me to show them how DynamoDB uses the key to encrypt the data, I can create a AWS CloudTrail trail, insert an item, and then scan the table to see the calls to the AWS KMS API. Here’s an extract from the trail:
Available Now This feature is available now in the US East (N. Virginia), US East (Ohio), US West (Oregon), and EU (Ireland) Regions and you can start using it today.
There’s no charge for the encryption; you will be charged for the calls that DynamoDB makes to AWS KMS on your behalf.
Thank you to my colleague Harvey Bendana for this blog on how to do shallow cloning on AWS CodeBuild using GitHub Enterprise as a source.
Today we are announcing support for using GitHub Enterprise as a source type for CodeBuild. You can now initiate build tasks from changes in source code hosted on your own implementation of GitHub Enterprise.
We are also announcing support for shallow cloning of a repo when you use CodeCommit, BitBucket, GitHub, or GitHub Enterprise as a source type. Shallow cloning allows you to truncate history of a repo in order to save space and speed up cloning times.
In this post, I’ll walk you through how to configure GitHub Enterprise as a source type with a defined clone depth for an AWS CodeBuild project. I’ll also show you all the moving parts associated with a successful implementation.
AWS CodeBuild is a fully managed build service. There are no servers to provision and scale, or software to install, configure, and operate. You just specify the location of your source code, choose your build settings, and CodeBuild runs build scripts for compiling, testing, and packaging your code.
GitHub Enterprise is the on-premises version of GitHub.com. It makes collaborative coding possible and enjoyable for large-scale enterprise software development teams.
Many enterprises choose GitHub Enterprise as their preferred source code/version control repository because it can be hosted in their own trusted network, whether that is an on-premises data center or their own Amazon VPCs.
Requirements
You’ll need an AWS account.
You’ll need a GitHub Enterprise implementation with a repo. If you’d like to deploy one inside your own Amazon VPC, check out our Quick Start Guide.
You’ll need an S3 bucket to store your GitHub Enterprise self-signed SSL certificate.
Download your GitHub Enterprise SSL certificate:
Note: The following steps are required only for self-signed certificates. You can forego installation of a certificate if you are using self-signed certificates and default to HTTP communication with your repo. For this post, I am using a self-signed certificate and the Firefox browser. These steps may vary, depending on your browser of choice.
Navigate to your GitHub Enterprise environment and sign in with your credentials.
2. Choose the lock icon in the upper-left corner to view and export your GitHub Enterprise certificate.
3. When you export and download your certificate, make sure you select the format type, which includes the entire certificate chain.
2. If there are no existing build projects, a welcome page is displayed. Choose Get started. If you already have build projects, then choose Create project.
3. On the Configure your project page, type a name for your build project. Build project names must be unique across each AWS account.
6. Enter your personal access token in your CodeBuild project and choose Save Token.
7. Enter the repository URL and choose a Git clone depth value that makes sense for you. Allowed values are 1, 5, 25, or Full. For this post, I am using a depth of 1.
8. Select the Webhook check box.
9. Continue with the rest of the configuration for your project, choosing options that best suit your build needs. For this post, I am using an AWS CodeBuild managed image running the Ubuntu OS with the base runtime configuration. Enter your build specifications or build commands. I am using a simple build command of git log . so that it can be easily found in the CloudWatch logs of the CodeBuild project. It will also be used to demonstrate the shallow clone feature.
10. Next, select Install certificate from your S3 to install your GitHub Enterprise self-signed certificate from S3. For Bucket of certificate, I’ve entered the S3 bucket where I uploaded the certificate. For Object key of certificate, I’ve entered the name of the certificate.
11. Lastly, configure artifacts, caching, IAM roles, and VPC configurations. For this post, I chose not to generate any artifacts from this build. From the following screenshot, you’ll see I’ve opted out of cache, requested a new IAM role with the required permissions, and have not defined VPC access. Choose Continue to validate and complete the creation of the CodeBuild project.
Note: If your GitHub Enterprise environment is in an Amazon VPC, configure VPC access for your project. Define the VPC ID, subnet ID, and security group so that your project has access to the EC2 instances hosting your GitHub Enterprise environment.
12. After the project is created, a dialog box displays a CodeBuild payload URL and secret. They are used to create a webhook for the repo in the GitHub Enterprise environment.
Create a webhook in your GitHub Enterprise repo:
1. In your GitHub Enterprise repo, navigate to Settings, choose Hooks & services, and then choose Add webhook.
2. Paste the payload URL and secret into their respective fields. Under Which events would you like to trigger this webhook? choose an option. For this post, I am using Let me select individual events. I then chose Pull request and Push as the two event triggers.
3. Make sure Active is selected and then choose Add webhook.
4. A webhook has now been created in the GitHub Enterprise repo.
Now I will show you how to test this.
Trigger your AWS CodeBuild project by pushing a change to the GitHub Enterprise repo
1. Clone the repo to the local file system. For information, see Clone the Repository Using the Command Line on the GitHub Help website. Now create a feature branch, push a change, and generate a pull request for review, and, ultimately, merge to master. Here is the state of the GitHub Enterprise repo and AWS CodeBuild project before pushing a change:
4. After the changes have been saved, push them to the feature branch.
5. There is now notification of a new branch in the GitHub Enterprise environment.
6. Generate a pull request from the feature branch in preparation of review and merge to master.
7. The reviewer(s) will then review and merge the pull request, pushing all changes to the master branch.
8. Here is the updated repo:
9. After the change has been pushed successfully, a new build is initiated.
10. In the following screenshot, you’ll see the initiator is Github-Hookshot/eb0c46 and the source version is 03169095b8f16ac077388471035becb2070aa12c.
11. In the Recent Deliveries section, under the configuration of the GitHub Enterprise repo webhook, the CodeBuild project initiator is defined as User-Agent. The source version is denoted in the Payload output. They match!
12. A successfully completed build should appear under the CodeBuild project.
13. The entire log output of the CodeBuild project can be viewed in CloudWatch logs. In the following screenshot, the source was downloaded successfully from the GitHub Enterprise repo and the build command of git log . was run successfully. Only the most recent commit appears in the git history output. This is because I defined a clone depth of 1.
14. If I query the git history of the repo on the local repository, the output has the full commit history. This is expected because I am doing a full clone locally.
Conclusion
In this blog post, I showed you how to configure GitHub Enterprise as a source type for your AWS CodeBuild project with a clone depth of 1. These new features expand the capabilities of AWS CodeBuild and the suite of AWS Developer Tools for CI/CD and DevOps processes.
I hope you found this post useful. Feel free to leave your feedback or suggestions in the comments.
We launched some important new EC2 instance types and features at AWS re:Invent. I’ve already told you about the M5, H1, T2 Unlimited and Bare Metal instances, and about Spot features such as Hibernation and the New Pricing Model. Randall told you about the Amazon Time Sync Service. Today I would like to tell you about two of the features that we launched: Spread placement groups and Launch Templates. Both features are available in the EC2 Console and from the EC2 APIs, and can be used in all of the AWS Regions in the “aws” partition:
Launch Templates You can use launch templates to store the instance, network, security, storage, and advanced parameters that you use to launch EC2 instances, and can also include any desired tags. Each template can include any desired subset of the full collection of parameters. You can, for example, define common configuration parameters such as tags or network configurations in a template, and allow the other parameters to be specified as part of the actual launch.
Templates give you the power to set up a consistent launch environment that spans instances launched in On-Demand and Spot form, as well as through EC2 Auto Scaling and as part of a Spot Fleet. You can use them to implement organization-wide standards and to enforce best practices, and you can give your IAM users the ability to launch instances via templates while withholding the ability to do so via the underlying APIs.
Templates are versioned and you can use any desired version when you launch an instance. You can create templates from scratch, base them on the previous version, or copy the parameters from a running instance.
Here’s how you create a launch template in the Console:
Here’s how to include network interfaces, storage volumes, tags, and security groups:
And here’s how to specify advanced and specialized parameters:
You don’t have to specify values for all of these parameters in your templates; enter the values that are common to multiple instances or launches and specify the rest at launch time.
When you click Create launch template, the template is created and can be used to launch On-Demand instances, create Auto Scaling Groups, and create Spot Fleets:
The Launch Instance button now gives you the option to launch from a template:
Simply choose the template and the version, and finalize all of the launch parameters:
You can also manage your templates and template versions from the Console:
Spread Placement Groups Spread placement groups indicate that you do not want the instances in the group to share the same underlying hardware. Applications that rely on a small number of critical instances can launch them in a spread placement group to reduce the odds that one hardware failure will impact more than one instance. Here are a couple of things to keep in mind when you use spread placement groups:
Availability Zones – A single spread placement group can span multiple Availability Zones. You can have a maximum of seven running instances per Availability Zone per group.
Unique Hardware – Launch requests can fail if there is insufficient unique hardware available. The situation changes over time as overall usage changes and as we add additional hardware; you can retry failed requests at a later time.
Instance Types – You can launch a wide variety of M4, M5, C3, R3, R4, X1, X1e, D2, H1, I2, I3, HS1, F1, G2, G3, P2, and P3 instances types in spread placement groups.
Reserved Instances – Instances launched into a spread placement group can make use of reserved capacity. However, you cannot currently reserve capacity for a placement group and could receive an ICE (Insufficient Capacity Error) even if you have some RI’s available.
Applicability – You cannot use spread placement groups in conjunction with Dedicated Instances or Dedicated Hosts.
Use the GPIO pins of a Raspberry Pi Zero while running Debian Stretch on a PC or Mac with our new GPIO expander software! With this tool, you can easily access a Pi Zero’s GPIO pins from your x86 laptop without using SSH, and you can also take advantage of your x86 computer’s processing power in your physical computing projects.
What is this magic?
Running our x86 Stretch distribution on a PC or Mac, whether installed on the hard drive or as a live image, is a great way of taking advantage of a well controlled and simple Linux distribution without the need for a Raspberry Pi.
The downside of not using a Pi, however, is that there aren’t any GPIO pins with which your Scratch or Python programs could communicate. This is a shame, because it means you are limited in your physical computing projects.
I was thinking about this while playing around with the Pi Zero’s USB booting capabilities, having seen people employ the Linux gadget USB mode to use the Pi Zero as an Ethernet device. It struck me that, using the udev subsystem, we could create a simple GUI application that automatically pops up when you plug a Pi Zero into your computer’s USB port. Then the Pi Zero could be programmed to turn into an Ethernet-connected computer running pigpio to provide you with remote GPIO pins.
So we went ahead and built this GPIO expander application, and your PC or Mac can now have GPIO pins which are accessible through Scratch or the GPIO Zero Python library. Note that you can only use this tool to access the Pi Zero.
You can also install the application on the Raspberry Pi. Theoretically, you could connect a number of Pi Zeros to a single Pi and (without a USB hub) use a maximum of 140 pins! But I’ve not tested this — one for you, I think…
Making the GPIO expander work
If you’re using a PC or Mac and you haven’t set up x86 Debian Stretch yet, you’ll need to do that first. An easy way to do it is to download a copy of the Stretch release from this page and image it onto a USB stick. Boot from the USB stick (on most computers, you just need to press F10 during booting and select the stick when asked), and then run Stretch directly from the USB key. You can also install it to the hard drive, but be aware that installing it will overwrite anything that was on your hard drive before.
Whether on a Mac, PC, or Pi, boot through to the Stretch desktop, open a terminal window, and install the GPIO expander application:
sudo apt install usbbootgui
Next, plug in your Raspberry Pi Zero (don’t insert an SD card), and after a few seconds the GUI will appear.
The Raspberry Pi USB programming GUI
Select GPIO expansion board and click OK. The Pi Zero will now be programmed as a locally connected Ethernet port (if you run ifconfig, you’ll see the new interface usb0 coming up).
What’s really cool about this is that your plugged-in Pi Zero is now running pigpio, which allows you to control its GPIOs through the network interface.
With Scratch 2
To utilise the pins with Scratch 2, just click on the start bar and select Programming > Scratch 2.
In Scratch, click on More Blocks, select Add an Extension, and then click Pi GPIO.
Two new blocks will be added: the first is used to set the output pin, the second is used to get the pin value (it is true if the pin is read high).
This a simple application using a Pibrella I had hanging around:
With Python
This is a Python example using the GPIO Zero library to flash an LED:
Note that in the code above the IP address of the Pi Zero is an IPv6 address and is shortened to fe80::1%usb0, where usb0 is the network interface created by the first Pi Zero.
With pigs directly
Another option you have is to use the pigpio library and the pigs application and redirect the output to the Pi Zero network port running IPv6. To do this, you’ll first need to set some environment variable for the redirection:
With the commands above, you should be able to flash the LED on the Pi Zero.
The secret sauce
I know there’ll be some people out there who would be interested in how we put this together. And I’m sure many people are interested in the ‘buildroot’ we created to run on the Pi Zero — after all, there are lots of things you can create if you’ve got a Pi Zero on the end of a piece of IPv6 string! For a closer look, find the build scripts for the GPIO expander here and the source code for the USB boot GUI here.
And be sure to share your projects built with the GPIO expander by tagging us on social media or posting links in the comments!
This is a guest post by Rafi Ton, founder and CEO of NUVIAD. NUVIAD is, in their own words, “a mobile marketing platform providing professional marketers, agencies and local businesses state of the art tools to promote their products and services through hyper targeting, big data analytics and advanced machine learning tools.”
At NUVIAD, we’ve been using Amazon Redshift as our main data warehouse solution for more than 3 years.
We store massive amounts of ad transaction data that our users and partners analyze to determine ad campaign strategies. When running real-time bidding (RTB) campaigns in large scale, data freshness is critical so that our users can respond rapidly to changes in campaign performance. We chose Amazon Redshift because of its simplicity, scalability, performance, and ability to load new data in near real time.
Over the past three years, our customer base grew significantly and so did our data. We saw our Amazon Redshift cluster grow from three nodes to 65 nodes. To balance cost and analytics performance, we looked for a way to store large amounts of less-frequently analyzed data at a lower cost. Yet, we still wanted to have the data immediately available for user queries and to meet their expectations for fast performance. We turned to Amazon Redshift Spectrum.
In this post, I explain the reasons why we extended Amazon Redshift with Redshift Spectrum as our modern data warehouse. I cover how our data growth and the need to balance cost and performance led us to adopt Redshift Spectrum. I also share key performance metrics in our environment, and discuss the additional AWS services that provide a scalable and fast environment, with data available for immediate querying by our growing user base.
Amazon Redshift as our foundation
The ability to provide fresh, up-to-the-minute data to our customers and partners was always a main goal with our platform. We saw other solutions provide data that was a few hours old, but this was not good enough for us. We insisted on providing the freshest data possible. For us, that meant loading Amazon Redshift in frequent micro batches and allowing our customers to query Amazon Redshift directly to get results in near real time.
The benefits were immediately evident. Our customers could see how their campaigns performed faster than with other solutions, and react sooner to the ever-changing media supply pricing and availability. They were very happy.
However, this approach required Amazon Redshift to store a lot of data for long periods, and our data grew substantially. In our peak, we maintained a cluster running 65 DC1.large nodes. The impact on our Amazon Redshift cluster was evident, and we saw our CPU utilization grow to 90%.
Why we extended Amazon Redshift to Redshift Spectrum
Redshift Spectrum gives us the ability to run SQL queries using the powerful Amazon Redshift query engine against data stored in Amazon S3, without needing to load the data. With Redshift Spectrum, we store data where we want, at the cost that we want. We have the data available for analytics when our users need it with the performance they expect.
Seamless scalability, high performance, and unlimited concurrency
Scaling Redshift Spectrum is a simple process. First, it allows us to leverage Amazon S3 as the storage engine and get practically unlimited data capacity.
Second, if we need more compute power, we can leverage Redshift Spectrum’s distributed compute engine over thousands of nodes to provide superior performance – perfect for complex queries running against massive amounts of data.
Third, all Redshift Spectrum clusters access the same data catalog so that we don’t have to worry about data migration at all, making scaling effortless and seamless.
Lastly, since Redshift Spectrum distributes queries across potentially thousands of nodes, they are not affected by other queries, providing much more stable performance and unlimited concurrency.
Keeping it SQL
Redshift Spectrum uses the same query engine as Amazon Redshift. This means that we did not need to change our BI tools or query syntax, whether we used complex queries across a single table or joins across multiple tables.
An interesting capability introduced recently is the ability to create a view that spans both Amazon Redshift and Redshift Spectrum external tables. With this feature, you can query frequently accessed data in your Amazon Redshift cluster and less-frequently accessed data in Amazon S3, using a single view.
Leveraging Parquet for higher performance
Parquet is a columnar data format that provides superior performance and allows Redshift Spectrum (or Amazon Athena) to scan significantly less data. With less I/O, queries run faster and we pay less per query. You can read all about Parquet at https://parquet.apache.org/ or https://en.wikipedia.org/wiki/Apache_Parquet.
Lower cost
From a cost perspective, we pay standard rates for our data in Amazon S3, and only small amounts per query to analyze data with Redshift Spectrum. Using the Parquet format, we can significantly reduce the amount of data scanned. Our costs are now lower, and our users get fast results even for large complex queries.
What we learned about Amazon Redshift vs. Redshift Spectrum performance
When we first started looking at Redshift Spectrum, we wanted to put it to the test. We wanted to know how it would compare to Amazon Redshift, so we looked at two key questions:
What is the performance difference between Amazon Redshift and Redshift Spectrum on simple and complex queries?
Does the data format impact performance?
During the migration phase, we had our dataset stored in Amazon Redshift and S3 as CSV/GZIP and as Parquet file formats. We tested three configurations:
Amazon Redshift cluster with 28 DC1.large nodes
Redshift Spectrum using CSV/GZIP
Redshift Spectrum using Parquet
We performed benchmarks for simple and complex queries on one month’s worth of data. We tested how much time it took to perform the query, and how consistent the results were when running the same query multiple times. The data we used for the tests was already partitioned by date and hour. Properly partitioning the data improves performance significantly and reduces query times.
Simple query
First, we tested a simple query aggregating billing data across a month:
SELECT
user_id,
count(*) AS impressions,
SUM(billing)::decimal /1000000 AS billing
FROM <table_name>
WHERE
date >= '2017-08-01' AND
date <= '2017-08-31'
GROUP BY
user_id;
We ran the same query seven times and measured the response times (red marking the longest time and green the shortest time):
Execution Time (seconds)
Amazon Redshift
Redshift Spectrum CSV
Redshift Spectrum Parquet
Run #1
39.65
45.11
11.92
Run #2
15.26
43.13
12.05
Run #3
15.27
46.47
13.38
Run #4
21.22
51.02
12.74
Run #5
17.27
43.35
11.76
Run #6
16.67
44.23
13.67
Run #7
25.37
40.39
12.75
Average
21.53
44.82
12.61
For simple queries, Amazon Redshift performed better than Redshift Spectrum, as we thought, because the data is local to Amazon Redshift.
What was surprising was that using Parquet data format in Redshift Spectrum significantly beat ‘traditional’ Amazon Redshift performance. For our queries, using Parquet data format with Redshift Spectrum delivered an average 40% performance gain over traditional Amazon Redshift. Furthermore, Redshift Spectrum showed high consistency in execution time with a smaller difference between the slowest run and the fastest run.
Comparing the amount of data scanned when using CSV/GZIP and Parquet, the difference was also significant:
Data Scanned (GB)
CSV (Gzip)
135.49
Parquet
2.83
Because we pay only for the data scanned by Redshift Spectrum, the cost saving of using Parquet is evident and substantial.
Complex query
Next, we compared the same three configurations with a complex query.
Execution Time (seconds)
Amazon Redshift
Redshift Spectrum CSV
Redshift Spectrum Parquet
Run #1
329.80
84.20
42.40
Run #2
167.60
65.30
35.10
Run #3
165.20
62.20
23.90
Run #4
273.90
74.90
55.90
Run #5
167.70
69.00
58.40
Average
220.84
71.12
43.14
This time, Redshift Spectrum using Parquet cut the average query time by 80% compared to traditional Amazon Redshift!
Bottom line: For complex queries, Redshift Spectrum provided a 67% performance gain over Amazon Redshift. Using the Parquet data format, Redshift Spectrum delivered an 80% performance improvement over Amazon Redshift. For us, this was substantial.
Optimizing the data structure for different workloads
Because the cost of S3 is relatively inexpensive and we pay only for the data scanned by each query, we believe that it makes sense to keep our data in different formats for different workloads and different analytics engines. It is important to note that we can have any number of tables pointing to the same data on S3. It all depends on how we partition the data and update the table partitions.
Data permutations
For example, we have a process that runs every minute and generates statistics for the last minute of data collected. With Amazon Redshift, this would be done by running the query on the table with something as follows:
SELECT
user,
COUNT(*)
FROM
events_table
WHERE
ts BETWEEN ‘2017-08-01 14:00:00’ AND ‘2017-08-01 14:00:59’
GROUP BY
user;
(Assuming ‘ts’ is your column storing the time stamp for each event.)
With Redshift Spectrum, we pay for the data scanned in each query. If the data is partitioned by the minute instead of the hour, a query looking at one minute would be 1/60th the cost. If we use a temporary table that points only to the data of the last minute, we save that unnecessary cost.
Creating Parquet data efficiently
On the average, we have 800 instances that process our traffic. Each instance sends events that are eventually loaded into Amazon Redshift. When we started three years ago, we would offload data from each server to S3 and then perform a periodic copy command from S3 to Amazon Redshift.
Recently, Amazon Kinesis Firehose added the capability to offload data directly to Amazon Redshift. While this is now a viable option, we kept the same collection process that worked flawlessly and efficiently for three years.
This changed, however, when we incorporated Redshift Spectrum. With Redshift Spectrum, we needed to find a way to:
Collect the event data from the instances.
Save the data in Parquet format.
Partition the data effectively.
To accomplish this, we save the data as CSV and then transform it to Parquet. The most effective method to generate the Parquet files is to:
Send the data in one-minute intervals from the instances to Kinesis Firehose with an S3 temporary bucket as the destination.
Aggregate hourly data and convert it to Parquet using AWS Lambda and AWS Glue.
Add the Parquet data to S3 by updating the table partitions.
With this new process, we had to give more attention to validating the data before we sent it to Kinesis Firehose, because a single corrupted record in a partition fails queries on that partition.
Data validation
To store our click data in a table, we considered the following SQL create table command:
create external TABLE spectrum.blog_clicks (
user_id varchar(50),
campaign_id varchar(50),
os varchar(50),
ua varchar(255),
ts bigint,
billing float
)
partitioned by (date date, hour smallint)
stored as parquet
location 's3://nuviad-temp/blog/clicks/';
The above statement defines a new external table (all Redshift Spectrum tables are external tables) with a few attributes. We stored ‘ts’ as a Unix time stamp and not as Timestamp, and billing data is stored as float and not decimal (more on that later). We also said that the data is partitioned by date and hour, and then stored as Parquet on S3.
First, we need to get the table definitions. This can be achieved by running the following query:
SELECT
*
FROM
svv_external_columns
WHERE
tablename = 'blog_clicks';
This query lists all the columns in the table with their respective definitions:
schemaname
tablename
columnname
external_type
columnnum
part_key
spectrum
blog_clicks
user_id
varchar(50)
1
0
spectrum
blog_clicks
campaign_id
varchar(50)
2
0
spectrum
blog_clicks
os
varchar(50)
3
0
spectrum
blog_clicks
ua
varchar(255)
4
0
spectrum
blog_clicks
ts
bigint
5
0
spectrum
blog_clicks
billing
double
6
0
spectrum
blog_clicks
date
date
7
1
spectrum
blog_clicks
hour
smallint
8
2
Now we can use this data to create a validation schema for our data:
Next, we create a function that uses this schema to validate data:
function valueIsValid(value, item_schema) {
if (schema.type == 'string') {
return (typeof value == 'string' && value.length <= schema.max_length);
}
else if (schema.type == 'integer') {
return (typeof value == 'number' && value >= schema.min_value && value <= schema.max_value);
}
else if (schema.type == 'float' || schema.type == 'double') {
return (typeof value == 'number' && value >= schema.min_value && value <= schema.max_value);
}
else if (schema.type == 'boolean') {
return typeof value == 'boolean';
}
else if (schema.type == 'timestamp') {
return (new Date(value)).getTime() > 0;
}
else {
return true;
}
}
Near real-time data loading with Kinesis Firehose
On Kinesis Firehose, we created a new delivery stream to handle the events as follows:
Delivery stream name: events
Source: Direct PUT
S3 bucket: nuviad-events
S3 prefix: rtb/
IAM role: firehose_delivery_role_1
Data transformation: Disabled
Source record backup: Disabled
S3 buffer size (MB): 100
S3 buffer interval (sec): 60
S3 Compression: GZIP
S3 Encryption: No Encryption
Status: ACTIVE
Error logging: Enabled
This delivery stream aggregates event data every minute, or up to 100 MB, and writes the data to an S3 bucket as a CSV/GZIP compressed file. Next, after we have the data validated, we can safely send it to our Kinesis Firehose API:
if (validated) {
let itemString = item.join('|')+'\n'; //Sending csv delimited by pipe and adding new line
let params = {
DeliveryStreamName: 'events',
Record: {
Data: itemString
}
};
firehose.putRecord(params, function(err, data) {
if (err) {
console.error(err, err.stack);
}
else {
// Continue to your next step
}
});
}
Now, we have a single CSV file representing one minute of event data stored in S3. The files are named automatically by Kinesis Firehose by adding a UTC time prefix in the format YYYY/MM/DD/HH before writing objects to S3. Because we use the date and hour as partitions, we need to change the file naming and location to fit our Redshift Spectrum schema.
Automating data distribution using AWS Lambda
We created a simple Lambda function triggered by an S3 put event that copies the file to a different location (or locations), while renaming it to fit our data structure and processing flow. As mentioned before, the files generated by Kinesis Firehose are structured in a pre-defined hierarchy, such as:
All we need to do is parse the object name and restructure it as we see fit. In our case, we did the following (the event is an object received in the Lambda function with all the data about the object written to S3):
/*
object key structure in the event object:
your-prefix/2017/08/01/20/event-4-2017-08-01-20-06-06-536f5c40-6893-4ee4-907d-81e4d3b09455.gz
*/
let key_parts = event.Records[0].s3.object.key.split('/');
let event_type = key_parts[0];
let date = key_parts[1] + '-' + key_parts[2] + '-' + key_parts[3];
let hour = key_parts[4];
if (hour.indexOf('0') == 0) {
hour = parseInt(hour, 10) + '';
}
let parts1 = key_parts[5].split('-');
let minute = parts1[7];
if (minute.indexOf('0') == 0) {
minute = parseInt(minute, 10) + '';
}
Now, we can redistribute the file to the two destinations we need—one for the minute processing task and the other for hourly aggregation:
Kinesis Firehose stores the data in a temporary folder. We copy the object to another folder that holds the data for the last processed minute. This folder is connected to a small Redshift Spectrum table where the data is being processed without needing to scan a much larger dataset. We also copy the data to a folder that holds the data for the entire hour, to be later aggregated and converted to Parquet.
Because we partition the data by date and hour, we created a new partition on the Redshift Spectrum table if the processed minute is the first minute in the hour (that is, minute 0). We ran the following:
ALTER TABLE
spectrum.events
ADD partition
(date='2017-08-01', hour=0)
LOCATION 's3://nuviad-temp/events/2017-08-01/0/';
After the data is processed and added to the table, we delete the processed data from the temporary Kinesis Firehose storage and from the minute storage folder.
Migrating CSV to Parquet using AWS Glue and Amazon EMR
The simplest way we found to run an hourly job converting our CSV data to Parquet is using Lambda and AWS Glue (and thanks to the awesome AWS Big Data team for their help with this).
Creating AWS Glue jobs
What this simple AWS Glue script does:
Gets parameters for the job, date, and hour to be processed
Creates a Spark EMR context allowing us to run Spark code
Reads CSV data into a DataFrame
Writes the data as Parquet to the destination S3 bucket
Adds or modifies the Redshift Spectrum / Amazon Athena table partition for the table
Note: Because Redshift Spectrum and Athena both use the AWS Glue Data Catalog, we could use the Athena client to add the partition to the table.
Here are a few words about float, decimal, and double. Using decimal proved to be more challenging than we expected, as it seems that Redshift Spectrum and Spark use them differently. Whenever we used decimal in Redshift Spectrum and in Spark, we kept getting errors, such as:
S3 Query Exception (Fetch). Task failed due to an internal error. File 'https://s3-external-1.amazonaws.com/nuviad-temp/events/2017-08-01/hour=2/part-00017-48ae5b6b-906e-4875-8cde-bc36c0c6d0ca.c000.snappy.parquet has an incompatible Parquet schema for column 's3://nuviad-events/events.lat'. Column type: DECIMAL(18, 8), Parquet schema:\noptional float lat [i:4 d:1 r:0]\n (https://s3-external-1.amazonaws.com/nuviad-temp/events/2017-08-01/hour=2/part-00017-48ae5b6b-906e-4875-8cde-bc36c0c6d0ca.c000.snappy.parq
We had to experiment with a few floating-point formats until we found that the only combination that worked was to define the column as double in the Spark code and float in Spectrum. This is the reason you see billing defined as float in Spectrum and double in the Spark code.
Creating a Lambda function to trigger conversion
Next, we created a simple Lambda function to trigger the AWS Glue script hourly using a simple Python code:
Using Amazon CloudWatch Events, we trigger this function hourly. This function triggers an AWS Glue job named ‘convertEventsParquetHourly’ and runs it for the previous hour, passing job names and values of the partitions to process to AWS Glue.
Redshift Spectrum and Node.js
Our development stack is based on Node.js, which is well-suited for high-speed, light servers that need to process a huge number of transactions. However, a few limitations of the Node.js environment required us to create workarounds and use other tools to complete the process.
Node.js and Parquet
The lack of Parquet modules for Node.js required us to implement an AWS Glue/Amazon EMR process to effectively migrate data from CSV to Parquet. We would rather save directly to Parquet, but we couldn’t find an effective way to do it.
One interesting project in the works is the development of a Parquet NPM by Marc Vertes called node-parquet (https://www.npmjs.com/package/node-parquet). It is not in a production state yet, but we think it would be well worth following the progress of this package.
Timestamp data type
According to the Parquet documentation, Timestamp data are stored in Parquet as 64-bit integers. However, JavaScript does not support 64-bit integers, because the native number type is a 64-bit double, giving only 53 bits of integer range.
The result is that you cannot store Timestamp correctly in Parquet using Node.js. The solution is to store Timestamp as string and cast the type to Timestamp in the query. Using this method, we did not witness any performance degradation whatsoever.
Lessons learned
You can benefit from our trial-and-error experience.
Lesson #1: Data validation is critical
As mentioned earlier, a single corrupt entry in a partition can fail queries running against this partition, especially when using Parquet, which is harder to edit than a simple CSV file. Make sure that you validate your data before scanning it with Redshift Spectrum.
Lesson #2: Structure and partition data effectively
One of the biggest benefits of using Redshift Spectrum (or Athena for that matter) is that you don’t need to keep nodes up and running all the time. You pay only for the queries you perform and only for the data scanned per query.
Keeping different permutations of your data for different queries makes a lot of sense in this case. For example, you can partition your data by date and hour to run time-based queries, and also have another set partitioned by user_id and date to run user-based queries. This results in faster and more efficient performance of your data warehouse.
Storing data in the right format
Use Parquet whenever you can. The benefits of Parquet are substantial. Faster performance, less data to scan, and much more efficient columnar format. However, it is not supported out-of-the-box by Kinesis Firehose, so you need to implement your own ETL. AWS Glue is a great option.
Creating small tables for frequent tasks
When we started using Redshift Spectrum, we saw our Amazon Redshift costs jump by hundreds of dollars per day. Then we realized that we were unnecessarily scanning a full day’s worth of data every minute. Take advantage of the ability to define multiple tables on the same S3 bucket or folder, and create temporary and small tables for frequent queries.
Lesson #3: Combine Athena and Redshift Spectrum for optimal performance
Moving to Redshift Spectrum also allowed us to take advantage of Athena as both use the AWS Glue Data Catalog. Run fast and simple queries using Athena while taking advantage of the advanced Amazon Redshift query engine for complex queries using Redshift Spectrum.
Redshift Spectrum excels when running complex queries. It can push many compute-intensive tasks, such as predicate filtering and aggregation, down to the Redshift Spectrum layer, so that queries use much less of your cluster’s processing capacity.
Lesson #4: Sort your Parquet data within the partition
We achieved another performance improvement by sorting data within the partition using sortWithinPartitions(sort_field). For example:
We were extremely pleased with using Amazon Redshift as our core data warehouse for over three years. But as our client base and volume of data grew substantially, we extended Amazon Redshift to take advantage of scalability, performance, and cost with Redshift Spectrum.
Redshift Spectrum lets us scale to virtually unlimited storage, scale compute transparently, and deliver super-fast results for our users. With Redshift Spectrum, we store data where we want at the cost we want, and have the data available for analytics when our users need it with the performance they expect.
About the Author
With 7 years of experience in the AdTech industry and 15 years in leading technology companies, Rafi Ton is the founder and CEO of NUVIAD. He enjoys exploring new technologies and putting them to use in cutting edge products and services, in the real world generating real money. Being an experienced entrepreneur, Rafi believes in practical-programming and fast adaptation of new technologies to achieve a significant market advantage.
Thank you to Michael Edge, Senior Cloud Architect, for a great blog on CodeCommit pull requests.
~~~~~~~
AWS CodeCommit is a fully managed service for securely hosting private Git repositories. CodeCommit now supports pull requests, which allows repository users to review, comment upon, and interactively iterate on code changes. Used as a collaboration tool between team members, pull requests help you to review potential changes to a CodeCommit repository before merging those changes into the repository. Each pull request goes through a simple lifecycle, as follows:
The new features to be merged are added as one or more commits to a feature branch. The commits are not merged into the destination branch.
The pull request is created, usually from the difference between two branches.
Team members review and comment on the pull request. The pull request might be updated with additional commits that contain changes made in response to comments, or include changes made to the destination branch.
Once team members are happy with the pull request, it is merged into the destination branch. The commits are applied to the destination branch in the same order they were added to the pull request.
Commenting is an integral part of the pull request process, and is used to collaborate between the developers and the reviewer. Reviewers add comments and questions to a pull request during the review process, and developers respond to these with explanations. Pull request comments can be added to the overall pull request, a file within the pull request, or a line within a file.
To make the comments more useful, sign in to the AWS Management Console as an AWS Identity and Access Management (IAM) user. The username will then be associated with the comment, indicating the owner of the comment. Pull request comments are a great quality improvement tool as they allow the entire development team visibility into what reviewers are looking for in the code. They also serve as a record of the discussion between team members at a point in time, and shouldn’t be deleted.
AWS CodeCommit is also introducing the ability to add comments to a commit, another useful collaboration feature that allows team members to discuss code changed as part of a commit. This helps you discuss changes made in a repository, including why the changes were made, whether further changes are necessary, or whether changes should be merged. As is the case with pull request comments, you can comment on an overall commit, on a file within a commit, or on a specific line or change within a file, and other repository users can respond to your comments. Comments are not restricted to commits, they can also be used to comment on the differences between two branches, or between two tags. Commit comments are separate from pull request comments, i.e. you will not see commit comments when reviewing a pull request – you will only see pull request comments.
A pull request example
Let’s get started by running through an example. We’ll take a typical pull request scenario and look at how we’d use CodeCommit and the AWS Management Console for each of the steps.
To try out this scenario, you’ll need:
An AWS CodeCommit repository with some sample code in the master branch. We’ve provided sample code below.
Git installed on your local computer, and access configured for AWS CodeCommit.
A clone of the AWS CodeCommit repository on your local computer.
In the course of this example, you’ll sign in to the AWS CodeCommit console as one IAM user to create the pull request, and as the other IAM user to review the pull request. To learn more about how to set up your IAM users and how to connect to AWS CodeCommit with Git, see the following topics:
If you’d like to use the same ‘hello world’ application as used in this article, here is the source code:
package com.amazon.helloworld;
public class Main {
public static void main(String[] args) {
System.out.println("Hello, world");
}
}
The scenario below uses the us-east-2 region.
Creating the branches
Before we jump in and create a pull request, we’ll need at least two branches. In this example, we’ll follow a branching strategy similar to the one described in GitFlow. We’ll create a new branch for our feature from the main development branch (the default branch). We’ll develop the feature in the feature branch. Once we’ve written and tested the code for the new feature in that branch, we’ll create a pull request that contains the differences between the feature branch and the main development branch. Our team lead (the second IAM user) will review the changes in the pull request. Once the changes have been reviewed, the feature branch will be merged into the development branch.
Figure 1: Pull request link
Sign in to the AWS CodeCommit console with the IAM user you want to use as the developer. You can use an existing repository or you can go ahead and create a new one. We won’t be merging any changes to the master branch of your repository, so it’s safe to use an existing repository for this example. You’ll find the Pull requests link has been added just above the Commits link (see Figure 1), and below Commits you’ll find the Branches link. Click Branches and create a new branch called ‘develop’, branched from the ‘master’ branch. Then create a new branch called ‘feature1’, branched from the ‘develop’ branch. You’ll end up with three branches, as you can see in Figure 2. (Your repository might contain other branches in addition to the three shown in the figure).
Figure 2: Create a feature branch
If you haven’t cloned your repo yet, go to the Code link in the CodeCommit console and click the Connect button. Follow the instructions to clone your repo (detailed instructions are here). Open a terminal or command line and paste the git clone command supplied in the Connect instructions for your repository. The example below shows cloning a repository named codecommit-demo:
If you’ve previously cloned the repo you’ll need to update your local repo with the branches you created. Open a terminal or command line and make sure you’re in the root directory of your repo, then run the following command:
git remote update origin
You’ll see your new branches pulled down to your local repository.
Now we’ll make a change to the ‘feature1’ branch. Open a terminal or command line and check out the feature1 branch by running the following command:
git checkout feature1
$ git checkout feature1
Branch feature1 set up to track remote branch feature1 from origin.
Switched to a new branch 'feature1'
Make code changes
Edit a file in the repo using your favorite editor and save the changes. Commit your changes to the local repository, and push your changes to CodeCommit. For example:
git commit -am 'added new feature'
git push origin feature1
$ git commit -am 'added new feature'
[feature1 8f6cb28] added new feature
1 file changed, 1 insertion(+), 1 deletion(-)
$ git push origin feature1
Counting objects: 9, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (9/9), 617 bytes | 617.00 KiB/s, done.
Total 9 (delta 2), reused 0 (delta 0)
To https://git-codecommit.us-east-2.amazonaws.com/v1/repos/codecommit-demo
2774a53..8f6cb28 feature1 -> feature1
Creating the pull request
Now we have a ‘feature1’ branch that differs from the ‘develop’ branch. At this point we want to merge our changes into the ‘develop’ branch. We’ll create a pull request to notify our team members to review our changes and check whether they are ready for a merge.
In the AWS CodeCommit console, click Pull requests. Click Create pull request. On the next page select ‘develop’ as the destination branch and ‘feature1’ as the source branch. Click Compare. CodeCommit will check for merge conflicts and highlight whether the branches can be automatically merged using the fast-forward option, or whether a manual merge is necessary. A pull request can be created in both situations.
Figure 3: Create a pull request
After comparing the two branches, the CodeCommit console displays the information you’ll need in order to create the pull request. In the ‘Details’ section, the ‘Title’ for the pull request is mandatory, and you may optionally provide comments to your reviewers to explain the code change you have made and what you’d like them to review. In the ‘Notifications’ section, there is an option to set up notifications to notify subscribers of changes to your pull request. Notifications will be sent on creation of the pull request as well as for any pull request updates or comments. And finally, you can review the changes that make up this pull request. This includes both the individual commits (a pull request can contain one or more commits, available in the Commits tab) as well as the changes made to each file, i.e. the diff between the two branches referenced by the pull request, available in the Changes tab. After you have reviewed this information and added a title for your pull request, click the Create button. You will see a confirmation screen, as shown in Figure 4, indicating that your pull request has been successfully created, and can be merged without conflicts into the ‘develop’ branch.
Figure 4: Pull request confirmation page
Reviewing the pull request
Now let’s view the pull request from the perspective of the team lead. If you set up notifications for this CodeCommit repository, creating the pull request would have sent an email notification to the team lead, and he/she can use the links in the email to navigate directly to the pull request. In this example, sign in to the AWS CodeCommit console as the IAM user you’re using as the team lead, and click Pull requests. You will see the same information you did during creation of the pull request, plus a record of activity related to the pull request, as you can see in Figure 5.
Figure 5: Team lead reviewing the pull request
Commenting on the pull request
You now perform a thorough review of the changes and make a number of comments using the new pull request comment feature. To gain an overall perspective on the pull request, you might first go to the Commits tab and review how many commits are included in this pull request. Next, you might visit the Changes tab to review the changes, which displays the differences between the feature branch code and the develop branch code. At this point, you can add comments to the pull request as you work through each of the changes. Let’s go ahead and review the pull request. During the review, you can add review comments at three levels:
The overall pull request
A file within the pull request
An individual line within a file
The overall pull request In the Changes tab near the bottom of the page you’ll see a ‘Comments on changes’ box. We’ll add comments here related to the overall pull request. Add your comments as shown in Figure 6 and click the Save button.
Figure 6: Pull request comment
A specific file in the pull request Hovering your mouse over a filename in the Changes tab will cause a blue ‘comments’ icon to appear to the left of the filename. Clicking the icon will allow you to enter comments specific to this file, as in the example in Figure 7. Go ahead and add comments for one of the files changed by the developer. Click the Save button to save your comment.
Figure 7: File comment
A specific line in a file in the pull request A blue ‘comments’ icon will appear as you hover over individual lines within each file in the pull request, allowing you to create comments against lines that have been added, removed or are unchanged. In Figure 8, you add comments against a line that has been added to the source code, encouraging the developer to review the naming standards. Go ahead and add line comments for one of the files changed by the developer. Click the Save button to save your comment.
Figure 8: Line comment
A pull request that has been commented at all three levels will look similar to Figure 9. The pull request comment is shown expanded in the ‘Comments on changes’ section, while the comments at file and line level are shown collapsed. A ‘comment’ icon indicates that comments exist at file and line level. Clicking the icon will expand and show the comment. Since you are expecting the developer to make further changes based on your comments, you won’t merge the pull request at this stage, but will leave it open awaiting feedback. Each comment you made results in a notification being sent to the developer, who can respond to the comments. This is great for remote working, where developers and team lead may be in different time zones.
Figure 9: Fully commented pull request
Adding a little complexity
A typical development team is going to be creating pull requests on a regular basis. It’s highly likely that the team lead will merge other pull requests into the ‘develop’ branch while pull requests on feature branches are in the review stage. This may result in a change to the ‘Mergable’ status of a pull request. Let’s add this scenario into the mix and check out how a developer will handle this.
To test this scenario, we could create a new pull request and ask the team lead to merge this to the ‘develop’ branch. But for the sake of simplicity we’ll take a shortcut. Clone your CodeCommit repo to a new folder, switch to the ‘develop’ branch, and make a change to one of the same files that were changed in your pull request. Make sure you change a line of code that was also changed in the pull request. Commit and push this back to CodeCommit. Since you’ve just changed a line of code in the ‘develop’ branch that has also been changed in the ‘feature1’ branch, the ‘feature1’ branch cannot be cleanly merged into the ‘develop’ branch. Your developer will need to resolve this merge conflict.
A developer reviewing the pull request would see the pull request now looks similar to Figure 10, with a ‘Resolve conflicts’ status rather than the ‘Mergable’ status it had previously (see Figure 5).
Figure 10: Pull request with merge conflicts
Reviewing the review comments
Once the team lead has completed his review, the developer will review the comments and make the suggested changes. As a developer, you’ll see the list of review comments made by the team lead in the pull request Activity tab, as shown in Figure 11. The Activity tab shows the history of the pull request, including commits and comments. You can reply to the review comments directly from the Activity tab, by clicking the Reply button, or you can do this from the Changes tab. The Changes tab shows the comments for the latest commit, as comments on previous commits may be associated with lines that have changed or been removed in the current commit. Comments for previous commits are available to view and reply to in the Activity tab.
In the Activity tab, use the shortcut link (which looks like this </>) to move quickly to the source code associated with the comment. In this example, you will make further changes to the source code to address the pull request review comments, so let’s go ahead and do this now. But first, you will need to resolve the ‘Resolve conflicts’ status.
Figure 11: Pull request activity
Resolving the ‘Resolve conflicts’ status
The ‘Resolve conflicts’ status indicates there is a merge conflict between the ‘develop’ branch and the ‘feature1’ branch. This will require manual intervention to restore the pull request back to the ‘Mergable’ state. We will resolve this conflict next.
Open a terminal or command line and check out the develop branch by running the following command:
git checkout develop
$ git checkout develop
Switched to branch 'develop'
Your branch is up-to-date with 'origin/develop'.
To incorporate the changes the team lead made to the ‘develop’ branch, merge the remote ‘develop’ branch with your local copy:
git checkout feature1
$ git checkout feature1
Switched to branch 'feature1'
Your branch is up-to-date with 'origin/feature1'.
Now merge the changes from the ‘develop’ branch into your ‘feature1’ branch:
git merge develop
$ git merge develop
Auto-merging src/main/java/com/amazon/helloworld/Main.java
CONFLICT (content): Merge conflict in src/main/java/com/amazon/helloworld/Main.java
Automatic merge failed; fix conflicts and then commit the result.
Yes, this fails. The file Main.java has been changed in both branches, resulting in a merge conflict that can’t be resolved automatically. However, Main.java will now contain markers that indicate where the conflicting code is, and you can use these to resolve the issues manually. Edit Main.java using your favorite IDE, and you’ll see it looks something like this:
package com.amazon.helloworld;
import java.util.*;
/**
* This class prints a hello world message
*/
public class Main {
public static void main(String[] args) {
<<<<<<< HEAD
Date todaysdate = Calendar.getInstance().getTime();
System.out.println("Hello, earthling. Today's date is: " + todaysdate);
=======
System.out.println("Hello, earth");
>>>>>>> develop
}
}
The code between HEAD and ‘===’ is the code the developer added in the ‘feature1’ branch (HEAD represents ‘feature1’ because this is the current checked out branch). The code between ‘===’ and ‘>>> develop’ is the code added to the ‘develop’ branch by the team lead. We’ll resolve the conflict by manually merging both changes, resulting in an updated Main.java:
package com.amazon.helloworld;
import java.util.*;
/**
* This class prints a hello world message
*/
public class Main {
public static void main(String[] args) {
Date todaysdate = Calendar.getInstance().getTime();
System.out.println("Hello, earth. Today's date is: " + todaysdate);
}
}
After saving the change you can add and commit it to your local repo:
Now you are ready to address the comments made by the team lead. If you are no longer pointing to the ‘feature1’ branch, check out the ‘feature1’ branch by running the following command:
git checkout feature1
$ git checkout feature1
Branch feature1 set up to track remote branch feature1 from origin.
Switched to a new branch 'feature1'
Edit the source code in your favorite IDE and make the changes to address the comments. In this example, the developer has updated the source code as follows:
package com.amazon.helloworld;
import java.util.*;
/**
* This class prints a hello world message
*
* @author Michael Edge
* @see HelloEarth
* @version 1.0
*/
public class Main {
public static void main(String[] args) {
Date todaysDate = Calendar.getInstance().getTime();
System.out.println("Hello, earth. Today's date is: " + todaysDate);
}
}
After saving the changes, commit and push to the CodeCommit ‘feature1’ branch as you did previously:
git commit -am 'updated based on review comments'
git push origin feature1
Responding to the reviewer
Now that you’ve fixed the code issues you will want to respond to the review comments. In the AWS CodeCommit console, check that your latest commit appears in the pull request Commits tab. You now have a pull request consisting of more than one commit. The pull request in Figure 12 has four commits, which originated from the following activities:
8th Nov: the original commit used to initiate this pull request
10th Nov, 3 hours ago: the commit by the team lead to the ‘develop’ branch, merged into our ‘feature1’ branch
10th Nov, 24 minutes ago: the commit by the developer that resolved the merge conflict
10th Nov, 4 minutes ago: the final commit by the developer addressing the review comments
Figure 12: Pull request with multiple commits
Let’s reply to the review comments provided by the team lead. In the Activity tab, reply to the pull request comment and save it, as shown in Figure 13.
Figure 13: Replying to a pull request comment
At this stage, your code has been committed and you’ve updated your pull request comments, so you are ready for a final review by the team lead.
Final review
The team lead reviews the code changes and comments made by the developer. As team lead, you own the ‘develop’ branch and it’s your decision on whether to merge the changes in the pull request into the ‘develop’ branch. You can close the pull request with or without merging using the Merge and Close buttons at the bottom of the pull request page (see Figure 13). Clicking Close will allow you to add comments on why you are closing the pull request without merging. Merging will perform a fast-forward merge, incorporating the commits referenced by the pull request. Let’s go ahead and click the Merge button to merge the pull request into the ‘develop’ branch.
Figure 14: Merging the pull request
After merging a pull request, development of that feature is complete and the feature branch is no longer needed. It’s common practice to delete the feature branch after merging. CodeCommit provides a check box during merge to automatically delete the associated feature branch, as seen in Figure 14. Clicking the Merge button will merge the pull request into the ‘develop’ branch, as shown in Figure 15. This will update the status of the pull request to ‘Merged’, and will close the pull request.
Conclusion
This blog has demonstrated how pull requests can be used to request a code review, and enable reviewers to get a comprehensive summary of what is changing, provide feedback to the author, and merge the code into production. For more information on pull requests, see the documentation.
This post courtesy of Aaron Friedman, Healthcare and Life Sciences Partner Solutions Architect, AWS and Angel Pizarro, Genomics and Life Sciences Senior Solutions Architect, AWS
Precision medicine is tailored to individuals based on quantitative signatures, including genomics, lifestyle, and environment. It is often considered to be the driving force behind the next wave of human health. Through new initiatives and technologies such as population-scale genomics sequencing and IoT-backed wearables, researchers and clinicians in both commercial and public sectors are gaining new, previously inaccessible insights.
Many of these precision medicine initiatives are already happening on AWS. A few of these include:
PrecisionFDA – This initiative is led by the US Food and Drug Administration. The goal is to define the next-generation standard of care for genomics in precision medicine.
Deloitte ConvergeHEALTH – Gives healthcare and life sciences organizations the ability to analyze their disparate datasets on a singular real world evidence platform.
Central to many of these initiatives is genomics, which gives healthcare organizations the ability to establish a baseline for longitudinal studies. Due to its wide applicability in precision medicine initiatives—from rare disease diagnosis to improving outcomes of clinical trials—genomics data is growing at a larger rate than Moore’s law across the globe. Many expect these datasets to grow to be in the range of tens of exabytes by 2025.
Genomics data is also regularly re-analyzed by the community as researchers develop new computational methods or compare older data with newer genome references. These trends are driving innovations in data analysis methods and algorithms to address the massive increase of computational requirements.
Edico Genome, an AWS Partner Network (APN) Partner, has developed a novel solution that accelerates genomics analysis using field-programmable gate arrays, or FPGAs. Historically, Edico Genome deployed their FPGA appliances on-premises. When AWS announced the Amazon EC2 F1 PGA-based instance family in December 2016, Edico Genome adopted a cloud-first strategy, became a F1 launch partner, and was one of the first partners to deploy FPGA-enabled applications on AWS.
On October 19, 2017, Edico Genome partnered with the Children’s Hospital of Philadelphia (CHOP) to demonstrate their FPGA-accelerated genomic pipeline software, called DRAGEN. It can significantly reduce time-to-insight for patient genomes, and analyzed 1,000 genomes from the Center for Applied Genomics Biobank in the shortest time possible. This set a Guinness World Record for the fastest analysis of 1000 whole human genomes, and they did this using 1000 EC2 f1.2xlarge instances in a single AWS region. Not only were they able to analyze genomes at high throughput, they did so averaging approximately $3 per whole human genome of AWS compute for the analysis.
The version of DRAGEN that Edico Genome used for this analysis was also the same one used in the precisionFDA Hidden Treasures – Warm Up challenge, where they were one of the top performers in every assessment.
In the remainder of this post, we walk through the architecture used by Edico Genome, combining EC2 F1 instances and AWS Batch to achieve this milestone.
EC2 F1 instances and Edico’s DRAGEN
EC2 F1 instances provide access to programmable hardware-acceleration using FPGAs at a cloud scale. AWS customers use F1 instances for a wide variety of applications, including big data, financial analytics and risk analysis, image and video processing, engineering simulations, AR/VR, and accelerated genomics. Edico Genome’s FPGA-backed DRAGEN Bio-IT Platform is now integrated with EC2 F1 instances. You can access the accuracy, speed, flexibility, and low compute cost of DRAGEN through a number of third-party platforms, AWS Marketplace, and Edico Genome’s own platform. The DRAGEN platform offers a scalable, accelerated, and cost-efficient secondary analysis solution for a wide variety of genomics applications. Edico Genome also provides a highly optimized mechanism for the efficient storage of genomic data.
Scaling DRAGEN on AWS
Edico Genome used 1,000 EC2 F1 instances to help their customer, the Children’s Hospital of Philadelphia (CHOP), to process and analyze all 1,000 whole human genomes in parallel. They used AWS Batch to provision compute resources and orchestrate DRAGEN compute jobs across the 1,000 EC2 F1 instances. This solution successfully addressed the challenge of creating a scalable genomic processing pipeline that can easily scale to thousands of engines running in parallel.
Architecture
A simplified view of the architecture used for the analysis is shown in the following diagram:
DRAGEN’s portal uses Elastic Load Balancing and Auto Scaling groups to scale out EC2 instances that submitted jobs to AWS Batch.
Job metadata is stored in their Workflow Management (WFM) database, built on top of Amazon Aurora.
The DRAGEN Workflow Manager API submits jobs to AWS Batch.
These jobs are executed on the AWS Batch managed compute environment that was responsible for launching the EC2 F1 instances.
These jobs run as Docker containers that have the requisite DRAGEN binaries for whole genome analysis.
As each job runs, it retrieves and stores genomics data that is staged in Amazon S3.
The steps listed previously can also be bucketed into the following higher-level layers:
Workflow: Edico Genome used their Workflow Management API to orchestrate the submission of AWS Batch jobs. Metadata for the jobs (such as the S3 locations of the genomes, etc.) resides in the Workflow Management Database backed by Amazon Aurora.
Batch execution: AWS Batch launches EC2 F1 instances and coordinates the execution of DRAGEN jobs on these compute resources. AWS Batch enabled Edico to quickly and easily scale up to the full number of instances they needed as jobs were submitted. They also scaled back down as each job was completed, to optimize for both cost and performance.
Compute/job: Edico Genome stored their binaries in a Docker container that AWS Batch deployed onto each of the F1 instances, giving each instance the ability to run DRAGEN without the need to pre-install the core executables. The AWS based DRAGEN solution streams all genomics data from S3 for local computation and then writes the results to a destination bucket. They used an AWS Batch job role that specified the IAM permissions. The role ensured that DRAGEN only had access to the buckets or S3 key space it needed for the analysis. Jobs didn’t need to embed AWS credentials.
In the following sections, we dive deeper into several tasks that enabled Edico Genome’s scalable FPGA genome analysis on AWS:
Prepare your Amazon FPGA Image for AWS Batch
Create a Dockerfile and build your Docker image
Set up your AWS Batch FPGA compute environment
Prerequisites
In brief, you need a modern Linux distribution (3.10+), Amazon ECS Container Agent, awslogs driver, and Docker configured on your image. There are additional recommendations in the Compute Resource AMI specification.
Preparing your Amazon FPGA Image for AWS Batch
You can use any Amazon Machine Image (AMI) or Amazon FPGA Image (AFI) with AWS Batch, provided that it meets the Compute Resource AMI specification. This gives you the ability to customize any workload by increasing the size of root or data volumes, adding instance stores, and connecting with the FPGA (F) and GPU (G and P) instance families.
Next, install the AWS CLI:
pip install awscli
Add any additional software required to interact with the FPGAs on the F1 instances.
As a starting point, AWS publishes an FPGA Developer AMI in the AWS Marketplace. It is based on a CentOS Linux image and includes pre-integrated FPGA development tools. It also includes the runtime tools required to develop and use custom FPGAs for hardware acceleration applications.
For more information about how to set up custom AMIs for your AWS Batch managed compute environments, see Creating a Compute Resource AMI.
Building your Dockerfile
There are two common methods for connecting to AWS Batch to run FPGA-enabled algorithms. The first method, which is the route Edico Genome took, involves storing your binaries in the Docker container itself and running that on top of an F1 instance with Docker installed. The following code example is what a Dockerfile to build your container might look like for this scenario.
# DRAGEN_EXEC Docker image generator --
# Run this Dockerfile from a local directory that contains the latest release of
# - Dragen RPM and Linux DMA Driver available from Edico
# - Edico's Dragen WFMS Wrapper files
FROM centos:centos7
RUN rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
# Install Basic packages needed for Dragen
RUN yum -y install \
perl \
sos \
coreutils \
gdb \
time \
systemd-libs \
bzip2-libs \
R \
ca-certificates \
ipmitool \
smartmontools \
rsync
# Install the Dragen RPM
RUN mkdir -m777 -p /var/log/dragen /var/run/dragen
ADD . /root
RUN rpm -Uvh /root/edico_driver*.rpm || true
RUN rpm -Uvh /root/dragen-aws*.rpm || true
# Auto generate the Dragen license
RUN /opt/edico/bin/dragen_lic -i auto
#########################################################
# Now install the Edico WFMS "Wrapper" functions
# Add development tools needed for some util
RUN yum groupinstall -y "Development Tools"
# Install necessary standard packages
RUN yum -y install \
dstat \
git \
python-devel \
python-pip \
time \
tree && \
pip install --upgrade pip && \
easy_install requests && \
pip install psutil && \
pip install python-dateutil && \
pip install constants && \
easy_install boto3
# Setup Python path used by the wrapper
RUN mkdir -p /opt/workflow/python/bin
RUN ln -s /usr/bin/python /opt/workflow/python/bin/python2.7
RUN ln -s /usr/bin/python /opt/workflow/python/bin/python
# Install d_haul and dragen_job_execute wrapper functions and associated packages
RUN mkdir -p /root/wfms/trunk/scheduler/scheduler
COPY scheduler/d_haul /root/wfms/trunk/scheduler/
COPY scheduler/dragen_job_execute /root/wfms/trunk/scheduler/
COPY scheduler/scheduler/aws_utils.py /root/wfms/trunk/scheduler/scheduler/
COPY scheduler/scheduler/constants.py /root/wfms/trunk/scheduler/scheduler/
COPY scheduler/scheduler/job_utils.py /root/wfms/trunk/scheduler/scheduler/
COPY scheduler/scheduler/logger.py /root/wfms/trunk/scheduler/scheduler/
COPY scheduler/scheduler/scheduler_utils.py /root/wfms/trunk/scheduler/scheduler/
COPY scheduler/scheduler/webapi.py /root/wfms/trunk/scheduler/scheduler/
COPY scheduler/scheduler/wfms_exception.py /root/wfms/trunk/scheduler/scheduler/
RUN touch /root/wfms/trunk/scheduler/scheduler/__init__.py
# Landing directory should be where DJX is located
WORKDIR "/root/wfms/trunk/scheduler/"
# Debug print of container's directories
RUN tree /root/wfms/trunk/scheduler
# Default behaviour. Over-ride with --entrypoint on docker run cmd line
ENTRYPOINT ["/root/wfms/trunk/scheduler/dragen_job_execute"]
CMD []
Note: Edico Genome’s custom Python wrapper functions for its Workflow Management System (WFMS) in the latter part of this Dockerfile should be replaced with functions that are specific to your workflow.
The second method is to install binaries and then use Docker as a lightweight connector between AWS Batch and the AFI. For example, this might be a route you would choose to use if you were provisioning DRAGEN from the AWS Marketplace.
In this case, the Dockerfile would not contain the installation of the binaries to run DRAGEN, but would contain any other packages necessary for job completion. When you run your Docker container, you enable Docker to access the underlying file system.
Connecting to AWS Batch
AWS Batch provisions compute resources and runs your jobs, choosing the right instance types based on your job requirements and scaling down resources as work is completed. AWS Batch users submit a job, based on a template or “job definition” to an AWS Batch job queue.
Job queues are mapped to one or more compute environments that describe the quantity and types of resources that AWS Batch can provision. In this case, Edico created a managed compute environment that was able to launch 1,000 EC2 F1 instances across multiple Availability Zones in us-east-1. As jobs are submitted to a job queue, the service launches the required quantity and types of instances that are needed. As instances become available, AWS Batch then runs each job within appropriately sized Docker containers.
The Edico Genome workflow manager API submits jobs to an AWS Batch job queue. This job queue maps to an AWS Batch managed compute environment containing On-Demand F1 instances. In this section, you can set this up yourself.
To create the compute environment that DRAGEN can use:
An f1.2xlarge EC2 instance contains one FPGA, eight vCPUs, and 122-GiB RAM. As DRAGEN requires an entire FPGA to run, Edico Genome needed to ensure that only one analysis per time executed on an instance. By using the f1.2xlarge vCPUs and memory as a proxy in their AWS Batch job definition, Edico Genome could ensure that only one job runs on an instance at a time. Here’s what that looks like in the AWS CLI:
You can query the status of your DRAGEN job with the following command:
aws batch describe-jobs --jobs <the job ID from the above command>
The logs for your job are written to the /aws/batch/job CloudWatch log group.
Conclusion
In this post, we demonstrated how to set up an environment with AWS Batch that can run DRAGEN on EC2 F1 instances at scale. If you followed the walkthrough, you’ve replicated much of the architecture Edico Genome used to set the Guinness World Record.
There are several ways in which you can harness the computational power of DRAGEN to analyze genomes at scale. First, DRAGEN is available through several different genomics platforms, such as the DNAnexus Platform. DRAGEN is also available on the AWS Marketplace. You can apply the architecture presented in this post to build a scalable solution that is both performant and cost-optimized.
For more information about how AWS Batch can facilitate genomics processing at scale, be sure to check out our aws-batch-genomics GitHub repo on high-throughput genomics on AWS.
Update from March 28, 2018: We updated the Amazon Trust Services table by replacing an out-of-date value with a new value.
Transport Layer Security (TLS, formerly called Secure Sockets Layer [SSL]) is essential for encrypting information that is exchanged on the internet. For example, Amazon.com uses TLS for all traffic on its website, and AWS uses it to secure calls to AWS services.
An electronic document called a certificate verifies the identity of the server when creating such an encrypted connection. The certificate helps establish proof that your web browser is communicating securely with the website that you typed in your browser’s address field. Certificate Authorities, also known as CAs, issue certificates to specific domains. When a domain presents a certificate that is issued by a trusted CA, your browser or application knows it’s safe to make the connection.
In January 2016, AWS launched AWS Certificate Manager (ACM), a service that lets you easily provision, manage, and deploy SSL/TLS certificates for use with AWS services. These certificates are available for no additional charge through Amazon’s own CA: Amazon Trust Services. For browsers and other applications to trust a certificate, the certificate’s issuer must be included in the browser’s trust store, which is a list of trusted CAs. If the issuing CA is not in the trust store, the browser will display an error message (see an example) and applications will show an application-specific error. To ensure the ubiquity of the Amazon Trust Services CA, AWS purchased the Starfield Services CA, a root found in most browsers and which has been valid since 2005. This means you shouldn’t have to take any action to use the certificates issued by Amazon Trust Services.
AWS has been offering free certificates to AWS customers from the Amazon Trust Services CA. Now, AWS is in the process of moving certificates for services such as Amazon EC2 and Amazon DynamoDB to use certificates from Amazon Trust Services as well. Most software doesn’t need to be changed to handle this transition, but there are exceptions. In this blog post, I show you how to verify that you are prepared to use the Amazon Trust Services CA.
How to tell if the Amazon Trust Services CAs are in your trust store
The following table lists the Amazon Trust Services certificates. To verify that these certificates are in your browser’s trust store, click each Test URL in the following table to verify that it works for you. When a Test URL does not work, it displays an error similar to this example.
* Note: Amazon doesn’t own this root and doesn’t have a test URL for it. The certificate can be downloaded from here.
You can calculate the SHA-256 hash of Subject Public Key Information as follows. With the PEM-encoded certificate stored in certificate.pem, run the following openssl commands:
As an example, with the Starfield Class 2 Certification Authority self-signed cert in a PEM encoded file sf-class2-root.crt, you can use the following openssl commands:
What to do if the Amazon Trust Services CAs are not in your trust store
If your tests of any of the Test URLs failed, you must update your trust store. The easiest way to update your trust store is to upgrade the operating system or browser that you are using.
You will find the Amazon Trust Services CAs in the following operating systems (release dates are in parentheses):
Microsoft Windows versions that have January 2005 or later updates installed, Windows Vista, Windows 7, Windows Server 2008, and newer versions
Mac OS X 10.4 with Java for Mac OS X 10.4 Release 5, Mac OS X 10.5 and newer versions
Red Hat Enterprise Linux 5 (March 2007), Linux 6, and Linux 7 and CentOS 5, CentOS 6, and CentOS 7
Ubuntu 8.10
Debian 5.0
Amazon Linux (all versions)
Java 1.4.2_12, Java 5 update 2, and all newer versions, including Java 6, Java 7, and Java 8
All modern browsers trust Amazon’s CAs. You can update the certificate bundle in your browser simply by updating your browser. You can find instructions for updating the following browsers on their respective websites:
If your application is using a custom trust store, you must add the Amazon root CAs to your application’s trust store. The instructions for doing this vary based on the application or platform. Please refer to the documentation for the application or platform you are using.
AWS SDKs and CLIs
Most AWS SDKs and CLIs are not impacted by the transition to the Amazon Trust Services CA. If you are using a version of the Python AWS SDK or CLI released before October 29, 2013, you must upgrade. The .NET, Java, PHP, Go, JavaScript, and C++ SDKs and CLIs do not bundle any certificates, so their certificates come from the underlying operating system. The Ruby SDK has included at least one of the required CAs since June 10, 2015. Before that date, the Ruby V2 SDK did not bundle certificates.
Certificate pinning
If you are using a technique called certificate pinning to lock down the CAs you trust on a domain-by-domain basis, you must adjust your pinning to include the Amazon Trust Services CAs. Certificate pinning helps defend you from an attacker using misissued certificates to fool an application into creating a connection to a spoofed host (an illegitimate host masquerading as a legitimate host). The restriction to a specific, pinned certificate is made by checking that the certificate issued is the expected certificate. This is done by checking that the hash of the certificate public key received from the server matches the expected hash stored in the application. If the hashes do not match, the code stops the connection.
AWS recommends against using certificate pinning because it introduces a potential availability risk. If the certificate to which you pin is replaced, your application will fail to connect. If your use case requires pinning, we recommend that you pin to a CA rather than to an individual certificate. If you are pinning to an Amazon Trust Services CA, you should pin to all CAs shown in the table earlier in this post.
If you have comments about this post, submit them in the “Comments” section below. If you have questions about this post, start a new thread on the ACM forum.
– Jonathan
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.