Tag Archives: AWS Batch

Processing batch jobs quickly, cost-efficiently, and reliably with Amazon EC2 On-Demand and Spot Instances

Post Syndicated from Bala Thekkedath original https://aws.amazon.com/blogs/compute/processing-batch-jobs-quickly-cost-efficiently-and-reliably-with-amazon-ec2-on-demand-and-spot-instances/

This post is contributed by Alex Kimber, Global Solutions Architect

No one asks for their High Performance Computing (HPC) jobs to take longer, cost more, or have more variability in the time to get results. Fortunately, you can combine Amazon EC2 and Amazon EC2 Auto Scaling to make the delivery of batch workloads fast, cost-efficient, and reliable. Spot Instances offer spare AWS compute power at a considerable discount. Customers such as Yelp, NASA, and FINRA use them to reduce costs and get results faster.

This post outlines an approach that combines On-Demand Instances and Spot Instances to balance a predictable delivery of HPC results with an opportunistic approach to cost optimization.

 

Prerequisites

This approach will be demonstrated via a simple batch-processing environment with the following components:

  • A producer Python script to generate batches of tasks to process. You can develop this script in the AWS Cloud9 development environment. This solution also uses the environment to run the script and generate tasks.
  • An Amazon SQS queue to manage the tasks.
  • A consumer Python script to take incomplete tasks from the queue, simulate work, and then remove them from the queue after they’re complete.
  • Amazon EC2 Auto Scaling groups to model scenarios.
  • Amazon CloudWatch alarms to trigger the Auto Scaling groups and detect whether the queue is empty. The EC2 instances run the consumer script in a loop on startup.

 

Testing On-Demand Instances

In this scenario, an HPC batch of 6,000 tasks must complete within five hours. Each task takes eight minutes to complete on a single vCPU.

A simple approach to meeting the target is to provision 160 vCPUs using 20 c5.2xlarge On-Demand Instances. Each of the instances should complete 60 tasks per hour, completing the batch in approximately five hours. This approach provides an adequate level of predictability. You can test this approach with a simple Auto Scaling group configuration, set to create 20 c5.2xlarge instances if the queue has any pending visible messages. As expected, the batch takes approximately five hours, as shown in the following screenshot.

In the Ireland Region, using 20 c5.2xlarge instances for five hours results in a cost of $0.384 per hour for each instance.  The batch total is $38.40.

 

Testing On-Demand and Spot Instances

The alternative approach to the scenario also provisions sufficient capacity for On-Demand Instances to meet the target time, in this case 20 instances. This approach gives confidence that you can meet the batch target of five hours regardless of what other capacity you add.

You can then configure the Auto Scaling group to also add a number of Spot Instances. These instances are more numerous, with the aim of delivering the results at a lower cost and also allowing the batch to complete much earlier than it would otherwise. When the queue is empty it automatically terminates all of the instances involved to prevent further charges. This example configures the Auto Scaling group to have 80 instances in total, with 20 On-Demand Instances and 60 Spot Instances. Selecting multiple different instance types is a good strategy to help secure Spot capacity by diversification.

Spot Instances occasionally experience interruptions when AWS must reclaim the capacity with a two-minute warning. You can handle this occurrence gracefully by configuring your batch processor code to react to the interruption, such as checkpointing progress to some data store. This example has the SQS visibility timeout on the SQS queue set to nine minutes, so SQS re-queues any task that doesn’t complete in that time.

To test the impact of the new configuration another 6000 tasks are submitted into the SQS queue. The Auto Scaling group quickly provisions 20 On-Demand and 60 Spot Instances.

The instances then quickly set to work on the queue.

The batch completes in approximately 30 minutes, which is a significant improvement. This result is due to the additional Spot Instance capacity, which gave a total of 2,140 vCPUs.

The batch used the following instances for 30 minutes.

 

Instance TypeProvisioningHost CountHourly Instance CostTotal 30-minute batch cost
c5.18xlargeSpot15 $     1.2367 $     9.2753
c5.2xlargeSpot22 $     0.1547 $     1.7017
c5.4xlargeSpot12 $     0.2772 $     1.6632
c5.9xlargeSpot11 $     0.6239 $     3.4315
c5.2xlargeOn-Demand13 $     0.3840 $     2.4960
c5.4xlargeOn-Demand3 $     0.7680 $     1.1520
c5.9xlargeOn-Demand4 $     1.7280 $     3.4560

The total cost is $23.18, which is approximately 60 percent of the On-Demand cost and allows you to compute the batch 10 times faster. This example also shows no interruptions to the Spot Instances.

 

Summary

This post demonstrated that by combining On-Demand and Spot Instances you can improve the performance of a loosely coupled HPC batch workload without compromising on the predictability of runtime. This approach balances reliability with improved performance while reducing costs. The use of Auto Scaling groups and CloudWatch alarms makes the solution largely automated, responding to demand and provisioning and removing capacity as required.

Optimizing for cost, availability and throughput by selecting your AWS Batch allocation strategy

Post Syndicated from Bala Thekkedath original https://aws.amazon.com/blogs/compute/optimizing-for-cost-availability-and-throughput-by-selecting-your-aws-batch-allocation-strategy/

This post is contributed by Steve Kendrex, Senior Technical Product Manager, AWS Batch

 

Introduction

 

AWS offers a broad range of instances that are advantageous for batch workloads. The scale and provisioning speed of AWS’ compute instances allow you to get up and running at peak capacity in minutes without paying for downtime. Today, I’m pleased to introduce allocation strategies: a significant new capability in AWS Batch that  makes provisioning compute resources flexible and simple. In this blog post, I explain how the AWS Batch allocation strategies work, when you should use them for your workload, and provide an example CloudFormation script. This blog helps you get started on building your personalized Compute Environment (CE) most appropriate to your workloads.

Overview

AWS Batch is a fully managed, cloud-native batch scheduler. It manages the queuing and scheduling of your batch jobs, and the resources required to run your jobs. One of AWS Batch’s great strengths is the ability to manage instance provisioning as your workload requirements and budget needs change. AWS Batch takes advantage of AWS’s broad base of compute types. For example, you can launch compute based instances and memory instances that can handle different workload types, without having to worry about building a cluster to meet peak demand.

Previously, AWS Batch had a cost-controlling approach to manage compute instances for your workloads. The service chose an instance that was the best fit for your jobs based on vCPU, memory, and GPU requirements, at the lowest cost. Now, the newly added allocation strategies provide flexibility. They allow AWS Batch to consider capacity and throughput in addition to cost when provisioning your instances. This allows you to leverage different priorities when launching instances depending on your workloads’ needs, such as: controlling cost, maximizing throughput, or minimizing Amazon EC2 Spot instances interruption rates.

There are now three instance allocation strategies from which to choose when creating an AWS Batch Compute Environment (CE). They are:

1.        Spot Capacity Optimized

2.        Best Fit Progressive

3.        Best Fit

 

Spot Capacity Optimized

As the name implies, the Spot capacity optimized strategy is only available when launching Spot CEs in AWS Batch. In fact, I recommend the Spot capacity optimized strategy for most of your interruptible workloads running on Spot today. This strategy takes advantage of the recently released EC2 Auto Scaling and EC2 Fleet capacity optimized strategy. Next, I examine how this strategy behaves in AWS Batch.

Let’s say you’re running a simulation workload in AWS Batch. Your workload is Spot-appropriate (see this whitepaper to determine), so you want to take advantage of the savings you can glean from using Spot. However, you also want to minimize your Spot interruption rate, so you’ve followed the Spot best practices. Your instances can run across multiple instance types and multiple Availability Zones. When creating your Spot CE in AWS Batch, Input all the instance types with which your workload is compatible in the instance field. OR select ‘optimal’, which allows Batch to choose from among M, C, or R instance families. The image below shows how this appears in the console:

AWS Batch console with SPOT_CAPACITY_OPTIMIZED selected

AWS Batch console with SPOT_CAPACITY_OPTIMIZED selected

When evaluating your workload, AWS Batch selects from the allowed instance types. These allowed instance types are specified in the ‘compute resource parameter’, and are capable of running your jobs listed in your Spot CE.  From the capable instances, AWS calculates the correct assortment of instance types that have the most Spot capacity. AWS Batch then launches those instances on your behalf, and runs your jobs when those instances are available. This strategy gives you access to all AWS compute resources at a fraction of On-Demand cost. The Spot capacity optimized strategy works whether you’re trying to launch hundreds of thousands (or a million!) of vCPU’s in Spot, or simply trying to lower your chance of interruption. Additionally, AWS Batch manages your instance pool to meet the capacity needed to run your workload as time passes.

For example, as your workloads run, demand in an Availability Zone may shift. This might lead to several of your instances being reclaimed. In that event, AWS Batch automatically attempts to scale a different instance type based on the deepest capacity pools. Assuming you set a retry attempt count, your jobs then automatically retry. Then, AWS Batch scales new instances until either it meets the desired capacity, or it runs out of instance types to launch based on those provided.  That is why I recommend that you give AWS Batch as many instance types and families as possible to choose from when running Spot capacity optimized. Additional detail on behavior can be found in the capacity optimized documentation.

To launch a Spot capacity optimized CE, follow these steps:

1.       Navigate to the console

2.       Create a new Compute Environment.

3.       Select “Spot Capacity Optimized” in the Allocation Strategy field

4.       Alternatively, you can use the CreateComputeEnvironment API; in the Allocation Strategy field, pass in “Spot_Capacity_Optimized”. This command should look like the following:

…
"TestAllocationStrategyCE": { 
"Type": "AWS::Batch::ComputeEnvironment",
 "Properties": { 
"State": "ENABLED", 
"Type": "MANAGED", 
"ComputeResources": { 
"Subnets": [
 {"Ref": "TestSubnet"}
 ], 
"InstanceRole": {
 "Ref": "TestIamInstanceProfile" 
},
 "MinvCpus": 0, 
" InstanceTypes": [ 
"optimal"
 ],
 "SecurityGroupIds": [
 	{"Ref": "TestSecurityGroup"} 
],
 "DesiredvCpus": 0, 
"MaxvCpus": 12, 
"AllocationStrategy": "SPOT_CAPACITY_OPTIMIZED", 
"Type": "EC2" },
 "ServiceRole": { 
"Ref": "TestAWSBatchServiceRole" 
}
 }
 },
…

Once you follow these steps your Spot capacity optimized CE should be up and running.

 

Best Fit Progressive

Imagine you have a time-sensitive machine learning workload that is very compute intensive. You want to run this workload on C5 instances because you know that those have a preferable vCPU/memory ratio for your jobs. In a pinch, however, you know that M5 instances can run your workload perfectly well. You’re happy to take advantage of Spot prices. However, you also have a base level of throughput you need so you have to run part of the workload on On-Demand instances.  In this case, I recommend the best fit progressive strategy. This strategy is available in both On-Demand and Spot CEs, and I recommend it for most On-Demand workloads. The best fit progressive strategy allows you to let AWS Batch choose the best fit instance for your workload (based on your jobs’ vCPU and memory requirements). In this context, “best fit” means AWS Batch provisions the least number of instances capable of running your jobs at the lowest cost.

Sometimes, AWS Batch cannot resource enough of the best fit instances to meet your capacity. When this is the case, AWS Batch progressively looks for the next best fit instance type from what you specified in the ‘compute resources’ parameter. Generally, AWS Batch attempts to spin up different instance sizes within the same family first. This is because AWS Batch has already determined that vCPU and memory ratio to fit your workload. If it still cannot find enough instances that can run your jobs to meet your capacity, AWS Batch launches instances from a different family. These attempts run until capacity is met, or until it runs out of available instances from which to select.

To create a best fit progressive CE, follow the steps detailed in the Spot capacity optimized strategy section. However, specify the strategy BEST_FIT_PROGRESSIVE when creating a CE, for example:


…{
 "Ref": "TestIamInstanceProfile" 
},
 "MinvCpus": 0, 
"InstanceTypes": [ 
"optimal"
 ],
 "SecurityGroupIds": [
 	{"Ref": "TestSecurityGroup"} 
],
 "DesiredvCpus": 0, 
"MaxvCpus": 12, 
"AllocationStrategy": "BEST_FIT_PROGRESSIVE", 
"Type": "EC2" },
 "ServiceRole": { 
"Ref": "TestAWSBatchServiceRole" 
}
 
…

Important note: you can always restrict AWS Batch’s ability to launch instances by using the max vCPU setting in your CE. AWS Batch may go above Max vCPU to meet your capacity requirements for best fit progressive and Spot capacity optimized strategies. In this event, AWS Batch will never go above Max vCPU by more than a single instance (for example, no more than a single instance from among those specified in your CE compute resources parameter).

 

How to Combine Strategies

You can combine strategies using separate AWS Batch Compute Environments. Let’s take the case I mentioned earlier: you’re happy to take advantage of Spot prices, but you want a base level of throughput for your time-sensitive workloads.

This diagram describes shows an On-Demand CE with a secondary Spot CE, attached to the same queue

This diagram describes shows an On-Demand CE with a secondary Spot CE, attached to the same queue

 

In this case, you can create two AWS Batch CEs:

1.       Create an On-Demand CE that uses the best fit progressive strategy.

2.       Set the max vCPU at the level of throughput that is necessary for your workload.

3.       Create a Spot CE using the Spot capacity optimized strategy.

4.        Attach both CEs to your job queue, with the On-Demand CE higher in order. Once you start submitting jobs to your queue, AWS Batch spins up your On-Demand CE first and starts placing jobs.

If AWS Batch meets its max vCPU criteria, it will spin up instances in the next CE. In this case, the next CE is your Spot CE, and AWS Batch will place any additional jobs on this CE.  AWS Batch continues to place jobs on both CEs until the queue is empty.

Please see this repository for sample CloudFormation code to replicate this environment. Or, click here for more examples of leveraging Spot with AWS Batch.

 

Best Fit

Imagine you have a well-defined genomics sequencing workload. You know that this workload performs best on M5 instances, and you run this workload On-Demand because it is not interruptible. You’ve run this workload on AWS Batch before and you’re happy with its current behavior. You’re willing to trade off occasional capacity constraints in return for the knowledge you’re controlling cost strictly.  In this case, the best fit strategy may be a good option. This strategy used to be AWS Batch’s only behavior. It examines the queue and picks the best fit instance type and size for the workload. As described earlier, best fit to AWS Batch means the least number of instances capable of running the workload, at the lowest cost. In general, we recommend the best fit strategy only when you want the lowest cost for your instance, and you’re willing to trade cost for throughput and availability.

Note: AWS Batch will not launch instances above Max vCPU while using the best fit strategy. To launch a best fit CE, you can launch it similar to the following:

…{
 "Ref": "TestIamInstanceProfile" 
},
 "MinvCpus": 0, 
"InstanceTypes": [ 
"optimal"
 ],
 "SecurityGroupIds": [
 	{"Ref": "TestSecurityGroup"} 
],
 "DesiredvCpus": 0, 
"MaxvCpus": 12, 
"AllocationStrategy": "BEST_FIT", 
"Type": "EC2" },
 "ServiceRole": { 
"Ref": "TestAWSBatchServiceRole" 
}
 
…

Important Note for AWS Batch Allocation Strategies with Spot Instances:

You always have the option to set a percentage of On-Demand price when creating a Spot CE. When setting a percentage of an On-Demand price, AWS Batch will only launch instances that have Spot prices lower than the lowest per-unit-hour instance. In general, setting a percentage of an On-Demand price lowers your availability, and should only be used if you want cost controls.If you want to enjoy the cost savings with Spot with better availability, I recommend that you do not set a percentage of On-Demand price.

Conclusion

With these new allocation strategies, you now have much greater flexibility to control how AWS Batch provisions your instances. This allows you to make better throughput and cost trade-offs depending on the sensitivity of your workload. To learn more about how these strategies behave, please visit the AWS Batch documentation. Feel free to experiment with AWS Batch on your own to get an idea of how they help you run your specific workload.

 

Thanks to Chad Scmutzer for his support on the CloudFormation template

Leveraging Elastic Fabric Adapter to run HPC and ML Workloads on AWS Batch

Post Syndicated from Bala Thekkedath original https://aws.amazon.com/blogs/compute/leveraging-efa-to-run-hpc-and-ml-workloads-on-aws-batch/

Leveraging Elastic Fabric Adapter to run HPC and ML Workloads on AWS Batch

 This post is contributed by  Sean Smith, Software Development Engineer II, AWS ParallelCluster & Arya Hezarkhani, Software Development Engineer II, AWS Batch and HPC

 

On August 2, 2019, AWS Batch announced support for Elastic Fabric Adapter (EFA). This enables you to run highly performant, distributed high performance computing (HPC) and machine learning (ML) workloads by using AWS Batch’s managed resource provisioning and job scheduling.

EFA is a network interface for Amazon EC2 instances that enables you to run applications requiring high levels of inter-node communications at scale on AWS. Its custom-built operating system bypasses the hardware interface and enhances the performance of inter-instance communications, which is critical to scaling these applications. With EFA, HPC applications using the Message Passing Interface (MPI) and ML applications using NVIDIA Collective Communications Library (NCCL) can scale to thousands of cores or GPUs. As a result, you get the application performance of on-premises HPC clusters with the on-demand elasticity and flexibility of the AWS Cloud.

AWS Batch is a cloud-native batch scheduler that manages instance provisioning and job scheduling. AWS Batch automatically provisions instances according to job specifications, with the appropriate placement group, networking configurations, and any user-specified file system. It also automatically sets up the EFA interconnect to the instances it launches, which you specify through a single launch template parameter.

In this post, we walk through the setup of EFA on AWS Batch and run the NAS Parallel Benchmark (NPB), a benchmark suite that evaluates the performance of parallel supercomputers, using the open source implementation of MPI, OpenMPI.

 

Prerequisites

This walk-through assumes:

 

Configuring your compute environment

First, configure your compute environment to launch instances with the EFA device.

Creating an EC2 placement group

The first step is to create a cluster placement group. This is a logical grouping of instances within a single Availability Zone. The chief benefit of a cluster placement group is non-blocking, non-oversubscribed, fully bi-sectional network connectivity. Use a Region that supports EFA—currently, that is us-east-1, us-east-2, us-west-2, and eu-west-1. Run the following command:

$ aws ec2 create-placement-group –group-name “efa” –strategy “cluster” –region [your-region]

Creating an EC2 launch template

Next, create a launch template that contains a user-data script to install EFA libraries onto the instance. Launch templates enable you to store launch parameters so that you do not have to specify them every time you launch an instance. This will be the launch template used by AWS Batch to scale the necessary compute resources in your AWS Batch Compute Environment.

First, encode the user data into base64-encoding. This example uses the CLI utility base64 to do so.

 

$ echo “MIME-Version: 1.0

Content-Type: multipart/mixed; boundary=”==MYBOUNDARY==”

 

–==MYBOUNDARY==

Content-Type: text/cloud-boothook; charset=”us-ascii” cloud-init-per once yum_wget yum install -y wget

cloud-init-per once wget_efa wget -q –timeout=20 https://s3-us-west-2.amazonaws.com/

cloud-init-per once tar_efa tar -xf /tmp/aws-efa-installer-latest.tar.gz -C /tmp pushd /tmp/aws-efa-installer

cloud-init-per once install_efa ./efa_installer.sh -y pop /tmp/aws-efa-installer

cloud-init-per once efa_info /opt/amazon/efa/bin/fi_info -p efa

–==MYBOUNDARY==–” | base64

 

Save the base64-encoded output, because you need it to create the launch template.

 

Next, make sure that your default security group is configured correctly. On the EC2 console, select the default security group associated with your default VPC and edit the inbound rules to allow SSH and All traffic to itself. This must be set explicitly to the security group ID for EFA to work, as seen in the following screenshot.

SecurityGroupInboundRules

SecurityGroupInboundRules

 

Then edit the outbound rules and add a rule that allows all inbound traffic from the security group itself, as seen in the following screenshot. This is a requirement for EFA to work.

SecurityGroupOutboundRules

SecurityGroupOutboundRules

 

Now, create an ecsInstanceRole, the Amazon ECS instance profile that will be applied to Amazon EC2 instances in a Compute Environment. To create a role, follow these steps.

  1. Choose Roles, then Create Role.
  2. Select AWS Service, then EC2.
  3. Choose Permissions.
  4. Attach the managed policy AmazonEC2ContainerServiceforEC2Role.
  5. Name the role ecsInstanceRole.

 

You will create the launch template using the ID of the security group, the ID of a subnet in your default VPC, and the ecsInstanceRole that you created.

Next, choose an instance type that supports EFA, that’s denoted by the n in the instance name. This example uses c5n.18xlarge instances.

You also need an Amazon Machine Image (AMI) ID. This example uses the latest ECS-optimized AMI based on Amazon Linux 2. Grab the AMI ID that corresponds to the Region that you are using.

This example uses UserData to install EFA. This adds 1.5 minutes of bootstrap time to the instance launch. In production workloads, bake the EFA installation into the AMI to avoid this additional bootstrap delay.

Now create a file called launch_template.json with the following content, making sure to substitute the account ID, security group, subnet ID, AMI ID, and key name.

{

“LaunchTemplateName”: “EFA-Batch-LaunchTemplate”, “LaunchTemplateData”: {

“InstanceType”: “c5n.18xlarge”,

“IamInstanceProfile”: {

“Arn”: “arn:aws:iam::<Account Id>:instance-profile/ecsInstanceRole”

},

“NetworkInterfaces”: [

{

“DeviceIndex”: 0,

“Groups”: [

“<Security Group>”

],

“SubnetId”: “<Subnet Id>”,

“InterfaceType”: “efa”,

“Description”: “NetworkInterfaces Configuration For EFA and Batch”

}

],

“Placement”: {

“GroupName”: “efa”

},

“TagSpecifications”: [

{

“ResourceType”: “instance”,

“Tags”: [

{

“Key”: “from-lt”,

“Value”: “networkInterfacesConfig-EFA-Batch”

}

]

}

],

“UserData”: “TUlNRS1WZXJzaW9uOiAxLjAKQ29udGVudC1UeXBlOiBtdWx0aXBhcnQvbWl4ZWQ7IGJvdW5kYXJ5PSI9PU1ZQk9VTkRBUlk9PSIKCi0tPT1NWUJPVU5EQVJZPT0KQ29udGVudC1UeXBlOiB0ZXh0L2Nsb3VkLWJvb3Rob29rOyBjaGFyc2V0PSJ1cy1hc2NpaSIKCmNsb3VkLWluaXQtcGVyIG9uY2UgeXVtX3dnZXQgeXVtIGluc3RhbGwgLXkgd2dldAoKY2xvdWQtaW5pdC1wZXIgb25jZSB3Z2V0X2VmYSB3Z2V0IC1xIC0tdGltZW91dD0yMCBodHRwczovL3MzLXVzLXdlc3QtMi5hbWF6b25hd3MuY29tL2F3cy1lZmEtaW5zdGFsbGVyL2F3cy1lZmEtaW5zdGFsbGVyLWxhdGVzdC50YXIuZ3ogLU8gL3RtcC9hd3MtZWZhLWluc3RhbGxlci1sYXRlc3QudGFyLmd6CgpjbG91ZC1pbml0LXBlciBvbmNlIHRhcl9lZmEgdGFyIC14ZiAvdG1wL2F3cy1lZmEtaW5zdGFsbGVyLWxhdGVzdC50YXIuZ3ogLUMgL3RtcAoKcHVzaGQgL3RtcC9hd3MtZWZhLWluc3RhbGxlcgpjbG91ZC1pbml0LXBlciBvbmNlIGluc3RhbGxfZWZhIC4vZWZhX2luc3RhbGxlci5zaCAteQpwb3AgL3RtcC9hd3MtZWZhLWluc3RhbGxlcgoKY2xvdWQtaW5pdC1wZXIgb25jZSBlZmFfaW5mbyAvb3B0L2FtYXpvbi9lZmEvYmluL2ZpX2luZm8gLXAgZWZhCgotLT09TVlCT1VOREFSWT09LS0K”

}

}

Create a launch template from that file:

 

$ aws ec2 create-launch-template –cli-input-json file://launch_template.json

{

“LaunchTemplate”: {

“LatestVersionNumber”: 1,

“LaunchTemplateId”: “lt-*****************”, “LaunchTemplateName”: “EFA-Batch-LaunchTemplate”, “DefaultVersionNumber”: 1,

“CreatedBy”: “arn:aws:iam::************:user/desktop-user”, “CreateTime”: “2019-09-23T13:00:21.000Z”

}

}

 

Creating a compute environment

Next, create an AWS Batch Compute Environment. This uses the information from the launch template

EFA-Batch-Launch-Template created earlier.

 

{

“computeEnvironmentName”: “EFA-Batch-ComputeEnvironment”,

“type”: “MANAGED”,

“state”: “ENABLED”,

“computeResources”: {

“type”: “EC2”,

“minvCpus”: 0,

“maxvCpus”: 2088,

“desiredvCpus”: 0,

“instanceTypes”: [

“c5n.18xlarge”

],

“subnets”: [

“<same-subnet-as-in-LaunchTemplate>”

],

“instanceRole”: “arn:aws:iam::<account-id>:instance-profile/ecsInstanceRole”,

“launchTemplate”: {

“launchTemplateName”: “EFA-Batch-LaunchTemplate”,

“version”: “$Latest”

}

},

“serviceRole”: “arn:aws:iam::<account-id>:role/service-role/AWSBatchServiceRole”

}

 

Now, create the compute environment:

$ aws batch create-compute-environment –cli-input-json file://compute_environment.json

{

“computeEnvironmentName”: “EFA-Batch-ComputeEnvironment”, “computeEnvironmentArn”: “arn:aws:batch:us-east-1:<Account Id>:compute-environment”

}

 

Building the container image

To build the container, clone the repository that contains the Dockerfile used in this example.

First, install git:

$ git clone https://github.com/aws-samples/aws-batch-efa.git

 

In that repository, there are several files, one of which is the following Dockerfile.

 

FROM amazonlinux:1 ENV USER efauser

 

RUN yum update -y

RUN yum install -y which util-linux make tar.x86_64 iproute2 gcc-gfortran openssh-serv RUN pip-2.7 install supervisor

 

RUN useradd -ms /bin/bash $USER ENV HOME /home/$USER

 

##################################################### ## SSH SETUP

ENV SSHDIR $HOME/.ssh

RUN mkdir -p ${SSHDIR} \

&& touch ${SSHDIR}/sshd_config \

&& ssh-keygen -t rsa -f ${SSHDIR}/ssh_host_rsa_key -N ” \

&& cp ${SSHDIR}/ssh_host_rsa_key.pub ${SSHDIR}/authorized_keys \ && cp ${SSHDIR}/ssh_host_rsa_key ${SSHDIR}/id_rsa \

&& echo ”  IdentityFile ${SSHDIR}/id_rsa” >> ${SSHDIR}/config \ && echo ”  StrictHostKeyChecking no” >> ${SSHDIR}/config \

&& echo ”  UserKnownHostsFile /dev/null” >> ${SSHDIR}/config \ && echo ”  Port 2022″ >> ${SSHDIR}/config \

&& echo ‘Port 2022’ >> ${SSHDIR}/sshd_config \

&& echo ‘UsePrivilegeSeparation no’ >> ${SSHDIR}/sshd_config \

&& echo “HostKey ${SSHDIR}/ssh_host_rsa_key” >> ${SSHDIR}/sshd_config \ && echo “PidFile ${SSHDIR}/sshd.pid” >> ${SSHDIR}/sshd_config \

&& chmod -R 600 ${SSHDIR}/* \

&& chown -R ${USER}:${USER} ${SSHDIR}/

 

# check if ssh agent is running or not, if not, run RUN eval `ssh-agent -s` && ssh-add ${SSHDIR}/id_rsa

 

################################################# ## EFA and MPI SETUP

RUN curl -O https://s3-us-west-2.amazonaws.com/aws-efa-installer/aws-efa-installer-1. && tar -xf aws-efa-installer-1.5.0.tar.gz \

&& cd aws-efa-installer \

&& ./efa_installer.sh -y –skip-kmod –skip-limit-conf –no-verify

 

RUN wget https://www.nas.nasa.gov/assets/npb/NPB3.3.1.tar.gz \ && tar xzf NPB3.3.1.tar.gz

COPY make.def_efa /NPB3.3.1/NPB3.3-MPI/config/make.def COPY suite.def      /NPB3.3.1/NPB3.3-MPI/config/suite.def

 

RUN cd /NPB3.3.1/NPB3.3-MPI \

&& make suite \

&& chmod -R 755 /NPB3.3.1/NPB3.3-MPI/

 

###################################################

## supervisor container startup

 

ADD conf/supervisord/supervisord.conf /etc/supervisor/supervisord.conf ADD supervised-scripts/mpi-run.sh supervised-scripts/mpi-run.sh

RUN chmod 755 supervised-scripts/mpi-run.sh

 

EXPOSE 2022

ADD batch-runtime-scripts/entry-point.sh batch-runtime-scripts/entry-point.sh RUN chmod 755 batch-runtime-scripts/entry-point.sh

 

CMD /batch-runtime-scripts/entry-point.sh

 

To build this Dockerfile, run the included Makerfile with:

make

Now, push the created container image to Amazon Elastic Container Registry (ECR), so you can use it in your AWS Batch JobDefinition:

From the AWS CLI, create an ECR repository, we’ll call it aws-batch-efa:

$ aws ecr create-repository –repository-name aws-batch-efa

{

“repository”: {

“registryId”: “<Account-Id>”,
“repositoryName”: “aws-batch-efa”,
“repositoryArn”: “arn:aws:ecr:us-east-2:<Account-Id>:repository/aws-batch-efa”,
“createdAt”: 1568154893.0,
“repositoryUri”: “<Account-Id>.dkr.ecr.us-east-2.amazonaws.com/aws-batch-efa”

}
}

Edit the top of the makefile and add your AWS account number and AWS Region.

AWS_REGION=<REGION>
ACCOUNT_ID=<ACCOUNT-ID>

To push the image to the ECR repository, run:

make tag
make push

 

Run the application

To run the application using AWS Batch multi-node parallel jobs, follow these steps.

 

Setting up the AWS Batch multi-node job definition

Set up the AWS Batch multi-node job definition and expose the EFA device to the container by following these steps.

 

First, create a file called job_definition.json with the following contents. This file holds the configurations for the AWS Batch JobDefinition. Specifically, this JobDefinition uses the newly supported field LinuxParameters.Devices to expose a particular device—in this case, the EFA device path /dev/infiniband/uverbs0—to the container. Be sure to substitute the image URI with the one you pushed to ECR in the previous step. This is used to start the container.

 

{

“jobDefinitionName”: “EFA-MPI-JobDefinition”,

“type”: “multinode”,

“nodeProperties”: {

“numNodes”: 8,

“mainNode”: 0,

“nodeRangeProperties”: [

{

“targetNodes”: “0:”,

“container”: {

“user”: “efauser”,

“image”: “<Docker Image From Previous Section>”,

“vcpus”: 72,

“memory”: 184320,

“linuxParameters”: {

“devices”: [

{

“hostPath”: “/dev/infiniband/uverbs0”

}

]

},

“ulimits”: [

{

“hardLimit”: -1,

“name”: “memlock”,

“softLimit”: -1

}

]

}

}

]

}

}

 

$ aws batch register-job-definition –cli-input-json file://job_definition.json

{

“jobDefinitionArn”: “arn:aws:batch:us-east-1:<account-id>:job-definition/EFA-MPI-JobDefinition”,

“jobDefinitionName”: “EFA-MPI-JobDefinition”,

“revision”: 1

}

 

Run the job

Next, create a job queue. This job queue points at the compute environment created before. When jobs are submitted to it, they queue until instances are available to run them.

 

{

“jobQueueName”: “EFA-Batch-JobQueue”,

“state”: “ENABLED”,

“priority”: 10,

“computeEnvironmentOrder”: [

{

“order”: 1,

“computeEnvironment”: “EFA-Batch-ComputeEnvironment”

}

]

}

 

aws   batch create-job-queue –cli-input-json  file://job_queue.json

Now that you’ve created all the resources, submit the job. The numNodes=8 parameter tells the job definition to use eight nodes.

aws   batch submit-job –job-name      example-mpi-job –job-queue EFA-Batch-JobQueue –job-definition EFA-MPI-JobDefinition –node-overrides numNodes=8

 

NPB overview

NPB is a small set of benchmarks derived from computational fluid dynamics (CFD) applications. They consist of five kernels and three pseudo-applications. This example runs the 3D Fast Fourier Transform (FFT) benchmark, as it tests all-to-all communication. For this run, use c5n.18xlarge, as configured in the AWS compute environment earlier. This is an excellent choice for this workload as it has an Intel Skylake processor (72 hyperthreaded cores) and 100 GB Enhanced Networking (ENA), which you can take advantage of with EFA.

 

This test runs the FT “C” Benchmark with eight nodes * 72 vcpus = 576 vcpus.

 

NAS Parallel Benchmarks 3.3 — FT Benchmark

No input file inputft.data. Using compiled defaults Size : 512x 512x 512

Iterations : 20

Number of processes : 512 Processor array : 1x 512 Layout type : 1D

Initialization time = 1.3533580760000063

T = 1 Checksum = 5.195078707457D+02 5.149019699238D+02

T = 2 Checksum = 5.155422171134D+02 5.127578201997D+02
T = 3 Checksum = 5.144678022222D+02 5.122251847514D+02
T = 4 Checksum = 5.140150594328D+02 5.121090289018D+02
T = 5 Checksum = 5.137550426810D+02 5.121143685824D+02
T = 6 Checksum = 5.135811056728D+02 5.121496764568D+02
T = 7 Checksum = 5.134569343165D+02 5.121870921893D+02
T = 8 Checksum = 5.133651975661D+02 5.122193250322D+02
T = 9 Checksum = 5.132955192805D+02 5.122454735794D+02
T = 10 Checksum = 5.132410471738D+02 5.122663649603D+02
T = 11 Checksum = 5.131971141679D+02 5.122830879827D+02
T = 12 Checksum = 5.131605205716D+02 5.122965869718D+02
T = 13 Checksum = 5.131290734194D+02 5.123075927445D+02
T = 14 Checksum = 5.131012720314D+02 5.123166486553D+02
T = 15 Checksum = 5.130760908195D+02 5.123241541685D+02
T = 16 Checksum = 5.130528295923D+02 5.123304037599D+02
T = 17 Checksum = 5.130310107773D+02 5.123356167976D+02
T = 18 Checksum = 5.130103090133D+02 5.123399592211D+02
T = 19 Checksum = 5.129905029333D+02 5.123435588985D+02
T = 20 Checksum = 5.129714421109D+02 5.123465164008D+02

 

Result verification successful class = C

FT Benchmark Completed. Class = C

Size = 512x 512x 512 Iterations = 20

Time in seconds = 1.92 Total processes = 512 Compiled procs = 512 Mop/s total = 206949.17 Mop/s/process = 404.20

Operation type = floating point Verification = SUCCESSFUL

 

Summary

In this post, we covered how to run MPI Batch jobs with an EFA-enabled elastic network interface using AWS Batch multi-node parallel jobs and an EC2 launch template. We used a launch template to configure the AWS Batch compute environment to launch an instance with the EFA device installed. We showed you how to expose the EFA device to the container. You also learned how to package an MPI benchmarking application, the NPB, as a Docker container, and how to run the application as an AWS Batch multi-node parallel job.

We hope you found the information in this post helpful and encouraging as to all the possibilities for HPC on AWS.

Creating an AWS Batch environment for mixed CPU and GPU genomics workflows

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/creating-an-aws-batch-environment-for-mixed-cpu-and-gpu-genomics-workflows/

This post is courtesy of Lee Pang – AWS Technical Business Development 

I recently worked with a customer who needed to process a bunch of raw sequence files (FastQs) into Hi-C format (*.hic), which is used for the structural analysis of DNA/chromatin loops and sequence accessibility. The tooling they were interested in using was the Juicer suite and they needed a minimal workflow:

  • Align the sample to the reference using the juicer CLI utility.
  • Annotate loops using the HiCCUPS algorithm from the juicer-tools library.

Because they had many files to process, they wanted to do this as scalably as possible. The juicer step of the workflow was CPU and memory-intensive, while the HiCCUPS step needed GPU acceleration. So, they were interested in using AWS Step Functions and AWS Batch.

Since its launch, AWS Batch has enabled the ability to create scalable compute environments for processing a mixture of CPU and memory-intensive jobs. This covers the needs of the majority of genomics workflows. So, how do you create a genomics workflow environment using AWS Batch that also includes using GPUs? Thankfully, AWS Batch recently announced support for GPU resources!

In this post, I show you how to use these new features to execute mixed CPU and GPU genomics workflows. By the end of this post, you will be able to build the architecture shown below.

Configuring AWS Batch for running CPU and GPU jobs

To handle a mixture of CPU and GPU jobs, the recommended strategy is to create multiple compute environments and job queues:

  • GPU-only resources
    • Compute environments (using the ECS GPU Optimized AMI)
    • Spot Instances of the P2 and P3 instance family
    • On-Demand Instances of the P2 and P3 instance family
    • Job queue for GPU compute environments
  • CPU-only resources
    • Compute environments (using the default ECS Optimized AMI)
    • Spot Instances of the “optimal” instance family
    • On-Demand Instances of the “optimal” instance family
    • Job queue for CPU compute environments

With the above in place, you then point CPU jobs to the “CPU” queue and GPU jobs to the “GPU” queue.

Notice that CPU and GPU resources are kept separate with queues and compute environments. I don’t recommend creating a compute environment or job queue that mixes CPU and GPU optimized instance types. In a mixed compute environment or queue, there is a chance that CPU jobs could be placed on GPU instances when no GPU jobs are scheduled. This could result in few (or no) GPU instances available when GPU jobs must be run.

Create a GPU compute environment

Amazon EC2 has a wide variety of instance types. This includes the P3 family of instances that enable GPU-accelerated computing. Previously, the best option for using GPUs in AWS Batch was to create a compute environment based on the publicly available Deep Learning AMI used by AI/ML services. For example, Amazon SageMaker has support for running containers and NVIDIA / CUDA drivers pre-installed.

Earlier this year, the Amazon ECS team announced the availability of the ECS GPU-optimized AMI. It’s essentially the same as the existing Amazon ECS-optimized Amazon Linux 2 AMI but with pre-installed capabilities to provide Docker containers with access to GPU acceleration. For more information about what’s included, see Amazon ECS-optimized AMIs.

The key points are:

This is a much more lightweight solution, and makes creating GPU-specific AWS Batch compute environments much easier.

Creating an AWS Batch compute environment specifically for GPU jobs is similar to creating one for CPU jobs. The key difference is that you select only GPU instance families for the instance types.

For compute environments that use the P2, P3, and P3dn instance families, AWS Batch automatically associates the ECS GPU-optimized AMI. You don’t have to create a custom AMI for GPU jobs to run on these instances.

The G2 and G3 families use a different type of GPU and have different drivers. Compute environments that use G2 and G3 families need a custom AMI to take advantage of acceleration. Otherwise, they default to the ECS-optimized AMI.

To create a GPU-enabled compute environment with the AWS CLI, create a file called gpu-ce.json with the following contents:

{
    "computeEnvironmentName": "gpu",
    "type": "MANAGED",
    "state": "ENABLED",
    "serviceRole": "arn:aws:iam::<account-id>:role/service-role/AWSBatchServiceRole",
    "computeResources": {
        "type": "EC2",
        "subnets": [
            "<subnet-id-1>",
            "<subnet-id-2>",
            "<subnet-id-3>"
        ],
        "tags": {
            "Name": "batch-gpu-worker"
        },
        "desiredvCpus": 0,
        "minvCpus": 0,
        "instanceTypes": [
            "p3",
            "p2"
        ],
        "instanceRole": "arn:aws:iam::<account-id>:instance-profile/ecsInstanceRole",
        "maxvCpus": 256,
        "securityGroupIds": [
            "<security-group-1>",
            "<security-group-2>"
        ],
        "ec2KeyPair": "<keypair-name>"
    }
}

From the command line, run the following:

aws batch create-compute-environment --cli-input-json file://gpu-ce.json

Create a GPU job queue

When you have a GPU compute environment, you can associate it with a dedicated GPU job queue.

From the command line, run the following:

aws batch create-job-queue \
    --job-queue-name gpu \
    --state ENABLED \
    --priority 100 \
    --compute-environment-order order=1,computeEnvironment=gpu

Specifying GPU resources in AWS Batch jobs

Creating job definitions in AWS Batch has not changed much. However, now there’s an additional field under Resource requirements that lets you specify how many GPUs the job should use.

The JSON for registering a job definition with GPU requirements looks like the following:

{
    "jobDefinitionName": "hiccups", 
    "type": "container", 
    "parameters": {
        "OutputS3Prefix": "s3://<bucket-name>/juicer/HIC003", 
        "InputHICS3Path": "s3://<bucket-name>/juicer/HIC003/aligned/inter.hic"
    }, 
    "containerProperties": {
        "mountPoints": [], 
        "image": "<docker-image-repository>/juicer-tools:latest", 
        "environment": [], 
        "vcpus": 8, 
        "command": [
            "hiccups", 
            "Ref::InputHICS3Path", 
            "Ref::OutputS3Prefix"
        ], 

        /* BEGIN NEW STUFF (delete comment before use) */
        "resourceRequirements" : [
            {
                "type" : "GPU",
                "value" : "1"
            }
        ],
        /* END NEW STUFF (delete comment before use) */
        
        "volumes": [], 
        "memory": 60000, 
        "ulimits": []
    }
}

Containerization considerations

To ensure that your containerized task can use GPU acceleration, use nvidia/cuda base images when building the container image for your job. For example, for a CentOS-based image with CUDA 9.2, your Dockerfile should have the following:

FROM nvidia/cuda:9.2-devel-centos7

This could be at the top if you are building the container entirely from scratch, or later if you are using a multi-stage build.

In this case, I already had a CentOS-based image for the juicer utility that I could recycle for juicer-tools. So my Dockerfile looked like the following:

FROM juicer AS base
FROM nvidia/cuda:9.2-devel-centos7

COPY --from=base /opt/juicer /opt/juicer

RUN yum install -y awscli
RUN yum install -y java-1.8.0-openjdk

WORKDIR /opt/juicer/scripts
COPY juicer-tools.aws.sh .

WORKDIR /opt/juicer/work
ENTRYPOINT ["/opt/juicer/scripts/juicer-tools.aws.sh"]

Running a mixed CPU / GPU workflow

To run a workflow that contains a mixture of CPU– and GPU-based jobs, you can use AWS Step Functions. This blog channel previously covered how Step Functions and AWS Batch can be combined to run genomics workflows on AWS. Also, Step Functions and AWS Batch are now more tightly integrated, making scalable workflow solutions much easier to build.

With all the AWS Batch resources in place, it is a matter of pointing each task at the job queue with the right compute resources.

The solution I put together for the customer with the Juicer suite resulted in the following state machine:

{
    "Comment": "State machine for FASTQ to HIC with annotation using juicer and hiccups",
    "StartAt": "JuicerTask",
    "States": {
        "JuicerTask": {
            "Type": "Task",
            "InputPath": "$",
            "ResultPath": "$.juicer.status",
            "Resource": "arn:aws:states:::batch:submitJob.sync",
            "Parameters": {
                "JobDefinition": "arn:aws:batch:<region>:<account_number>:job-definition/juicer:2",
                "JobName": "juicer",
                "JobQueue": "arn:aws:batch:<region>:<account_number>:job-queue/cpu",
                "Parameters.$": "$.juicer.parameters"
            },
            "Next": "HiccupsTask"
        },
        "HiccupsTask": {
            "Type": "Task",
            "InputPath": "$",
            "ResultPath": "$.hiccups.status",
            "Resource": "arn:aws:states:::batch:submitJob.sync",
            "Parameters": {
                "JobDefinition": "arn:aws:batch:<region>:<account_number>:job-definition/juicer-tools:2",
                "JobName": "hiccups",
                "JobQueue": "arn:aws:batch:<region>:<account_number>:job-queue/gpu",
                "Parameters.$": "$.hiccups.parameters"
            },
            "End": true
        }
    }
}

JuicerTask is submitted to the cpu job queue, while HiccupsTask is submitted to the gpu job queue.

Conclusion

With the ECS GPU-optimized AMI and AWS Batch support for GPU resources, you can easily build a scalable solution for running genomics workflows with both CPU and GPU resources.

Build on!

GPU workloads on AWS Batch

Post Syndicated from Josh Rad original https://aws.amazon.com/blogs/compute/gpu-workloads-on-aws-batch/

Contributed by Manuel Manzano Hoss, Cloud Support Engineer

I remember playing around with graphics processing units (GPUs) workload examples in 2017 when the Deep Learning on AWS Batch post was published by my colleague Kiuk Chung. He provided an example of how to train a convolutional neural network (CNN), the LeNet architecture, to recognize handwritten digits from the MNIST dataset using Apache MXNet as the framework. Back then, to run such jobs with GPU capabilities, I had to do the following:

  • Create a custom GPU-enabled AMI that had installed Docker, the ECS agent, NVIDIA driver and container runtime, and CUDA.
  • Identify the type of P2 EC2 instance that had the required amount of GPU for my job.
  • Check the amount of vCPUs that it offered (even if I was not interested on using them).
  • Specify that number of vCPUs for my job.

All that, when I didn’t have any certainty that the instance was going to have the GPU required available when my job was already running. Back then, there was no GPU pinning. Other jobs running on the same EC2 instance were able to use that GPU, making the orchestration of my jobs a tricky task.

Fast forward two years. Today, AWS Batch announced integrated support for Amazon EC2 Accelerated Instances. It is now possible to specify an amount of GPU as a resource that AWS Batch considers in choosing the EC2 instance to run your job, along with vCPU and memory. That allows me to take advantage of the main benefits of using AWS Batch, the compute resource selection algorithm and job scheduler. It also frees me from having to check the types of EC2 instances that have enough GPU.

Also, I can take advantage of the Amazon ECS GPU-optimized AMI maintained by AWS. It comes with the NVIDIA drivers and all the necessary software to run GPU-enabled jobs. When I allow the P2 or P3 instance types on my compute environment, AWS Batch launches my compute resources using the Amazon ECS GPU-optimized AMI automatically.

In other words, now I don’t worry about the GPU task list mentioned earlier. I can focus on deciding which framework and command to run on my GPU-accelerated workload. At the same time, I’m now sure that my jobs have access to the required performance, as physical GPUs are pinned to each job and not shared among them.

A GPU race against the past

As a kind of GPU-race exercise, I checked a similar example to the one from Kiuk’s post, to see how fast it could be to run a GPU-enabled job now. I used the AWS Management Console to demonstrate how simple the steps are.

In this case, I decided to use the deep neural network architecture called multilayer perceptron (MLP), not the LeNet CNN, to compare the validation accuracy between them.

To make the test even simpler and faster to implement, I thought I would use one of the recently announced AWS Deep Learning containers, which come pre-packed with different frameworks and ready-to-process data. I chose the container that comes with MXNet and Python 2.7, customized for Training and GPU. For more information about the Docker images available, see the AWS Deep Learning Containers documentation.

In the AWS Batch console, I created a managed compute environment with the default settings, allowing AWS Batch to create the required IAM roles on my behalf.

On the configuration of the compute resources, I selected the P2 and P3 families of instances, as those are the type of instance with GPU capabilities. You can select On-Demand Instances, but in this case I decided to use Spot Instances to take advantage of the discounts that this pricing model offers. I left the defaults for all other settings, selecting the AmazonEC2SpotFleetRole role that I created the first time that I used Spot Instances:

Finally, I also left the network settings as default. My compute environment selected the default VPC, three subnets, and a security group. They are enough to run my jobs and at the same time keep my environment safe by limiting connections from outside the VPC:

I created a job queue, GPU_JobQueue, attaching it to the compute environment that I just created:

Next, I registered the same job definition that I would have created following Kiuk’s post. I specified enough memory to run this test, one vCPU, and the AWS Deep Learning Docker image that I chose, in this case mxnet-training:1.4.0-gpu-py27-cu90-ubuntu16.04. The amount of GPU required was in this case, one. To have access to run the script, the container must run as privileged, or using the root user.

Finally, I submitted the job. I first cloned the MXNet repository for the train_mnist.py Python script. Then I ran the script itself, with the parameter –gpus 0 to indicate that the assigned GPU should be used. The job inherits all the other parameters from the job definition:

sh -c 'git clone -b 1.3.1 https://github.com/apache/incubator-mxnet.git && python /incubator-mxnet/example/image-classification/train_mnist.py --gpus 0'

That’s all, and my GPU-enabled job was running. It took me less than two minutes to go from zero to having the job submitted. This is the log of my job, from which I removed the iterations from epoch 1 to 18 to make it shorter:

14:32:31     Cloning into 'incubator-mxnet'...

14:33:50     Note: checking out '19c501680183237d52a862e6ae1dc4ddc296305b'.

14:33:51     INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none', gpus='0', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=20, num_examples=60000, num_layers=No

14:33:51     DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): yann.lecun.com:80

14:33:54     DEBUG:urllib3.connectionpool:http://yann.lecun.com:80 "GET /exdb/mnist/train-labels-idx1-ubyte.gz HTTP/1.1" 200 28881

14:33:55     DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): yann.lecun.com:80

14:33:55     DEBUG:urllib3.connectionpool:http://yann.lecun.com:80 "GET /exdb/mnist/train-images-idx3-ubyte.gz HTTP/1.1" 200 9912422

14:33:59     DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): yann.lecun.com:80

14:33:59     DEBUG:urllib3.connectionpool:http://yann.lecun.com:80 "GET /exdb/mnist/t10k-labels-idx1-ubyte.gz HTTP/1.1" 200 4542

14:33:59     DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): yann.lecun.com:80

14:34:00     DEBUG:urllib3.connectionpool:http://yann.lecun.com:80 "GET /exdb/mnist/t10k-images-idx3-ubyte.gz HTTP/1.1" 200 1648877

14:34:04     INFO:root:Epoch[0] Batch [0-100] Speed: 37038.30 samples/sec accuracy=0.793472

14:34:04     INFO:root:Epoch[0] Batch [100-200] Speed: 36457.89 samples/sec accuracy=0.906719

14:34:04     INFO:root:Epoch[0] Batch [200-300] Speed: 36981.20 samples/sec accuracy=0.927500

14:34:04     INFO:root:Epoch[0] Batch [300-400] Speed: 36925.04 samples/sec accuracy=0.935156

14:34:04     INFO:root:Epoch[0] Batch [400-500] Speed: 37262.36 samples/sec accuracy=0.940156

14:34:05     INFO:root:Epoch[0] Batch [500-600] Speed: 37729.64 samples/sec accuracy=0.942813

14:34:05     INFO:root:Epoch[0] Batch [600-700] Speed: 37493.55 samples/sec accuracy=0.949063

14:34:05     INFO:root:Epoch[0] Batch [700-800] Speed: 37320.80 samples/sec accuracy=0.953906

14:34:05     INFO:root:Epoch[0] Batch [800-900] Speed: 37705.85 samples/sec accuracy=0.958281

14:34:05     INFO:root:Epoch[0] Train-accuracy=0.924024

14:34:05     INFO:root:Epoch[0] Time cost=1.633

...  LOGS REMOVED

14:34:44     INFO:root:Epoch[19] Batch [0-100] Speed: 36864.44 samples/sec accuracy=0.999691

14:34:44     INFO:root:Epoch[19] Batch [100-200] Speed: 37088.35 samples/sec accuracy=1.000000

14:34:44     INFO:root:Epoch[19] Batch [200-300] Speed: 36706.91 samples/sec accuracy=0.999687

14:34:44     INFO:root:Epoch[19] Batch [300-400] Speed: 37941.19 samples/sec accuracy=0.999687

14:34:44     INFO:root:Epoch[19] Batch [400-500] Speed: 37180.97 samples/sec accuracy=0.999844

14:34:44     INFO:root:Epoch[19] Batch [500-600] Speed: 37122.30 samples/sec accuracy=0.999844

14:34:45     INFO:root:Epoch[19] Batch [600-700] Speed: 37199.37 samples/sec accuracy=0.999687

14:34:45     INFO:root:Epoch[19] Batch [700-800] Speed: 37284.93 samples/sec accuracy=0.999219

14:34:45     INFO:root:Epoch[19] Batch [800-900] Speed: 36996.80 samples/sec accuracy=0.999844

14:34:45     INFO:root:Epoch[19] Train-accuracy=0.999733

14:34:45     INFO:root:Epoch[19] Time cost=1.617

14:34:45     INFO:root:Epoch[19] Validation-accuracy=0.983579

Summary

As you can see, after AWS Batch launched the instance, the job took slightly more than two minutes to run. I spent roughly five minutes from start to finish. That was much faster than the time that I was previously spending just to configure the AMI. Using the AWS CLI, one of the AWS SDKs, or AWS CloudFormation, the same environment could be created even faster.

From a training point of view, I lost on the validation accuracy, as the results obtained using the LeNet CNN are higher than when using an MLP network. On the other hand, my job was faster, with a time cost of 1.6 seconds in average for each epoch. As the software stack evolves, and increased hardware capabilities come along, these numbers keep improving, but that shouldn’t mean extra complexity. Using managed primitives like the one presented in this post enables a simpler implementation.

I encourage you to test this example and see for yourself how just a few clicks or commands lets you start running GPU jobs with AWS Batch. Then, it is just a matter of replacing the Docker image that I used for one with the framework of your choice, TensorFlow, Caffe, PyTorch, Keras, etc. Start to run your GPU-enabled machine learning, deep learning, computational fluid dynamics (CFD), seismic analysis, molecular modeling, genomics, or computational finance workloads. It’s faster and easier than ever.

If you decide to give it a try, have any doubt or just want to let me know what you think about this post, please write in the comments section!

Building Simpler Genomics Workflows on AWS Step Functions

Post Syndicated from Christie Gifrin original https://aws.amazon.com/blogs/compute/building-simpler-genomics-workflows-on-aws-step-functions/

This post is courtesy of Ryan Ulaszek, AWS Genomics Partner Solutions Architect and Aaron Friedman, AWS Healthcare and Life Sciences Partner Solutions Architect

In 2017, we published a four part blog series on how to build a genomics workflow on AWS. In part 1, we introduced a general architecture highlighting three common layers: job, batch and workflow.  In part 2, we described building the job layer with Docker and Amazon Elastic Container Registry (Amazon ECR).  In part 3, we tackled the batch layer and built a batch engine using AWS Batch.  In part 4, we built out the workflow layer using AWS Step Functions and AWS Lambda.

Since then, we’ve worked with many AWS customers and APN partners to implement this solution in genomics as well as in other workloads-of-interest. Today, we wanted to highlight a new feature in Step Functions that simplifies how customers and partners can build high-throughput genomics workflows on AWS.

Step Functions now supports native integration with AWS Batch, which simplifies how you can create an AWS Batch state that submits an asynchronous job and waits for that job to finish.

Before, you needed to build a state machine building block that submitted a job to AWS Batch, and then polled and checked its execution. Now, you can just submit the job to AWS Batch using the new AWS Batch task type.  Step Functions waits to proceed until the job is completed. This reduces the complexity of your state machine and makes it easier to build a genomics workflow with asynchronous AWS Batch steps.

The new integrations include support for the following API actions:

  • AWS Batch SubmitJob
  • Amazon SNS Publish
  • Amazon SQS SendMessage
  • Amazon ECS RunTask
  • AWS Fargate RunTask
  • Amazon DynamoDB
    • PutItem
    • GetItem
    • UpdateItem
    • DeleteItem
  • Amazon SageMaker
    • CreateTrainingJob
    • CreateTransformJob
  • AWS Glue
    • StartJobRun

You can also pass parameters to the service API.  To use the new integrations, the role that you assume when running a state machine needs to have the appropriate permissions.  For more information, see the AWS Step Functions Developer Guide.

Using a job status poller

In our 2017 post series, we created a job poller “pattern” with two separate Lambda functions. When the job finishes, the state machine proceeds to the next step and operates according to the necessary business logic.  This is a useful pattern to manage asynchronous jobs when a direct integration is unavailable.

The steps in this building block state machine are as follows:

  1. A job is submitted through a Lambda function.
  2. The state machine queries the AWS Batch API for the job status in another Lambda function.
  3. The job status is checked to see if the job has completed.  If the job status equals SUCCESS, the final job status is logged. If the job status equals FAILED, the execution of the state machine ends. In all other cases, wait 30 seconds and go back to Step 2.

Both of the Submit Job and Get Job Lambda functions are available as example Lambda functions in the console.  The job status poller is available in the Step Functions console as a sample project.

Here is the JSON representing this state machine.

{
  "Comment": "A simple example that submits a job to AWS Batch",
  "StartAt": "SubmitJob",
  "States": {
    "SubmitJob": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:<account-id>::function:batchSubmitJob",
      "Next": "GetJobStatus"
    },
    "GetJobStatus": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:<account-id>:function:batchGetJobStatus",
      "Next": "CheckJobStatus",
      "InputPath": "$",
      "ResultPath": "$.status"
    },
    "CheckJobStatus": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.status",
          "StringEquals": "FAILED",
          "End": true
        },
        {
          "Variable": "$.status",
          "StringEquals": "SUCCEEDED",
          "Next": "GetFinalJobStatus"
        }
      ],
      "Default": "Wait30Seconds"
    },
    "Wait30Seconds": {
      "Type": "Wait",
      "Seconds": 30,
      "Next": "GetJobStatus"
    },
    "GetFinalJobStatus": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:<account-id>:function:batchGetJobStatus",
      "End": true
    }
  }
}

With Step Functions Service Integrations

With Step Functions service integrations, it is now simpler to submit and wait for an AWS Batch job, or any other supported service.

The following code block is the JSON representing the new state machine for an asynchronous batch job. If you are familiar with the AWS Batch SubmitJob API action, you may notice that the parameters are consistent with what you would see in that API call. You can also use the optional AWS Batch parameters in addition to JobDefinition, JobName, and JobQueue.

{
 "StartAt": "RunBatchJob",
 "States": {
     "RunIsaacJob":{
     "Type":"Task",
     "Resource":"arn:aws:states:::batch:submitJob.sync",
     "Parameters":{
        "JobDefinition":"Isaac",
        "JobName.$":"$.isaac.JobName",
        "JobQueue":"HighPriority",
        "Parameters.$": "$.isaac"
     },
     "TimeoutSeconds": 900,
     "HeartbeatSeconds": 60,
     "Next":"Parallel",
     "InputPath":"$",
     "ResultPath":"$.status",
     "Retry" : [
        {
          "ErrorEquals": [ "States.Timeout" ],
          "IntervalSeconds": 3,
          "MaxAttempts": 2,
          "BackoffRate": 1.5
        }
     ]
  }
}

Here is an example of the workflow input JSON.  Pass all of the container parameters that were being constructed in the submit job Lambda function.

{
  "isaac": {
    "WorkingDir": "/scratch",
    "JobName": "isaac-1",
    "FastQ1S3Path": "s3://aws-batch-genomics-resources/fastq/SRR1919605_1.fastq.gz",
    "BAMS3FolderPath": "s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz",
    "FastQ2S3Path": "s3://bccn-genome-data/fastq/NIST7035_R2_trimmed.fastq.gz",
    "ReferenceS3Path": "s3://aws-batch-genomics-resources/reference/isaac/"
  }
}

When you deploy the job definition, add the command attribute that was previously being constructed in the Lambda function launching the AWS Batch job.

IsaacJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      JobDefinitionName: "Isaac"
      Type: container
      RetryStrategy:
        Attempts: 1
      Parameters:
        BAMS3FolderPath: !Sub "s3://${JobResultsBucket}/NA12878_states_1/bam"
        FastQ1S3Path: "s3://aws-batch-genomics-resources/fastq/SRR1919605_1.fastq.gz"
        FastQ2S3Path: "s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz"
        ReferenceS3Path: "s3://aws-batch-genomics-resources/reference/isaac/"
        WorkingDir: "/scratch"
      ContainerProperties:
        Image: "rulaszek/isaac"
        Vcpus: 32
        Memory: 80000
        JobRoleArn:
          Fn::ImportValue: !Sub "${RoleStackName}:ECSTaskRole"
        Command:
          - "--bam_s3_folder_path"
          - "Ref::BAMS3FolderPath"
          - "--fastq1_s3_path"
          - "Ref::FastQ1S3Path"
          - "--fastq2_s3_path"
          - "Ref::FastQ2S3Path"
          - "--reference_s3_path"
          - "Ref::ReferenceS3Path"
          - "--working_dir"
          - "Ref::WorkingDir"
        MountPoints:
          - ContainerPath: "/scratch"
            ReadOnly: false
            SourceVolume: docker_scratch
        Volumes:
          - Name: docker_scratch
            Host:
              SourcePath: "/docker_scratch"

The key-value parameters passed into the workflow are mapped using Parameters.$ to the values in the job definition using the keys.  Value substitutions do take place. The Docker run looks like the following:

docker run <isaac_container_uri> --bam_s3_folder_path s3://batch-genomics-pipeline-jobresultsbucket-1kzdu216m2b0k/NA12878_states_3/bam
                                 --fastq1_s3_path s3://aws-batch-genomics-resources/fastq/SRR1919605_1.fastq.gz
                                 --fastq2_s3_path s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz 
                                 --reference_s3_path s3://aws-batch-genomics-resources/reference/isaac/ 
                                 --working_dir /scratch

Genomics workflow: Before and after

Overall, connectors dramatically simplify your genomics workflow.  The following workflow is a simple genomics secondary analysis pipeline, which we highlighted in our original post series.

The first step aligns the sample against a reference genome.  When alignment is complete, variant calling and QA metrics are calculated in two parallel steps.  When variant calling is complete, variant annotation is performed.  Before, our genomics workflow looked like this:

Now it looks like this:

Here is the new workflow JSON:

{
   "Comment":"A simple genomics secondary-analysis workflow",
   "StartAt":"RunIsaacJob",
   "States":{
      "RunIsaacJob":{
         "Type":"Task",
         "Resource":"arn:aws:states:::batch:submitJob.sync",
         "Parameters":{
            "JobDefinition":"Isaac",
            "JobName.$":"$.isaac.JobName",
            "JobQueue":"HighPriority",
            "Parameters.$": "$.isaac"
         },
         "TimeoutSeconds": 900,
         "HeartbeatSeconds": 60,
         "Next":"Parallel",
         "InputPath":"$",
         "ResultPath":"$.status",
         "Retry" : [
            {
              "ErrorEquals": [ "States.Timeout" ],
              "IntervalSeconds": 3,
              "MaxAttempts": 2,
              "BackoffRate": 1.5
            }
         ]
      },
      "Parallel":{
         "Type":"Parallel",
         "Next":"FinalState",
         "Branches":[
            {
               "StartAt":"RunStrelkaJob",
               "States":{
                  "RunStrelkaJob":{
                     "Type":"Task",
                     "Resource":"arn:aws:states:::batch:submitJob.sync",
                     "Parameters":{
                        "JobDefinition":"Strelka",
                        "JobName.$":"$.strelka.JobName",
                        "JobQueue":"HighPriority",
                        "Parameters.$": "$.strelka"
                     },
                     "TimeoutSeconds": 900,
                     "HeartbeatSeconds": 60,
                     "Next":"RunSnpEffJob",
                     "InputPath":"$",
                     "ResultPath":"$.status",
                     "Retry" : [
                        {
                          "ErrorEquals": [ "States.Timeout" ],
                          "IntervalSeconds": 3,
                          "MaxAttempts": 2,
                          "BackoffRate": 1.5
                        }
                     ]
                  },
                  "RunSnpEffJob":{
                     "Type":"Task",
                     "Resource":"arn:aws:states:::batch:submitJob.sync",
                     "Parameters":{
                        "JobDefinition":"SNPEff",
                        "JobName.$":"$.snpeff.JobName",
                        "JobQueue":"HighPriority",
                        "Parameters.$": "$.snpeff"
                     },
                     "TimeoutSeconds": 900,
                     "HeartbeatSeconds": 60,
                     "Retry" : [
                        {
                          "ErrorEquals": [ "States.Timeout" ],
                          "IntervalSeconds": 3,
                          "MaxAttempts": 2,
                          "BackoffRate": 1.5
                        }
                     ],
                     "End":true
                  }
               }
            },
            {
               "StartAt":"RunSamtoolsStatsJob",
               "States":{
                  "RunSamtoolsStatsJob":{
                     "Type":"Task",
                     "Resource":"arn:aws:states:::batch:submitJob.sync",
                     "Parameters":{
                        "JobDefinition":"SamtoolsStats",
                        "JobName.$":"$.samtools.JobName",
                        "JobQueue":"HighPriority",
                        "Parameters.$": "$.samtools"
                     },
                     "TimeoutSeconds": 900,
                     "HeartbeatSeconds": 60,
                     "End":true,
                     "Retry" : [
                        {
                          "ErrorEquals": [ "States.Timeout" ],
                          "IntervalSeconds": 3,
                          "MaxAttempts": 2,
                          "BackoffRate": 1.5
                        }
                     ]
                  }
               }
            }
         ]
      },
      "FinalState":{
         "Type":"Pass",
         "End":true
      }
   }
}

Here is the new Amazon CloudFormation template for deploying the AWS Batch job definitions for each tool:

AWSTemplateFormatVersion: 2010-09-09

Description: Batch job definitions for batch genomics

Parameters:
  RoleStackName:
    Description: "Stack that deploys roles for genomic workflow"
    Type: String
  VPCStackName:
    Description: "Stack that deploys vps for genomic workflow"
    Type: String
  JobResultsBucket:
    Description: "Bucket that holds workflow job results"
    Type: String

Resources:
  IsaacJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      JobDefinitionName: "Isaac"
      Type: container
      RetryStrategy:
        Attempts: 1
      Parameters:
        BAMS3FolderPath: !Sub "s3://${JobResultsBucket}/NA12878_states_1/bam"
        FastQ1S3Path: "s3://aws-batch-genomics-resources/fastq/SRR1919605_1.fastq.gz"
        FastQ2S3Path: "s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz"
        ReferenceS3Path: "s3://aws-batch-genomics-resources/reference/isaac/"
        WorkingDir: "/scratch"
      ContainerProperties:
        Image: "rulaszek/isaac"
        Vcpus: 32
        Memory: 80000
        JobRoleArn:
          Fn::ImportValue: !Sub "${RoleStackName}:ECSTaskRole"
        Command:
          - "--bam_s3_folder_path"
          - "Ref::BAMS3FolderPath"
          - "--fastq1_s3_path"
          - "Ref::FastQ1S3Path"
          - "--fastq2_s3_path"
          - "Ref::FastQ2S3Path"
          - "--reference_s3_path"
          - "Ref::ReferenceS3Path"
          - "--working_dir"
          - "Ref::WorkingDir"
        MountPoints:
          - ContainerPath: "/scratch"
            ReadOnly: false
            SourceVolume: docker_scratch
        Volumes:
          - Name: docker_scratch
            Host:
              SourcePath: "/docker_scratch"

  StrelkaJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      JobDefinitionName: "Strelka"
      Type: container
      RetryStrategy:
        Attempts: 1
      Parameters:
        BAMS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/bam/sorted.bam"
        BAIS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/bam/sorted.bam.bai"
        ReferenceS3Path: "s3://aws-batch-genomics-resources/reference/hg38.fa"
        ReferenceIndexS3Path: "s3://aws-batch-genomics-resources/reference/hg38.fa.fai"
        VCFS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/vcf"
        WorkingDir: "/scratch"
      ContainerProperties:
        Image: "rulaszek/strelka"
        Vcpus: 32
        Memory: 32000
        JobRoleArn:
          Fn::ImportValue: !Sub "${RoleStackName}:ECSTaskRole"
        Command:
          - "--bam_s3_path"
          - "Ref::BAMS3Path"
          - "--bai_s3_path"
          - "Ref::BAIS3Path"
          - "--reference_s3_path"
          - "Ref::ReferenceS3Path"
          - "--reference_index_s3_path"
          - "Ref::ReferenceIndexS3Path"
          - "--vcf_s3_path"
          - "Ref::VCFS3Path"
          - "--working_dir"
          - "Ref::WorkingDir"
        MountPoints:
          - ContainerPath: "/scratch"
            ReadOnly: false
            SourceVolume: docker_scratch
        Volumes:
          - Name: docker_scratch
            Host:
              SourcePath: "/docker_scratch"

  SnpEffJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      JobDefinitionName: "SNPEff"
      Type: container
      RetryStrategy:
        Attempts: 1
      Parameters:
        VCFS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/vcf/variants/genome.vcf.gz"
        AnnotatedVCFS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/vcf/genome.anno.vcf"
        CommandArgs: " -t hg38 "
        WorkingDir: "/scratch"
      ContainerProperties:
        Image: "rulaszek/snpeff"
        Vcpus: 4
        Memory: 10000
        JobRoleArn:
          Fn::ImportValue: !Sub "${RoleStackName}:ECSTaskRole"
        Command:
          - "--annotated_vcf_s3_path"
          - "Ref::AnnotatedVCFS3Path"
          - "--vcf_s3_path"
          - "Ref::VCFS3Path"
          - "--cmd_args"
          - "Ref::CommandArgs"
          - "--working_dir"
          - "Ref::WorkingDir"
        MountPoints:
          - ContainerPath: "/scratch"
            ReadOnly: false
            SourceVolume: docker_scratch
        Volumes:
          - Name: docker_scratch
            Host:
              SourcePath: "/docker_scratch"

  SamtoolsStatsJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      JobDefinitionName: "SamtoolsStats"
      Type: container
      RetryStrategy:
        Attempts: 1
      Parameters:
        ReferenceS3Path: "s3://aws-batch-genomics-resources/reference/hg38.fa"
        BAMS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/bam/sorted.bam"
        BAMStatsS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/bam/sorted.bam.stats"
        WorkingDir: "/scratch"
      ContainerProperties:
        Image: "rulaszek/samtools-stats"
        Vcpus: 4
        Memory: 10000
        JobRoleArn:
          Fn::ImportValue: !Sub "${RoleStackName}:ECSTaskRole"
        Command:
          - "--bam_s3_path"
          - "Ref::BAMS3Path"
          - "--bam_stats_s3_path"
          - "Ref::BAMStatsS3Path"
          - "--reference_s3_path"
          - "Ref::ReferenceS3Path"
          - "--working_dir"
          - "Ref::WorkingDir"
        MountPoints:
          - ContainerPath: "/scratch"
            ReadOnly: false
            SourceVolume: docker_scratch
        Volumes:
          - Name: docker_scratch
            Host:
              SourcePath: "/docker_scratch"

Here is the new CloudFormation script that deploys the new workflow:

AWSTemplateFormatVersion: 2010-09-09

Description: State Machine for batch benomics

Parameters:
  RoleStackName:
    Description: "Stack that deploys roles for genomic workflow"
    Type: String
  VPCStackName:
    Description: "Stack that deploys vps for genomic workflow"
    Type: String

Resources:
  # S3
  GenomicWorkflow:
    Type: AWS::StepFunctions::StateMachine
    Properties:
      RoleArn:
        Fn::ImportValue: !Sub "${RoleStackName}:StatesExecutionRole"
      DefinitionString: !Sub |-
        {
           "Comment":"A simple example that submits a job to AWS Batch",
           "StartAt":"RunIsaacJob",
           "States":{
              "RunIsaacJob":{
                 "Type":"Task",
                 "Resource":"arn:aws:states:::batch:submitJob.sync",
                 "Parameters":{
                    "JobDefinition":"Isaac",
                    "JobName.$":"$.isaac.JobName",
                    "JobQueue":"HighPriority",
                    "Parameters.$": "$.isaac"
                 },
                 "TimeoutSeconds": 900,
                 "HeartbeatSeconds": 60,
                 "Next":"Parallel",
                 "InputPath":"$",
                 "ResultPath":"$.status",
                 "Retry" : [
                    {
                      "ErrorEquals": [ "States.Timeout" ],
                      "IntervalSeconds": 3,
                      "MaxAttempts": 2,
                      "BackoffRate": 1.5
                    }
                 ]
              },
              "Parallel":{
                 "Type":"Parallel",
                 "Next":"FinalState",
                 "Branches":[
                    {
                       "StartAt":"RunStrelkaJob",
                       "States":{
                          "RunStrelkaJob":{
                             "Type":"Task",
                             "Resource":"arn:aws:states:::batch:submitJob.sync",
                             "Parameters":{
                                "JobDefinition":"Strelka",
                                "JobName.$":"$.strelka.JobName",
                                "JobQueue":"HighPriority",
                                "Parameters.$": "$.strelka"
                             },
                             "TimeoutSeconds": 900,
                             "HeartbeatSeconds": 60,
                             "Next":"RunSnpEffJob",
                             "InputPath":"$",
                             "ResultPath":"$.status",
                             "Retry" : [
                                {
                                  "ErrorEquals": [ "States.Timeout" ],
                                  "IntervalSeconds": 3,
                                  "MaxAttempts": 2,
                                  "BackoffRate": 1.5
                                }
                             ]
                          },
                          "RunSnpEffJob":{
                             "Type":"Task",
                             "Resource":"arn:aws:states:::batch:submitJob.sync",
                             "Parameters":{
                                "JobDefinition":"SNPEff",
                                "JobName.$":"$.snpeff.JobName",
                                "JobQueue":"HighPriority",
                                "Parameters.$": "$.snpeff"
                             },
                             "TimeoutSeconds": 900,
                             "HeartbeatSeconds": 60,
                             "Retry" : [
                                {
                                  "ErrorEquals": [ "States.Timeout" ],
                                  "IntervalSeconds": 3,
                                  "MaxAttempts": 2,
                                  "BackoffRate": 1.5
                                }
                             ],
                             "End":true
                          }
                       }
                    },
                    {
                       "StartAt":"RunSamtoolsStatsJob",
                       "States":{
                          "RunSamtoolsStatsJob":{
                             "Type":"Task",
                             "Resource":"arn:aws:states:::batch:submitJob.sync",
                             "Parameters":{
                                "JobDefinition":"SamtoolsStats",
                                "JobName.$":"$.samtools.JobName",
                                "JobQueue":"HighPriority",
                                "Parameters.$": "$.samtools"
                             },
                             "TimeoutSeconds": 900,
                             "HeartbeatSeconds": 60,
                             "End":true,
                             "Retry" : [
                                {
                                  "ErrorEquals": [ "States.Timeout" ],
                                  "IntervalSeconds": 3,
                                  "MaxAttempts": 2,
                                  "BackoffRate": 1.5
                                }
                             ]
                          }
                       }
                    }
                 ]
              },
              "FinalState":{
                 "Type":"Pass",
                 "End":true
              }
           }
        }

Outputs:
  GenomicsWorkflowArn:
    Description: GenomicWorkflow ARN
    Value: !Ref GenomicWorkflow
  StackName:
    Description: StackName
    Value: !Sub ${AWS::StackName}

Conclusion

AWS Step Functions service integrations are a great way to simplify creating complex workflows with asynchronous steps. While we highlighted the use case with AWS Batch today, there are many other ways that healthcare and life sciences customers can use this new feature, such as with message processing.

For more information about how AWS can enable your genomics workloads, be sure to check out the AWS Genomics page.

We’ve updated the open-source project to take advantage of the new AWS Batch integration in Step Functions.  You can find the changes aws-batch-genomics/tree/v2.0.0 folder.

Original posts in this four-part series:

Happy coding!

Building a tightly coupled molecular dynamics workflow with multi-node parallel jobs in AWS Batch

Post Syndicated from Josh Rad original https://aws.amazon.com/blogs/compute/building-a-tightly-coupled-molecular-dynamics-workflow-with-multi-node-parallel-jobs-in-aws-batch/

Contributed by Amr Ragab, HPC Application Consultant, AWS Professional Services and Aswin Damodar, Senior Software Development Engineer, AWS Batch

At Supercomputing 2018 in Dallas, TX, AWS announced AWS Batch support for running tightly coupled workloads in a multi-node parallel jobs environment. This AWS Batch feature enables applications that require strong scaling for efficient computational workloads.

Some of the more popular workloads that can take advantage of this feature enhancement include computational fluid dynamics (CFD) codes such as OpenFoam, Fluent, and ANSYS. Other workloads include molecular dynamics (MD) applications such as AMBER, GROMACS, NAMD.

Running tightly coupled, distributed, deep learning frameworks is also now possible on AWS Batch. Applications that can take advantage include TensorFlow, MXNet, Pytorch, and Chainer. Essentially, any application scaling that benefits from tightly coupled–based scalability can now be integrated into AWS Batch.

In this post, we show you how to build a workflow executing an MD simulation using GROMACS running on GPUs, using the p3 instance family.

AWS Batch overview

AWS Batch is a service providing managed planning, scheduling, and execution of containerized workloads on AWS. Purpose-built for scalable compute workloads, AWS Batch is ideal for high throughput, distributed computing jobs such as video and image encoding, loosely coupled numerical calculations, and multistep computational workflows.

If you are new to AWS Batch, consider gaining familiarity with the service by following the tutorial in the Creating a Simple “Fetch & Run” AWS Batch Job post.

Prerequisites

You need an AWS account to go through this walkthrough. Other prerequisites include:

  • Launch an ECS instance, p3.2xlarge with a NVIDIA Tesla V100 backend. Use the Amazon Linux 2 AMIs for ECS.
  • In the ECS instance, install the latest CUDA 10 stack, which provides the toolchain and compilation libraries as well as the NVIDIA driver.
  • Install nvidia-docker2.
  • In your /etc/docker/daemon.json file, ensure that the default-runtime value is set to nvidia.
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-runtime": "nvidia"
}
  • Finally, save the EC2 instance as an AMI in your account. Copy the AMI ID, as you need it later in the post.

Deploying the workload

In a production environment, it’s important to efficiently execute the compute workload with multi-node parallel jobs. Most of the optimization is on the application layer and how efficiently the Message Passing Interface (MPI) ranks (MPI and OpenMP threads) are distributed across nodes. Application-level optimization is out of scope for this post, but should be considered when running in production.

One of the key requirements for running on AWS Batch is a Dockerized image with the application, libraries, scripts, and code. For multi-node parallel jobs, you need an MPI stack for the tightly coupled communication layer and a wrapper script for the MPI orchestration. The running child Docker containers need to pass container IP address information to the master node to fill out the MPI host file.

The undifferentiated heavy lifting that AWS Batch provides is the Docker-to-Docker communication across nodes using Amazon ECS task networking. With multi-node parallel jobs, the ECS container receives environmental variables from the backend, which can be used to establish which running container is the master and which is the child.

  • AWS_BATCH_JOB_MAIN_NODE_INDEX—The designation of the master node in a multi-node parallel job. This is the main node in which the MPI job is launched.
  • AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS—The IPv4 address of the main node. This is presented in the environment for all children nodes.
  • AWS_BATCH_JOB_NODE_INDEX—The designation of the node index.
  • AWS_BATCH_JOB_NUM_NODES – The number of nodes launched as part of the node group for your multi-node parallel job.

If AWS_BATCH_JOB_MAIN_NODE_INDEX = AWS_BATCH_JOB_NODE_INDEX, then this is the main node. The following code block is an example MPI synchronization script that you can include as part of the CMD structure of the Docker container. Save the following code as mpi-run.sh.

#!/bin/bash

cd $JOB_DIR

PATH="$PATH:/opt/openmpi/bin/"
BASENAME="${0##*/}"
log () {
  echo "${BASENAME} - ${1}"
}
HOST_FILE_PATH="/tmp/hostfile"
AWS_BATCH_EXIT_CODE_FILE="/tmp/batch-exit-code"

aws s3 cp $S3_INPUT $SCRATCH_DIR
tar -xvf $SCRATCH_DIR/*.tar.gz -C $SCRATCH_DIR

sleep 2

usage () {
  if [ "${#@}" -ne 0 ]; then
    log "* ${*}"
    log
  fi
  cat <<ENDUSAGE
Usage:
export AWS_BATCH_JOB_NODE_INDEX=0
export AWS_BATCH_JOB_NUM_NODES=10
export AWS_BATCH_JOB_MAIN_NODE_INDEX=0
export AWS_BATCH_JOB_ID=string
./mpi-run.sh
ENDUSAGE

  error_exit
}

# Standard function to print an error and exit with a failing return code
error_exit () {
  log "${BASENAME} - ${1}" >&2
  log "${2:-1}" > $AWS_BATCH_EXIT_CODE_FILE
  kill  $(cat /tmp/supervisord.pid)
}

# Set child by default switch to main if on main node container
NODE_TYPE="child"
if [ "${AWS_BATCH_JOB_MAIN_NODE_INDEX}" == 
"${AWS_BATCH_JOB_NODE_INDEX}" ]; then
  log "Running synchronize as the main node"
  NODE_TYPE="main"
fi


# wait for all nodes to report
wait_for_nodes () {
  log "Running as master node"

  touch $HOST_FILE_PATH
  ip=$(/sbin/ip -o -4 addr list eth0 | awk '{print $4}' | cut -d/ -f1)
  
  if [ -x "$(command -v nvidia-smi)" ] ; then
      NUM_GPUS=$(ls -l /dev/nvidia[0-9] | wc -l)
      availablecores=$NUM_GPUS
  else
      availablecores=$(nproc)
  fi

  log "master details -> $ip:$availablecores"
  echo "$ip slots=$availablecores" >> $HOST_FILE_PATH

  lines=$(uniq $HOST_FILE_PATH|wc -l)
  while [ "$AWS_BATCH_JOB_NUM_NODES" -gt "$lines" ]
  do
    log "$lines out of $AWS_BATCH_JOB_NUM_NODES nodes joined, check again in 1 second"
    sleep 1
    lines=$(uniq $HOST_FILE_PATH|wc -l)
  done
  # Make the temporary file executable and run it with any given arguments
  log "All nodes successfully joined"
  
  # remove duplicates if there are any.
  awk '!a[$0]++' $HOST_FILE_PATH > ${HOST_FILE_PATH}-
deduped
  cat $HOST_FILE_PATH-deduped
  log "executing main MPIRUN workflow"

  cd $SCRATCH_DIR
  . /opt/gromacs/bin/GMXRC
  /opt/openmpi/bin/mpirun --mca btl_tcp_if_include eth0 \
                          -x PATH -x LD_LIBRARY_PATH -x 
GROMACS_DIR -x GMXBIN -x GMXMAN -x GMXDATA \
                          --allow-run-as-root --machinefile 
${HOST_FILE_PATH}-deduped \
                          $GMX_COMMAND
  sleep 2

  tar -czvf $JOB_DIR/batch_output_$AWS_BATCH_JOB_ID.tar.gz 
$SCRATCH_DIR/*
  aws s3 cp $JOB_DIR/batch_output_$AWS_BATCH_JOB_ID.tar.gz 
$S3_OUTPUT

  log "done! goodbye, writing exit code to 
$AWS_BATCH_EXIT_CODE_FILE and shutting down my supervisord"
  echo "0" > $AWS_BATCH_EXIT_CODE_FILE
  kill  $(cat /tmp/supervisord.pid)
  exit 0
}


# Fetch and run a script
report_to_master () {
  # get own ip and num cpus
  #
  ip=$(/sbin/ip -o -4 addr list eth0 | awk '{print $4}' | cut -d/ -f1)

  if [ -x "$(command -v nvidia-smi)" ] ; then
      NUM_GPUS=$(ls -l /dev/nvidia[0-9] | wc -l)
      availablecores=$NUM_GPUS
  else
      availablecores=$(nproc)
  fi

  log "I am a child node -> $ip:$availablecores, reporting to the master node -> 
${AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS}"
  until echo "$ip slots=$availablecores" | ssh 
${AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS} "cat >> 
/$HOST_FILE_PATH"
  do
    echo "Sleeping 5 seconds and trying again"
  done
  log "done! goodbye"
  exit 0
  }


# Main - dispatch user request to appropriate function
log $NODE_TYPE
case $NODE_TYPE in
  main)
    wait_for_nodes "${@}"
    ;;

  child)
    report_to_master "${@}"
    ;;

  *)
    log $NODE_TYPE
    usage "Could not determine node type. Expected (main/child)"
    ;;
esac

The synchronization script supports downloading the assets from Amazon S3 as well as preparing the MPI host file based on GPU scheduling for GROMACS.

Furthermore, the mpirun stanza is captured in this script. This script can be a template for several multi-node parallel job applications by just changing a few lines.  These lines are essentially the GROMACS-specific steps:

. /opt/gromacs/bin/GMXRC
export OMP_NUM_THREADS=$OMP_THREADS
/opt/openmpi/bin/mpirun -np $MPI_THREADS --mca btl_tcp_if_include eth0 \
-x OMP_NUM_THREADS -x PATH -x LD_LIBRARY_PATH -x GROMACS_DIR -x GMXBIN -x GMXMAN -x GMXDATA \
--allow-run-as-root --machinefile ${HOST_FILE_PATH}-deduped \
$GMX_COMMAND

In your development environment for building Docker images, create a Dockerfile that prepares the software stack for running GROMACS. The key elements of the Dockerfile are:

  1. Set up a passwordless-ssh keygen.
  2. Download, and compile OpenMPI. In this Dockerfile, you are downloading the recently released OpenMPI 4.0.0 source and compiling on a NVIDIA Tesla V100 GPU-backed instance (p3.2xlarge).
  3. Download and compile GROMACS.
  4. Set up supervisor to run SSH at Docker container startup as well as processing the mpi-run.sh script as the CMD.

Save the following script as a Dockerfile:

FROM nvidia/cuda:latest

ENV USER root

# -------------------------------------------------------------------------------------
# install needed software -
# openssh
# mpi
# awscli
# supervisor
# -------------------------------------------------------------------------------------

RUN apt update
RUN DEBIAN_FRONTEND=noninteractive apt install -y iproute2 cmake openssh-server openssh-client python python-pip build-essential gfortran wget curl
RUN pip install supervisor awscli

RUN mkdir -p /var/run/sshd
ENV DEBIAN_FRONTEND noninteractive

ENV NOTVISIBLE "in users profile"

#####################################################
## SSH SETUP

RUN sed -i 's/PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
RUN sed '[email protected]\s*required\s*[email protected] optional [email protected]' -i /etc/pam.d/sshd
RUN echo "export VISIBLE=now" >> /etc/profile

RUN echo "${USER} ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers
ENV SSHDIR /root/.ssh
RUN mkdir -p ${SSHDIR}
RUN touch ${SSHDIR}/sshd_config
RUN ssh-keygen -t rsa -f ${SSHDIR}/ssh_host_rsa_key -N ''
RUN cp ${SSHDIR}/ssh_host_rsa_key.pub ${SSHDIR}/authorized_keys
RUN cp ${SSHDIR}/ssh_host_rsa_key ${SSHDIR}/id_rsa
RUN echo " IdentityFile ${SSHDIR}/id_rsa" >> /etc/ssh/ssh_config
RUN echo "Host *" >> /etc/ssh/ssh_config && echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config
RUN chmod -R 600 ${SSHDIR}/* && \
chown -R ${USER}:${USER} ${SSHDIR}/
# check if ssh agent is running or not, if not, run
RUN eval `ssh-agent -s` && ssh-add ${SSHDIR}/id_rsa

##################################################
## S3 OPTIMIZATION

RUN aws configure set default.s3.max_concurrent_requests 30
RUN aws configure set default.s3.max_queue_size 10000
RUN aws configure set default.s3.multipart_threshold 64MB
RUN aws configure set default.s3.multipart_chunksize 16MB
RUN aws configure set default.s3.max_bandwidth 4096MB/s
RUN aws configure set default.s3.addressing_style path

##################################################
## CUDA MPI

RUN wget -O /tmp/openmpi.tar.gz https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.0.tar.gz && \
tar -xvf /tmp/openmpi.tar.gz -C /tmp
RUN cd /tmp/openmpi* && ./configure --prefix=/opt/openmpi --with-cuda --enable-mpirun-prefix-by-default && \
make -j $(nproc) && make install
RUN echo "export PATH=$PATH:/opt/openmpi/bin" >> /etc/profile
RUN echo "export LD_LIBRARY_PATH=$LD_LIRBARY_PATH:/opt/openmpi/lib:/usr/local/cuda/include:/usr/local/cuda/lib64" >> /etc/profile

###################################################
## GROMACS 2018 INSTALL

ENV PATH $PATH:/opt/openmpi/bin
ENV LD_LIBRARY_PATH $LD_LIRBARY_PATH:/opt/openmpi/lib:/usr/local/cuda/include:/usr/local/cuda/lib64
RUN wget -O /tmp/gromacs.tar.gz http://ftp.gromacs.org/pub/gromacs/gromacs-2018.4.tar.gz && \
tar -xvf /tmp/gromacs.tar.gz -C /tmp
RUN cd /tmp/gromacs* && mkdir build
RUN cd /tmp/gromacs*/build && \
cmake .. -DGMX_MPI=on -DGMX_THREAD_MPI=ON -DGMX_GPU=ON -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda -DGMX_BUILD_OWN_FFTW=ON -DCMAKE_INSTALL_PREFIX=/opt/gromacs && \
make -j $(nproc) && make install
RUN echo "source /opt/gromacs/bin/GMXRC" >> /etc/profile

###################################################
## supervisor container startup

ADD conf/supervisord/supervisord.conf /etc/supervisor/supervisord.conf
ADD supervised-scripts/mpi-run.sh supervised-scripts/mpi-run.sh
RUN chmod 755 supervised-scripts/mpi-run.sh

EXPOSE 22
RUN export PATH="$PATH:/opt/openmpi/bin"
ADD batch-runtime-scripts/entry-point.sh batch-runtime-scripts/entry-point.sh
RUN chmod 0755 batch-runtime-scripts/entry-point.sh

CMD /batch-runtime-scripts/entry-point.sh

After the container is built, push the image to your Amazon ECR repository and note the container image URI for later steps.

Set up GROMACS

For the input files, use the Chalcone Synthase (1CGZ) example, from RCSB.org. For this post, just run a simple simulation following the Lysozyme in Water GROMACS tutorial.

Execute the production MD run before the analysis (that is, after the system is solvated, neutralized, and equilibrated), so you can show that the longest part of the simulation can be achieved in a containizered workflow.

It is possible from the tutorial to run the entire workflow from PDB preparation to solvation, and energy minimization and analysis in AWS Batch.

Set up the compute environment

For the purpose of running the MD simulation in this test case, use two p3.2xlarge instances. Each instance provides one NVIDIA Tesla V100 GPU for which GROMACS distributes the job. You don’t have to launch specific instance types. With the p3 family, the MPI-wrapper can concomitantly modify the MPI ranks to accommodate the current GPU and node topology.

When the job is executed, instantiate two MPI processes with two OpenMP threads per MPI process. For this post, launch EC2 OnDemand, using the Amazon Linux AMIs we can take advantage of per-second billing.

Under Create Compute Environment, choose a managed compute environment and provide a name, such as gromacs-gpu-ce. Attach two roles:

  • AWSBatchServiceRole—Allows AWS Batch to make EC2 calls on your behalf.
  • ecsInstanceRole—Allows the underlying instance to make AWS API calls.

In the next panel, specify the following field values:

  • Provisioning model: EC2
  • Allowed instance types: p3 family
  • Minimum vCPUs: 0
  • Desired vCPUs: 0
  • Maximum vCPUs: 128

For Enable user-specified Ami ID and enter the AMI that you created earlier.

Finally, for the compute environment, specify the network VPC and subnets for launching the instances, as well as a security group. We recommend specifying a placement group for tightly coupled workloads for better performance. You can also create EC2 tags for the launch instances. We used name=gromacs-gpu-processor.

Next, choose Job Queues and create a gromacs-queue queue coupled with the compute environment created earlier. Set the priority to 1 and select Enable job queue.

Set up the job definition

In the job definition setup, you create a two-node group, where each node pulls the gromacs_mpi image. Because you are using the p3.2xlarge instance providing one V100 GPU per instance, your vCPU slots = 8 for scheduling purposes.

{
    "jobDefinitionName": "gromacs-jobdef",
    "jobDefinitionArn": "arn:aws:batch:us-east-2:<accountid>:job-definition/gromacs-jobdef:1",
    "revision": 6,
    "status": "ACTIVE",
    "type": "multinode",
    "parameters": {},
    "nodeProperties": {
        "numNodes": 2,
        "mainNode": 0,
        "nodeRangeProperties": [
            {
                "targetNodes": "0:1",
                "container": {
                    "image": "<accountid>.dkr.ecr.us-east-2.amazonaws.com/gromacs_mpi:latest",
                    "vcpus": 8,
                    "memory": 24000,
                    "command": [],
                    "jobRoleArn": "arn:aws:iam::<accountid>:role/ecsTaskExecutionRole",
                    "volumes": [
                        {
                            "host": {
                                "sourcePath": "/scratch"
                            },
                            "name": "scratch"
                        },
                        {
                            "host": {
                                "sourcePath": "/efs"
                            },
                            "name": "efs"
                        }
                    ],
                    "environment": [
                        {
                            "name": "SCRATCH_DIR",
                            "value": "/scratch"
                        },
                        {
                            "name": "JOB_DIR",
                            "value": "/efs"
                        },
                        {
                            "name": "GMX_COMMAND",
                            "value": "gmx_mpi mdrun -deffnm md_0_1 -nb gpu -ntomp 1"
                        },
                        {
                            "name": "OMP_THREADS",
                            "value": "2"
                        },
                        {
                            “name”: “MPI_THREADS”,
                            “value”: “1”
                        },
                        {
                            "name": "S3_INPUT",
                            "value": "s3://ragab-md/1CGZ.tar.gz"
                        },
                        {
                            "name": "S3_OUTPUT",
                            "value": "s3://ragab-md"
                        }
                    ],
                    "mountPoints": [
                        {
                            "containerPath": "/scratch",
                            "sourceVolume": "scratch"
                        },
                        {
                            "containerPath": "/efs",
                            "sourceVolume": "efs"
                        }
                    ],
                    "ulimits": [],
                    "instanceType": "p3.2xlarge"
                }
            }
        ]
    }
}

Submit the GROMACS job

In the AWS Batch job submission portal, provide a job name and select the job definition created earlier as well as the job queue. Ensure that the vCPU value is set to 8 and the Memory (MiB) value is 24000.

Under Environmental Variables, within in each node group, ensure that the keys are set correctly as follows.

KeyValue
SCRATCH_DIR/scratch
JOB_DIR/efs
OMP_THREADS2
GMX_COMMANDgmx_mpi mdrun -deffnm md_0_1 -nb gpu
MPI_THREADS2
S3_INPUTs3://<your input>
S3_OUTPUTs3://<your output>

Submit the job and wait for it to enter into the RUNNING state. After the job is in the RUNNING state, select the job ID and choose Nodes.

The containers listed each write to a separate Amazon CloudWatch log stream where you can monitor the progress.

After the job is completed the entire working directory is compressed and uploaded to S3, the trajectories (*.xtc) and input .gro files can be viewed in your favorite MD analysis package. For more information about preparing a desktop, see Deploying a 4x4K, GPU-backed Linux desktop instance on AWS.

You can view the trajectories in PyMOL as well as running any subsequent trajectory analysis.

Extending the solution

As we mentioned earlier, you can take this core workload and extend it as part of a job execution chain in a workflow. Native support for job dependencies exists in AWS Batch and alternatively in AWS Step Functions. With Step Functions, you can create a decision-based workflow tree to run the preparation, solvation, energy minimization, equilibration, production MD, and analysis.

Conclusion

In this post, we showed that tightly coupled, scalable MD simulations can be executed using the recently released multi-node parallel jobs feature for AWS Batch. You finally have an end-to-end solution for distributed and MPI-based workloads.

As we mentioned earlier, many other applications can also take advantage of this feature set. We invite you to try this out and let us know how it goes.

Want to discuss how your tightly coupled workloads can benefit on AWS Batch? Contact AWS.

Using Cromwell with AWS Batch

Post Syndicated from Josh Rad original https://aws.amazon.com/blogs/compute/using-cromwell-with-aws-batch/

Contributed by W. Lee Pang and Emil Lerch, WWPS Professional Services

DNA is often referred to as the “source code of life.” All living cells contain long chains of deoxyribonucleic acid that encode instructions on how they are constructed and behave in their surroundings. Genomics is the study of the structure and function of DNA at the molecular level. It has recently shown immense potential to provide improved detection, diagnosis, and treatment of human diseases.

Continuous improvements in genome sequencing technologies have accelerated genomics research by providing unprecedented speed, accuracy, and quantity of DNA sequence data. In fact, the rate of sequencing efficiency has been shown to outpace Moore’s law. Processing this influx of genomic data is ideally aligned with the power and scalability of cloud computing.

Genomic data processing typically uses a wide assortment of specialized bioinformatics tools, like sequence alignment algorithms, variant callers, and statistical analysis methods. These tools are run in sequence as workflow pipelines that can range from a couple of steps to many long toolchains executing in parallel.

Traditionally, bioinformaticians and genomics scientists relied on Bash, Perl, or Python scripts to orchestrate their pipelines. As pipelines have gotten more complex, and maintainability and reproducibility have become standard requirements in science, the need for specialized orchestration tooling and portable workflow definitions has grown significantly.

What is Cromwell?

The Broad Institute’s Cromwell is purpose-built for this need. It is a workflow execution engine for orchestrating command line and containerized tools. Most importantly, it is the engine that drives the GATK Best Practices genome analysis pipeline.

Workflows for Cromwell are defined using the Workflow Definition Language (WDL – pronounced “widdle”), a flexible meta-scripting language that allows researchers to focus on the pieces of their workflow that matter. That’s the tools for each step and their respective inputs and outputs, and not the plumbing in between.

Genomics data is not small (on the order of TBs-PBs for one experiment!), so processing it usually requires significant computing scale, like HPC clusters and cloud computing. Cromwell has previously enabled this with support for many backends such as Spark, and HPC frameworks like Sun GridEngine and SLURM.

AWS and Cromwell

We are excited to announce that Cromwell now supports AWS! In this post, we go over how to configure Cromwell on AWS and get started running genomics pipelines in the cloud.

In a nutshell, the AWS backend for Cromwell is a layer that communicates with AWS Batch. Why AWS Batch? As stated before, genomics analysis pipelines are composed of many different tools. Each of these tools can have specific computing requirements. Operations like genome alignment can be memory-intensive, whereas joint genotyping may be compute-heavy.

AWS Batch dynamically provisions the optimal quantity and type of compute resources (for example, CPU or memory-optimized instances). Provisioning is based on the volume and specific resource requirements of the batch jobs submitted. This means that each step of a genomics workflow gets the most ideal instance to run on.

The AWS backend translates Cromwell task definitions into batch job definitions and submits them via API calls to a user-specified batch queue. Runtime parameters such as the container image to use, and resources like desired vCPUs and memory are also translated from the WDL task and transmitted to the batch job. A number of environment variables are automatically set on the job to support data localization and de-localization to the job instance. Ultimately, scientists and genomics researchers should be familiar with the backend method to submit jobs to AWS Batch because it uses their existing WDL files and research processes.

Getting started

To get started using Cromwell with AWS, create a custom AMI. This is necessary to ensure that the AMI is private to the account, encrypted, and has tooling specific to genomics workloads and Cromwell.

One feature of this tooling is the automatic creation and attachment of additional Amazon Elastic Block Store (Amazon EBS) capacity as additional data is copied onto the EC2 instance for processing. It also contains an ECS agent that has been customized to the needs of Cromwell, and a Cromwell Docker image responsible for interfacing the Cromwell task with Amazon S3.

After the custom AMI is created, install Cromwell on your workstation or EC2 instance. Configure an S3 bucket to hold Cromwell execution directories. For the purposes of this post, we refer to the bucket as s3-bucket-name. Lastly, go to the AWS Batch console, and create a job queue. Save the ARN of the queue, as this is needed later.

To get up these resources with a single click, this link provides a set of AWS CloudFormation templates that gets all the needed infrastructure running in minutes.

The next step is to configure Cromwell to work with AWS Batch and your newly created S3 bucket. Use the sample hello.wdl and hello.inputs files from the Cromwell AWS backend tutorial. You also need a custom configuration file so that Cromwell can interact with AWS Batch.

The following sample file can be used on an EC2 instance with the appropriate IAM role attached, or on a developer workstation with the AWS CLI configured. Keep in mind that you must replace <s3-bucket-name> in the configuration file with the appropriate bucket name. Also, replace “your ARN here” with the ARN of the job queue that you created earlier.

// aws.conf

include required(classpath("application"))

aws {

    application-name = "cromwell"
    
    auths = [
        {
         name = "default"
         scheme = "default"
        }
    ]
    
    region = "default"
    // uses region from ~/.aws/config set by aws configure command,
    // or us-east-1 by default
}

engine {
     filesystems {
         s3 {
            auth = "default"
         }
    }
}

backend {
     default = "AWSBATCH"
     providers {
         AWSBATCH {
             actor-factory = "cromwell.backend.impl.aws.AwsBatchBackendLifecycleActorFactory"
             config {
                 // Base bucket for workflow executions
                 root = "s3://<s3-bucket-name>/cromwell-execution"
                
                 // A reference to an auth defined in the `aws` stanza at the top. This auth is used to create
                 // Jobs and manipulate auth JSONs.
                 auth = "default"

                 numSubmitAttempts = 3
                 numCreateDefinitionAttempts = 3

                 concurrent-job-limit = 16
                
                 default-runtime-attributes {
                    queueArn: "<your ARN here>"
                 }
                
                 filesystems {
                     s3 {
                         // A reference to a potentially different auth for manipulating files via engine functions.
                         auth = "default"
                     }
                 }
             }
         }
     }
}

Now, you can run your workflow. The following command runs Hello World, and ensures that everything is connected properly:

$ java -Dconfig.file=aws.conf -jar cromwell-34.jar run hello.wdl -i hello.inputs

After the workflow has run, your workflow logs should report the workflow outputs.

[info] SingleWorkflowRunnerActor workflow finished with status 'Succeeded'.
{
 "outputs": {
    "wf_hello.hello.message": "Hello World! Welcome to Cromwell . . . on AWS!"
 },
 "id": "08213b40-bcf5-470d-b8b7-1d1a9dccb10e"
}

You also see your job in the “succeeded” section of the AWS Batch Jobs console.

After the environment is configured properly, other Cromwell WDL files can be used as usual.

Conclusion
With AWS Batch, a customized AMI instance, and Cromwell workflow definitions, AWS provides a simple solution to process genomics data easily. We invite you to incorporate this into your automated pipeline.

Deploy an 8K HEVC pipeline using Amazon EC2 P3 instances with AWS Batch

Post Syndicated from Geoff Murase original https://aws.amazon.com/blogs/compute/deploy-an-8k-hevc-pipeline-using-amazon-ec2-p3-instances-with-aws-batch/

Contributed by Amr Ragab, HPC Application Consultant, AWS Professional Services

AWS provides several managed services for file- and streaming-based media encoding options.

Currently, these services offer up to 4K encoding. Recent developments and the growing popularity of 8K content has now increased the need to distribute higher resolution content.

In this solution, you use an Amazon EC2 P3 instance to create a file-based encoding pipeline utilizing AWS Batch by first uploading a sample 8K (7680×4320) file to Amazon S3.

AWS Batch

AWS Batch enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. AWS Batch dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs submitted. With AWS Batch, there is no need to install and manage batch computing software or server clusters that you use to run your jobs, allowing you to focus on analyzing results and solving problems. AWS Batch plans, schedules, and executes your batch computing workloads across the full range of AWS compute services and features, such as Amazon EC2 and Spot Instances.

P3 instances for video transcoding workloads

The P3 instance comes equipped with the NVIDIA Tesla V100 GPU. The V100 is a 16 GB 5,120 CUDA Core-GPU based on the latest Volta architecture; well suited for video coding workloads. The largest instance size in that family, p3.16xlarge, has 64 vCPU, 488 GB of RAM, 8 NVIDIA Tesla V100 GPUs, and 25 Gbps networking bandwidth.

Other than being a mainstay in computational workloads the V100 offers enhanced hardware-based encoding/decoding (NVENC/NVDEC). The following tables summarize the NVENC/NVDEC options available compared to other GPUs offered at AWS.

NVENC Support Matrix

AWS GPU instance
GPU FAMILYGPUH.264 (AVCHD) YUV 4:2:0H.264 (AVCHD) YUV 4:4:4H.264 (AVCHD) LosslessH.265 (HEVC) 4K YUV 4:2:0H.265 (HEVC) 4K YUV 4:4:4H.265 (HEVC) 4K LosslessH.265 (HEVC) 8k
G2KeplerGRID K520YES
P2Kepler (2nd Gen)Tesla K80YES
G3Maxwell (2nd Gen)Tesla M60YESYESYESYES
P3VoltaTesla V100YESYESYESYESYESYESYES

NVDEC Support Matrix

AWS GPU instanceGPU FAMILYGPUMPEG-2VC-1H.264 (AVCHD)H.265 (HEVC)VP8VP9
G2KeplerGRID K520YESYESYES
P2Kepler (2nd Gen)Tesla K80YESYESYES
G3Maxwell (2nd Gen)Tesla M60YESYESYESYES
P3VoltaTesla V100YESYESYESYESYESYES

Cinematic 8K encoding is supported using the Tesla V100 (P3 instance family) either in landscape or portrait orientations using the HEVC codec. 

GPUH264H264_444H264_MEH264_WxHHEVCHEVC_Main10HEVC_LosslessHEVC_SAOHEVC_444HEVC_MEHEVC_WxH
Tesla M60+++4096x
4096
+4096x
4096
Tesla V100+++4096x
4096
++++++8192x
8192

Prerequisites

To follow along with these procedures, ensure that you have the following:

  • An AWS account with permissions to create IAM roles and policies, as well as read and write access to S3
  • Registration with the NVIDIA Developer Network
  • Familiarity with Docker

Deployment

For deployment, you containerize the encoding pipeline. After building the underlying P3 container instance, you then use nvidia-docker2 to build the video-encoding Docker image, which is registered with Amazon Elastic Container Registry (Amazon ECR).

As shown in the following diagram, the pipeline reads an input raw YUV file from S3, then pulls the containerized encoding application to execute at scale on the P3 container instance. The encoded video file is then be transferred to S3.

The nvidia-docker2 image video encoding stack contains the following components:

  • NVIDIA CUDA 9.2
  • FFMPEG 4.0
  • NVIDIA Video Codec SDK 8.1

This is a relatively lengthy procedure. However, after it’s built, the underlying instance and Docker image are reusable and can be quickly deployed as part of a high performance computing (HPC) pipeline.

Creating the ECS container instance

The underlying instance can be built by selecting the Amazon Linux AMI with the p3.2xlarge instance type in a public subnet. Additionally, add an EBS volume (150 GB), which is used for the 8k input, raw yuv, and output files. Scale the storage amount for larger input files. Persist the mount in /etc/fstab. Connect to the instance over SSH and install any OS updates as well as the EPEL Release and support packages as well as the base docker-ce.

sudo yum update -y
sudo yum install yum-utils \
                 device-mapper-persistent-data \
                 lvm2 \

sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
sudo yum install epel-release-latest-7.noarch.rpm
sudo yum update
sudo yum install docker-ce -y

The NVIDIA/CUDA stack can be installed using the cuda-repo-rhel7.rpm file. The CUDA framework installs the NVIDIA driver dependencies.

sudo yum install cuda -y

Next, install nvidia-docker2 as provided in the NVIDIA GitHub repo.

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \
  sudo tee /etc/yum.repos.d/nvidia-docker.repo
sudo yum install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

sudo tee /etc/docker/daemon.json <<EOF
{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-runtime": "nvidia"
}
EOF

sudo systemctl restart docker

With the base components in place, make this instance compatible with the ECS service:

sudo yum install ecs-init -y

Create the /etc/ecs/ecs.config file with the following template:

cat << EOF > /etc/ecs/ecs.config
ECS_DATADIR=/data
ECS_ENABLE_TASK_IAM_ROLE=true
ECS_ENABLE_TASK_IAM_ROLE_NETWORK_HOST=true
ECS_LOGFILE=/log/ecs-agent.log
ECS_AVAILABLE_LOGGING_DRIVERS=["json-file","awslogs"]
ECS_LOGLEVEL=info
ECS_CLUSTER=default
EOF

Iptables and packet forwarding rules need to be created to pass IAM roles into task operations:

sudo sh -c "echo 'net.ipv4.conf.all.route_localnet = 1' >> /etc/sysctl.conf"
sudo sysctl -p /etc/sysctl.conf
sudo iptables -t nat -A PREROUTING -p tcp -d 169.254.170.2 --dport 80 -j DNAT --to-destination 127.0.0.1:51679
sudo iptables -t nat -A OUTPUT -d 169.254.170.2 -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 51679
sudo sh -c 'iptables-save > /etc/sysconfig/iptables'

Finally, a systemd unit file needs to be created:

sudo cat << EOF > /etc/systemd/system/[email protected]
[Unit]
Description=Docker Container %I
Requires=docker.service
After=docker.service

[Service]
Restart=always
ExecStartPre=-/usr/bin/docker rm -f %i
ExecStart=/usr/bin/docker run --name %i \
--privileged \
--restart=on-failure:10 \
--volume=/var/run:/var/run \
--volume=/var/log/ecs/:/log:Z \
--volume=/var/lib/ecs/data:/data:Z \
--volume=/etc/ecs:/etc/ecs \
--net=host \
--env-file=/etc/ecs/ecs.config \
amazon/amazon-ecs-agent:latest
ExecStop=/usr/bin/docker stop %i

[Install]
WantedBy=default.target
EOF

sudo systemctl enable [email protected]
sudo systemctl start [email protected]
sudo systemctl status [email protected]vice

Ensure that the [email protected] service starts successfully.

Creating the NVIDIA-Docker image

With Docker installed, pull the latest nvidia/cuda:latest image from DockerHub.

docker pull nvidia/cuda:latest

It is best at this point to run the Docker container in interactive mode. However, a Docker build file can be created afterwards. At the time of publication, only CUDA 9.0 is installed. NVIDIA has already provided the necessary repositories. Install CUDA 9.2, and support packages, inside the Docker container, referenced by the (docker)  label:

docker run -it --runtime=nvidia --rm nvidia/cuda
(docker) apt update
(docker) apt install pkg-config build-essential wget curl nasm unzip \
                     git libglew-dev cuda-toolkit-9-2 python3-pip -y
(docker) pip3 install awscli

Next, download the FFMPEG 4.0, nv-codec-headers, and the Video Codec SDK 8.1 from the NVIDIA Developer platform.

First, extract the nv-codec-headers and into the directory:

(docker) make
(docker) make install

Extract the ffmpeg-4.0 directory and compile and install FFmpeg:

(docker) ./configure --enable-cuda --enable-cuvid --enable-nvenc --enable-nonfree --enable-libnpp --extra-cflags=-I/usr/local/cuda/include --extra-ldflags=-L/usr/local/cuda/lib64
(docker) make -j 4
(docker) make install

Download and extract the NVIDIA Video Codec SDK 8.1. The “Samples” directory has a preconfigured Makefile that compiles the binaries in the SDK. After it’s successful, confirm that the binaries are correctly set up.

(docker): ~/Video_Codec_SDK_8.1.24/Samples/AppEncode/AppEncCuda$ ./AppEncCuda -h
Options:
-i Input file path
-o Output file path
-s Input resolution in this form: WxH
-if Input format: iyuv nv12 yuv444 p010 yuv444p16 bgra bgra10 ayuv abgr abgr10
-gpu Ordinal of GPU to use
-codec Codec: h264 hevc
-preset Preset: default hp hq bd ll ll_hp ll_hq lossless lossless_hp
-profile H264: baseline main high high444; HEVC: main main10 frext
-444 (Only for RGB input) YUV444 encode
-rc Rate control mode: constqp vbr cbr cbr_ll_hq cbr_hq vbr_hq
-fps Frame rate
-gop Length of GOP (Group of Pictures)
-bf Number of consecutive B-frames
-bitrate Average bit rate, can be in unit of 1, K, M
-maxbitrate Max bit rate, can be in unit of 1, K, M
-vbvbufsize VBV buffer size in bits, can be in unit of 1, K, M
-vbvinit VBV initial delay in bits, can be in unit of 1, K, M
-aq Enable spatial AQ and set its stength (range 1-15, 0-auto)
-temporalaq (No value) Enable temporal AQ
-lookahead Maximum depth of lookahead (range 0-32)
-cq Target constant quality level for VBR mode (range 1-51, 0-auto)
-qmin Min QP value
-qmax Max QP value
-initqp Initial QP value
-constqp QP value for constqp rate control mode
Note: QP value can be in the form of qp_of_P_B_I or qp_P,qp_B,qp_I (no space)

Encoder Capability
# GPU H264 H264_444 H264_ME H264_WxH HEVC HEVC_Main10 HEVC_Lossless HEVC_SAO HEVC_444 HEVC_ME HEVC_WxH
0 Tesla V100-SXM2-16GB + + + 4096x4096 + + + + + + 8192x8192

Create a small script to be used for the 8K-encoding test inside the Docker container. Save the file as /root/nvenc-processor.sh. In the basic form, this script encodes using a single thread. For comparison, the same file is encoded using four threads.

(docker)
#!/bin/bash -xe
time aws s3 cp $S3_INPUT /mnt/8k.webm

time /usr/local/bin/ffmpeg -y -hwaccel cuda -i /mnt/8k.webm -c:v rawvideo -pix_fmt yuv420p /mnt/8k.yuv
time /root/Video_Codec_SDK/Samples/AppEncode/AppEncCuda/AppEncCuda -i /mnt/8k.yuv -o /mnt/8k.hevc -s 7680x4320 -codec hevc
time /root/Video_Codec_SDK/Samples/AppEncode/AppEncPerf/AppEncPerf -i /mnt/8k.yuv -s 7680x4320 -thread 4 -codec hevc

time aws s3 cp /mnt/8k.hevc $S3_OUTPUT

This script downloads a file from S3 and processes it through FFmpeg. Using the AppEncCuda and AppEncPerf methods, create the 8K-encoded file to be uploaded back to S3. Commit your Docker container into a new Docker image:

docker commit -m "creating hvec-processor image" <containerid> nvidia-hvec:latest

Ensure that a Docker repo has been created in Amazon ECS. Choose Repositories, Create Repository. After you open the repository, choose View Push Commands. Commit the new created image to your ECR repo.

After confirming that your image is in your ECR repo, delete all images locally in the instance:

docker rmi -f $(docker images -a -q)

Before stopping the instance, remove the ECS agent checkpoint file:

sudo rm -rf /var/lib/ecs/data/ecs_agent_data.json

Create an AMI from the instance, maintaining the attached EBS volume. Note the AMI ID.

Creating IAM role permissions

To ensure that access to ECS is controlled and to allow AWS Batch to be called, create two IAM roles:

  • BatchServiceRole allows AWS Batch to call services on your behalf.
  • ecsInstanceRole is specific to this workflow and adds permissions for S3FullAccess. This allows the container to read from and write to your S3 bucket. The following screenshot shows the example policy stack.

In AWS Batch, select the compute environment and create a managed compute environment. Assign a cluster name and min and max vCPUs values. Use the AMI ID, and IAM roles created earlier. Use the Spot pricing model with a consideration of running at 60% of the On-Demand price. Look at the current Spot price to see if more aggressive discounts are possible.

Note the cluster name. In Amazon ECS, you should see the cluster created. Next, create a job queue and associate this job queue with the compute environment created earlier. Note the job queue name.

Next, create a job definition file. This provides the job parameters to be used including mounting paths, CPU, and memory requirements.

{
    "containerProperties": {
        "mountPoints": [
            {
                "sourceVolume": "codec-data",
                "readOnly": false,
                "containerPath": "/mnt"
            }],
        "image": "<accountnumber>.dkr.ecr.us-east-1.amazonaws.com/nvidia/nvidia-hvec:latest",
        "command": ["/root/nvenc-processor.sh"],
        "volumes": [
            {
                "host": {"sourcePath": "/mnt"},
                "name": "codec-data"
            }],
        "memory": 32768,
        "vcpus": 8,
        "privileged": true,
        "environment": [
            {
                "name": "S3_INPUT",
                "value": "s3://<bucket>/<key_name>"
            },
            {
                "name": "S3_OUTPUT",
                "value": "s3://<bucket>"
            }
        ],
        "ulimits": []
    },
    "type": "container",
    "jobDefinitionName": "nvenc-test"
}

Save the file as nvenc-test.json and register the job in AWS Batch.

aws batch register-job-definition --cli-input-json file://nvenc-test.json

In the AWS Batch console, create a job queue assigning a priority of 1 to the compute environment created earlier. Create a job assigning a job name, with the job definition file, and job queue. Add additional environment variables for the S3 bucket. Ensure that these buckets and input file are created.

S3_INPUT = s3://<bucket>/<key_name> 
S3_OUTPUT = s3://<bucket> 

Execute the job. In a few moments, the job should be in the Running state. Check the CloudWatch logs for an updated status of the job progression. Open the job record information and scroll down to CloudWatch metrics. The events are logged in a new AWS Batch log stream.

A 1-minute 8K YUV 4:2:0 file took approximately 10 minutes single-threaded (top panel), and 58 seconds using four threads (bottom panel). The nvenc-processor.sh script serves as a basic implementation of 8K encoding. Explore the options provided by the NVIDIA Video Codec SDK for additional encoding/decoding and transcoding options.

Conclusion

With AWS Batch, a customized container instance, and a dockerized NVIDIA video encoding platform, AWS can provide your HD, 4K, and now 8K media distribution. I invite you to incorporate this into your automated pipeline.

With some minor modification, it’s possible to trigger this pipeline after a new file is uploaded into S3. Then, execute through AWS Lambda or as part of an AWS Step Functions workflow.

 

 

 

 

 

Spring 2018 AWS SOC Reports are Now Available with 11 Services Added in Scope

Post Syndicated from Chris Gile original https://aws.amazon.com/blogs/security/spring-2018-aws-soc-reports-are-now-available-with-11-services-added-in-scope/

Since our last System and Organization Control (SOC) audit, our service and compliance teams have been working to increase the number of AWS Services in scope prioritized based on customer requests. Today, we’re happy to report 11 services are newly SOC compliant, which is a 21 percent increase in the last six months.

With the addition of the following 11 new services, you can now select from a total of 62 SOC-compliant services. To see the full list, go to our Services in Scope by Compliance Program page:

• Amazon Athena
• Amazon QuickSight
• Amazon WorkDocs
• AWS Batch
• AWS CodeBuild
• AWS Config
• AWS OpsWorks Stacks
• AWS Snowball
• AWS Snowball Edge
• AWS Snowmobile
• AWS X-Ray

Our latest SOC 1, 2, and 3 reports covering the period from October 1, 2017 to March 31, 2018 are now available. The SOC 1 and 2 reports are available on-demand through AWS Artifact by logging into the AWS Management Console. The SOC 3 report can be downloaded here.

Finally, prospective customers can read our SOC 1 and 2 reports by reaching out to AWS Compliance.

Want more AWS Security news? Follow us on Twitter.

EC2 Fleet – Manage Thousands of On-Demand and Spot Instances with One Request

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/ec2-fleet-manage-thousands-of-on-demand-and-spot-instances-with-one-request/

EC2 Spot Fleets are really cool. You can launch a fleet of Spot Instances that spans EC2 instance types and Availability Zones without having to write custom code to discover capacity or monitor prices. You can set the target capacity (the size of the fleet) in units that are meaningful to your application and have Spot Fleet create and then maintain the fleet on your behalf. Our customers are creating Spot Fleets of all sizes. For example, one financial service customer runs Monte Carlo simulations across 10 different EC2 instance types. They routinely make requests for hundreds of thousands of vCPUs and count on Spot Fleet to give them access to massive amounts of capacity at the best possible price.

EC2 Fleet
Today we are extending and generalizing the set-it-and-forget-it model that we pioneered in Spot Fleet with EC2 Fleet, a new building block that gives you the ability to create fleets that are composed of a combination of EC2 On-Demand, Reserved, and Spot Instances with a single API call. You tell us what you need, capacity and instance-wise, and we’ll handle all the heavy lifting. We will launch, manage, monitor and scale instances as needed, without the need for scaffolding code.

You can specify the capacity of your fleet in terms of instances, vCPUs, or application-oriented units, and also indicate how much of the capacity should be fulfilled by Spot Instances. The application-oriented units allow you to specify the relative power of each EC2 instance type in a way that directly maps to the needs of your application. All three capacity specification options (instances, vCPUs, and application-oriented units) are known as weights.

I think you’ll find a number ways this feature makes managing a fleet of instances easier, and believe that you will also appreciate the team’s near-term feature roadmap of interest (more on that in a bit).

Using EC2 Fleet
There are a number of ways that you can use this feature, whether you’re running a stateless web service, a big data cluster or a continuous integration pipeline. Today I’m going to describe how you can use EC2 Fleet for genomic processing, but this is similar to workloads like risk analysis, log processing or image rendering. Modern DNA sequencers can produce multiple terabytes of raw data each day, to process that data into meaningful information in a timely fashion you need lots of processing power. I’ll be showing you how to deploy a “grid” of worker nodes that can quickly crunch through secondary analysis tasks in parallel.

Projects in genomics can use the elasticity EC2 provides to experiment and try out new pipelines on hundreds or even thousands of servers. With EC2 you can access as many cores as you need and only pay for what you use. Prior to today, you would need to use the RunInstances API or an Auto Scaling group for the On-Demand & Reserved Instance portion of your grid. To get the best price performance you’d also create and manage a Spot Fleet or multiple Spot Auto Scaling groups with different instance types if you wanted to add Spot Instances to turbo-boost your secondary analysis. Finally, to automate scaling decisions across multiple APIs and Auto Scaling groups you would need to write Lambda functions that periodically assess your grid’s progress & backlog, as well as current Spot prices – modifying your Auto Scaling Groups and Spot Fleets accordingly.

You can now replace all of this with a single EC2 Fleet, analyzing genomes at scale for as little as $1 per analysis. In my grid, each step in in the pipeline requires 1 vCPU and 4 GiB of memory, a perfect match for M4 and M5 instances with 4 GiB of memory per vCPU. I will create a fleet using M4 and M5 instances with weights that correspond to the number of vCPUs on each instance:

  • m4.16xlarge – 64 vCPUs, weight = 64
  • m5.24xlarge – 96 vCPUs, weight = 96

This is expressed in a template that looks like this:

"Overrides": [
{
  "InstanceType": "m4.16xlarge",
  "WeightedCapacity": 64,
},
{
  "InstanceType": "m5.24xlarge",
  "WeightedCapacity": 96,
},
]

By default, EC2 Fleet will select the most cost effective combination of instance types and Availability Zones (both specified in the template) using the current prices for the Spot Instances and public prices for the On-Demand Instances (if you specify instances for which you have matching RIs, your discounts will apply). The default mode takes weights into account to get the instances that have the lowest price per unit. So for my grid, fleet will find the instance that offers the lowest price per vCPU.

Now I can request capacity in terms of vCPUs, knowing EC2 Fleet will select the lowest cost option using only the instance types I’ve defined as acceptable. Also, I can specify how many vCPUs I want to launch using On-Demand or Reserved Instance capacity and how many vCPUs should be launched using Spot Instance capacity:

"TargetCapacitySpecification": {
	"TotalTargetCapacity": 2880,
	"OnDemandTargetCapacity": 960,
	"SpotTargetCapacity": 1920,
	"DefaultTargetCapacityType": "Spot"
}

The above means that I want a total of 2880 vCPUs, with 960 vCPUs fulfilled using On-Demand and 1920 using Spot. The On-Demand price per vCPU is lower for m5.24xlarge than the On-Demand price per vCPU for m4.16xlarge, so EC2 Fleet will launch 10 m5.24xlarge instances to fulfill 960 vCPUs. Based on current Spot pricing (again, on a per-vCPU basis), EC2 Fleet will choose to launch 30 m4.16xlarge instances or 20 m5.24xlarges, delivering 1920 vCPUs either way.

Putting it all together, I have a single file (fl1.json) that describes my fleet:

    "LaunchTemplateConfigs": [
        {
            "LaunchTemplateSpecification": {
                "LaunchTemplateId": "lt-0e8c754449b27161c",
                "Version": "1"
            }
        "Overrides": [
        {
          "InstanceType": "m4.16xlarge",
          "WeightedCapacity": 64,
        },
        {
          "InstanceType": "m5.24xlarge",
          "WeightedCapacity": 96,
        },
      ]
        }
    ],
    "TargetCapacitySpecification": {
        "TotalTargetCapacity": 2880,
        "OnDemandTargetCapacity": 960,
        "SpotTargetCapacity": 1920,
        "DefaultTargetCapacityType": "Spot"
    }
}

I can launch my fleet with a single command:

$ aws ec2 create-fleet --cli-input-json file://home/ec2-user/fl1.json
{
    "FleetId":"fleet-838cf4e5-fded-4f68-acb5-8c47ee1b248a"
}

My entire fleet is created within seconds and was built using 10 m5.24xlarge On-Demand Instances and 30 m4.16xlarge Spot Instances, since the current Spot price was 1.5¢ per vCPU for m4.16xlarge and 1.6¢ per vCPU for m5.24xlarge.

Now lets imagine my grid has crunched through its backlog and no longer needs the additional Spot Instances. I can then modify the size of my fleet by changing the target capacity in my fleet specification, like this:

{         
    "TotalTargetCapacity": 960,
}

Since 960 was equal to the amount of On-Demand vCPUs I had requested, when I describe my fleet I will see all of my capacity being delivered using On-Demand capacity:

"TargetCapacitySpecification": {
	"TotalTargetCapacity": 960,
	"OnDemandTargetCapacity": 960,
	"SpotTargetCapacity": 0,
	"DefaultTargetCapacityType": "Spot"
}

When I no longer need my fleet I can delete it and terminate the instances in it like this:

$ aws ec2 delete-fleets --fleet-id fleet-838cf4e5-fded-4f68-acb5-8c47ee1b248a \
  --terminate-instances   
{
    "UnsuccessfulFleetDletetions": [],
    "SuccessfulFleetDeletions": [
        {
            "CurrentFleetState": "deleted_terminating",
            "PreviousFleetState": "active",
            "FleetId": "fleet-838cf4e5-fded-4f68-acb5-8c47ee1b248a"
        }
    ]
}

Earlier I described how RI discounts apply when EC2 Fleet launches instances for which you have matching RIs, so you might be wondering how else RI customers benefit from EC2 Fleet. Let’s say that I own regional RIs for M4 instances. In my EC2 Fleet I would remove m5.24xlarge and specify m4.10xlarge and m4.16xlarge. Then when EC2 Fleet creates the grid, it will quickly find M4 capacity across the sizes and AZs I’ve specified, and my RI discounts apply automatically to this usage.

In the Works
We plan to connect EC2 Fleet and EC2 Auto Scaling groups. This will let you create a single fleet that mixed instance types and Spot, Reserved and On-Demand, while also taking advantage of EC2 Auto Scaling features such as health checks and lifecycle hooks. This integration will also bring EC2 Fleet functionality to services such as Amazon ECS, Amazon EKS, and AWS Batch that build on and make use of EC2 Auto Scaling for fleet management.

Available Now
You can create and make use of EC2 Fleets today in all public AWS Regions!

Jeff;

AWS Adds 16 More Services to Its PCI DSS Compliance Program

Post Syndicated from Chad Woolf original https://aws.amazon.com/blogs/security/aws-adds-16-more-services-to-its-pci-dss-compliance-program/

PCI logo

AWS has added 16 more AWS services to its Payment Card Industry Data Security Standard (PCI DSS) compliance program, giving you more options, flexibility, and functionality to process and store sensitive payment card data in the AWS Cloud. The services were audited by Coalfire to ensure that they meet strict PCI DSS standards.

The newly compliant AWS services are:

AWS now offers 58 services that are officially PCI DSS compliant, giving administrators more service options for implementing a PCI-compliant cardholder environment.

For more information about the AWS PCI DSS compliance program, see Compliance ResourcesAWS Services in Scope by Compliance Program, and PCI DSS Compliance.

– Chad Woolf

AWS Updated Its ISO Certifications and Now Has 67 Services Under ISO Compliance

Post Syndicated from Chad Woolf original https://aws.amazon.com/blogs/security/aws-updated-its-iso-certifications-and-now-has-67-services-under-iso-compliance/

ISO logo

AWS has updated its certifications against ISO 9001, ISO 27001, ISO 27017, and ISO 27018 standards, bringing the total to 67 services now under ISO compliance. We added the following 29 services this cycle:

Amazon AuroraAmazon S3 Transfer AccelerationAWS [email protected]
Amazon Cloud DirectoryAmazon SageMakerAWS Managed Services
Amazon CloudWatch LogsAmazon Simple Notification ServiceAWS OpsWorks Stacks
Amazon CognitoAuto ScalingAWS Shield
Amazon ConnectAWS BatchAWS Snowball Edge
Amazon Elastic Container RegistryAWS CodeBuildAWS Snowmobile
Amazon InspectorAWS CodeCommitAWS Step Functions
Amazon Kinesis Data StreamsAWS CodeDeployAWS Systems Manager (formerly Amazon EC2 Systems Manager)
Amazon MacieAWS CodePipelineAWS X-Ray
Amazon QuickSightAWS IoT Core

For the complete list of services under ISO compliance, see AWS Services in Scope by Compliance Program.

AWS maintains certifications through extensive audits of its controls to ensure that information security risks that affect the confidentiality, integrity, and availability of company and customer information are appropriately managed.

You can download copies of the AWS ISO certificates that contain AWS’s in-scope services and Regions, and use these certificates to jump-start your own certification efforts:

AWS does not increase service costs in any AWS Region as a result of updating its certifications.

To learn more about compliance in the AWS Cloud, see AWS Cloud Compliance.

– Chad

Amazon EC2 Update – Streamlined Access to Spot Capacity, Smooth Price Changes, Instance Hibernation

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-ec2-update-streamlined-access-to-spot-capacity-smooth-price-changes-instance-hibernation/

EC2 Spot Instances give you access to spare compute capacity in the AWS Cloud. Our customers use fleets of Spot Instances to power their CI/CD environments & traffic generators, host web servers & microservices, render movies, and to run many types of analytics jobs, all at prices that offer significant savings in comparison to On-Demand Instances.

New Streamlined Access
Today we are introducing a new, streamlined access model for Spot Instances. You simply indicate your desire to use Spot capacity when you launch an instance via the RunInstances function, the run-instances command, or the AWS Management Console to submit a request that will be fulfilled as long as the capacity is available. With no extra effort on your part you’ll save up to 90% off of the On-Demand price for the instance type, allowing you to boost your overall application throughput by up to 10x for the same budget. The instances that you launch in this way will continue to run until you terminate them or if EC2 needs to reclaim them for On-Demand usage. At that point the instance will be given the usual 2-minute warning and then reclaimed, making this a great fit for applications that are fault-tolerant.

Unlike the old model which required an understanding of Spot markets, bidding, and calls to a standalone asynchronous API, the new model is synchronous and as easy to use as On-Demand. Your code or your script receives an Instance ID immediately and need not check back to see if the request has been processed and accepted.

We’ve made this as clean and as simple as possible, with the expectation that it will be easy to modify many current scripts and applications to request and make use of Spot capacity. If you want to exercise additional control over your Spot instance budget, you have the option to specify a maximum price when you make a request for capacity. If you use Spot capacity to power your Amazon EMR, Amazon ECS, or AWS Batch clusters, or if you launch Spot instances by way of a AWS CloudFormation template or Auto Scaling Group, you will benefit from this new model without having to make any changes.

Applications that are built around RequestSpotInstances or RequestSpotFleet will continue to work just fine with no changes. However, you now have the option to make requests that do not include the SpotPrice parameter.

Smooth Price Changes
As part of today’s launch we are also changing the way that Spot prices change, moving to a model where prices adjust more gradually, based on longer-term trends in supply and demand. As I mentioned earlier, you will continue to save an average of 70-90% off the On-Demand price, and you will continue to pay the Spot price that’s in effect for the time period your instances are running. Applications built around our Spot Fleet feature will continue to automatically diversify placement of their Spot Instances across the most cost-effective pools based on the configuration you specified when you created the fleet.

Spot in Action
To launch a Spot Instance from the command line; simply specify the Spot market:

$ aws ec2 run-instances –-market Spot --image-id ami-1a2b3c4d --count 1 --instance-type c3.large 

Instance Hibernation
If you run workloads that keep a lot of state in memory, you will love this new feature!

You can arrange for instances to save their in-memory state when they are reclaimed, allowing the instances and the applications on them to pick up where they left off when capacity is once again available, just like closing and then opening your laptop. This feature works on C3, C4, and certain sizes of R3, R4, and M4 instances running Amazon Linux, Ubuntu, or Windows Server, and is supported by the EC2 Hibernation Agent.

The in-memory state is written to the root EBS volume of the instance using space that is set-aside when the instance launches. The private IP address and any Elastic IP Addresses are also preserved across a stop/start cycle.

Jeff;

Resume AWS Step Functions from Any State

Post Syndicated from Andy Katz original https://aws.amazon.com/blogs/compute/resume-aws-step-functions-from-any-state/


Yash Pant, Solutions Architect, AWS


Aaron Friedman, Partner Solutions Architect, AWS

When we discuss how to build applications with customers, we often align to the Well Architected Framework pillars of security, reliability, performance efficiency, cost optimization, and operational excellence. Designing for failure is an essential component to developing well architected applications that are resilient to spurious errors that may occur.

There are many ways you can use AWS services to achieve high availability and resiliency of your applications. For example, you can couple Elastic Load Balancing with Auto Scaling and Amazon EC2 instances to build highly available applications. Or use Amazon API Gateway and AWS Lambda to rapidly scale out a microservices-based architecture. Many AWS services have built in solutions to help with the appropriate error handling, such as Dead Letter Queues (DLQ) for Amazon SQS or retries in AWS Batch.

AWS Step Functions is an AWS service that makes it easy for you to coordinate the components of distributed applications and microservices. Step Functions allows you to easily design for failure, by incorporating features such as error retries and custom error handling from AWS Lambda exceptions. These features allow you to programmatically handle many common error modes and build robust, reliable applications.

In some rare cases, however, your application may fail in an unexpected manner. In these situations, you might not want to duplicate in a repeat execution those portions of your state machine that have already run. This is especially true when orchestrating long-running jobs or executing a complex state machine as part of a microservice. Here, you need to know the last successful state in your state machine from which to resume, so that you don’t duplicate previous work. In this post, we present a solution to enable you to resume from any given state in your state machine in the case of an unexpected failure.

Resuming from a given state

To resume a failed state machine execution from the state at which it failed, you first run a script that dynamically creates a new state machine. When the new state machine is executed, it resumes the failed execution from the point of failure. The script contains the following two primary steps:

  1. Parse the execution history of the failed execution to find the name of the state at which it failed, as well as the JSON input to that state.
  2. Create a new state machine, which adds an additional state to failed state machine, called "GoToState". "GoToState" is a choice state at the beginning of the state machine that branches execution directly to the failed state, allowing you to skip states that had succeeded in the previous execution.

The full script along with a CloudFormation template that creates a demo of this is available in the aws-sfn-resume-from-any-state GitHub repo.

Diving into the script

In this section, we walk you through the script and highlight the core components of its functionality. The script contains a main function, which adds a command line parameter for the failedExecutionArn so that you can easily call the script from the command line:

python gotostate.py --failedExecutionArn '<Failed_Execution_Arn>'

Identifying the failed state in your execution

First, the script extracts the name of the failed state along with the input to that state. It does so by using the failed state machine execution history, which is identified by the Amazon Resource Name (ARN) of the execution. The failed state is marked in the execution history, along with the input to that state (which is also the output of the preceding successful state). The script is able to parse these values from the log.

The script loops through the execution history of the failed state machine, and traces it backwards until it finds the failed state. If the state machine failed in a parallel state, then it must restart from the beginning of the parallel state. The script is able to capture the name of the parallel state that failed, rather than any substate within the parallel state that may have caused the failure. The following code is the Python function that does this.


def parseFailureHistory(failedExecutionArn):

    '''
    Parses the execution history of a failed state machine to get the name of failed state and the input to the failed state:
    Input failedExecutionArn = A string containing the execution ARN of a failed state machine y
    Output = A list with two elements: [name of failed state, input to failed state]
    '''
    failedAtParallelState = False
    try:
        #Get the execution history
        response = client.get\_execution\_history(
            executionArn=failedExecutionArn,
            reverseOrder=True
        )
        failedEvents = response['events']
    except Exception as ex:
        raise ex
    #Confirm that the execution actually failed, raise exception if it didn't fail.
    try:
        failedEvents[0]['executionFailedEventDetails']
    except:
        raise('Execution did not fail')
        
    '''
    If you have a 'States.Runtime' error (for example, if a task state in your state machine attempts to execute a Lambda function in a different region than the state machine), get the ID of the failed state, and use it to determine the failed state name and input.
    '''
    
    if failedEvents[0]['executionFailedEventDetails']['error'] == 'States.Runtime':
        failedId = int(filter(str.isdigit, str(failedEvents[0]['executionFailedEventDetails']['cause'].split()[13])))
        failedState = failedEvents[-1 \* failedId]['stateEnteredEventDetails']['name']
        failedInput = failedEvents[-1 \* failedId]['stateEnteredEventDetails']['input']
        return (failedState, failedInput)
        
    '''
    You need to loop through the execution history, tracing back the executed steps.
    The first state you encounter is the failed state. If you failed on a parallel state, you need the name of the parallel state rather than the name of a state within a parallel state that it failed on. This is because you can only attach goToState to the parallel state, but not a substate within the parallel state.
    This loop starts with the ID of the latest event and uses the previous event IDs to trace back the execution to the beginning (id 0). However, it returns as soon it finds the name of the failed state.
    '''

    currentEventId = failedEvents[0]['id']
    while currentEventId != 0:
        #multiply event ID by -1 for indexing because you're looking at the reversed history
        currentEvent = failedEvents[-1 \* currentEventId]
        
        '''
        You can determine if the failed state was a parallel state because it and an event with 'type'='ParallelStateFailed' appears in the execution history before the name of the failed state
        '''

        if currentEvent['type'] == 'ParallelStateFailed':
            failedAtParallelState = True

        '''
        If the failed state is not a parallel state, then the name of failed state to return is the name of the state in the first 'TaskStateEntered' event type you run into when tracing back the execution history
        '''

        if currentEvent['type'] == 'TaskStateEntered' and failedAtParallelState == False:
            failedState = currentEvent['stateEnteredEventDetails']['name']
            failedInput = currentEvent['stateEnteredEventDetails']['input']
            return (failedState, failedInput)

        '''
        If the failed state was a parallel state, then you need to trace execution back to the first event with 'type'='ParallelStateEntered', and return the name of the state
        '''

        if currentEvent['type'] == 'ParallelStateEntered' and failedAtParallelState:
            failedState = failedState = currentEvent['stateEnteredEventDetails']['name']
            failedInput = currentEvent['stateEnteredEventDetails']['input']
            return (failedState, failedInput)
        #Update the ID for the next execution of the loop
        currentEventId = currentEvent['previousEventId']
        

Create the new state machine

The script uses the name of the failed state to create the new state machine, with "GoToState" branching execution directly to the failed state.

To do this, the script requires the Amazon States Language (ASL) definition of the failed state machine. It modifies the definition to append "GoToState", and create a new state machine from it.

The script gets the ARN of the failed state machine from the execution ARN of the failed state machine. This ARN allows it to get the ASL definition of the failed state machine by calling the DesribeStateMachine API action. It creates a new state machine with "GoToState".

When the script creates the new state machine, it also adds an additional input variable called "resuming". When you execute this new state machine, you specify this resuming variable as true in the input JSON. This tells "GoToState" to branch execution to the state that had previously failed. Here’s the function that does this:

def attachGoToState(failedStateName, stateMachineArn):

    '''
    Given a state machine ARN and the name of a state in that state machine, create a new state machine that starts at a new choice state called 'GoToState'. "GoToState" branches to the named state, and sends the input of the state machine to that state, when a variable called "resuming" is set to True.
    Input failedStateName = A string with the name of the failed state
          stateMachineArn = A string with the ARN of the state machine
    Output response from the create_state_machine call, which is the API call that creates a new state machine
    '''

    try:
        response = client.describe\_state\_machine(
            stateMachineArn=stateMachineArn
        )
    except:
        raise('Could not get ASL definition of state machine')
    roleArn = response['roleArn']
    stateMachine = json.loads(response['definition'])
    #Create a name for the new state machine
    newName = response['name'] + '-with-GoToState'
    #Get the StartAt state for the original state machine, because you point the 'GoToState' to this state
    originalStartAt = stateMachine['StartAt']

    '''
    Create the GoToState with the variable $.resuming.
    If new state machine is executed with $.resuming = True, then the state machine skips to the failed state.
    Otherwise, it executes the state machine from the original start state.
    '''

    goToState = {'Type':'Choice', 'Choices':[{'Variable':'$.resuming', 'BooleanEquals':False, 'Next':originalStartAt}], 'Default':failedStateName}
    #Add GoToState to the set of states in the new state machine
    stateMachine['States']['GoToState'] = goToState
    #Add StartAt
    stateMachine['StartAt'] = 'GoToState'
    #Create new state machine
    try:
        response = client.create_state_machine(
            name=newName,
            definition=json.dumps(stateMachine),
            roleArn=roleArn
        )
    except:
        raise('Failed to create new state machine with GoToState')
    return response

Testing the script

Now that you understand how the script works, you can test it out.

The following screenshot shows an example state machine that has failed, called "TestMachine". This state machine successfully completed "FirstState" and "ChoiceState", but when it branched to "FirstMatchState", it failed.

Use the script to create a new state machine that allows you to rerun this state machine, but skip the "FirstState" and the "ChoiceState" steps that already succeeded. You can do this by calling the script as follows:

python gotostate.py --failedExecutionArn 'arn:aws:states:us-west-2:<AWS_ACCOUNT_ID>:execution:TestMachine-with-GoToState:b2578403-f41d-a2c7-e70c-7500045288595

This creates a new state machine called "TestMachine-with-GoToState", and returns its ARN, along with the input that had been sent to "FirstMatchState". You can then inspect the input to determine what caused the error. In this case, you notice that the input to "FirstMachState" was the following:

{
"foo": 1,
"Message": true
}

However, this state machine expects the "Message" field of the JSON to be a string rather than a Boolean. Execute the new "TestMachine-with-GoToState" state machine, change the input to be a string, and add the "resuming" variable that "GoToState" requires:

{
"foo": 1,
"Message": "Hello!",
"resuming":true
}

When you execute the new state machine, it skips "FirstState" and "ChoiceState", and goes directly to "FirstMatchState", which was the state that failed:

Look at what happens when you have a state machine with multiple parallel steps. This example is included in the GitHub repository associated with this post. The repo contains a CloudFormation template that sets up this state machine and provides instructions to replicate this solution.

The following state machine, "ParallelStateMachine", takes an input through two subsequent parallel states before doing some final processing and exiting, along with the JSON with the ASL definition of the state machine.

{
  "Comment": "An example of the Amazon States Language using a parallel state to execute two branches at the same time.",
  "StartAt": "Parallel",
  "States": {
    "Parallel": {
      "Type": "Parallel",
      "ResultPath":"$.output",
      "Next": "Parallel 2",
      "Branches": [
        {
          "StartAt": "Parallel Step 1, Process 1",
          "States": {
            "Parallel Step 1, Process 1": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaA",
              "End": true
            }
          }
        },
        {
          "StartAt": "Parallel Step 1, Process 2",
          "States": {
            "Parallel Step 1, Process 2": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaA",
              "End": true
            }
          }
        }
      ]
    },
    "Parallel 2": {
      "Type": "Parallel",
      "Next": "Final Processing",
      "Branches": [
        {
          "StartAt": "Parallel Step 2, Process 1",
          "States": {
            "Parallel Step 2, Process 1": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXXX:function:LambdaB",
              "End": true
            }
          }
        },
        {
          "StartAt": "Parallel Step 2, Process 2",
          "States": {
            "Parallel Step 2, Process 2": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaB",
              "End": true
            }
          }
        }
      ]
    },
    "Final Processing": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaC",
      "End": true
    }
  }
}

First, use an input that initially fails:

{
  "Message": "Hello!"
}

This fails because the state machine expects you to have a variable in the input JSON called "foo" in the second parallel state to run "Parallel Step 2, Process 1" and "Parallel Step 2, Process 2". Instead, the original input gets processed by the first parallel state and produces the following output to pass to the second parallel state:

{
"output": [
    {
      "Message": "Hello!"
    },
    {
      "Message": "Hello!"
    }
  ],
}

Run the script on the failed state machine to create a new state machine that allows it to resume directly at the second parallel state instead of having to redo the first parallel state. This creates a new state machine called "ParallelStateMachine-with-GoToState". The following JSON was created by the script to define the new state machine in ASL. It contains the "GoToState" value that was attached by the script.

{
   "Comment":"An example of the Amazon States Language using a parallel state to execute two branches at the same time.",
   "States":{
      "Final Processing":{
         "Resource":"arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaC",
         "End":true,
         "Type":"Task"
      },
      "GoToState":{
         "Default":"Parallel 2",
         "Type":"Choice",
         "Choices":[
            {
               "Variable":"$.resuming",
               "BooleanEquals":false,
               "Next":"Parallel"
            }
         ]
      },
      "Parallel":{
         "Branches":[
            {
               "States":{
                  "Parallel Step 1, Process 1":{
                     "Resource":"arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaA",
                     "End":true,
                     "Type":"Task"
                  }
               },
               "StartAt":"Parallel Step 1, Process 1"
            },
            {
               "States":{
                  "Parallel Step 1, Process 2":{
                     "Resource":"arn:aws:lambda:us-west-2:XXXXXXXXXXXX:LambdaA",
                     "End":true,
                     "Type":"Task"
                  }
               },
               "StartAt":"Parallel Step 1, Process 2"
            }
         ],
         "ResultPath":"$.output",
         "Type":"Parallel",
         "Next":"Parallel 2"
      },
      "Parallel 2":{
         "Branches":[
            {
               "States":{
                  "Parallel Step 2, Process 1":{
                     "Resource":"arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaB",
                     "End":true,
                     "Type":"Task"
                  }
               },
               "StartAt":"Parallel Step 2, Process 1"
            },
            {
               "States":{
                  "Parallel Step 2, Process 2":{
                     "Resource":"arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaB",
                     "End":true,
                     "Type":"Task"
                  }
               },
               "StartAt":"Parallel Step 2, Process 2"
            }
         ],
         "Type":"Parallel",
         "Next":"Final Processing"
      }
   },
   "StartAt":"GoToState"
}

You can then execute this state machine with the correct input by adding the "foo" and "resuming" variables:

{
  "foo": 1,
  "output": [
    {
      "Message": "Hello!"
    },
    {
      "Message": "Hello!"
    }
  ],
  "resuming": true
}

This yields the following result. Notice that this time, the state machine executed successfully to completion, and skipped the steps that had previously failed.


Conclusion

When you’re building out complex workflows, it’s important to be prepared for failure. You can do this by taking advantage of features such as automatic error retries in Step Functions and custom error handling of Lambda exceptions.

Nevertheless, state machines still have the possibility of failing. With the methodology and script presented in this post, you can resume a failed state machine from its point of failure. This allows you to skip the execution of steps in the workflow that had already succeeded, and recover the process from the point of failure.

To see more examples, please visit the Step Functions Getting Started page.

If you have questions or suggestions, please comment below.

AWS HIPAA Eligibility Update (October 2017) – Sixteen Additional Services

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-hipaa-eligibility-post-update-october-2017-sixteen-additional-services/

Our Health Customer Stories page lists just a few of the many customers that are building and running healthcare and life sciences applications that run on AWS. Customers like Verge Health, Care Cloud, and Orion Health trust AWS with Protected Health Information (PHI) and Personally Identifying Information (PII) as part of their efforts to comply with HIPAA and HITECH.

Sixteen More Services
In my last HIPAA Eligibility Update I shared the news that we added eight additional services to our list of HIPAA eligible services. Today I am happy to let you know that we have added another sixteen services to the list, bringing the total up to 46. Here are the newest additions, along with some short descriptions and links to some of my blog posts to jog your memory:

Amazon Aurora with PostgreSQL Compatibility – This brand-new addition to Amazon Aurora allows you to encrypt your relational databases using keys that you create and manage through AWS Key Management Service (KMS). When you enable encryption for an Amazon Aurora database, the underlying storage is encrypted, as are automated backups, read replicas, and snapshots. Read New – Encryption at Rest for Amazon Aurora to learn more.

Amazon CloudWatch Logs – You can use the logs to monitor and troubleshoot your systems and applications. You can monitor your existing system, application, and custom log files in near real-time, watching for specific phrases, values, or patterns. Log data can be stored durably and at low cost, for as long as needed. To learn more, read Store and Monitor OS & Application Log Files with Amazon CloudWatch and Improvements to CloudWatch Logs and Dashboards.

Amazon Connect – This self-service, cloud-based contact center makes it easy for you to deliver better customer service at a lower cost. You can use the visual designer to set up your contact flows, manage agents, and track performance, all without specialized skills. Read Amazon Connect – Customer Contact Center in the Cloud and New – Amazon Connect and Amazon Lex Integration to learn more.

Amazon ElastiCache for Redis – This service lets you deploy, operate, and scale an in-memory data store or cache that you can use to improve the performance of your applications. Each ElastiCache for Redis cluster publishes key performance metrics to Amazon CloudWatch. To learn more, read Caching in the Cloud with Amazon ElastiCache and Amazon ElastiCache – Now With a Dash of Redis.

Amazon Kinesis Streams – This service allows you to build applications that process or analyze streaming data such as website clickstreams, financial transactions, social media feeds, and location-tracking events. To learn more, read Amazon Kinesis – Real-Time Processing of Streaming Big Data and New: Server-Side Encryption for Amazon Kinesis Streams.

Amazon RDS for MariaDB – This service lets you set up scalable, managed MariaDB instances in minutes, and offers high performance, high availability, and a simplified security model that makes it easy for you to encrypt data at rest and in transit. Read Amazon RDS Update – MariaDB is Now Available to learn more.

Amazon RDS SQL Server – This service lets you set up scalable, managed Microsoft SQL Server instances in minutes, and also offers high performance, high availability, and a simplified security model. To learn more, read Amazon RDS for SQL Server and .NET support for AWS Elastic Beanstalk and Amazon RDS for Microsoft SQL Server – Transparent Data Encryption (TDE) to learn more.

Amazon Route 53 – This is a highly available Domain Name Server. It translates names like www.example.com into IP addresses. To learn more, read Moving Ahead with Amazon Route 53.

AWS Batch – This service lets you run large-scale batch computing jobs on AWS. You don’t need to install or maintain specialized batch software or build your own server clusters. Read AWS Batch – Run Batch Computing Jobs on AWS to learn more.

AWS CloudHSM – A cloud-based Hardware Security Module (HSM) for key storage and management at cloud scale. Designed for sensitive workloads, CloudHSM lets you manage your own keys using FIPS 140-2 Level 3 validated HSMs. To learn more, read AWS CloudHSM – Secure Key Storage and Cryptographic Operations and AWS CloudHSM Update – Cost Effective Hardware Key Management at Cloud Scale for Sensitive & Regulated Workloads.

AWS Key Management Service – This service makes it easy for you to create and control the encryption keys used to encrypt your data. It uses HSMs to protect your keys, and is integrated with AWS CloudTrail in order to provide you with a log of all key usage. Read New AWS Key Management Service (KMS) to learn more.

AWS Lambda – This service lets you run event-driven application or backend code without thinking about or managing servers. To learn more, read AWS Lambda – Run Code in the Cloud, AWS Lambda – A Look Back at 2016, and AWS Lambda – In Full Production with New Features for Mobile Devs.

[email protected] – You can use this new feature of AWS Lambda to run Node.js functions across the global network of AWS locations without having to provision or manager servers, in order to deliver rich, personalized content to your users with low latency. Read [email protected] – Intelligent Processing of HTTP Requests at the Edge to learn more.

AWS Snowball Edge – This is a data transfer device with 100 terabytes of on-board storage as well as compute capabilities. You can use it to move large amounts of data into or out of AWS, as a temporary storage tier, or to support workloads in remote or offline locations. To learn more, read AWS Snowball Edge – More Storage, Local Endpoints, Lambda Functions.

AWS Snowmobile – This is an exabyte-scale data transfer service. Pulled by a semi-trailer truck, each Snowmobile packs 100 petabytes of storage into a ruggedized 45-foot long shipping container. Read AWS Snowmobile – Move Exabytes of Data to the Cloud in Weeks to learn more (and to see some of my finest LEGO work).

AWS Storage Gateway – This hybrid storage service lets your on-premises applications use AWS cloud storage (Amazon Simple Storage Service (S3), Amazon Glacier, and Amazon Elastic File System) in a simple and seamless way, with storage for volumes, files, and virtual tapes. To learn more, read The AWS Storage Gateway – Integrate Your Existing On-Premises Applications with AWS Cloud Storage and File Interface to AWS Storage Gateway.

And there you go! Check out my earlier post for a list of resources that will help you to build applications that comply with HIPAA and HITECH.

Jeff;