Tag Archives: AWS Batch

GPU workloads on AWS Batch

Post Syndicated from Josh Rad original https://aws.amazon.com/blogs/compute/gpu-workloads-on-aws-batch/

Contributed by Manuel Manzano Hoss, Cloud Support Engineer

I remember playing around with graphics processing units (GPUs) workload examples in 2017 when the Deep Learning on AWS Batch post was published by my colleague Kiuk Chung. He provided an example of how to train a convolutional neural network (CNN), the LeNet architecture, to recognize handwritten digits from the MNIST dataset using Apache MXNet as the framework. Back then, to run such jobs with GPU capabilities, I had to do the following:

  • Create a custom GPU-enabled AMI that had installed Docker, the ECS agent, NVIDIA driver and container runtime, and CUDA.
  • Identify the type of P2 EC2 instance that had the required amount of GPU for my job.
  • Check the amount of vCPUs that it offered (even if I was not interested on using them).
  • Specify that number of vCPUs for my job.

All that, when I didn’t have any certainty that the instance was going to have the GPU required available when my job was already running. Back then, there was no GPU pinning. Other jobs running on the same EC2 instance were able to use that GPU, making the orchestration of my jobs a tricky task.

Fast forward two years. Today, AWS Batch announced integrated support for Amazon EC2 Accelerated Instances. It is now possible to specify an amount of GPU as a resource that AWS Batch considers in choosing the EC2 instance to run your job, along with vCPU and memory. That allows me to take advantage of the main benefits of using AWS Batch, the compute resource selection algorithm and job scheduler. It also frees me from having to check the types of EC2 instances that have enough GPU.

Also, I can take advantage of the Amazon ECS GPU-optimized AMI maintained by AWS. It comes with the NVIDIA drivers and all the necessary software to run GPU-enabled jobs. When I allow the P2 or P3 instance types on my compute environment, AWS Batch launches my compute resources using the Amazon ECS GPU-optimized AMI automatically.

In other words, now I don’t worry about the GPU task list mentioned earlier. I can focus on deciding which framework and command to run on my GPU-accelerated workload. At the same time, I’m now sure that my jobs have access to the required performance, as physical GPUs are pinned to each job and not shared among them.

A GPU race against the past

As a kind of GPU-race exercise, I checked a similar example to the one from Kiuk’s post, to see how fast it could be to run a GPU-enabled job now. I used the AWS Management Console to demonstrate how simple the steps are.

In this case, I decided to use the deep neural network architecture called multilayer perceptron (MLP), not the LeNet CNN, to compare the validation accuracy between them.

To make the test even simpler and faster to implement, I thought I would use one of the recently announced AWS Deep Learning containers, which come pre-packed with different frameworks and ready-to-process data. I chose the container that comes with MXNet and Python 2.7, customized for Training and GPU. For more information about the Docker images available, see the AWS Deep Learning Containers documentation.

In the AWS Batch console, I created a managed compute environment with the default settings, allowing AWS Batch to create the required IAM roles on my behalf.

On the configuration of the compute resources, I selected the P2 and P3 families of instances, as those are the type of instance with GPU capabilities. You can select On-Demand Instances, but in this case I decided to use Spot Instances to take advantage of the discounts that this pricing model offers. I left the defaults for all other settings, selecting the AmazonEC2SpotFleetRole role that I created the first time that I used Spot Instances:

Finally, I also left the network settings as default. My compute environment selected the default VPC, three subnets, and a security group. They are enough to run my jobs and at the same time keep my environment safe by limiting connections from outside the VPC:

I created a job queue, GPU_JobQueue, attaching it to the compute environment that I just created:

Next, I registered the same job definition that I would have created following Kiuk’s post. I specified enough memory to run this test, one vCPU, and the AWS Deep Learning Docker image that I chose, in this case mxnet-training:1.4.0-gpu-py27-cu90-ubuntu16.04. The amount of GPU required was in this case, one. To have access to run the script, the container must run as privileged, or using the root user.

Finally, I submitted the job. I first cloned the MXNet repository for the train_mnist.py Python script. Then I ran the script itself, with the parameter –gpus 0 to indicate that the assigned GPU should be used. The job inherits all the other parameters from the job definition:

sh -c 'git clone -b 1.3.1 https://github.com/apache/incubator-mxnet.git && python /incubator-mxnet/example/image-classification/train_mnist.py --gpus 0'

That’s all, and my GPU-enabled job was running. It took me less than two minutes to go from zero to having the job submitted. This is the log of my job, from which I removed the iterations from epoch 1 to 18 to make it shorter:

14:32:31     Cloning into 'incubator-mxnet'...

14:33:50     Note: checking out '19c501680183237d52a862e6ae1dc4ddc296305b'.

14:33:51     INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none', gpus='0', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=20, num_examples=60000, num_layers=No

14:33:51     DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): yann.lecun.com:80

14:33:54     DEBUG:urllib3.connectionpool:http://yann.lecun.com:80 "GET /exdb/mnist/train-labels-idx1-ubyte.gz HTTP/1.1" 200 28881

14:33:55     DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): yann.lecun.com:80

14:33:55     DEBUG:urllib3.connectionpool:http://yann.lecun.com:80 "GET /exdb/mnist/train-images-idx3-ubyte.gz HTTP/1.1" 200 9912422

14:33:59     DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): yann.lecun.com:80

14:33:59     DEBUG:urllib3.connectionpool:http://yann.lecun.com:80 "GET /exdb/mnist/t10k-labels-idx1-ubyte.gz HTTP/1.1" 200 4542

14:33:59     DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): yann.lecun.com:80

14:34:00     DEBUG:urllib3.connectionpool:http://yann.lecun.com:80 "GET /exdb/mnist/t10k-images-idx3-ubyte.gz HTTP/1.1" 200 1648877

14:34:04     INFO:root:Epoch[0] Batch [0-100] Speed: 37038.30 samples/sec accuracy=0.793472

14:34:04     INFO:root:Epoch[0] Batch [100-200] Speed: 36457.89 samples/sec accuracy=0.906719

14:34:04     INFO:root:Epoch[0] Batch [200-300] Speed: 36981.20 samples/sec accuracy=0.927500

14:34:04     INFO:root:Epoch[0] Batch [300-400] Speed: 36925.04 samples/sec accuracy=0.935156

14:34:04     INFO:root:Epoch[0] Batch [400-500] Speed: 37262.36 samples/sec accuracy=0.940156

14:34:05     INFO:root:Epoch[0] Batch [500-600] Speed: 37729.64 samples/sec accuracy=0.942813

14:34:05     INFO:root:Epoch[0] Batch [600-700] Speed: 37493.55 samples/sec accuracy=0.949063

14:34:05     INFO:root:Epoch[0] Batch [700-800] Speed: 37320.80 samples/sec accuracy=0.953906

14:34:05     INFO:root:Epoch[0] Batch [800-900] Speed: 37705.85 samples/sec accuracy=0.958281

14:34:05     INFO:root:Epoch[0] Train-accuracy=0.924024

14:34:05     INFO:root:Epoch[0] Time cost=1.633

...  LOGS REMOVED

14:34:44     INFO:root:Epoch[19] Batch [0-100] Speed: 36864.44 samples/sec accuracy=0.999691

14:34:44     INFO:root:Epoch[19] Batch [100-200] Speed: 37088.35 samples/sec accuracy=1.000000

14:34:44     INFO:root:Epoch[19] Batch [200-300] Speed: 36706.91 samples/sec accuracy=0.999687

14:34:44     INFO:root:Epoch[19] Batch [300-400] Speed: 37941.19 samples/sec accuracy=0.999687

14:34:44     INFO:root:Epoch[19] Batch [400-500] Speed: 37180.97 samples/sec accuracy=0.999844

14:34:44     INFO:root:Epoch[19] Batch [500-600] Speed: 37122.30 samples/sec accuracy=0.999844

14:34:45     INFO:root:Epoch[19] Batch [600-700] Speed: 37199.37 samples/sec accuracy=0.999687

14:34:45     INFO:root:Epoch[19] Batch [700-800] Speed: 37284.93 samples/sec accuracy=0.999219

14:34:45     INFO:root:Epoch[19] Batch [800-900] Speed: 36996.80 samples/sec accuracy=0.999844

14:34:45     INFO:root:Epoch[19] Train-accuracy=0.999733

14:34:45     INFO:root:Epoch[19] Time cost=1.617

14:34:45     INFO:root:Epoch[19] Validation-accuracy=0.983579

Summary

As you can see, after AWS Batch launched the instance, the job took slightly more than two minutes to run. I spent roughly five minutes from start to finish. That was much faster than the time that I was previously spending just to configure the AMI. Using the AWS CLI, one of the AWS SDKs, or AWS CloudFormation, the same environment could be created even faster.

From a training point of view, I lost on the validation accuracy, as the results obtained using the LeNet CNN are higher than when using an MLP network. On the other hand, my job was faster, with a time cost of 1.6 seconds in average for each epoch. As the software stack evolves, and increased hardware capabilities come along, these numbers keep improving, but that shouldn’t mean extra complexity. Using managed primitives like the one presented in this post enables a simpler implementation.

I encourage you to test this example and see for yourself how just a few clicks or commands lets you start running GPU jobs with AWS Batch. Then, it is just a matter of replacing the Docker image that I used for one with the framework of your choice, TensorFlow, Caffe, PyTorch, Keras, etc. Start to run your GPU-enabled machine learning, deep learning, computational fluid dynamics (CFD), seismic analysis, molecular modeling, genomics, or computational finance workloads. It’s faster and easier than ever.

If you decide to give it a try, have any doubt or just want to let me know what you think about this post, please write in the comments section!

Building Simpler Genomics Workflows on AWS Step Functions

Post Syndicated from Christie Gifrin original https://aws.amazon.com/blogs/compute/building-simpler-genomics-workflows-on-aws-step-functions/

This post is courtesy of Ryan Ulaszek, AWS Genomics Partner Solutions Architect and Aaron Friedman, AWS Healthcare and Life Sciences Partner Solutions Architect

In 2017, we published a four part blog series on how to build a genomics workflow on AWS. In part 1, we introduced a general architecture highlighting three common layers: job, batch and workflow.  In part 2, we described building the job layer with Docker and Amazon Elastic Container Registry (Amazon ECR).  In part 3, we tackled the batch layer and built a batch engine using AWS Batch.  In part 4, we built out the workflow layer using AWS Step Functions and AWS Lambda.

Since then, we’ve worked with many AWS customers and APN partners to implement this solution in genomics as well as in other workloads-of-interest. Today, we wanted to highlight a new feature in Step Functions that simplifies how customers and partners can build high-throughput genomics workflows on AWS.

Step Functions now supports native integration with AWS Batch, which simplifies how you can create an AWS Batch state that submits an asynchronous job and waits for that job to finish.

Before, you needed to build a state machine building block that submitted a job to AWS Batch, and then polled and checked its execution. Now, you can just submit the job to AWS Batch using the new AWS Batch task type.  Step Functions waits to proceed until the job is completed. This reduces the complexity of your state machine and makes it easier to build a genomics workflow with asynchronous AWS Batch steps.

The new integrations include support for the following API actions:

  • AWS Batch SubmitJob
  • Amazon SNS Publish
  • Amazon SQS SendMessage
  • Amazon ECS RunTask
  • AWS Fargate RunTask
  • Amazon DynamoDB
    • PutItem
    • GetItem
    • UpdateItem
    • DeleteItem
  • Amazon SageMaker
    • CreateTrainingJob
    • CreateTransformJob
  • AWS Glue
    • StartJobRun

You can also pass parameters to the service API.  To use the new integrations, the role that you assume when running a state machine needs to have the appropriate permissions.  For more information, see the AWS Step Functions Developer Guide.

Using a job status poller

In our 2017 post series, we created a job poller “pattern” with two separate Lambda functions. When the job finishes, the state machine proceeds to the next step and operates according to the necessary business logic.  This is a useful pattern to manage asynchronous jobs when a direct integration is unavailable.

The steps in this building block state machine are as follows:

  1. A job is submitted through a Lambda function.
  2. The state machine queries the AWS Batch API for the job status in another Lambda function.
  3. The job status is checked to see if the job has completed.  If the job status equals SUCCESS, the final job status is logged. If the job status equals FAILED, the execution of the state machine ends. In all other cases, wait 30 seconds and go back to Step 2.

Both of the Submit Job and Get Job Lambda functions are available as example Lambda functions in the console.  The job status poller is available in the Step Functions console as a sample project.

Here is the JSON representing this state machine.

{
  "Comment": "A simple example that submits a job to AWS Batch",
  "StartAt": "SubmitJob",
  "States": {
    "SubmitJob": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:<account-id>::function:batchSubmitJob",
      "Next": "GetJobStatus"
    },
    "GetJobStatus": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:<account-id>:function:batchGetJobStatus",
      "Next": "CheckJobStatus",
      "InputPath": "$",
      "ResultPath": "$.status"
    },
    "CheckJobStatus": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.status",
          "StringEquals": "FAILED",
          "End": true
        },
        {
          "Variable": "$.status",
          "StringEquals": "SUCCEEDED",
          "Next": "GetFinalJobStatus"
        }
      ],
      "Default": "Wait30Seconds"
    },
    "Wait30Seconds": {
      "Type": "Wait",
      "Seconds": 30,
      "Next": "GetJobStatus"
    },
    "GetFinalJobStatus": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:<account-id>:function:batchGetJobStatus",
      "End": true
    }
  }
}

With Step Functions Service Integrations

With Step Functions service integrations, it is now simpler to submit and wait for an AWS Batch job, or any other supported service.

The following code block is the JSON representing the new state machine for an asynchronous batch job. If you are familiar with the AWS Batch SubmitJob API action, you may notice that the parameters are consistent with what you would see in that API call. You can also use the optional AWS Batch parameters in addition to JobDefinition, JobName, and JobQueue.

{
 "StartAt": "RunBatchJob",
 "States": {
     "RunIsaacJob":{
     "Type":"Task",
     "Resource":"arn:aws:states:::batch:submitJob.sync",
     "Parameters":{
        "JobDefinition":"Isaac",
        "JobName.$":"$.isaac.JobName",
        "JobQueue":"HighPriority",
        "Parameters.$": "$.isaac"
     },
     "TimeoutSeconds": 900,
     "HeartbeatSeconds": 60,
     "Next":"Parallel",
     "InputPath":"$",
     "ResultPath":"$.status",
     "Retry" : [
        {
          "ErrorEquals": [ "States.Timeout" ],
          "IntervalSeconds": 3,
          "MaxAttempts": 2,
          "BackoffRate": 1.5
        }
     ]
  }
}

Here is an example of the workflow input JSON.  Pass all of the container parameters that were being constructed in the submit job Lambda function.

{
  "isaac": {
    "WorkingDir": "/scratch",
    "JobName": "isaac-1",
    "FastQ1S3Path": "s3://aws-batch-genomics-resources/fastq/SRR1919605_1.fastq.gz",
    "BAMS3FolderPath": "s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz",
    "FastQ2S3Path": "s3://bccn-genome-data/fastq/NIST7035_R2_trimmed.fastq.gz",
    "ReferenceS3Path": "s3://aws-batch-genomics-resources/reference/isaac/"
  }
}

When you deploy the job definition, add the command attribute that was previously being constructed in the Lambda function launching the AWS Batch job.

IsaacJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      JobDefinitionName: "Isaac"
      Type: container
      RetryStrategy:
        Attempts: 1
      Parameters:
        BAMS3FolderPath: !Sub "s3://${JobResultsBucket}/NA12878_states_1/bam"
        FastQ1S3Path: "s3://aws-batch-genomics-resources/fastq/SRR1919605_1.fastq.gz"
        FastQ2S3Path: "s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz"
        ReferenceS3Path: "s3://aws-batch-genomics-resources/reference/isaac/"
        WorkingDir: "/scratch"
      ContainerProperties:
        Image: "rulaszek/isaac"
        Vcpus: 32
        Memory: 80000
        JobRoleArn:
          Fn::ImportValue: !Sub "${RoleStackName}:ECSTaskRole"
        Command:
          - "--bam_s3_folder_path"
          - "Ref::BAMS3FolderPath"
          - "--fastq1_s3_path"
          - "Ref::FastQ1S3Path"
          - "--fastq2_s3_path"
          - "Ref::FastQ2S3Path"
          - "--reference_s3_path"
          - "Ref::ReferenceS3Path"
          - "--working_dir"
          - "Ref::WorkingDir"
        MountPoints:
          - ContainerPath: "/scratch"
            ReadOnly: false
            SourceVolume: docker_scratch
        Volumes:
          - Name: docker_scratch
            Host:
              SourcePath: "/docker_scratch"

The key-value parameters passed into the workflow are mapped using Parameters.$ to the values in the job definition using the keys.  Value substitutions do take place. The Docker run looks like the following:

docker run <isaac_container_uri> --bam_s3_folder_path s3://batch-genomics-pipeline-jobresultsbucket-1kzdu216m2b0k/NA12878_states_3/bam
                                 --fastq1_s3_path s3://aws-batch-genomics-resources/fastq/SRR1919605_1.fastq.gz
                                 --fastq2_s3_path s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz 
                                 --reference_s3_path s3://aws-batch-genomics-resources/reference/isaac/ 
                                 --working_dir /scratch

Genomics workflow: Before and after

Overall, connectors dramatically simplify your genomics workflow.  The following workflow is a simple genomics secondary analysis pipeline, which we highlighted in our original post series.

The first step aligns the sample against a reference genome.  When alignment is complete, variant calling and QA metrics are calculated in two parallel steps.  When variant calling is complete, variant annotation is performed.  Before, our genomics workflow looked like this:

Now it looks like this:

Here is the new workflow JSON:

{
   "Comment":"A simple genomics secondary-analysis workflow",
   "StartAt":"RunIsaacJob",
   "States":{
      "RunIsaacJob":{
         "Type":"Task",
         "Resource":"arn:aws:states:::batch:submitJob.sync",
         "Parameters":{
            "JobDefinition":"Isaac",
            "JobName.$":"$.isaac.JobName",
            "JobQueue":"HighPriority",
            "Parameters.$": "$.isaac"
         },
         "TimeoutSeconds": 900,
         "HeartbeatSeconds": 60,
         "Next":"Parallel",
         "InputPath":"$",
         "ResultPath":"$.status",
         "Retry" : [
            {
              "ErrorEquals": [ "States.Timeout" ],
              "IntervalSeconds": 3,
              "MaxAttempts": 2,
              "BackoffRate": 1.5
            }
         ]
      },
      "Parallel":{
         "Type":"Parallel",
         "Next":"FinalState",
         "Branches":[
            {
               "StartAt":"RunStrelkaJob",
               "States":{
                  "RunStrelkaJob":{
                     "Type":"Task",
                     "Resource":"arn:aws:states:::batch:submitJob.sync",
                     "Parameters":{
                        "JobDefinition":"Strelka",
                        "JobName.$":"$.strelka.JobName",
                        "JobQueue":"HighPriority",
                        "Parameters.$": "$.strelka"
                     },
                     "TimeoutSeconds": 900,
                     "HeartbeatSeconds": 60,
                     "Next":"RunSnpEffJob",
                     "InputPath":"$",
                     "ResultPath":"$.status",
                     "Retry" : [
                        {
                          "ErrorEquals": [ "States.Timeout" ],
                          "IntervalSeconds": 3,
                          "MaxAttempts": 2,
                          "BackoffRate": 1.5
                        }
                     ]
                  },
                  "RunSnpEffJob":{
                     "Type":"Task",
                     "Resource":"arn:aws:states:::batch:submitJob.sync",
                     "Parameters":{
                        "JobDefinition":"SNPEff",
                        "JobName.$":"$.snpeff.JobName",
                        "JobQueue":"HighPriority",
                        "Parameters.$": "$.snpeff"
                     },
                     "TimeoutSeconds": 900,
                     "HeartbeatSeconds": 60,
                     "Retry" : [
                        {
                          "ErrorEquals": [ "States.Timeout" ],
                          "IntervalSeconds": 3,
                          "MaxAttempts": 2,
                          "BackoffRate": 1.5
                        }
                     ],
                     "End":true
                  }
               }
            },
            {
               "StartAt":"RunSamtoolsStatsJob",
               "States":{
                  "RunSamtoolsStatsJob":{
                     "Type":"Task",
                     "Resource":"arn:aws:states:::batch:submitJob.sync",
                     "Parameters":{
                        "JobDefinition":"SamtoolsStats",
                        "JobName.$":"$.samtools.JobName",
                        "JobQueue":"HighPriority",
                        "Parameters.$": "$.samtools"
                     },
                     "TimeoutSeconds": 900,
                     "HeartbeatSeconds": 60,
                     "End":true,
                     "Retry" : [
                        {
                          "ErrorEquals": [ "States.Timeout" ],
                          "IntervalSeconds": 3,
                          "MaxAttempts": 2,
                          "BackoffRate": 1.5
                        }
                     ]
                  }
               }
            }
         ]
      },
      "FinalState":{
         "Type":"Pass",
         "End":true
      }
   }
}

Here is the new Amazon CloudFormation template for deploying the AWS Batch job definitions for each tool:

AWSTemplateFormatVersion: 2010-09-09

Description: Batch job definitions for batch genomics

Parameters:
  RoleStackName:
    Description: "Stack that deploys roles for genomic workflow"
    Type: String
  VPCStackName:
    Description: "Stack that deploys vps for genomic workflow"
    Type: String
  JobResultsBucket:
    Description: "Bucket that holds workflow job results"
    Type: String

Resources:
  IsaacJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      JobDefinitionName: "Isaac"
      Type: container
      RetryStrategy:
        Attempts: 1
      Parameters:
        BAMS3FolderPath: !Sub "s3://${JobResultsBucket}/NA12878_states_1/bam"
        FastQ1S3Path: "s3://aws-batch-genomics-resources/fastq/SRR1919605_1.fastq.gz"
        FastQ2S3Path: "s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz"
        ReferenceS3Path: "s3://aws-batch-genomics-resources/reference/isaac/"
        WorkingDir: "/scratch"
      ContainerProperties:
        Image: "rulaszek/isaac"
        Vcpus: 32
        Memory: 80000
        JobRoleArn:
          Fn::ImportValue: !Sub "${RoleStackName}:ECSTaskRole"
        Command:
          - "--bam_s3_folder_path"
          - "Ref::BAMS3FolderPath"
          - "--fastq1_s3_path"
          - "Ref::FastQ1S3Path"
          - "--fastq2_s3_path"
          - "Ref::FastQ2S3Path"
          - "--reference_s3_path"
          - "Ref::ReferenceS3Path"
          - "--working_dir"
          - "Ref::WorkingDir"
        MountPoints:
          - ContainerPath: "/scratch"
            ReadOnly: false
            SourceVolume: docker_scratch
        Volumes:
          - Name: docker_scratch
            Host:
              SourcePath: "/docker_scratch"

  StrelkaJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      JobDefinitionName: "Strelka"
      Type: container
      RetryStrategy:
        Attempts: 1
      Parameters:
        BAMS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/bam/sorted.bam"
        BAIS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/bam/sorted.bam.bai"
        ReferenceS3Path: "s3://aws-batch-genomics-resources/reference/hg38.fa"
        ReferenceIndexS3Path: "s3://aws-batch-genomics-resources/reference/hg38.fa.fai"
        VCFS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/vcf"
        WorkingDir: "/scratch"
      ContainerProperties:
        Image: "rulaszek/strelka"
        Vcpus: 32
        Memory: 32000
        JobRoleArn:
          Fn::ImportValue: !Sub "${RoleStackName}:ECSTaskRole"
        Command:
          - "--bam_s3_path"
          - "Ref::BAMS3Path"
          - "--bai_s3_path"
          - "Ref::BAIS3Path"
          - "--reference_s3_path"
          - "Ref::ReferenceS3Path"
          - "--reference_index_s3_path"
          - "Ref::ReferenceIndexS3Path"
          - "--vcf_s3_path"
          - "Ref::VCFS3Path"
          - "--working_dir"
          - "Ref::WorkingDir"
        MountPoints:
          - ContainerPath: "/scratch"
            ReadOnly: false
            SourceVolume: docker_scratch
        Volumes:
          - Name: docker_scratch
            Host:
              SourcePath: "/docker_scratch"

  SnpEffJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      JobDefinitionName: "SNPEff"
      Type: container
      RetryStrategy:
        Attempts: 1
      Parameters:
        VCFS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/vcf/variants/genome.vcf.gz"
        AnnotatedVCFS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/vcf/genome.anno.vcf"
        CommandArgs: " -t hg38 "
        WorkingDir: "/scratch"
      ContainerProperties:
        Image: "rulaszek/snpeff"
        Vcpus: 4
        Memory: 10000
        JobRoleArn:
          Fn::ImportValue: !Sub "${RoleStackName}:ECSTaskRole"
        Command:
          - "--annotated_vcf_s3_path"
          - "Ref::AnnotatedVCFS3Path"
          - "--vcf_s3_path"
          - "Ref::VCFS3Path"
          - "--cmd_args"
          - "Ref::CommandArgs"
          - "--working_dir"
          - "Ref::WorkingDir"
        MountPoints:
          - ContainerPath: "/scratch"
            ReadOnly: false
            SourceVolume: docker_scratch
        Volumes:
          - Name: docker_scratch
            Host:
              SourcePath: "/docker_scratch"

  SamtoolsStatsJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      JobDefinitionName: "SamtoolsStats"
      Type: container
      RetryStrategy:
        Attempts: 1
      Parameters:
        ReferenceS3Path: "s3://aws-batch-genomics-resources/reference/hg38.fa"
        BAMS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/bam/sorted.bam"
        BAMStatsS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/bam/sorted.bam.stats"
        WorkingDir: "/scratch"
      ContainerProperties:
        Image: "rulaszek/samtools-stats"
        Vcpus: 4
        Memory: 10000
        JobRoleArn:
          Fn::ImportValue: !Sub "${RoleStackName}:ECSTaskRole"
        Command:
          - "--bam_s3_path"
          - "Ref::BAMS3Path"
          - "--bam_stats_s3_path"
          - "Ref::BAMStatsS3Path"
          - "--reference_s3_path"
          - "Ref::ReferenceS3Path"
          - "--working_dir"
          - "Ref::WorkingDir"
        MountPoints:
          - ContainerPath: "/scratch"
            ReadOnly: false
            SourceVolume: docker_scratch
        Volumes:
          - Name: docker_scratch
            Host:
              SourcePath: "/docker_scratch"

Here is the new CloudFormation script that deploys the new workflow:

AWSTemplateFormatVersion: 2010-09-09

Description: State Machine for batch benomics

Parameters:
  RoleStackName:
    Description: "Stack that deploys roles for genomic workflow"
    Type: String
  VPCStackName:
    Description: "Stack that deploys vps for genomic workflow"
    Type: String

Resources:
  # S3
  GenomicWorkflow:
    Type: AWS::StepFunctions::StateMachine
    Properties:
      RoleArn:
        Fn::ImportValue: !Sub "${RoleStackName}:StatesExecutionRole"
      DefinitionString: !Sub |-
        {
           "Comment":"A simple example that submits a job to AWS Batch",
           "StartAt":"RunIsaacJob",
           "States":{
              "RunIsaacJob":{
                 "Type":"Task",
                 "Resource":"arn:aws:states:::batch:submitJob.sync",
                 "Parameters":{
                    "JobDefinition":"Isaac",
                    "JobName.$":"$.isaac.JobName",
                    "JobQueue":"HighPriority",
                    "Parameters.$": "$.isaac"
                 },
                 "TimeoutSeconds": 900,
                 "HeartbeatSeconds": 60,
                 "Next":"Parallel",
                 "InputPath":"$",
                 "ResultPath":"$.status",
                 "Retry" : [
                    {
                      "ErrorEquals": [ "States.Timeout" ],
                      "IntervalSeconds": 3,
                      "MaxAttempts": 2,
                      "BackoffRate": 1.5
                    }
                 ]
              },
              "Parallel":{
                 "Type":"Parallel",
                 "Next":"FinalState",
                 "Branches":[
                    {
                       "StartAt":"RunStrelkaJob",
                       "States":{
                          "RunStrelkaJob":{
                             "Type":"Task",
                             "Resource":"arn:aws:states:::batch:submitJob.sync",
                             "Parameters":{
                                "JobDefinition":"Strelka",
                                "JobName.$":"$.strelka.JobName",
                                "JobQueue":"HighPriority",
                                "Parameters.$": "$.strelka"
                             },
                             "TimeoutSeconds": 900,
                             "HeartbeatSeconds": 60,
                             "Next":"RunSnpEffJob",
                             "InputPath":"$",
                             "ResultPath":"$.status",
                             "Retry" : [
                                {
                                  "ErrorEquals": [ "States.Timeout" ],
                                  "IntervalSeconds": 3,
                                  "MaxAttempts": 2,
                                  "BackoffRate": 1.5
                                }
                             ]
                          },
                          "RunSnpEffJob":{
                             "Type":"Task",
                             "Resource":"arn:aws:states:::batch:submitJob.sync",
                             "Parameters":{
                                "JobDefinition":"SNPEff",
                                "JobName.$":"$.snpeff.JobName",
                                "JobQueue":"HighPriority",
                                "Parameters.$": "$.snpeff"
                             },
                             "TimeoutSeconds": 900,
                             "HeartbeatSeconds": 60,
                             "Retry" : [
                                {
                                  "ErrorEquals": [ "States.Timeout" ],
                                  "IntervalSeconds": 3,
                                  "MaxAttempts": 2,
                                  "BackoffRate": 1.5
                                }
                             ],
                             "End":true
                          }
                       }
                    },
                    {
                       "StartAt":"RunSamtoolsStatsJob",
                       "States":{
                          "RunSamtoolsStatsJob":{
                             "Type":"Task",
                             "Resource":"arn:aws:states:::batch:submitJob.sync",
                             "Parameters":{
                                "JobDefinition":"SamtoolsStats",
                                "JobName.$":"$.samtools.JobName",
                                "JobQueue":"HighPriority",
                                "Parameters.$": "$.samtools"
                             },
                             "TimeoutSeconds": 900,
                             "HeartbeatSeconds": 60,
                             "End":true,
                             "Retry" : [
                                {
                                  "ErrorEquals": [ "States.Timeout" ],
                                  "IntervalSeconds": 3,
                                  "MaxAttempts": 2,
                                  "BackoffRate": 1.5
                                }
                             ]
                          }
                       }
                    }
                 ]
              },
              "FinalState":{
                 "Type":"Pass",
                 "End":true
              }
           }
        }

Outputs:
  GenomicsWorkflowArn:
    Description: GenomicWorkflow ARN
    Value: !Ref GenomicWorkflow
  StackName:
    Description: StackName
    Value: !Sub ${AWS::StackName}

Conclusion

AWS Step Functions service integrations are a great way to simplify creating complex workflows with asynchronous steps. While we highlighted the use case with AWS Batch today, there are many other ways that healthcare and life sciences customers can use this new feature, such as with message processing.

For more information about how AWS can enable your genomics workloads, be sure to check out the AWS Genomics page.

We’ve updated the open-source project to take advantage of the new AWS Batch integration in Step Functions.  You can find the changes aws-batch-genomics/tree/v2.0.0 folder.

Original posts in this four-part series:

Happy coding!

Building a tightly coupled molecular dynamics workflow with multi-node parallel jobs in AWS Batch

Post Syndicated from Josh Rad original https://aws.amazon.com/blogs/compute/building-a-tightly-coupled-molecular-dynamics-workflow-with-multi-node-parallel-jobs-in-aws-batch/

Contributed by Amr Ragab, HPC Application Consultant, AWS Professional Services and Aswin Damodar, Senior Software Development Engineer, AWS Batch

At Supercomputing 2018 in Dallas, TX, AWS announced AWS Batch support for running tightly coupled workloads in a multi-node parallel jobs environment. This AWS Batch feature enables applications that require strong scaling for efficient computational workloads.

Some of the more popular workloads that can take advantage of this feature enhancement include computational fluid dynamics (CFD) codes such as OpenFoam, Fluent, and ANSYS. Other workloads include molecular dynamics (MD) applications such as AMBER, GROMACS, NAMD.

Running tightly coupled, distributed, deep learning frameworks is also now possible on AWS Batch. Applications that can take advantage include TensorFlow, MXNet, Pytorch, and Chainer. Essentially, any application scaling that benefits from tightly coupled–based scalability can now be integrated into AWS Batch.

In this post, we show you how to build a workflow executing an MD simulation using GROMACS running on GPUs, using the p3 instance family.

AWS Batch overview

AWS Batch is a service providing managed planning, scheduling, and execution of containerized workloads on AWS. Purpose-built for scalable compute workloads, AWS Batch is ideal for high throughput, distributed computing jobs such as video and image encoding, loosely coupled numerical calculations, and multistep computational workflows.

If you are new to AWS Batch, consider gaining familiarity with the service by following the tutorial in the Creating a Simple “Fetch & Run” AWS Batch Job post.

Prerequisites

You need an AWS account to go through this walkthrough. Other prerequisites include:

  • Launch an ECS instance, p3.2xlarge with a NVIDIA Tesla V100 backend. Use the Amazon Linux 2 AMIs for ECS.
  • In the ECS instance, install the latest CUDA 10 stack, which provides the toolchain and compilation libraries as well as the NVIDIA driver.
  • Install nvidia-docker2.
  • In your /etc/docker/daemon.json file, ensure that the default-runtime value is set to nvidia.
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-runtime": "nvidia"
}
  • Finally, save the EC2 instance as an AMI in your account. Copy the AMI ID, as you need it later in the post.

Deploying the workload

In a production environment, it’s important to efficiently execute the compute workload with multi-node parallel jobs. Most of the optimization is on the application layer and how efficiently the Message Passing Interface (MPI) ranks (MPI and OpenMP threads) are distributed across nodes. Application-level optimization is out of scope for this post, but should be considered when running in production.

One of the key requirements for running on AWS Batch is a Dockerized image with the application, libraries, scripts, and code. For multi-node parallel jobs, you need an MPI stack for the tightly coupled communication layer and a wrapper script for the MPI orchestration. The running child Docker containers need to pass container IP address information to the master node to fill out the MPI host file.

The undifferentiated heavy lifting that AWS Batch provides is the Docker-to-Docker communication across nodes using Amazon ECS task networking. With multi-node parallel jobs, the ECS container receives environmental variables from the backend, which can be used to establish which running container is the master and which is the child.

  • AWS_BATCH_JOB_MAIN_NODE_INDEX—The designation of the master node in a multi-node parallel job. This is the main node in which the MPI job is launched.
  • AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS—The IPv4 address of the main node. This is presented in the environment for all children nodes.
  • AWS_BATCH_JOB_NODE_INDEX—The designation of the node index.
  • AWS_BATCH_JOB_NUM_NODES – The number of nodes launched as part of the node group for your multi-node parallel job.

If AWS_BATCH_JOB_MAIN_NODE_INDEX = AWS_BATCH_JOB_NODE_INDEX, then this is the main node. The following code block is an example MPI synchronization script that you can include as part of the CMD structure of the Docker container. Save the following code as mpi-run.sh.

#!/bin/bash

cd $JOB_DIR

PATH="$PATH:/opt/openmpi/bin/"
BASENAME="${0##*/}"
log () {
  echo "${BASENAME} - ${1}"
}
HOST_FILE_PATH="/tmp/hostfile"
AWS_BATCH_EXIT_CODE_FILE="/tmp/batch-exit-code"

aws s3 cp $S3_INPUT $SCRATCH_DIR
tar -xvf $SCRATCH_DIR/*.tar.gz -C $SCRATCH_DIR

sleep 2

usage () {
  if [ "${#@}" -ne 0 ]; then
    log "* ${*}"
    log
  fi
  cat <<ENDUSAGE
Usage:
export AWS_BATCH_JOB_NODE_INDEX=0
export AWS_BATCH_JOB_NUM_NODES=10
export AWS_BATCH_JOB_MAIN_NODE_INDEX=0
export AWS_BATCH_JOB_ID=string
./mpi-run.sh
ENDUSAGE

  error_exit
}

# Standard function to print an error and exit with a failing return code
error_exit () {
  log "${BASENAME} - ${1}" >&2
  log "${2:-1}" > $AWS_BATCH_EXIT_CODE_FILE
  kill  $(cat /tmp/supervisord.pid)
}

# Set child by default switch to main if on main node container
NODE_TYPE="child"
if [ "${AWS_BATCH_JOB_MAIN_NODE_INDEX}" == 
"${AWS_BATCH_JOB_NODE_INDEX}" ]; then
  log "Running synchronize as the main node"
  NODE_TYPE="main"
fi


# wait for all nodes to report
wait_for_nodes () {
  log "Running as master node"

  touch $HOST_FILE_PATH
  ip=$(/sbin/ip -o -4 addr list eth0 | awk '{print $4}' | cut -d/ -f1)
  
  if [ -x "$(command -v nvidia-smi)" ] ; then
      NUM_GPUS=$(ls -l /dev/nvidia[0-9] | wc -l)
      availablecores=$NUM_GPUS
  else
      availablecores=$(nproc)
  fi

  log "master details -> $ip:$availablecores"
  echo "$ip slots=$availablecores" >> $HOST_FILE_PATH

  lines=$(uniq $HOST_FILE_PATH|wc -l)
  while [ "$AWS_BATCH_JOB_NUM_NODES" -gt "$lines" ]
  do
    log "$lines out of $AWS_BATCH_JOB_NUM_NODES nodes joined, check again in 1 second"
    sleep 1
    lines=$(uniq $HOST_FILE_PATH|wc -l)
  done
  # Make the temporary file executable and run it with any given arguments
  log "All nodes successfully joined"
  
  # remove duplicates if there are any.
  awk '!a[$0]++' $HOST_FILE_PATH > ${HOST_FILE_PATH}-
deduped
  cat $HOST_FILE_PATH-deduped
  log "executing main MPIRUN workflow"

  cd $SCRATCH_DIR
  . /opt/gromacs/bin/GMXRC
  /opt/openmpi/bin/mpirun --mca btl_tcp_if_include eth0 \
                          -x PATH -x LD_LIBRARY_PATH -x 
GROMACS_DIR -x GMXBIN -x GMXMAN -x GMXDATA \
                          --allow-run-as-root --machinefile 
${HOST_FILE_PATH}-deduped \
                          $GMX_COMMAND
  sleep 2

  tar -czvf $JOB_DIR/batch_output_$AWS_BATCH_JOB_ID.tar.gz 
$SCRATCH_DIR/*
  aws s3 cp $JOB_DIR/batch_output_$AWS_BATCH_JOB_ID.tar.gz 
$S3_OUTPUT

  log "done! goodbye, writing exit code to 
$AWS_BATCH_EXIT_CODE_FILE and shutting down my supervisord"
  echo "0" > $AWS_BATCH_EXIT_CODE_FILE
  kill  $(cat /tmp/supervisord.pid)
  exit 0
}


# Fetch and run a script
report_to_master () {
  # get own ip and num cpus
  #
  ip=$(/sbin/ip -o -4 addr list eth0 | awk '{print $4}' | cut -d/ -f1)

  if [ -x "$(command -v nvidia-smi)" ] ; then
      NUM_GPUS=$(ls -l /dev/nvidia[0-9] | wc -l)
      availablecores=$NUM_GPUS
  else
      availablecores=$(nproc)
  fi

  log "I am a child node -> $ip:$availablecores, reporting to the master node -> 
${AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS}"
  until echo "$ip slots=$availablecores" | ssh 
${AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS} "cat >> 
/$HOST_FILE_PATH"
  do
    echo "Sleeping 5 seconds and trying again"
  done
  log "done! goodbye"
  exit 0
  }


# Main - dispatch user request to appropriate function
log $NODE_TYPE
case $NODE_TYPE in
  main)
    wait_for_nodes "${@}"
    ;;

  child)
    report_to_master "${@}"
    ;;

  *)
    log $NODE_TYPE
    usage "Could not determine node type. Expected (main/child)"
    ;;
esac

The synchronization script supports downloading the assets from Amazon S3 as well as preparing the MPI host file based on GPU scheduling for GROMACS.

Furthermore, the mpirun stanza is captured in this script. This script can be a template for several multi-node parallel job applications by just changing a few lines.  These lines are essentially the GROMACS-specific steps:

. /opt/gromacs/bin/GMXRC
export OMP_NUM_THREADS=$OMP_THREADS
/opt/openmpi/bin/mpirun -np $MPI_THREADS --mca btl_tcp_if_include eth0 \
-x OMP_NUM_THREADS -x PATH -x LD_LIBRARY_PATH -x GROMACS_DIR -x GMXBIN -x GMXMAN -x GMXDATA \
--allow-run-as-root --machinefile ${HOST_FILE_PATH}-deduped \
$GMX_COMMAND

In your development environment for building Docker images, create a Dockerfile that prepares the software stack for running GROMACS. The key elements of the Dockerfile are:

  1. Set up a passwordless-ssh keygen.
  2. Download, and compile OpenMPI. In this Dockerfile, you are downloading the recently released OpenMPI 4.0.0 source and compiling on a NVIDIA Tesla V100 GPU-backed instance (p3.2xlarge).
  3. Download and compile GROMACS.
  4. Set up supervisor to run SSH at Docker container startup as well as processing the mpi-run.sh script as the CMD.

Save the following script as a Dockerfile:

FROM nvidia/cuda:latest

ENV USER root

# -------------------------------------------------------------------------------------
# install needed software -
# openssh
# mpi
# awscli
# supervisor
# -------------------------------------------------------------------------------------

RUN apt update
RUN DEBIAN_FRONTEND=noninteractive apt install -y iproute2 cmake openssh-server openssh-client python python-pip build-essential gfortran wget curl
RUN pip install supervisor awscli

RUN mkdir -p /var/run/sshd
ENV DEBIAN_FRONTEND noninteractive

ENV NOTVISIBLE "in users profile"

#####################################################
## SSH SETUP

RUN sed -i 's/PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
RUN sed '[email protected]\s*required\s*[email protected] optional [email protected]' -i /etc/pam.d/sshd
RUN echo "export VISIBLE=now" >> /etc/profile

RUN echo "${USER} ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers
ENV SSHDIR /root/.ssh
RUN mkdir -p ${SSHDIR}
RUN touch ${SSHDIR}/sshd_config
RUN ssh-keygen -t rsa -f ${SSHDIR}/ssh_host_rsa_key -N ''
RUN cp ${SSHDIR}/ssh_host_rsa_key.pub ${SSHDIR}/authorized_keys
RUN cp ${SSHDIR}/ssh_host_rsa_key ${SSHDIR}/id_rsa
RUN echo " IdentityFile ${SSHDIR}/id_rsa" >> /etc/ssh/ssh_config
RUN echo "Host *" >> /etc/ssh/ssh_config && echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config
RUN chmod -R 600 ${SSHDIR}/* && \
chown -R ${USER}:${USER} ${SSHDIR}/
# check if ssh agent is running or not, if not, run
RUN eval `ssh-agent -s` && ssh-add ${SSHDIR}/id_rsa

##################################################
## S3 OPTIMIZATION

RUN aws configure set default.s3.max_concurrent_requests 30
RUN aws configure set default.s3.max_queue_size 10000
RUN aws configure set default.s3.multipart_threshold 64MB
RUN aws configure set default.s3.multipart_chunksize 16MB
RUN aws configure set default.s3.max_bandwidth 4096MB/s
RUN aws configure set default.s3.addressing_style path

##################################################
## CUDA MPI

RUN wget -O /tmp/openmpi.tar.gz https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.0.tar.gz && \
tar -xvf /tmp/openmpi.tar.gz -C /tmp
RUN cd /tmp/openmpi* && ./configure --prefix=/opt/openmpi --with-cuda --enable-mpirun-prefix-by-default && \
make -j $(nproc) && make install
RUN echo "export PATH=$PATH:/opt/openmpi/bin" >> /etc/profile
RUN echo "export LD_LIBRARY_PATH=$LD_LIRBARY_PATH:/opt/openmpi/lib:/usr/local/cuda/include:/usr/local/cuda/lib64" >> /etc/profile

###################################################
## GROMACS 2018 INSTALL

ENV PATH $PATH:/opt/openmpi/bin
ENV LD_LIBRARY_PATH $LD_LIRBARY_PATH:/opt/openmpi/lib:/usr/local/cuda/include:/usr/local/cuda/lib64
RUN wget -O /tmp/gromacs.tar.gz http://ftp.gromacs.org/pub/gromacs/gromacs-2018.4.tar.gz && \
tar -xvf /tmp/gromacs.tar.gz -C /tmp
RUN cd /tmp/gromacs* && mkdir build
RUN cd /tmp/gromacs*/build && \
cmake .. -DGMX_MPI=on -DGMX_THREAD_MPI=ON -DGMX_GPU=ON -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda -DGMX_BUILD_OWN_FFTW=ON -DCMAKE_INSTALL_PREFIX=/opt/gromacs && \
make -j $(nproc) && make install
RUN echo "source /opt/gromacs/bin/GMXRC" >> /etc/profile

###################################################
## supervisor container startup

ADD conf/supervisord/supervisord.conf /etc/supervisor/supervisord.conf
ADD supervised-scripts/mpi-run.sh supervised-scripts/mpi-run.sh
RUN chmod 755 supervised-scripts/mpi-run.sh

EXPOSE 22
RUN export PATH="$PATH:/opt/openmpi/bin"
ADD batch-runtime-scripts/entry-point.sh batch-runtime-scripts/entry-point.sh
RUN chmod 0755 batch-runtime-scripts/entry-point.sh

CMD /batch-runtime-scripts/entry-point.sh

After the container is built, push the image to your Amazon ECR repository and note the container image URI for later steps.

Set up GROMACS

For the input files, use the Chalcone Synthase (1CGZ) example, from RCSB.org. For this post, just run a simple simulation following the Lysozyme in Water GROMACS tutorial.

Execute the production MD run before the analysis (that is, after the system is solvated, neutralized, and equilibrated), so you can show that the longest part of the simulation can be achieved in a containizered workflow.

It is possible from the tutorial to run the entire workflow from PDB preparation to solvation, and energy minimization and analysis in AWS Batch.

Set up the compute environment

For the purpose of running the MD simulation in this test case, use two p3.2xlarge instances. Each instance provides one NVIDIA Tesla V100 GPU for which GROMACS distributes the job. You don’t have to launch specific instance types. With the p3 family, the MPI-wrapper can concomitantly modify the MPI ranks to accommodate the current GPU and node topology.

When the job is executed, instantiate two MPI processes with two OpenMP threads per MPI process. For this post, launch EC2 OnDemand, using the Amazon Linux AMIs we can take advantage of per-second billing.

Under Create Compute Environment, choose a managed compute environment and provide a name, such as gromacs-gpu-ce. Attach two roles:

  • AWSBatchServiceRole—Allows AWS Batch to make EC2 calls on your behalf.
  • ecsInstanceRole—Allows the underlying instance to make AWS API calls.

In the next panel, specify the following field values:

  • Provisioning model: EC2
  • Allowed instance types: p3 family
  • Minimum vCPUs: 0
  • Desired vCPUs: 0
  • Maximum vCPUs: 128

For Enable user-specified Ami ID and enter the AMI that you created earlier.

Finally, for the compute environment, specify the network VPC and subnets for launching the instances, as well as a security group. We recommend specifying a placement group for tightly coupled workloads for better performance. You can also create EC2 tags for the launch instances. We used name=gromacs-gpu-processor.

Next, choose Job Queues and create a gromacs-queue queue coupled with the compute environment created earlier. Set the priority to 1 and select Enable job queue.

Set up the job definition

In the job definition setup, you create a two-node group, where each node pulls the gromacs_mpi image. Because you are using the p3.2xlarge instance providing one V100 GPU per instance, your vCPU slots = 8 for scheduling purposes.

{
    "jobDefinitionName": "gromacs-jobdef",
    "jobDefinitionArn": "arn:aws:batch:us-east-2:<accountid>:job-definition/gromacs-jobdef:1",
    "revision": 6,
    "status": "ACTIVE",
    "type": "multinode",
    "parameters": {},
    "nodeProperties": {
        "numNodes": 2,
        "mainNode": 0,
        "nodeRangeProperties": [
            {
                "targetNodes": "0:1",
                "container": {
                    "image": "<accountid>.dkr.ecr.us-east-2.amazonaws.com/gromacs_mpi:latest",
                    "vcpus": 8,
                    "memory": 24000,
                    "command": [],
                    "jobRoleArn": "arn:aws:iam::<accountid>:role/ecsTaskExecutionRole",
                    "volumes": [
                        {
                            "host": {
                                "sourcePath": "/scratch"
                            },
                            "name": "scratch"
                        },
                        {
                            "host": {
                                "sourcePath": "/efs"
                            },
                            "name": "efs"
                        }
                    ],
                    "environment": [
                        {
                            "name": "SCRATCH_DIR",
                            "value": "/scratch"
                        },
                        {
                            "name": "JOB_DIR",
                            "value": "/efs"
                        },
                        {
                            "name": "GMX_COMMAND",
                            "value": "gmx_mpi mdrun -deffnm md_0_1 -nb gpu -ntomp 1"
                        },
                        {
                            "name": "OMP_THREADS",
                            "value": "2"
                        },
                        {
                            “name”: “MPI_THREADS”,
                            “value”: “1”
                        },
                        {
                            "name": "S3_INPUT",
                            "value": "s3://ragab-md/1CGZ.tar.gz"
                        },
                        {
                            "name": "S3_OUTPUT",
                            "value": "s3://ragab-md"
                        }
                    ],
                    "mountPoints": [
                        {
                            "containerPath": "/scratch",
                            "sourceVolume": "scratch"
                        },
                        {
                            "containerPath": "/efs",
                            "sourceVolume": "efs"
                        }
                    ],
                    "ulimits": [],
                    "instanceType": "p3.2xlarge"
                }
            }
        ]
    }
}

Submit the GROMACS job

In the AWS Batch job submission portal, provide a job name and select the job definition created earlier as well as the job queue. Ensure that the vCPU value is set to 8 and the Memory (MiB) value is 24000.

Under Environmental Variables, within in each node group, ensure that the keys are set correctly as follows.

KeyValue
SCRATCH_DIR/scratch
JOB_DIR/efs
OMP_THREADS2
GMX_COMMANDgmx_mpi mdrun -deffnm md_0_1 -nb gpu
MPI_THREADS2
S3_INPUTs3://<your input>
S3_OUTPUTs3://<your output>

Submit the job and wait for it to enter into the RUNNING state. After the job is in the RUNNING state, select the job ID and choose Nodes.

The containers listed each write to a separate Amazon CloudWatch log stream where you can monitor the progress.

After the job is completed the entire working directory is compressed and uploaded to S3, the trajectories (*.xtc) and input .gro files can be viewed in your favorite MD analysis package. For more information about preparing a desktop, see Deploying a 4x4K, GPU-backed Linux desktop instance on AWS.

You can view the trajectories in PyMOL as well as running any subsequent trajectory analysis.

Extending the solution

As we mentioned earlier, you can take this core workload and extend it as part of a job execution chain in a workflow. Native support for job dependencies exists in AWS Batch and alternatively in AWS Step Functions. With Step Functions, you can create a decision-based workflow tree to run the preparation, solvation, energy minimization, equilibration, production MD, and analysis.

Conclusion

In this post, we showed that tightly coupled, scalable MD simulations can be executed using the recently released multi-node parallel jobs feature for AWS Batch. You finally have an end-to-end solution for distributed and MPI-based workloads.

As we mentioned earlier, many other applications can also take advantage of this feature set. We invite you to try this out and let us know how it goes.

Want to discuss how your tightly coupled workloads can benefit on AWS Batch? Contact AWS.

Using Cromwell with AWS Batch

Post Syndicated from Josh Rad original https://aws.amazon.com/blogs/compute/using-cromwell-with-aws-batch/

Contributed by W. Lee Pang and Emil Lerch, WWPS Professional Services

DNA is often referred to as the “source code of life.” All living cells contain long chains of deoxyribonucleic acid that encode instructions on how they are constructed and behave in their surroundings. Genomics is the study of the structure and function of DNA at the molecular level. It has recently shown immense potential to provide improved detection, diagnosis, and treatment of human diseases.

Continuous improvements in genome sequencing technologies have accelerated genomics research by providing unprecedented speed, accuracy, and quantity of DNA sequence data. In fact, the rate of sequencing efficiency has been shown to outpace Moore’s law. Processing this influx of genomic data is ideally aligned with the power and scalability of cloud computing.

Genomic data processing typically uses a wide assortment of specialized bioinformatics tools, like sequence alignment algorithms, variant callers, and statistical analysis methods. These tools are run in sequence as workflow pipelines that can range from a couple of steps to many long toolchains executing in parallel.

Traditionally, bioinformaticians and genomics scientists relied on Bash, Perl, or Python scripts to orchestrate their pipelines. As pipelines have gotten more complex, and maintainability and reproducibility have become standard requirements in science, the need for specialized orchestration tooling and portable workflow definitions has grown significantly.

What is Cromwell?

The Broad Institute’s Cromwell is purpose-built for this need. It is a workflow execution engine for orchestrating command line and containerized tools. Most importantly, it is the engine that drives the GATK Best Practices genome analysis pipeline.

Workflows for Cromwell are defined using the Workflow Definition Language (WDL – pronounced “widdle”), a flexible meta-scripting language that allows researchers to focus on the pieces of their workflow that matter. That’s the tools for each step and their respective inputs and outputs, and not the plumbing in between.

Genomics data is not small (on the order of TBs-PBs for one experiment!), so processing it usually requires significant computing scale, like HPC clusters and cloud computing. Cromwell has previously enabled this with support for many backends such as Spark, and HPC frameworks like Sun GridEngine and SLURM.

AWS and Cromwell

We are excited to announce that Cromwell now supports AWS! In this post, we go over how to configure Cromwell on AWS and get started running genomics pipelines in the cloud.

In a nutshell, the AWS backend for Cromwell is a layer that communicates with AWS Batch. Why AWS Batch? As stated before, genomics analysis pipelines are composed of many different tools. Each of these tools can have specific computing requirements. Operations like genome alignment can be memory-intensive, whereas joint genotyping may be compute-heavy.

AWS Batch dynamically provisions the optimal quantity and type of compute resources (for example, CPU or memory-optimized instances). Provisioning is based on the volume and specific resource requirements of the batch jobs submitted. This means that each step of a genomics workflow gets the most ideal instance to run on.

The AWS backend translates Cromwell task definitions into batch job definitions and submits them via API calls to a user-specified batch queue. Runtime parameters such as the container image to use, and resources like desired vCPUs and memory are also translated from the WDL task and transmitted to the batch job. A number of environment variables are automatically set on the job to support data localization and de-localization to the job instance. Ultimately, scientists and genomics researchers should be familiar with the backend method to submit jobs to AWS Batch because it uses their existing WDL files and research processes.

Getting started

To get started using Cromwell with AWS, create a custom AMI. This is necessary to ensure that the AMI is private to the account, encrypted, and has tooling specific to genomics workloads and Cromwell.

One feature of this tooling is the automatic creation and attachment of additional Amazon Elastic Block Store (Amazon EBS) capacity as additional data is copied onto the EC2 instance for processing. It also contains an ECS agent that has been customized to the needs of Cromwell, and a Cromwell Docker image responsible for interfacing the Cromwell task with Amazon S3.

After the custom AMI is created, install Cromwell on your workstation or EC2 instance. Configure an S3 bucket to hold Cromwell execution directories. For the purposes of this post, we refer to the bucket as s3-bucket-name. Lastly, go to the AWS Batch console, and create a job queue. Save the ARN of the queue, as this is needed later.

To get up these resources with a single click, this link provides a set of AWS CloudFormation templates that gets all the needed infrastructure running in minutes.

The next step is to configure Cromwell to work with AWS Batch and your newly created S3 bucket. Use the sample hello.wdl and hello.inputs files from the Cromwell AWS backend tutorial. You also need a custom configuration file so that Cromwell can interact with AWS Batch.

The following sample file can be used on an EC2 instance with the appropriate IAM role attached, or on a developer workstation with the AWS CLI configured. Keep in mind that you must replace <s3-bucket-name> in the configuration file with the appropriate bucket name. Also, replace “your ARN here” with the ARN of the job queue that you created earlier.

// aws.conf

include required(classpath("application"))

aws {

    application-name = "cromwell"
    
    auths = [
        {
         name = "default"
         scheme = "default"
        }
    ]
    
    region = "default"
    // uses region from ~/.aws/config set by aws configure command,
    // or us-east-1 by default
}

engine {
     filesystems {
         s3 {
            auth = "default"
         }
    }
}

backend {
     default = "AWSBATCH"
     providers {
         AWSBATCH {
             actor-factory = "cromwell.backend.impl.aws.AwsBatchBackendLifecycleActorFactory"
             config {
                 // Base bucket for workflow executions
                 root = "s3://<s3-bucket-name>/cromwell-execution"
                
                 // A reference to an auth defined in the `aws` stanza at the top. This auth is used to create
                 // Jobs and manipulate auth JSONs.
                 auth = "default"

                 numSubmitAttempts = 3
                 numCreateDefinitionAttempts = 3

                 concurrent-job-limit = 16
                
                 default-runtime-attributes {
                    queueArn: "<your ARN here>"
                 }
                
                 filesystems {
                     s3 {
                         // A reference to a potentially different auth for manipulating files via engine functions.
                         auth = "default"
                     }
                 }
             }
         }
     }
}

Now, you can run your workflow. The following command runs Hello World, and ensures that everything is connected properly:

$ java -Dconfig.file=aws.conf -jar cromwell-34.jar run hello.wdl -i hello.inputs

After the workflow has run, your workflow logs should report the workflow outputs.

[info] SingleWorkflowRunnerActor workflow finished with status 'Succeeded'.
{
 "outputs": {
    "wf_hello.hello.message": "Hello World! Welcome to Cromwell . . . on AWS!"
 },
 "id": "08213b40-bcf5-470d-b8b7-1d1a9dccb10e"
}

You also see your job in the “succeeded” section of the AWS Batch Jobs console.

After the environment is configured properly, other Cromwell WDL files can be used as usual.

Conclusion
With AWS Batch, a customized AMI instance, and Cromwell workflow definitions, AWS provides a simple solution to process genomics data easily. We invite you to incorporate this into your automated pipeline.

Deploy an 8K HEVC pipeline using Amazon EC2 P3 instances with AWS Batch

Post Syndicated from Geoff Murase original https://aws.amazon.com/blogs/compute/deploy-an-8k-hevc-pipeline-using-amazon-ec2-p3-instances-with-aws-batch/

Contributed by Amr Ragab, HPC Application Consultant, AWS Professional Services

AWS provides several managed services for file- and streaming-based media encoding options.

Currently, these services offer up to 4K encoding. Recent developments and the growing popularity of 8K content has now increased the need to distribute higher resolution content.

In this solution, you use an Amazon EC2 P3 instance to create a file-based encoding pipeline utilizing AWS Batch by first uploading a sample 8K (7680×4320) file to Amazon S3.

AWS Batch

AWS Batch enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. AWS Batch dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs submitted. With AWS Batch, there is no need to install and manage batch computing software or server clusters that you use to run your jobs, allowing you to focus on analyzing results and solving problems. AWS Batch plans, schedules, and executes your batch computing workloads across the full range of AWS compute services and features, such as Amazon EC2 and Spot Instances.

P3 instances for video transcoding workloads

The P3 instance comes equipped with the NVIDIA Tesla V100 GPU. The V100 is a 16 GB 5,120 CUDA Core-GPU based on the latest Volta architecture; well suited for video coding workloads. The largest instance size in that family, p3.16xlarge, has 64 vCPU, 488 GB of RAM, 8 NVIDIA Tesla V100 GPUs, and 25 Gbps networking bandwidth.

Other than being a mainstay in computational workloads the V100 offers enhanced hardware-based encoding/decoding (NVENC/NVDEC). The following tables summarize the NVENC/NVDEC options available compared to other GPUs offered at AWS.

NVENC Support Matrix

AWS GPU instance
GPU FAMILYGPUH.264 (AVCHD) YUV 4:2:0H.264 (AVCHD) YUV 4:4:4H.264 (AVCHD) LosslessH.265 (HEVC) 4K YUV 4:2:0H.265 (HEVC) 4K YUV 4:4:4H.265 (HEVC) 4K LosslessH.265 (HEVC) 8k
G2KeplerGRID K520YES
P2Kepler (2nd Gen)Tesla K80YES
G3Maxwell (2nd Gen)Tesla M60YESYESYESYES
P3VoltaTesla V100YESYESYESYESYESYESYES

NVDEC Support Matrix

AWS GPU instanceGPU FAMILYGPUMPEG-2VC-1H.264 (AVCHD)H.265 (HEVC)VP8VP9
G2KeplerGRID K520YESYESYES
P2Kepler (2nd Gen)Tesla K80YESYESYES
G3Maxwell (2nd Gen)Tesla M60YESYESYESYES
P3VoltaTesla V100YESYESYESYESYESYES

Cinematic 8K encoding is supported using the Tesla V100 (P3 instance family) either in landscape or portrait orientations using the HEVC codec. 

GPUH264H264_444H264_MEH264_WxHHEVCHEVC_Main10HEVC_LosslessHEVC_SAOHEVC_444HEVC_MEHEVC_WxH
Tesla M60+++4096x
4096
+4096x
4096
Tesla V100+++4096x
4096
++++++8192x
8192

Prerequisites

To follow along with these procedures, ensure that you have the following:

  • An AWS account with permissions to create IAM roles and policies, as well as read and write access to S3
  • Registration with the NVIDIA Developer Network
  • Familiarity with Docker

Deployment

For deployment, you containerize the encoding pipeline. After building the underlying P3 container instance, you then use nvidia-docker2 to build the video-encoding Docker image, which is registered with Amazon Elastic Container Registry (Amazon ECR).

As shown in the following diagram, the pipeline reads an input raw YUV file from S3, then pulls the containerized encoding application to execute at scale on the P3 container instance. The encoded video file is then be transferred to S3.

The nvidia-docker2 image video encoding stack contains the following components:

  • NVIDIA CUDA 9.2
  • FFMPEG 4.0
  • NVIDIA Video Codec SDK 8.1

This is a relatively lengthy procedure. However, after it’s built, the underlying instance and Docker image are reusable and can be quickly deployed as part of a high performance computing (HPC) pipeline.

Creating the ECS container instance

The underlying instance can be built by selecting the Amazon Linux AMI with the p3.2xlarge instance type in a public subnet. Additionally, add an EBS volume (150 GB), which is used for the 8k input, raw yuv, and output files. Scale the storage amount for larger input files. Persist the mount in /etc/fstab. Connect to the instance over SSH and install any OS updates as well as the EPEL Release and support packages as well as the base docker-ce.

sudo yum update -y
sudo yum install yum-utils \
                 device-mapper-persistent-data \
                 lvm2 \

sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
sudo yum install epel-release-latest-7.noarch.rpm
sudo yum update
sudo yum install docker-ce -y

The NVIDIA/CUDA stack can be installed using the cuda-repo-rhel7.rpm file. The CUDA framework installs the NVIDIA driver dependencies.

sudo yum install cuda -y

Next, install nvidia-docker2 as provided in the NVIDIA GitHub repo.

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \
  sudo tee /etc/yum.repos.d/nvidia-docker.repo
sudo yum install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

sudo tee /etc/docker/daemon.json <<EOF
{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-runtime": "nvidia"
}
EOF

sudo systemctl restart docker

With the base components in place, make this instance compatible with the ECS service:

sudo yum install ecs-init -y

Create the /etc/ecs/ecs.config file with the following template:

cat << EOF > /etc/ecs/ecs.config
ECS_DATADIR=/data
ECS_ENABLE_TASK_IAM_ROLE=true
ECS_ENABLE_TASK_IAM_ROLE_NETWORK_HOST=true
ECS_LOGFILE=/log/ecs-agent.log
ECS_AVAILABLE_LOGGING_DRIVERS=["json-file","awslogs"]
ECS_LOGLEVEL=info
ECS_CLUSTER=default
EOF

Iptables and packet forwarding rules need to be created to pass IAM roles into task operations:

sudo sh -c "echo 'net.ipv4.conf.all.route_localnet = 1' >> /etc/sysctl.conf"
sudo sysctl -p /etc/sysctl.conf
sudo iptables -t nat -A PREROUTING -p tcp -d 169.254.170.2 --dport 80 -j DNAT --to-destination 127.0.0.1:51679
sudo iptables -t nat -A OUTPUT -d 169.254.170.2 -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 51679
sudo sh -c 'iptables-save > /etc/sysconfig/iptables'

Finally, a systemd unit file needs to be created:

sudo cat << EOF > /etc/systemd/system/[email protected]
[Unit]
Description=Docker Container %I
Requires=docker.service
After=docker.service

[Service]
Restart=always
ExecStartPre=-/usr/bin/docker rm -f %i
ExecStart=/usr/bin/docker run --name %i \
--privileged \
--restart=on-failure:10 \
--volume=/var/run:/var/run \
--volume=/var/log/ecs/:/log:Z \
--volume=/var/lib/ecs/data:/data:Z \
--volume=/etc/ecs:/etc/ecs \
--net=host \
--env-file=/etc/ecs/ecs.config \
amazon/amazon-ecs-agent:latest
ExecStop=/usr/bin/docker stop %i

[Install]
WantedBy=default.target
EOF

sudo systemctl enable [email protected]
sudo systemctl start [email protected]
sudo systemctl status [email protected]

Ensure that the [email protected] service starts successfully.

Creating the NVIDIA-Docker image

With Docker installed, pull the latest nvidia/cuda:latest image from DockerHub.

docker pull nvidia/cuda:latest

It is best at this point to run the Docker container in interactive mode. However, a Docker build file can be created afterwards. At the time of publication, only CUDA 9.0 is installed. NVIDIA has already provided the necessary repositories. Install CUDA 9.2, and support packages, inside the Docker container, referenced by the (docker)  label:

docker run -it --runtime=nvidia --rm nvidia/cuda
(docker) apt update
(docker) apt install pkg-config build-essential wget curl nasm unzip \
                     git libglew-dev cuda-toolkit-9-2 python3-pip -y
(docker) pip3 install awscli

Next, download the FFMPEG 4.0, nv-codec-headers, and the Video Codec SDK 8.1 from the NVIDIA Developer platform.

First, extract the nv-codec-headers and into the directory:

(docker) make
(docker) make install

Extract the ffmpeg-4.0 directory and compile and install FFmpeg:

(docker) ./configure --enable-cuda --enable-cuvid --enable-nvenc --enable-nonfree --enable-libnpp --extra-cflags=-I/usr/local/cuda/include --extra-ldflags=-L/usr/local/cuda/lib64
(docker) make -j 4
(docker) make install

Download and extract the NVIDIA Video Codec SDK 8.1. The “Samples” directory has a preconfigured Makefile that compiles the binaries in the SDK. After it’s successful, confirm that the binaries are correctly set up.

(docker): ~/Video_Codec_SDK_8.1.24/Samples/AppEncode/AppEncCuda$ ./AppEncCuda -h
Options:
-i Input file path
-o Output file path
-s Input resolution in this form: WxH
-if Input format: iyuv nv12 yuv444 p010 yuv444p16 bgra bgra10 ayuv abgr abgr10
-gpu Ordinal of GPU to use
-codec Codec: h264 hevc
-preset Preset: default hp hq bd ll ll_hp ll_hq lossless lossless_hp
-profile H264: baseline main high high444; HEVC: main main10 frext
-444 (Only for RGB input) YUV444 encode
-rc Rate control mode: constqp vbr cbr cbr_ll_hq cbr_hq vbr_hq
-fps Frame rate
-gop Length of GOP (Group of Pictures)
-bf Number of consecutive B-frames
-bitrate Average bit rate, can be in unit of 1, K, M
-maxbitrate Max bit rate, can be in unit of 1, K, M
-vbvbufsize VBV buffer size in bits, can be in unit of 1, K, M
-vbvinit VBV initial delay in bits, can be in unit of 1, K, M
-aq Enable spatial AQ and set its stength (range 1-15, 0-auto)
-temporalaq (No value) Enable temporal AQ
-lookahead Maximum depth of lookahead (range 0-32)
-cq Target constant quality level for VBR mode (range 1-51, 0-auto)
-qmin Min QP value
-qmax Max QP value
-initqp Initial QP value
-constqp QP value for constqp rate control mode
Note: QP value can be in the form of qp_of_P_B_I or qp_P,qp_B,qp_I (no space)

Encoder Capability
# GPU H264 H264_444 H264_ME H264_WxH HEVC HEVC_Main10 HEVC_Lossless HEVC_SAO HEVC_444 HEVC_ME HEVC_WxH
0 Tesla V100-SXM2-16GB + + + 4096x4096 + + + + + + 8192x8192

Create a small script to be used for the 8K-encoding test inside the Docker container. Save the file as /root/nvenc-processor.sh. In the basic form, this script encodes using a single thread. For comparison, the same file is encoded using four threads.

(docker)
#!/bin/bash -xe
time aws s3 cp $S3_INPUT /mnt/8k.webm

time /usr/local/bin/ffmpeg -y -hwaccel cuda -i /mnt/8k.webm -c:v rawvideo -pix_fmt yuv420p /mnt/8k.yuv
time /root/Video_Codec_SDK/Samples/AppEncode/AppEncCuda/AppEncCuda -i /mnt/8k.yuv -o /mnt/8k.hevc -s 7680x4320 -codec hevc
time /root/Video_Codec_SDK/Samples/AppEncode/AppEncPerf/AppEncPerf -i /mnt/8k.yuv -s 7680x4320 -thread 4 -codec hevc

time aws s3 cp /mnt/8k.hevc $S3_OUTPUT

This script downloads a file from S3 and processes it through FFmpeg. Using the AppEncCuda and AppEncPerf methods, create the 8K-encoded file to be uploaded back to S3. Commit your Docker container into a new Docker image:

docker commit -m "creating hvec-processor image" <containerid> nvidia-hvec:latest

Ensure that a Docker repo has been created in Amazon ECS. Choose Repositories, Create Repository. After you open the repository, choose View Push Commands. Commit the new created image to your ECR repo.

After confirming that your image is in your ECR repo, delete all images locally in the instance:

docker rmi -f $(docker images -a -q)

Before stopping the instance, remove the ECS agent checkpoint file:

sudo rm -rf /var/lib/ecs/data/ecs_agent_data.json

Create an AMI from the instance, maintaining the attached EBS volume. Note the AMI ID.

Creating IAM role permissions

To ensure that access to ECS is controlled and to allow AWS Batch to be called, create two IAM roles:

  • BatchServiceRole allows AWS Batch to call services on your behalf.
  • ecsInstanceRole is specific to this workflow and adds permissions for S3FullAccess. This allows the container to read from and write to your S3 bucket. The following screenshot shows the example policy stack.

In AWS Batch, select the compute environment and create a managed compute environment. Assign a cluster name and min and max vCPUs values. Use the AMI ID, and IAM roles created earlier. Use the Spot pricing model with a consideration of running at 60% of the On-Demand price. Look at the current Spot price to see if more aggressive discounts are possible.

Note the cluster name. In Amazon ECS, you should see the cluster created. Next, create a job queue and associate this job queue with the compute environment created earlier. Note the job queue name.

Next, create a job definition file. This provides the job parameters to be used including mounting paths, CPU, and memory requirements.

{
    "containerProperties": {
        "mountPoints": [
            {
                "sourceVolume": "codec-data",
                "readOnly": false,
                "containerPath": "/mnt"
            }],
        "image": "<accountnumber>.dkr.ecr.us-east-1.amazonaws.com/nvidia/nvidia-hvec:latest",
        "command": ["/root/nvenc-processor.sh"],
        "volumes": [
            {
                "host": {"sourcePath": "/mnt"},
                "name": "codec-data"
            }],
        "memory": 32768,
        "vcpus": 8,
        "privileged": true,
        "environment": [
            {
                "name": "S3_INPUT",
                "value": "s3://<bucket>/<key_name>"
            },
            {
                "name": "S3_OUTPUT",
                "value": "s3://<bucket>"
            }
        ],
        "ulimits": []
    },
    "type": "container",
    "jobDefinitionName": "nvenc-test"
}

Save the file as nvenc-test.json and register the job in AWS Batch.

aws batch register-job-definition --cli-input-json file://nvenc-test.json

In the AWS Batch console, create a job queue assigning a priority of 1 to the compute environment created earlier. Create a job assigning a job name, with the job definition file, and job queue. Add additional environment variables for the S3 bucket. Ensure that these buckets and input file are created.

S3_INPUT = s3://<bucket>/<key_name> 
S3_OUTPUT = s3://<bucket> 

Execute the job. In a few moments, the job should be in the Running state. Check the CloudWatch logs for an updated status of the job progression. Open the job record information and scroll down to CloudWatch metrics. The events are logged in a new AWS Batch log stream.

A 1-minute 8K YUV 4:2:0 file took approximately 10 minutes single-threaded (top panel), and 58 seconds using four threads (bottom panel). The nvenc-processor.sh script serves as a basic implementation of 8K encoding. Explore the options provided by the NVIDIA Video Codec SDK for additional encoding/decoding and transcoding options.

Conclusion

With AWS Batch, a customized container instance, and a dockerized NVIDIA video encoding platform, AWS can provide your HD, 4K, and now 8K media distribution. I invite you to incorporate this into your automated pipeline.

With some minor modification, it’s possible to trigger this pipeline after a new file is uploaded into S3. Then, execute through AWS Lambda or as part of an AWS Step Functions workflow.

 

 

 

 

 

Spring 2018 AWS SOC Reports are Now Available with 11 Services Added in Scope

Post Syndicated from Chris Gile original https://aws.amazon.com/blogs/security/spring-2018-aws-soc-reports-are-now-available-with-11-services-added-in-scope/

Since our last System and Organization Control (SOC) audit, our service and compliance teams have been working to increase the number of AWS Services in scope prioritized based on customer requests. Today, we’re happy to report 11 services are newly SOC compliant, which is a 21 percent increase in the last six months.

With the addition of the following 11 new services, you can now select from a total of 62 SOC-compliant services. To see the full list, go to our Services in Scope by Compliance Program page:

• Amazon Athena
• Amazon QuickSight
• Amazon WorkDocs
• AWS Batch
• AWS CodeBuild
• AWS Config
• AWS OpsWorks Stacks
• AWS Snowball
• AWS Snowball Edge
• AWS Snowmobile
• AWS X-Ray

Our latest SOC 1, 2, and 3 reports covering the period from October 1, 2017 to March 31, 2018 are now available. The SOC 1 and 2 reports are available on-demand through AWS Artifact by logging into the AWS Management Console. The SOC 3 report can be downloaded here.

Finally, prospective customers can read our SOC 1 and 2 reports by reaching out to AWS Compliance.

Want more AWS Security news? Follow us on Twitter.

EC2 Fleet – Manage Thousands of On-Demand and Spot Instances with One Request

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/ec2-fleet-manage-thousands-of-on-demand-and-spot-instances-with-one-request/

EC2 Spot Fleets are really cool. You can launch a fleet of Spot Instances that spans EC2 instance types and Availability Zones without having to write custom code to discover capacity or monitor prices. You can set the target capacity (the size of the fleet) in units that are meaningful to your application and have Spot Fleet create and then maintain the fleet on your behalf. Our customers are creating Spot Fleets of all sizes. For example, one financial service customer runs Monte Carlo simulations across 10 different EC2 instance types. They routinely make requests for hundreds of thousands of vCPUs and count on Spot Fleet to give them access to massive amounts of capacity at the best possible price.

EC2 Fleet
Today we are extending and generalizing the set-it-and-forget-it model that we pioneered in Spot Fleet with EC2 Fleet, a new building block that gives you the ability to create fleets that are composed of a combination of EC2 On-Demand, Reserved, and Spot Instances with a single API call. You tell us what you need, capacity and instance-wise, and we’ll handle all the heavy lifting. We will launch, manage, monitor and scale instances as needed, without the need for scaffolding code.

You can specify the capacity of your fleet in terms of instances, vCPUs, or application-oriented units, and also indicate how much of the capacity should be fulfilled by Spot Instances. The application-oriented units allow you to specify the relative power of each EC2 instance type in a way that directly maps to the needs of your application. All three capacity specification options (instances, vCPUs, and application-oriented units) are known as weights.

I think you’ll find a number ways this feature makes managing a fleet of instances easier, and believe that you will also appreciate the team’s near-term feature roadmap of interest (more on that in a bit).

Using EC2 Fleet
There are a number of ways that you can use this feature, whether you’re running a stateless web service, a big data cluster or a continuous integration pipeline. Today I’m going to describe how you can use EC2 Fleet for genomic processing, but this is similar to workloads like risk analysis, log processing or image rendering. Modern DNA sequencers can produce multiple terabytes of raw data each day, to process that data into meaningful information in a timely fashion you need lots of processing power. I’ll be showing you how to deploy a “grid” of worker nodes that can quickly crunch through secondary analysis tasks in parallel.

Projects in genomics can use the elasticity EC2 provides to experiment and try out new pipelines on hundreds or even thousands of servers. With EC2 you can access as many cores as you need and only pay for what you use. Prior to today, you would need to use the RunInstances API or an Auto Scaling group for the On-Demand & Reserved Instance portion of your grid. To get the best price performance you’d also create and manage a Spot Fleet or multiple Spot Auto Scaling groups with different instance types if you wanted to add Spot Instances to turbo-boost your secondary analysis. Finally, to automate scaling decisions across multiple APIs and Auto Scaling groups you would need to write Lambda functions that periodically assess your grid’s progress & backlog, as well as current Spot prices – modifying your Auto Scaling Groups and Spot Fleets accordingly.

You can now replace all of this with a single EC2 Fleet, analyzing genomes at scale for as little as $1 per analysis. In my grid, each step in in the pipeline requires 1 vCPU and 4 GiB of memory, a perfect match for M4 and M5 instances with 4 GiB of memory per vCPU. I will create a fleet using M4 and M5 instances with weights that correspond to the number of vCPUs on each instance:

  • m4.16xlarge – 64 vCPUs, weight = 64
  • m5.24xlarge – 96 vCPUs, weight = 96

This is expressed in a template that looks like this:

"Overrides": [
{
  "InstanceType": "m4.16xlarge",
  "WeightedCapacity": 64,
},
{
  "InstanceType": "m5.24xlarge",
  "WeightedCapacity": 96,
},
]

By default, EC2 Fleet will select the most cost effective combination of instance types and Availability Zones (both specified in the template) using the current prices for the Spot Instances and public prices for the On-Demand Instances (if you specify instances for which you have matching RIs, your discounts will apply). The default mode takes weights into account to get the instances that have the lowest price per unit. So for my grid, fleet will find the instance that offers the lowest price per vCPU.

Now I can request capacity in terms of vCPUs, knowing EC2 Fleet will select the lowest cost option using only the instance types I’ve defined as acceptable. Also, I can specify how many vCPUs I want to launch using On-Demand or Reserved Instance capacity and how many vCPUs should be launched using Spot Instance capacity:

"TargetCapacitySpecification": {
	"TotalTargetCapacity": 2880,
	"OnDemandTargetCapacity": 960,
	"SpotTargetCapacity": 1920,
	"DefaultTargetCapacityType": "Spot"
}

The above means that I want a total of 2880 vCPUs, with 960 vCPUs fulfilled using On-Demand and 1920 using Spot. The On-Demand price per vCPU is lower for m5.24xlarge than the On-Demand price per vCPU for m4.16xlarge, so EC2 Fleet will launch 10 m5.24xlarge instances to fulfill 960 vCPUs. Based on current Spot pricing (again, on a per-vCPU basis), EC2 Fleet will choose to launch 30 m4.16xlarge instances or 20 m5.24xlarges, delivering 1920 vCPUs either way.

Putting it all together, I have a single file (fl1.json) that describes my fleet:

    "LaunchTemplateConfigs": [
        {
            "LaunchTemplateSpecification": {
                "LaunchTemplateId": "lt-0e8c754449b27161c",
                "Version": "1"
            }
        "Overrides": [
        {
          "InstanceType": "m4.16xlarge",
          "WeightedCapacity": 64,
        },
        {
          "InstanceType": "m5.24xlarge",
          "WeightedCapacity": 96,
        },
      ]
        }
    ],
    "TargetCapacitySpecification": {
        "TotalTargetCapacity": 2880,
        "OnDemandTargetCapacity": 960,
        "SpotTargetCapacity": 1920,
        "DefaultTargetCapacityType": "Spot"
    }
}

I can launch my fleet with a single command:

$ aws ec2 create-fleet --cli-input-json file://home/ec2-user/fl1.json
{
    "FleetId":"fleet-838cf4e5-fded-4f68-acb5-8c47ee1b248a"
}

My entire fleet is created within seconds and was built using 10 m5.24xlarge On-Demand Instances and 30 m4.16xlarge Spot Instances, since the current Spot price was 1.5¢ per vCPU for m4.16xlarge and 1.6¢ per vCPU for m5.24xlarge.

Now lets imagine my grid has crunched through its backlog and no longer needs the additional Spot Instances. I can then modify the size of my fleet by changing the target capacity in my fleet specification, like this:

{         
    "TotalTargetCapacity": 960,
}

Since 960 was equal to the amount of On-Demand vCPUs I had requested, when I describe my fleet I will see all of my capacity being delivered using On-Demand capacity:

"TargetCapacitySpecification": {
	"TotalTargetCapacity": 960,
	"OnDemandTargetCapacity": 960,
	"SpotTargetCapacity": 0,
	"DefaultTargetCapacityType": "Spot"
}

When I no longer need my fleet I can delete it and terminate the instances in it like this:

$ aws ec2 delete-fleets --fleet-id fleet-838cf4e5-fded-4f68-acb5-8c47ee1b248a \
  --terminate-instances   
{
    "UnsuccessfulFleetDletetions": [],
    "SuccessfulFleetDeletions": [
        {
            "CurrentFleetState": "deleted_terminating",
            "PreviousFleetState": "active",
            "FleetId": "fleet-838cf4e5-fded-4f68-acb5-8c47ee1b248a"
        }
    ]
}

Earlier I described how RI discounts apply when EC2 Fleet launches instances for which you have matching RIs, so you might be wondering how else RI customers benefit from EC2 Fleet. Let’s say that I own regional RIs for M4 instances. In my EC2 Fleet I would remove m5.24xlarge and specify m4.10xlarge and m4.16xlarge. Then when EC2 Fleet creates the grid, it will quickly find M4 capacity across the sizes and AZs I’ve specified, and my RI discounts apply automatically to this usage.

In the Works
We plan to connect EC2 Fleet and EC2 Auto Scaling groups. This will let you create a single fleet that mixed instance types and Spot, Reserved and On-Demand, while also taking advantage of EC2 Auto Scaling features such as health checks and lifecycle hooks. This integration will also bring EC2 Fleet functionality to services such as Amazon ECS, Amazon EKS, and AWS Batch that build on and make use of EC2 Auto Scaling for fleet management.

Available Now
You can create and make use of EC2 Fleets today in all public AWS Regions!

Jeff;

AWS Adds 16 More Services to Its PCI DSS Compliance Program

Post Syndicated from Chad Woolf original https://aws.amazon.com/blogs/security/aws-adds-16-more-services-to-its-pci-dss-compliance-program/

PCI logo

AWS has added 16 more AWS services to its Payment Card Industry Data Security Standard (PCI DSS) compliance program, giving you more options, flexibility, and functionality to process and store sensitive payment card data in the AWS Cloud. The services were audited by Coalfire to ensure that they meet strict PCI DSS standards.

The newly compliant AWS services are:

AWS now offers 58 services that are officially PCI DSS compliant, giving administrators more service options for implementing a PCI-compliant cardholder environment.

For more information about the AWS PCI DSS compliance program, see Compliance ResourcesAWS Services in Scope by Compliance Program, and PCI DSS Compliance.

– Chad Woolf

AWS Updated Its ISO Certifications and Now Has 67 Services Under ISO Compliance

Post Syndicated from Chad Woolf original https://aws.amazon.com/blogs/security/aws-updated-its-iso-certifications-and-now-has-67-services-under-iso-compliance/

ISO logo

AWS has updated its certifications against ISO 9001, ISO 27001, ISO 27017, and ISO 27018 standards, bringing the total to 67 services now under ISO compliance. We added the following 29 services this cycle:

Amazon AuroraAmazon S3 Transfer AccelerationAWS [email protected]
Amazon Cloud DirectoryAmazon SageMakerAWS Managed Services
Amazon CloudWatch LogsAmazon Simple Notification ServiceAWS OpsWorks Stacks
Amazon CognitoAuto ScalingAWS Shield
Amazon ConnectAWS BatchAWS Snowball Edge
Amazon Elastic Container RegistryAWS CodeBuildAWS Snowmobile
Amazon InspectorAWS CodeCommitAWS Step Functions
Amazon Kinesis Data StreamsAWS CodeDeployAWS Systems Manager (formerly Amazon EC2 Systems Manager)
Amazon MacieAWS CodePipelineAWS X-Ray
Amazon QuickSightAWS IoT Core

For the complete list of services under ISO compliance, see AWS Services in Scope by Compliance Program.

AWS maintains certifications through extensive audits of its controls to ensure that information security risks that affect the confidentiality, integrity, and availability of company and customer information are appropriately managed.

You can download copies of the AWS ISO certificates that contain AWS’s in-scope services and Regions, and use these certificates to jump-start your own certification efforts:

AWS does not increase service costs in any AWS Region as a result of updating its certifications.

To learn more about compliance in the AWS Cloud, see AWS Cloud Compliance.

– Chad

Amazon EC2 Update – Streamlined Access to Spot Capacity, Smooth Price Changes, Instance Hibernation

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-ec2-update-streamlined-access-to-spot-capacity-smooth-price-changes-instance-hibernation/

EC2 Spot Instances give you access to spare compute capacity in the AWS Cloud. Our customers use fleets of Spot Instances to power their CI/CD environments & traffic generators, host web servers & microservices, render movies, and to run many types of analytics jobs, all at prices that offer significant savings in comparison to On-Demand Instances.

New Streamlined Access
Today we are introducing a new, streamlined access model for Spot Instances. You simply indicate your desire to use Spot capacity when you launch an instance via the RunInstances function, the run-instances command, or the AWS Management Console to submit a request that will be fulfilled as long as the capacity is available. With no extra effort on your part you’ll save up to 90% off of the On-Demand price for the instance type, allowing you to boost your overall application throughput by up to 10x for the same budget. The instances that you launch in this way will continue to run until you terminate them or if EC2 needs to reclaim them for On-Demand usage. At that point the instance will be given the usual 2-minute warning and then reclaimed, making this a great fit for applications that are fault-tolerant.

Unlike the old model which required an understanding of Spot markets, bidding, and calls to a standalone asynchronous API, the new model is synchronous and as easy to use as On-Demand. Your code or your script receives an Instance ID immediately and need not check back to see if the request has been processed and accepted.

We’ve made this as clean and as simple as possible, with the expectation that it will be easy to modify many current scripts and applications to request and make use of Spot capacity. If you want to exercise additional control over your Spot instance budget, you have the option to specify a maximum price when you make a request for capacity. If you use Spot capacity to power your Amazon EMR, Amazon ECS, or AWS Batch clusters, or if you launch Spot instances by way of a AWS CloudFormation template or Auto Scaling Group, you will benefit from this new model without having to make any changes.

Applications that are built around RequestSpotInstances or RequestSpotFleet will continue to work just fine with no changes. However, you now have the option to make requests that do not include the SpotPrice parameter.

Smooth Price Changes
As part of today’s launch we are also changing the way that Spot prices change, moving to a model where prices adjust more gradually, based on longer-term trends in supply and demand. As I mentioned earlier, you will continue to save an average of 70-90% off the On-Demand price, and you will continue to pay the Spot price that’s in effect for the time period your instances are running. Applications built around our Spot Fleet feature will continue to automatically diversify placement of their Spot Instances across the most cost-effective pools based on the configuration you specified when you created the fleet.

Spot in Action
To launch a Spot Instance from the command line; simply specify the Spot market:

$ aws ec2 run-instances –-market Spot --image-id ami-1a2b3c4d --count 1 --instance-type c3.large 

Instance Hibernation
If you run workloads that keep a lot of state in memory, you will love this new feature!

You can arrange for instances to save their in-memory state when they are reclaimed, allowing the instances and the applications on them to pick up where they left off when capacity is once again available, just like closing and then opening your laptop. This feature works on C3, C4, and certain sizes of R3, R4, and M4 instances running Amazon Linux, Ubuntu, or Windows Server, and is supported by the EC2 Hibernation Agent.

The in-memory state is written to the root EBS volume of the instance using space that is set-aside when the instance launches. The private IP address and any Elastic IP Addresses are also preserved across a stop/start cycle.

Jeff;

Resume AWS Step Functions from Any State

Post Syndicated from Andy Katz original https://aws.amazon.com/blogs/compute/resume-aws-step-functions-from-any-state/


Yash Pant, Solutions Architect, AWS


Aaron Friedman, Partner Solutions Architect, AWS

When we discuss how to build applications with customers, we often align to the Well Architected Framework pillars of security, reliability, performance efficiency, cost optimization, and operational excellence. Designing for failure is an essential component to developing well architected applications that are resilient to spurious errors that may occur.

There are many ways you can use AWS services to achieve high availability and resiliency of your applications. For example, you can couple Elastic Load Balancing with Auto Scaling and Amazon EC2 instances to build highly available applications. Or use Amazon API Gateway and AWS Lambda to rapidly scale out a microservices-based architecture. Many AWS services have built in solutions to help with the appropriate error handling, such as Dead Letter Queues (DLQ) for Amazon SQS or retries in AWS Batch.

AWS Step Functions is an AWS service that makes it easy for you to coordinate the components of distributed applications and microservices. Step Functions allows you to easily design for failure, by incorporating features such as error retries and custom error handling from AWS Lambda exceptions. These features allow you to programmatically handle many common error modes and build robust, reliable applications.

In some rare cases, however, your application may fail in an unexpected manner. In these situations, you might not want to duplicate in a repeat execution those portions of your state machine that have already run. This is especially true when orchestrating long-running jobs or executing a complex state machine as part of a microservice. Here, you need to know the last successful state in your state machine from which to resume, so that you don’t duplicate previous work. In this post, we present a solution to enable you to resume from any given state in your state machine in the case of an unexpected failure.

Resuming from a given state

To resume a failed state machine execution from the state at which it failed, you first run a script that dynamically creates a new state machine. When the new state machine is executed, it resumes the failed execution from the point of failure. The script contains the following two primary steps:

  1. Parse the execution history of the failed execution to find the name of the state at which it failed, as well as the JSON input to that state.
  2. Create a new state machine, which adds an additional state to failed state machine, called "GoToState". "GoToState" is a choice state at the beginning of the state machine that branches execution directly to the failed state, allowing you to skip states that had succeeded in the previous execution.

The full script along with a CloudFormation template that creates a demo of this is available in the aws-sfn-resume-from-any-state GitHub repo.

Diving into the script

In this section, we walk you through the script and highlight the core components of its functionality. The script contains a main function, which adds a command line parameter for the failedExecutionArn so that you can easily call the script from the command line:

python gotostate.py --failedExecutionArn '<Failed_Execution_Arn>'

Identifying the failed state in your execution

First, the script extracts the name of the failed state along with the input to that state. It does so by using the failed state machine execution history, which is identified by the Amazon Resource Name (ARN) of the execution. The failed state is marked in the execution history, along with the input to that state (which is also the output of the preceding successful state). The script is able to parse these values from the log.

The script loops through the execution history of the failed state machine, and traces it backwards until it finds the failed state. If the state machine failed in a parallel state, then it must restart from the beginning of the parallel state. The script is able to capture the name of the parallel state that failed, rather than any substate within the parallel state that may have caused the failure. The following code is the Python function that does this.


def parseFailureHistory(failedExecutionArn):

    '''
    Parses the execution history of a failed state machine to get the name of failed state and the input to the failed state:
    Input failedExecutionArn = A string containing the execution ARN of a failed state machine y
    Output = A list with two elements: [name of failed state, input to failed state]
    '''
    failedAtParallelState = False
    try:
        #Get the execution history
        response = client.get\_execution\_history(
            executionArn=failedExecutionArn,
            reverseOrder=True
        )
        failedEvents = response['events']
    except Exception as ex:
        raise ex
    #Confirm that the execution actually failed, raise exception if it didn't fail.
    try:
        failedEvents[0]['executionFailedEventDetails']
    except:
        raise('Execution did not fail')
        
    '''
    If you have a 'States.Runtime' error (for example, if a task state in your state machine attempts to execute a Lambda function in a different region than the state machine), get the ID of the failed state, and use it to determine the failed state name and input.
    '''
    
    if failedEvents[0]['executionFailedEventDetails']['error'] == 'States.Runtime':
        failedId = int(filter(str.isdigit, str(failedEvents[0]['executionFailedEventDetails']['cause'].split()[13])))
        failedState = failedEvents[-1 \* failedId]['stateEnteredEventDetails']['name']
        failedInput = failedEvents[-1 \* failedId]['stateEnteredEventDetails']['input']
        return (failedState, failedInput)
        
    '''
    You need to loop through the execution history, tracing back the executed steps.
    The first state you encounter is the failed state. If you failed on a parallel state, you need the name of the parallel state rather than the name of a state within a parallel state that it failed on. This is because you can only attach goToState to the parallel state, but not a substate within the parallel state.
    This loop starts with the ID of the latest event and uses the previous event IDs to trace back the execution to the beginning (id 0). However, it returns as soon it finds the name of the failed state.
    '''

    currentEventId = failedEvents[0]['id']
    while currentEventId != 0:
        #multiply event ID by -1 for indexing because you're looking at the reversed history
        currentEvent = failedEvents[-1 \* currentEventId]
        
        '''
        You can determine if the failed state was a parallel state because it and an event with 'type'='ParallelStateFailed' appears in the execution history before the name of the failed state
        '''

        if currentEvent['type'] == 'ParallelStateFailed':
            failedAtParallelState = True

        '''
        If the failed state is not a parallel state, then the name of failed state to return is the name of the state in the first 'TaskStateEntered' event type you run into when tracing back the execution history
        '''

        if currentEvent['type'] == 'TaskStateEntered' and failedAtParallelState == False:
            failedState = currentEvent['stateEnteredEventDetails']['name']
            failedInput = currentEvent['stateEnteredEventDetails']['input']
            return (failedState, failedInput)

        '''
        If the failed state was a parallel state, then you need to trace execution back to the first event with 'type'='ParallelStateEntered', and return the name of the state
        '''

        if currentEvent['type'] == 'ParallelStateEntered' and failedAtParallelState:
            failedState = failedState = currentEvent['stateEnteredEventDetails']['name']
            failedInput = currentEvent['stateEnteredEventDetails']['input']
            return (failedState, failedInput)
        #Update the ID for the next execution of the loop
        currentEventId = currentEvent['previousEventId']
        

Create the new state machine

The script uses the name of the failed state to create the new state machine, with "GoToState" branching execution directly to the failed state.

To do this, the script requires the Amazon States Language (ASL) definition of the failed state machine. It modifies the definition to append "GoToState", and create a new state machine from it.

The script gets the ARN of the failed state machine from the execution ARN of the failed state machine. This ARN allows it to get the ASL definition of the failed state machine by calling the DesribeStateMachine API action. It creates a new state machine with "GoToState".

When the script creates the new state machine, it also adds an additional input variable called "resuming". When you execute this new state machine, you specify this resuming variable as true in the input JSON. This tells "GoToState" to branch execution to the state that had previously failed. Here’s the function that does this:

def attachGoToState(failedStateName, stateMachineArn):

    '''
    Given a state machine ARN and the name of a state in that state machine, create a new state machine that starts at a new choice state called 'GoToState'. "GoToState" branches to the named state, and sends the input of the state machine to that state, when a variable called "resuming" is set to True.
    Input failedStateName = A string with the name of the failed state
          stateMachineArn = A string with the ARN of the state machine
    Output response from the create_state_machine call, which is the API call that creates a new state machine
    '''

    try:
        response = client.describe\_state\_machine(
            stateMachineArn=stateMachineArn
        )
    except:
        raise('Could not get ASL definition of state machine')
    roleArn = response['roleArn']
    stateMachine = json.loads(response['definition'])
    #Create a name for the new state machine
    newName = response['name'] + '-with-GoToState'
    #Get the StartAt state for the original state machine, because you point the 'GoToState' to this state
    originalStartAt = stateMachine['StartAt']

    '''
    Create the GoToState with the variable $.resuming.
    If new state machine is executed with $.resuming = True, then the state machine skips to the failed state.
    Otherwise, it executes the state machine from the original start state.
    '''

    goToState = {'Type':'Choice', 'Choices':[{'Variable':'$.resuming', 'BooleanEquals':False, 'Next':originalStartAt}], 'Default':failedStateName}
    #Add GoToState to the set of states in the new state machine
    stateMachine['States']['GoToState'] = goToState
    #Add StartAt
    stateMachine['StartAt'] = 'GoToState'
    #Create new state machine
    try:
        response = client.create_state_machine(
            name=newName,
            definition=json.dumps(stateMachine),
            roleArn=roleArn
        )
    except:
        raise('Failed to create new state machine with GoToState')
    return response

Testing the script

Now that you understand how the script works, you can test it out.

The following screenshot shows an example state machine that has failed, called "TestMachine". This state machine successfully completed "FirstState" and "ChoiceState", but when it branched to "FirstMatchState", it failed.

Use the script to create a new state machine that allows you to rerun this state machine, but skip the "FirstState" and the "ChoiceState" steps that already succeeded. You can do this by calling the script as follows:

python gotostate.py --failedExecutionArn 'arn:aws:states:us-west-2:<AWS_ACCOUNT_ID>:execution:TestMachine-with-GoToState:b2578403-f41d-a2c7-e70c-7500045288595

This creates a new state machine called "TestMachine-with-GoToState", and returns its ARN, along with the input that had been sent to "FirstMatchState". You can then inspect the input to determine what caused the error. In this case, you notice that the input to "FirstMachState" was the following:

{
"foo": 1,
"Message": true
}

However, this state machine expects the "Message" field of the JSON to be a string rather than a Boolean. Execute the new "TestMachine-with-GoToState" state machine, change the input to be a string, and add the "resuming" variable that "GoToState" requires:

{
"foo": 1,
"Message": "Hello!",
"resuming":true
}

When you execute the new state machine, it skips "FirstState" and "ChoiceState", and goes directly to "FirstMatchState", which was the state that failed:

Look at what happens when you have a state machine with multiple parallel steps. This example is included in the GitHub repository associated with this post. The repo contains a CloudFormation template that sets up this state machine and provides instructions to replicate this solution.

The following state machine, "ParallelStateMachine", takes an input through two subsequent parallel states before doing some final processing and exiting, along with the JSON with the ASL definition of the state machine.

{
  "Comment": "An example of the Amazon States Language using a parallel state to execute two branches at the same time.",
  "StartAt": "Parallel",
  "States": {
    "Parallel": {
      "Type": "Parallel",
      "ResultPath":"$.output",
      "Next": "Parallel 2",
      "Branches": [
        {
          "StartAt": "Parallel Step 1, Process 1",
          "States": {
            "Parallel Step 1, Process 1": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaA",
              "End": true
            }
          }
        },
        {
          "StartAt": "Parallel Step 1, Process 2",
          "States": {
            "Parallel Step 1, Process 2": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaA",
              "End": true
            }
          }
        }
      ]
    },
    "Parallel 2": {
      "Type": "Parallel",
      "Next": "Final Processing",
      "Branches": [
        {
          "StartAt": "Parallel Step 2, Process 1",
          "States": {
            "Parallel Step 2, Process 1": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXXX:function:LambdaB",
              "End": true
            }
          }
        },
        {
          "StartAt": "Parallel Step 2, Process 2",
          "States": {
            "Parallel Step 2, Process 2": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaB",
              "End": true
            }
          }
        }
      ]
    },
    "Final Processing": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaC",
      "End": true
    }
  }
}

First, use an input that initially fails:

{
  "Message": "Hello!"
}

This fails because the state machine expects you to have a variable in the input JSON called "foo" in the second parallel state to run "Parallel Step 2, Process 1" and "Parallel Step 2, Process 2". Instead, the original input gets processed by the first parallel state and produces the following output to pass to the second parallel state:

{
"output": [
    {
      "Message": "Hello!"
    },
    {
      "Message": "Hello!"
    }
  ],
}

Run the script on the failed state machine to create a new state machine that allows it to resume directly at the second parallel state instead of having to redo the first parallel state. This creates a new state machine called "ParallelStateMachine-with-GoToState". The following JSON was created by the script to define the new state machine in ASL. It contains the "GoToState" value that was attached by the script.

{
   "Comment":"An example of the Amazon States Language using a parallel state to execute two branches at the same time.",
   "States":{
      "Final Processing":{
         "Resource":"arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaC",
         "End":true,
         "Type":"Task"
      },
      "GoToState":{
         "Default":"Parallel 2",
         "Type":"Choice",
         "Choices":[
            {
               "Variable":"$.resuming",
               "BooleanEquals":false,
               "Next":"Parallel"
            }
         ]
      },
      "Parallel":{
         "Branches":[
            {
               "States":{
                  "Parallel Step 1, Process 1":{
                     "Resource":"arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaA",
                     "End":true,
                     "Type":"Task"
                  }
               },
               "StartAt":"Parallel Step 1, Process 1"
            },
            {
               "States":{
                  "Parallel Step 1, Process 2":{
                     "Resource":"arn:aws:lambda:us-west-2:XXXXXXXXXXXX:LambdaA",
                     "End":true,
                     "Type":"Task"
                  }
               },
               "StartAt":"Parallel Step 1, Process 2"
            }
         ],
         "ResultPath":"$.output",
         "Type":"Parallel",
         "Next":"Parallel 2"
      },
      "Parallel 2":{
         "Branches":[
            {
               "States":{
                  "Parallel Step 2, Process 1":{
                     "Resource":"arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaB",
                     "End":true,
                     "Type":"Task"
                  }
               },
               "StartAt":"Parallel Step 2, Process 1"
            },
            {
               "States":{
                  "Parallel Step 2, Process 2":{
                     "Resource":"arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaB",
                     "End":true,
                     "Type":"Task"
                  }
               },
               "StartAt":"Parallel Step 2, Process 2"
            }
         ],
         "Type":"Parallel",
         "Next":"Final Processing"
      }
   },
   "StartAt":"GoToState"
}

You can then execute this state machine with the correct input by adding the "foo" and "resuming" variables:

{
  "foo": 1,
  "output": [
    {
      "Message": "Hello!"
    },
    {
      "Message": "Hello!"
    }
  ],
  "resuming": true
}

This yields the following result. Notice that this time, the state machine executed successfully to completion, and skipped the steps that had previously failed.


Conclusion

When you’re building out complex workflows, it’s important to be prepared for failure. You can do this by taking advantage of features such as automatic error retries in Step Functions and custom error handling of Lambda exceptions.

Nevertheless, state machines still have the possibility of failing. With the methodology and script presented in this post, you can resume a failed state machine from its point of failure. This allows you to skip the execution of steps in the workflow that had already succeeded, and recover the process from the point of failure.

To see more examples, please visit the Step Functions Getting Started page.

If you have questions or suggestions, please comment below.

AWS HIPAA Eligibility Update (October 2017) – Sixteen Additional Services

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-hipaa-eligibility-post-update-october-2017-sixteen-additional-services/

Our Health Customer Stories page lists just a few of the many customers that are building and running healthcare and life sciences applications that run on AWS. Customers like Verge Health, Care Cloud, and Orion Health trust AWS with Protected Health Information (PHI) and Personally Identifying Information (PII) as part of their efforts to comply with HIPAA and HITECH.

Sixteen More Services
In my last HIPAA Eligibility Update I shared the news that we added eight additional services to our list of HIPAA eligible services. Today I am happy to let you know that we have added another sixteen services to the list, bringing the total up to 46. Here are the newest additions, along with some short descriptions and links to some of my blog posts to jog your memory:

Amazon Aurora with PostgreSQL Compatibility – This brand-new addition to Amazon Aurora allows you to encrypt your relational databases using keys that you create and manage through AWS Key Management Service (KMS). When you enable encryption for an Amazon Aurora database, the underlying storage is encrypted, as are automated backups, read replicas, and snapshots. Read New – Encryption at Rest for Amazon Aurora to learn more.

Amazon CloudWatch Logs – You can use the logs to monitor and troubleshoot your systems and applications. You can monitor your existing system, application, and custom log files in near real-time, watching for specific phrases, values, or patterns. Log data can be stored durably and at low cost, for as long as needed. To learn more, read Store and Monitor OS & Application Log Files with Amazon CloudWatch and Improvements to CloudWatch Logs and Dashboards.

Amazon Connect – This self-service, cloud-based contact center makes it easy for you to deliver better customer service at a lower cost. You can use the visual designer to set up your contact flows, manage agents, and track performance, all without specialized skills. Read Amazon Connect – Customer Contact Center in the Cloud and New – Amazon Connect and Amazon Lex Integration to learn more.

Amazon ElastiCache for Redis – This service lets you deploy, operate, and scale an in-memory data store or cache that you can use to improve the performance of your applications. Each ElastiCache for Redis cluster publishes key performance metrics to Amazon CloudWatch. To learn more, read Caching in the Cloud with Amazon ElastiCache and Amazon ElastiCache – Now With a Dash of Redis.

Amazon Kinesis Streams – This service allows you to build applications that process or analyze streaming data such as website clickstreams, financial transactions, social media feeds, and location-tracking events. To learn more, read Amazon Kinesis – Real-Time Processing of Streaming Big Data and New: Server-Side Encryption for Amazon Kinesis Streams.

Amazon RDS for MariaDB – This service lets you set up scalable, managed MariaDB instances in minutes, and offers high performance, high availability, and a simplified security model that makes it easy for you to encrypt data at rest and in transit. Read Amazon RDS Update – MariaDB is Now Available to learn more.

Amazon RDS SQL Server – This service lets you set up scalable, managed Microsoft SQL Server instances in minutes, and also offers high performance, high availability, and a simplified security model. To learn more, read Amazon RDS for SQL Server and .NET support for AWS Elastic Beanstalk and Amazon RDS for Microsoft SQL Server – Transparent Data Encryption (TDE) to learn more.

Amazon Route 53 – This is a highly available Domain Name Server. It translates names like www.example.com into IP addresses. To learn more, read Moving Ahead with Amazon Route 53.

AWS Batch – This service lets you run large-scale batch computing jobs on AWS. You don’t need to install or maintain specialized batch software or build your own server clusters. Read AWS Batch – Run Batch Computing Jobs on AWS to learn more.

AWS CloudHSM – A cloud-based Hardware Security Module (HSM) for key storage and management at cloud scale. Designed for sensitive workloads, CloudHSM lets you manage your own keys using FIPS 140-2 Level 3 validated HSMs. To learn more, read AWS CloudHSM – Secure Key Storage and Cryptographic Operations and AWS CloudHSM Update – Cost Effective Hardware Key Management at Cloud Scale for Sensitive & Regulated Workloads.

AWS Key Management Service – This service makes it easy for you to create and control the encryption keys used to encrypt your data. It uses HSMs to protect your keys, and is integrated with AWS CloudTrail in order to provide you with a log of all key usage. Read New AWS Key Management Service (KMS) to learn more.

AWS Lambda – This service lets you run event-driven application or backend code without thinking about or managing servers. To learn more, read AWS Lambda – Run Code in the Cloud, AWS Lambda – A Look Back at 2016, and AWS Lambda – In Full Production with New Features for Mobile Devs.

[email protected] – You can use this new feature of AWS Lambda to run Node.js functions across the global network of AWS locations without having to provision or manager servers, in order to deliver rich, personalized content to your users with low latency. Read [email protected] – Intelligent Processing of HTTP Requests at the Edge to learn more.

AWS Snowball Edge – This is a data transfer device with 100 terabytes of on-board storage as well as compute capabilities. You can use it to move large amounts of data into or out of AWS, as a temporary storage tier, or to support workloads in remote or offline locations. To learn more, read AWS Snowball Edge – More Storage, Local Endpoints, Lambda Functions.

AWS Snowmobile – This is an exabyte-scale data transfer service. Pulled by a semi-trailer truck, each Snowmobile packs 100 petabytes of storage into a ruggedized 45-foot long shipping container. Read AWS Snowmobile – Move Exabytes of Data to the Cloud in Weeks to learn more (and to see some of my finest LEGO work).

AWS Storage Gateway – This hybrid storage service lets your on-premises applications use AWS cloud storage (Amazon Simple Storage Service (S3), Amazon Glacier, and Amazon Elastic File System) in a simple and seamless way, with storage for volumes, files, and virtual tapes. To learn more, read The AWS Storage Gateway – Integrate Your Existing On-Premises Applications with AWS Cloud Storage and File Interface to AWS Storage Gateway.

And there you go! Check out my earlier post for a list of resources that will help you to build applications that comply with HIPAA and HITECH.

Jeff;

 

New – Per-Second Billing for EC2 Instances and EBS Volumes

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-per-second-billing-for-ec2-instances-and-ebs-volumes/

Back in the old days, you needed to buy or lease a server if you needed access to compute power. When we launched EC2 back in 2006, the ability to use an instance for an hour, and to pay only for that hour, was big news. The pay-as-you-go model inspired our customers to think about new ways to develop, test, and run applications of all types.

Today, services like AWS Lambda prove that we can do a lot of useful work in a short time. Many of our customers are dreaming up applications for EC2 that can make good use of a large number of instances for shorter amounts of time, sometimes just a few minutes.

Per-Second Billing for EC2 and EBS
Effective October 2nd, usage of Linux instances that are launched in On-Demand, Reserved, and Spot form will be billed in one-second increments. Similarly, provisioned storage for EBS volumes will be billed in one-second increments.

Per-second billing also applies to Amazon EMR and AWS Batch:

Amazon EMR – Our customers add capacity to their EMR clusters in order to get their results more quickly. With per-second billing for the EC2 instances in the clusters, adding nodes is more cost-effective than ever.

AWS Batch – Many of the batch jobs that our customers run complete in less than an hour. AWS Batch already launches and terminates Spot Instances; with per-second billing batch processing will become even more economical.

Some of our more sophisticated customers have built systems to get the most value from EC2 by strategically choosing the most advantageous target instances when managing their gaming, ad tech, or 3D rendering fleets. Per-second billing obviates the need for this extra layer of instance management, and brings the costs savings to all customers and all workloads.

While this will result in a price reduction for many workloads (and you know we love price reductions), I don’t think that’s the most important aspect of this change. I believe that this change will inspire you to innovate and to think about your compute-bound problems in new ways. How can you use it to improve your support for continuous integration? Can it change the way that you provision transient environments for your dev and test workloads? What about your analytics, batch processing, and 3D rendering?

One of the many advantages of cloud computing is the elastic nature of provisioning or deprovisioning resources as you need them. By billing usage down to the second we will enable customers to level up their elasticity, save money, and customers will be positioned to take advantage of continuing advances in computing.

Things to Know
This change is effective in all AWS Regions and will be effective October 2, for all Linux instances that are newly launched or already running. Per-second billing is not currently applicable to instances running Microsoft Windows or Linux distributions that have a separate hourly charge. There is a 1 minute minimum charge per-instance.

List prices and Spot Market prices are still listed on a per-hour basis, but bills are calculated down to the second, as is Reserved Instance usage (you can launch, use, and terminate multiple instances within an hour and get the Reserved Instance Benefit for all of the instances). Also, bills will show times in decimal form, like this:

The Dedicated Per Region Fee, EBS Snapshots, and products in AWS Marketplace are still billed on an hourly basis.

Jeff;

 

Building High-Throughput Genomics Batch Workflows on AWS: Workflow Layer (Part 4 of 4)

Post Syndicated from Andy Katz original https://aws.amazon.com/blogs/compute/building-high-throughput-genomics-batch-workflows-on-aws-workflow-layer-part-4-of-4/

Aaron Friedman is a Healthcare and Life Sciences Partner Solutions Architect at AWS

Angel Pizarro is a Scientific Computing Technical Business Development Manager at AWS

This post is the fourth in a series on how to build a genomics workflow on AWS. In Part 1, we introduced a general architecture, shown below, and highlighted the three common layers in a batch workflow:

  • Job
  • Batch
  • Workflow

In Part 2, you built a Docker container for each job that needed to run as part of your workflow, and stored them in Amazon ECR.

In Part 3, you tackled the batch layer and built a scalable, elastic, and easily maintainable batch engine using AWS Batch. This solution took care of dynamically scaling your compute resources in response to the number of runnable jobs in your job queue length as well as managed job placement.

In part 4, you build out the workflow layer of your solution using AWS Step Functions and AWS Lambda. You then run an end-to-end genomic analysis―specifically known as exome secondary analysis―for many times at a cost of less than $1 per exome.

Step Functions makes it easy to coordinate the components of your applications using visual workflows. Building applications from individual components that each perform a single function lets you scale and change your workflow quickly. You can use the graphical console to arrange and visualize the components of your application as a series of steps, which simplify building and running multi-step applications. You can change and add steps without writing code, so you can easily evolve your application and innovate faster.

An added benefit of using Step Functions to define your workflows is that the state machines you create are immutable. While you can delete a state machine, you cannot alter it after it is created. For regulated workloads where auditing is important, you can be assured that state machines you used in production cannot be altered.

In this blog post, you will create a Lambda state machine to orchestrate your batch workflow. For more information on how to create a basic state machine, please see this Step Functions tutorial.

All code related to this blog series can be found in the associated GitHub repository here.

Build a state machine building block

To skip the following steps, we have provided an AWS CloudFormation template that can deploy your Step Functions state machine. You can use this in combination with the setup you did in part 3 to quickly set up the environment in which to run your analysis.

The state machine is composed of smaller state machines that submit a job to AWS Batch, and then poll and check its execution.

The steps in this building block state machine are as follows:

  1. A job is submitted.
    Each analytical module/job has its own Lambda function for submission and calls the batchSubmitJob Lambda function that you built in the previous blog post. You will build these specialized Lambda functions in the following section.
  2. The state machine queries the AWS Batch API for the job status.
    This is also a Lambda function.
  3. The job status is checked to see if the job has completed.
    If the job status equals SUCCESS, proceed to log the final job status. If the job status equals FAILED, end the execution of the state machine. In all other cases, wait 30 seconds and go back to Step 2.

Here is the JSON representing this state machine.

{
  "Comment": "A simple example that submits a Job to AWS Batch",
  "StartAt": "SubmitJob",
  "States": {
    "SubmitJob": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:<account-id>::function:batchSubmitJob",
      "Next": "GetJobStatus"
    },
    "GetJobStatus": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:<account-id>:function:batchGetJobStatus",
      "Next": "CheckJobStatus",
      "InputPath": "$",
      "ResultPath": "$.status"
    },
    "CheckJobStatus": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.status",
          "StringEquals": "FAILED",
          "End": true
        },
        {
          "Variable": "$.status",
          "StringEquals": "SUCCEEDED",
          "Next": "GetFinalJobStatus"
        }
      ],
      "Default": "Wait30Seconds"
    },
    "Wait30Seconds": {
      "Type": "Wait",
      "Seconds": 30,
      "Next": "GetJobStatus"
    },
    "GetFinalJobStatus": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:<account-id>:function:batchGetJobStatus",
      "End": true
    }
  }
}

Building the Lambda functions for the state machine

You need two basic Lambda functions for this state machine. The first one submits a job to AWS Batch and the second checks the status of the AWS Batch job that was submitted.

In AWS Step Functions, you specify an input as JSON that is read into your state machine. Each state receives the aggregate of the steps immediately preceding it, and you can specify which components a state passes on to its children. Because you are using Lambda functions to execute tasks, one of the easiest routes to take is to modify the input JSON, represented as a Python dictionary, within the Lambda function and return the entire dictionary back for the next state to consume.

Building the batchSubmitIsaacJob Lambda function

For Step 1 above, you need a Lambda function for each of the steps in your analysis workflow. As you created a generic Lambda function in the previous post to submit a batch job (batchSubmitJob), you can use that function as the basis for the specialized functions you’ll include in this state machine. Here is such a Lambda function for the Isaac aligner.

from __future__ import print_function

import boto3
import json
import traceback

lambda_client = boto3.client('lambda')



def lambda_handler(event, context):
    try:
        # Generate output put
        bam_s3_path = '/'.join([event['resultsS3Path'], event['sampleId'], 'bam/'])

        depends_on = event['dependsOn'] if 'dependsOn' in event else []

        # Generate run command
        command = [
            '--bam_s3_folder_path', bam_s3_path,
            '--fastq1_s3_path', event['fastq1S3Path'],
            '--fastq2_s3_path', event['fastq2S3Path'],
            '--reference_s3_path', event['isaac']['referenceS3Path'],
            '--working_dir', event['workingDir']
        ]

        if 'cmdArgs' in event['isaac']:
            command.extend(['--cmd_args', event['isaac']['cmdArgs']])
        if 'memory' in event['isaac']:
            command.extend(['--memory', event['isaac']['memory']])

        # Submit Payload
        response = lambda_client.invoke(
            FunctionName='batchSubmitJob',
            InvocationType='RequestResponse',
            LogType='Tail',
            Payload=json.dumps(dict(
                dependsOn=depends_on,
                containerOverrides={
                    'command': command,
                },
                jobDefinition=event['isaac']['jobDefinition'],
                jobName='-'.join(['isaac', event['sampleId']]),
                jobQueue=event['isaac']['jobQueue']
            )))

        response_payload = response['Payload'].read()

        # Update event
        event['bamS3Path'] = bam_s3_path
        event['jobId'] = json.loads(response_payload)['jobId']
        
        return event
    except Exception as e:
        traceback.print_exc()
        raise e

In the Lambda console, create a Python 2.7 Lambda function named batchSubmitIsaacJob and paste in the above code. Use the LambdaBatchExecutionRole that you created in the previous post. For more information, see Step 2.1: Create a Hello World Lambda Function.

This Lambda function reads in the inputs passed to the state machine it is part of, formats the data for the batchSubmitJob Lambda function, invokes that Lambda function, and then modifies the event dictionary to pass onto the subsequent states. You can repeat these for each of the other tools, which can be found in the tools//lambda/lambda_function.py script in the GitHub repo.

Building the batchGetJobStatus Lambda function

For Step 2 above, the process queries the AWS Batch DescribeJobs API action with jobId to identify the state that the job is in. You can put this into a Lambda function to integrate it with Step Functions.

In the Lambda console, create a new Python 2.7 function with the LambdaBatchExecutionRole IAM role. Name your function batchGetJobStatus and paste in the following code. This is similar to the batch-get-job-python27 Lambda blueprint.

from __future__ import print_function

import boto3
import json

print('Loading function')

batch_client = boto3.client('batch')

def lambda_handler(event, context):
    # Log the received event
    print("Received event: " + json.dumps(event, indent=2))
    # Get jobId from the event
    job_id = event['jobId']

    try:
        response = batch_client.describe_jobs(
            jobs=[job_id]
        )
        job_status = response['jobs'][0]['status']
        return job_status
    except Exception as e:
        print(e)
        message = 'Error getting Batch Job status'
        print(message)
        raise Exception(message)

Structuring state machine input

You have structured the state machine input so that general file references are included at the top-level of the JSON object, and any job-specific items are contained within a nested JSON object. At a high level, this is what the input structure looks like:

{
        "general_field_1": "value1",
        "general_field_2": "value2",
        "general_field_3": "value3",
        "job1": {},
        "job2": {},
        "job3": {}
}

Building the full state machine

By chaining these state machine components together, you can quickly build flexible workflows that can process genomes in multiple ways. The development of the larger state machine that defines the entire workflow uses four of the above building blocks. You use the Lambda functions that you built in the previous section. Rename each building block submission to match the tool name.

We have provided a CloudFormation template to deploy your state machine and the associated IAM roles. In the CloudFormation console, select Create Stack, choose your template (deploy_state_machine.yaml), and enter in the ARNs for the Lambda functions you created.

Continue through the rest of the steps and deploy your stack. Be sure to check the box next to "I acknowledge that AWS CloudFormation might create IAM resources."

Once the CloudFormation stack is finished deploying, you should see the following image of your state machine.

In short, you first submit a job for Isaac, which is the aligner you are using for the analysis. Next, you use parallel state to split your output from "GetFinalIsaacJobStatus" and send it to both your variant calling step, Strelka, and your QC step, Samtools Stats. These then are run in parallel and you annotate the results from your Strelka step with snpEff.

Putting it all together

Now that you have built all of the components for a genomics secondary analysis workflow, test the entire process.

We have provided sequences from an Illumina sequencer that cover a region of the genome known as the exome. Most of the positions in the genome that we have currently associated with disease or human traits reside in this region, which is 1–2% of the entire genome. The workflow that you have built works for both analyzing an exome, as well as an entire genome.

Additionally, we have provided prebuilt reference genomes for Isaac, located at:

s3://aws-batch-genomics-resources/reference/

If you are interested, we have provided a script that sets up all of that data. To execute that script, run the following command on a large EC2 instance:

make reference REGISTRY=<your-ecr-registry>

Indexing and preparing this reference takes many hours on a large-memory EC2 instance. Be careful about the costs involved and note that the data is available through the prebuilt reference genomes.

Starting the execution

In a previous section, you established a provenance for the JSON that is fed into your state machine. For ease, we have auto-populated the input JSON for you to the state machine. You can also find this in the GitHub repo under workflow/test.input.json:

{
  "fastq1S3Path": "s3://aws-batch-genomics-resources/fastq/SRR1919605_1.fastq.gz",
  "fastq2S3Path": "s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz",
  "referenceS3Path": "s3://aws-batch-genomics-resources/reference/hg38.fa",
  "resultsS3Path": "s3://<bucket>/genomic-workflow/results",
  "sampleId": "NA12878_states_1",
  "workingDir": "/scratch",
  "isaac": {
    "jobDefinition": "isaac-myenv:1",
    "jobQueue": "arn:aws:batch:us-east-1:<account-id>:job-queue/highPriority-myenv",
    "referenceS3Path": "s3://aws-batch-genomics-resources/reference/isaac/"
  },
  "samtoolsStats": {
    "jobDefinition": "samtools_stats-myenv:1",
    "jobQueue": "arn:aws:batch:us-east-1:<account-id>:job-queue/lowPriority-myenv"
  },
  "strelka": {
    "jobDefinition": "strelka-myenv:1",
    "jobQueue": "arn:aws:batch:us-east-1:<account-id>:job-queue/highPriority-myenv",
    "cmdArgs": " --exome "
  },
  "snpEff": {
    "jobDefinition": "snpeff-myenv:1",
    "jobQueue": "arn:aws:batch:us-east-1:<account-id>:job-queue/lowPriority-myenv",
    "cmdArgs": " -t hg38 "
  }
}

You are now at the stage to run your full genomic analysis. Copy the above to a new text file, change paths and ARNs to the ones that you created previously, and save your JSON input as input.states.json.

In the CLI, execute the following command. You need the ARN of the state machine that you created in the previous post:

aws stepfunctions start-execution --state-machine-arn <your-state-machine-arn> --input file://input.states.json

Your analysis has now started. By using Spot Instances with AWS Batch, you can quickly scale out your workflows while concurrently optimizing for cost. While this is not guaranteed, most executions of the workflows presented here should cost under $1 for a full analysis.

Monitoring the execution

The output from the above CLI command gives you the ARN that describes the specific execution. Copy that and navigate to the Step Functions console. Select the state machine that you created previously and paste the ARN into the search bar.

The screen shows information about your specific execution. On the left, you see where your execution currently is in the workflow.

In the following screenshot, you can see that your workflow has successfully completed the alignment job and moved onto the subsequent steps, which are variant calling and generating quality information about your sample.

You can also navigate to the AWS Batch console and see that progress of all of your jobs reflected there as well.

Finally, after your workflow has completed successfully, check out the S3 path to which you wrote all of your files. If you run a ls –recursive command on the S3 results path, specified in the input to your state machine execution, you should see something similar to the following:

2017-05-02 13:46:32 6475144340 genomic-workflow/results/NA12878_run1/bam/sorted.bam
2017-05-02 13:46:34    7552576 genomic-workflow/results/NA12878_run1/bam/sorted.bam.bai
2017-05-02 13:46:32         45 genomic-workflow/results/NA12878_run1/bam/sorted.bam.md5
2017-05-02 13:53:20      68769 genomic-workflow/results/NA12878_run1/stats/bam_stats.dat
2017-05-02 14:05:12        100 genomic-workflow/results/NA12878_run1/vcf/stats/runStats.tsv
2017-05-02 14:05:12        359 genomic-workflow/results/NA12878_run1/vcf/stats/runStats.xml
2017-05-02 14:05:12  507577928 genomic-workflow/results/NA12878_run1/vcf/variants/genome.S1.vcf.gz
2017-05-02 14:05:12     723144 genomic-workflow/results/NA12878_run1/vcf/variants/genome.S1.vcf.gz.tbi
2017-05-02 14:05:12  507577928 genomic-workflow/results/NA12878_run1/vcf/variants/genome.vcf.gz
2017-05-02 14:05:12     723144 genomic-workflow/results/NA12878_run1/vcf/variants/genome.vcf.gz.tbi
2017-05-02 14:05:12   30783484 genomic-workflow/results/NA12878_run1/vcf/variants/variants.vcf.gz
2017-05-02 14:05:12    1566596 genomic-workflow/results/NA12878_run1/vcf/variants/variants.vcf.gz.tbi

Modifications to the workflow

You have now built and run your genomics workflow. While diving deep into modifications to this architecture are beyond the scope of these posts, we wanted to leave you with several suggestions of how you might modify this workflow to satisfy additional business requirements.

  • Job tracking with Amazon DynamoDB
    In many cases, such as if you are offering Genomics-as-a-Service, you might want to track the state of your jobs with DynamoDB to get fine-grained records of how your jobs are running. This way, you can easily identify the cost of individual jobs and workflows that you run.
  • Resuming from failure
    Both AWS Batch and Step Functions natively support job retries and can cover many of the standard cases where a job might be interrupted. There may be cases, however, where your workflow might fail in a way that is unpredictable. In this case, you can use custom error handling with AWS Step Functions to build out a workflow that is even more resilient. Also, you can build in fail states into your state machine to fail at any point, such as if a batch job fails after a certain number of retries.
  • Invoking Step Functions from Amazon API Gateway
    You can use API Gateway to build an API that acts as a "front door" to Step Functions. You can create a POST method that contains the input JSON to feed into the state machine you built. For more information, see the Implementing Serverless Manual Approval Steps in AWS Step Functions and Amazon API Gateway blog post.

Conclusion

While the approach we have demonstrated in this series has been focused on genomics, it is important to note that this can be generalized to nearly any high-throughput batch workload. We hope that you have found the information useful and that it can serve as a jump-start to building your own batch workloads on AWS with native AWS services.

For more information about how AWS can enable your genomics workloads, be sure to check out the AWS Genomics page.

Other posts in this four-part series:

Please leave any questions and comments below.