Tag Archives: genomics

Building Simpler Genomics Workflows on AWS Step Functions

Post Syndicated from Christie Gifrin original https://aws.amazon.com/blogs/compute/building-simpler-genomics-workflows-on-aws-step-functions/

This post is courtesy of Ryan Ulaszek, AWS Genomics Partner Solutions Architect and Aaron Friedman, AWS Healthcare and Life Sciences Partner Solutions Architect

In 2017, we published a four part blog series on how to build a genomics workflow on AWS. In part 1, we introduced a general architecture highlighting three common layers: job, batch and workflow.  In part 2, we described building the job layer with Docker and Amazon Elastic Container Registry (Amazon ECR).  In part 3, we tackled the batch layer and built a batch engine using AWS Batch.  In part 4, we built out the workflow layer using AWS Step Functions and AWS Lambda.

Since then, we’ve worked with many AWS customers and APN partners to implement this solution in genomics as well as in other workloads-of-interest. Today, we wanted to highlight a new feature in Step Functions that simplifies how customers and partners can build high-throughput genomics workflows on AWS.

Step Functions now supports native integration with AWS Batch, which simplifies how you can create an AWS Batch state that submits an asynchronous job and waits for that job to finish.

Before, you needed to build a state machine building block that submitted a job to AWS Batch, and then polled and checked its execution. Now, you can just submit the job to AWS Batch using the new AWS Batch task type.  Step Functions waits to proceed until the job is completed. This reduces the complexity of your state machine and makes it easier to build a genomics workflow with asynchronous AWS Batch steps.

The new integrations include support for the following API actions:

  • AWS Batch SubmitJob
  • Amazon SNS Publish
  • Amazon SQS SendMessage
  • Amazon ECS RunTask
  • AWS Fargate RunTask
  • Amazon DynamoDB
    • PutItem
    • GetItem
    • UpdateItem
    • DeleteItem
  • Amazon SageMaker
    • CreateTrainingJob
    • CreateTransformJob
  • AWS Glue
    • StartJobRun

You can also pass parameters to the service API.  To use the new integrations, the role that you assume when running a state machine needs to have the appropriate permissions.  For more information, see the AWS Step Functions Developer Guide.

Using a job status poller

In our 2017 post series, we created a job poller “pattern” with two separate Lambda functions. When the job finishes, the state machine proceeds to the next step and operates according to the necessary business logic.  This is a useful pattern to manage asynchronous jobs when a direct integration is unavailable.

The steps in this building block state machine are as follows:

  1. A job is submitted through a Lambda function.
  2. The state machine queries the AWS Batch API for the job status in another Lambda function.
  3. The job status is checked to see if the job has completed.  If the job status equals SUCCESS, the final job status is logged. If the job status equals FAILED, the execution of the state machine ends. In all other cases, wait 30 seconds and go back to Step 2.

Both of the Submit Job and Get Job Lambda functions are available as example Lambda functions in the console.  The job status poller is available in the Step Functions console as a sample project.

Here is the JSON representing this state machine.

{
  "Comment": "A simple example that submits a job to AWS Batch",
  "StartAt": "SubmitJob",
  "States": {
    "SubmitJob": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:<account-id>::function:batchSubmitJob",
      "Next": "GetJobStatus"
    },
    "GetJobStatus": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:<account-id>:function:batchGetJobStatus",
      "Next": "CheckJobStatus",
      "InputPath": "$",
      "ResultPath": "$.status"
    },
    "CheckJobStatus": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.status",
          "StringEquals": "FAILED",
          "End": true
        },
        {
          "Variable": "$.status",
          "StringEquals": "SUCCEEDED",
          "Next": "GetFinalJobStatus"
        }
      ],
      "Default": "Wait30Seconds"
    },
    "Wait30Seconds": {
      "Type": "Wait",
      "Seconds": 30,
      "Next": "GetJobStatus"
    },
    "GetFinalJobStatus": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:<account-id>:function:batchGetJobStatus",
      "End": true
    }
  }
}

With Step Functions Service Integrations

With Step Functions service integrations, it is now simpler to submit and wait for an AWS Batch job, or any other supported service.

The following code block is the JSON representing the new state machine for an asynchronous batch job. If you are familiar with the AWS Batch SubmitJob API action, you may notice that the parameters are consistent with what you would see in that API call. You can also use the optional AWS Batch parameters in addition to JobDefinition, JobName, and JobQueue.

{
 "StartAt": "RunBatchJob",
 "States": {
     "RunIsaacJob":{
     "Type":"Task",
     "Resource":"arn:aws:states:::batch:submitJob.sync",
     "Parameters":{
        "JobDefinition":"Isaac",
        "JobName.$":"$.isaac.JobName",
        "JobQueue":"HighPriority",
        "Parameters.$": "$.isaac"
     },
     "TimeoutSeconds": 900,
     "HeartbeatSeconds": 60,
     "Next":"Parallel",
     "InputPath":"$",
     "ResultPath":"$.status",
     "Retry" : [
        {
          "ErrorEquals": [ "States.Timeout" ],
          "IntervalSeconds": 3,
          "MaxAttempts": 2,
          "BackoffRate": 1.5
        }
     ]
  }
}

Here is an example of the workflow input JSON.  Pass all of the container parameters that were being constructed in the submit job Lambda function.

{
  "isaac": {
    "WorkingDir": "/scratch",
    "JobName": "isaac-1",
    "FastQ1S3Path": "s3://aws-batch-genomics-resources/fastq/SRR1919605_1.fastq.gz",
    "BAMS3FolderPath": "s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz",
    "FastQ2S3Path": "s3://bccn-genome-data/fastq/NIST7035_R2_trimmed.fastq.gz",
    "ReferenceS3Path": "s3://aws-batch-genomics-resources/reference/isaac/"
  }
}

When you deploy the job definition, add the command attribute that was previously being constructed in the Lambda function launching the AWS Batch job.

IsaacJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      JobDefinitionName: "Isaac"
      Type: container
      RetryStrategy:
        Attempts: 1
      Parameters:
        BAMS3FolderPath: !Sub "s3://${JobResultsBucket}/NA12878_states_1/bam"
        FastQ1S3Path: "s3://aws-batch-genomics-resources/fastq/SRR1919605_1.fastq.gz"
        FastQ2S3Path: "s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz"
        ReferenceS3Path: "s3://aws-batch-genomics-resources/reference/isaac/"
        WorkingDir: "/scratch"
      ContainerProperties:
        Image: "rulaszek/isaac"
        Vcpus: 32
        Memory: 80000
        JobRoleArn:
          Fn::ImportValue: !Sub "${RoleStackName}:ECSTaskRole"
        Command:
          - "--bam_s3_folder_path"
          - "Ref::BAMS3FolderPath"
          - "--fastq1_s3_path"
          - "Ref::FastQ1S3Path"
          - "--fastq2_s3_path"
          - "Ref::FastQ2S3Path"
          - "--reference_s3_path"
          - "Ref::ReferenceS3Path"
          - "--working_dir"
          - "Ref::WorkingDir"
        MountPoints:
          - ContainerPath: "/scratch"
            ReadOnly: false
            SourceVolume: docker_scratch
        Volumes:
          - Name: docker_scratch
            Host:
              SourcePath: "/docker_scratch"

The key-value parameters passed into the workflow are mapped using Parameters.$ to the values in the job definition using the keys.  Value substitutions do take place. The Docker run looks like the following:

docker run <isaac_container_uri> --bam_s3_folder_path s3://batch-genomics-pipeline-jobresultsbucket-1kzdu216m2b0k/NA12878_states_3/bam
                                 --fastq1_s3_path s3://aws-batch-genomics-resources/fastq/SRR1919605_1.fastq.gz
                                 --fastq2_s3_path s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz 
                                 --reference_s3_path s3://aws-batch-genomics-resources/reference/isaac/ 
                                 --working_dir /scratch

Genomics workflow: Before and after

Overall, connectors dramatically simplify your genomics workflow.  The following workflow is a simple genomics secondary analysis pipeline, which we highlighted in our original post series.

The first step aligns the sample against a reference genome.  When alignment is complete, variant calling and QA metrics are calculated in two parallel steps.  When variant calling is complete, variant annotation is performed.  Before, our genomics workflow looked like this:

Now it looks like this:

Here is the new workflow JSON:

{
   "Comment":"A simple genomics secondary-analysis workflow",
   "StartAt":"RunIsaacJob",
   "States":{
      "RunIsaacJob":{
         "Type":"Task",
         "Resource":"arn:aws:states:::batch:submitJob.sync",
         "Parameters":{
            "JobDefinition":"Isaac",
            "JobName.$":"$.isaac.JobName",
            "JobQueue":"HighPriority",
            "Parameters.$": "$.isaac"
         },
         "TimeoutSeconds": 900,
         "HeartbeatSeconds": 60,
         "Next":"Parallel",
         "InputPath":"$",
         "ResultPath":"$.status",
         "Retry" : [
            {
              "ErrorEquals": [ "States.Timeout" ],
              "IntervalSeconds": 3,
              "MaxAttempts": 2,
              "BackoffRate": 1.5
            }
         ]
      },
      "Parallel":{
         "Type":"Parallel",
         "Next":"FinalState",
         "Branches":[
            {
               "StartAt":"RunStrelkaJob",
               "States":{
                  "RunStrelkaJob":{
                     "Type":"Task",
                     "Resource":"arn:aws:states:::batch:submitJob.sync",
                     "Parameters":{
                        "JobDefinition":"Strelka",
                        "JobName.$":"$.strelka.JobName",
                        "JobQueue":"HighPriority",
                        "Parameters.$": "$.strelka"
                     },
                     "TimeoutSeconds": 900,
                     "HeartbeatSeconds": 60,
                     "Next":"RunSnpEffJob",
                     "InputPath":"$",
                     "ResultPath":"$.status",
                     "Retry" : [
                        {
                          "ErrorEquals": [ "States.Timeout" ],
                          "IntervalSeconds": 3,
                          "MaxAttempts": 2,
                          "BackoffRate": 1.5
                        }
                     ]
                  },
                  "RunSnpEffJob":{
                     "Type":"Task",
                     "Resource":"arn:aws:states:::batch:submitJob.sync",
                     "Parameters":{
                        "JobDefinition":"SNPEff",
                        "JobName.$":"$.snpeff.JobName",
                        "JobQueue":"HighPriority",
                        "Parameters.$": "$.snpeff"
                     },
                     "TimeoutSeconds": 900,
                     "HeartbeatSeconds": 60,
                     "Retry" : [
                        {
                          "ErrorEquals": [ "States.Timeout" ],
                          "IntervalSeconds": 3,
                          "MaxAttempts": 2,
                          "BackoffRate": 1.5
                        }
                     ],
                     "End":true
                  }
               }
            },
            {
               "StartAt":"RunSamtoolsStatsJob",
               "States":{
                  "RunSamtoolsStatsJob":{
                     "Type":"Task",
                     "Resource":"arn:aws:states:::batch:submitJob.sync",
                     "Parameters":{
                        "JobDefinition":"SamtoolsStats",
                        "JobName.$":"$.samtools.JobName",
                        "JobQueue":"HighPriority",
                        "Parameters.$": "$.samtools"
                     },
                     "TimeoutSeconds": 900,
                     "HeartbeatSeconds": 60,
                     "End":true,
                     "Retry" : [
                        {
                          "ErrorEquals": [ "States.Timeout" ],
                          "IntervalSeconds": 3,
                          "MaxAttempts": 2,
                          "BackoffRate": 1.5
                        }
                     ]
                  }
               }
            }
         ]
      },
      "FinalState":{
         "Type":"Pass",
         "End":true
      }
   }
}

Here is the new Amazon CloudFormation template for deploying the AWS Batch job definitions for each tool:

AWSTemplateFormatVersion: 2010-09-09

Description: Batch job definitions for batch genomics

Parameters:
  RoleStackName:
    Description: "Stack that deploys roles for genomic workflow"
    Type: String
  VPCStackName:
    Description: "Stack that deploys vps for genomic workflow"
    Type: String
  JobResultsBucket:
    Description: "Bucket that holds workflow job results"
    Type: String

Resources:
  IsaacJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      JobDefinitionName: "Isaac"
      Type: container
      RetryStrategy:
        Attempts: 1
      Parameters:
        BAMS3FolderPath: !Sub "s3://${JobResultsBucket}/NA12878_states_1/bam"
        FastQ1S3Path: "s3://aws-batch-genomics-resources/fastq/SRR1919605_1.fastq.gz"
        FastQ2S3Path: "s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz"
        ReferenceS3Path: "s3://aws-batch-genomics-resources/reference/isaac/"
        WorkingDir: "/scratch"
      ContainerProperties:
        Image: "rulaszek/isaac"
        Vcpus: 32
        Memory: 80000
        JobRoleArn:
          Fn::ImportValue: !Sub "${RoleStackName}:ECSTaskRole"
        Command:
          - "--bam_s3_folder_path"
          - "Ref::BAMS3FolderPath"
          - "--fastq1_s3_path"
          - "Ref::FastQ1S3Path"
          - "--fastq2_s3_path"
          - "Ref::FastQ2S3Path"
          - "--reference_s3_path"
          - "Ref::ReferenceS3Path"
          - "--working_dir"
          - "Ref::WorkingDir"
        MountPoints:
          - ContainerPath: "/scratch"
            ReadOnly: false
            SourceVolume: docker_scratch
        Volumes:
          - Name: docker_scratch
            Host:
              SourcePath: "/docker_scratch"

  StrelkaJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      JobDefinitionName: "Strelka"
      Type: container
      RetryStrategy:
        Attempts: 1
      Parameters:
        BAMS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/bam/sorted.bam"
        BAIS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/bam/sorted.bam.bai"
        ReferenceS3Path: "s3://aws-batch-genomics-resources/reference/hg38.fa"
        ReferenceIndexS3Path: "s3://aws-batch-genomics-resources/reference/hg38.fa.fai"
        VCFS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/vcf"
        WorkingDir: "/scratch"
      ContainerProperties:
        Image: "rulaszek/strelka"
        Vcpus: 32
        Memory: 32000
        JobRoleArn:
          Fn::ImportValue: !Sub "${RoleStackName}:ECSTaskRole"
        Command:
          - "--bam_s3_path"
          - "Ref::BAMS3Path"
          - "--bai_s3_path"
          - "Ref::BAIS3Path"
          - "--reference_s3_path"
          - "Ref::ReferenceS3Path"
          - "--reference_index_s3_path"
          - "Ref::ReferenceIndexS3Path"
          - "--vcf_s3_path"
          - "Ref::VCFS3Path"
          - "--working_dir"
          - "Ref::WorkingDir"
        MountPoints:
          - ContainerPath: "/scratch"
            ReadOnly: false
            SourceVolume: docker_scratch
        Volumes:
          - Name: docker_scratch
            Host:
              SourcePath: "/docker_scratch"

  SnpEffJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      JobDefinitionName: "SNPEff"
      Type: container
      RetryStrategy:
        Attempts: 1
      Parameters:
        VCFS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/vcf/variants/genome.vcf.gz"
        AnnotatedVCFS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/vcf/genome.anno.vcf"
        CommandArgs: " -t hg38 "
        WorkingDir: "/scratch"
      ContainerProperties:
        Image: "rulaszek/snpeff"
        Vcpus: 4
        Memory: 10000
        JobRoleArn:
          Fn::ImportValue: !Sub "${RoleStackName}:ECSTaskRole"
        Command:
          - "--annotated_vcf_s3_path"
          - "Ref::AnnotatedVCFS3Path"
          - "--vcf_s3_path"
          - "Ref::VCFS3Path"
          - "--cmd_args"
          - "Ref::CommandArgs"
          - "--working_dir"
          - "Ref::WorkingDir"
        MountPoints:
          - ContainerPath: "/scratch"
            ReadOnly: false
            SourceVolume: docker_scratch
        Volumes:
          - Name: docker_scratch
            Host:
              SourcePath: "/docker_scratch"

  SamtoolsStatsJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      JobDefinitionName: "SamtoolsStats"
      Type: container
      RetryStrategy:
        Attempts: 1
      Parameters:
        ReferenceS3Path: "s3://aws-batch-genomics-resources/reference/hg38.fa"
        BAMS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/bam/sorted.bam"
        BAMStatsS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/bam/sorted.bam.stats"
        WorkingDir: "/scratch"
      ContainerProperties:
        Image: "rulaszek/samtools-stats"
        Vcpus: 4
        Memory: 10000
        JobRoleArn:
          Fn::ImportValue: !Sub "${RoleStackName}:ECSTaskRole"
        Command:
          - "--bam_s3_path"
          - "Ref::BAMS3Path"
          - "--bam_stats_s3_path"
          - "Ref::BAMStatsS3Path"
          - "--reference_s3_path"
          - "Ref::ReferenceS3Path"
          - "--working_dir"
          - "Ref::WorkingDir"
        MountPoints:
          - ContainerPath: "/scratch"
            ReadOnly: false
            SourceVolume: docker_scratch
        Volumes:
          - Name: docker_scratch
            Host:
              SourcePath: "/docker_scratch"

Here is the new CloudFormation script that deploys the new workflow:

AWSTemplateFormatVersion: 2010-09-09

Description: State Machine for batch benomics

Parameters:
  RoleStackName:
    Description: "Stack that deploys roles for genomic workflow"
    Type: String
  VPCStackName:
    Description: "Stack that deploys vps for genomic workflow"
    Type: String

Resources:
  # S3
  GenomicWorkflow:
    Type: AWS::StepFunctions::StateMachine
    Properties:
      RoleArn:
        Fn::ImportValue: !Sub "${RoleStackName}:StatesExecutionRole"
      DefinitionString: !Sub |-
        {
           "Comment":"A simple example that submits a job to AWS Batch",
           "StartAt":"RunIsaacJob",
           "States":{
              "RunIsaacJob":{
                 "Type":"Task",
                 "Resource":"arn:aws:states:::batch:submitJob.sync",
                 "Parameters":{
                    "JobDefinition":"Isaac",
                    "JobName.$":"$.isaac.JobName",
                    "JobQueue":"HighPriority",
                    "Parameters.$": "$.isaac"
                 },
                 "TimeoutSeconds": 900,
                 "HeartbeatSeconds": 60,
                 "Next":"Parallel",
                 "InputPath":"$",
                 "ResultPath":"$.status",
                 "Retry" : [
                    {
                      "ErrorEquals": [ "States.Timeout" ],
                      "IntervalSeconds": 3,
                      "MaxAttempts": 2,
                      "BackoffRate": 1.5
                    }
                 ]
              },
              "Parallel":{
                 "Type":"Parallel",
                 "Next":"FinalState",
                 "Branches":[
                    {
                       "StartAt":"RunStrelkaJob",
                       "States":{
                          "RunStrelkaJob":{
                             "Type":"Task",
                             "Resource":"arn:aws:states:::batch:submitJob.sync",
                             "Parameters":{
                                "JobDefinition":"Strelka",
                                "JobName.$":"$.strelka.JobName",
                                "JobQueue":"HighPriority",
                                "Parameters.$": "$.strelka"
                             },
                             "TimeoutSeconds": 900,
                             "HeartbeatSeconds": 60,
                             "Next":"RunSnpEffJob",
                             "InputPath":"$",
                             "ResultPath":"$.status",
                             "Retry" : [
                                {
                                  "ErrorEquals": [ "States.Timeout" ],
                                  "IntervalSeconds": 3,
                                  "MaxAttempts": 2,
                                  "BackoffRate": 1.5
                                }
                             ]
                          },
                          "RunSnpEffJob":{
                             "Type":"Task",
                             "Resource":"arn:aws:states:::batch:submitJob.sync",
                             "Parameters":{
                                "JobDefinition":"SNPEff",
                                "JobName.$":"$.snpeff.JobName",
                                "JobQueue":"HighPriority",
                                "Parameters.$": "$.snpeff"
                             },
                             "TimeoutSeconds": 900,
                             "HeartbeatSeconds": 60,
                             "Retry" : [
                                {
                                  "ErrorEquals": [ "States.Timeout" ],
                                  "IntervalSeconds": 3,
                                  "MaxAttempts": 2,
                                  "BackoffRate": 1.5
                                }
                             ],
                             "End":true
                          }
                       }
                    },
                    {
                       "StartAt":"RunSamtoolsStatsJob",
                       "States":{
                          "RunSamtoolsStatsJob":{
                             "Type":"Task",
                             "Resource":"arn:aws:states:::batch:submitJob.sync",
                             "Parameters":{
                                "JobDefinition":"SamtoolsStats",
                                "JobName.$":"$.samtools.JobName",
                                "JobQueue":"HighPriority",
                                "Parameters.$": "$.samtools"
                             },
                             "TimeoutSeconds": 900,
                             "HeartbeatSeconds": 60,
                             "End":true,
                             "Retry" : [
                                {
                                  "ErrorEquals": [ "States.Timeout" ],
                                  "IntervalSeconds": 3,
                                  "MaxAttempts": 2,
                                  "BackoffRate": 1.5
                                }
                             ]
                          }
                       }
                    }
                 ]
              },
              "FinalState":{
                 "Type":"Pass",
                 "End":true
              }
           }
        }

Outputs:
  GenomicsWorkflowArn:
    Description: GenomicWorkflow ARN
    Value: !Ref GenomicWorkflow
  StackName:
    Description: StackName
    Value: !Sub ${AWS::StackName}

Conclusion

AWS Step Functions service integrations are a great way to simplify creating complex workflows with asynchronous steps. While we highlighted the use case with AWS Batch today, there are many other ways that healthcare and life sciences customers can use this new feature, such as with message processing.

For more information about how AWS can enable your genomics workloads, be sure to check out the AWS Genomics page.

We’ve updated the open-source project to take advantage of the new AWS Batch integration in Step Functions.  You can find the changes aws-batch-genomics/tree/v2.0.0 folder.

Original posts in this four-part series:

Happy coding!

Using Cromwell with AWS Batch

Post Syndicated from Josh Rad original https://aws.amazon.com/blogs/compute/using-cromwell-with-aws-batch/

Contributed by W. Lee Pang and Emil Lerch, WWPS Professional Services

DNA is often referred to as the “source code of life.” All living cells contain long chains of deoxyribonucleic acid that encode instructions on how they are constructed and behave in their surroundings. Genomics is the study of the structure and function of DNA at the molecular level. It has recently shown immense potential to provide improved detection, diagnosis, and treatment of human diseases.

Continuous improvements in genome sequencing technologies have accelerated genomics research by providing unprecedented speed, accuracy, and quantity of DNA sequence data. In fact, the rate of sequencing efficiency has been shown to outpace Moore’s law. Processing this influx of genomic data is ideally aligned with the power and scalability of cloud computing.

Genomic data processing typically uses a wide assortment of specialized bioinformatics tools, like sequence alignment algorithms, variant callers, and statistical analysis methods. These tools are run in sequence as workflow pipelines that can range from a couple of steps to many long toolchains executing in parallel.

Traditionally, bioinformaticians and genomics scientists relied on Bash, Perl, or Python scripts to orchestrate their pipelines. As pipelines have gotten more complex, and maintainability and reproducibility have become standard requirements in science, the need for specialized orchestration tooling and portable workflow definitions has grown significantly.

What is Cromwell?

The Broad Institute’s Cromwell is purpose-built for this need. It is a workflow execution engine for orchestrating command line and containerized tools. Most importantly, it is the engine that drives the GATK Best Practices genome analysis pipeline.

Workflows for Cromwell are defined using the Workflow Definition Language (WDL – pronounced “widdle”), a flexible meta-scripting language that allows researchers to focus on the pieces of their workflow that matter. That’s the tools for each step and their respective inputs and outputs, and not the plumbing in between.

Genomics data is not small (on the order of TBs-PBs for one experiment!), so processing it usually requires significant computing scale, like HPC clusters and cloud computing. Cromwell has previously enabled this with support for many backends such as Spark, and HPC frameworks like Sun GridEngine and SLURM.

AWS and Cromwell

We are excited to announce that Cromwell now supports AWS! In this post, we go over how to configure Cromwell on AWS and get started running genomics pipelines in the cloud.

In a nutshell, the AWS backend for Cromwell is a layer that communicates with AWS Batch. Why AWS Batch? As stated before, genomics analysis pipelines are composed of many different tools. Each of these tools can have specific computing requirements. Operations like genome alignment can be memory-intensive, whereas joint genotyping may be compute-heavy.

AWS Batch dynamically provisions the optimal quantity and type of compute resources (for example, CPU or memory-optimized instances). Provisioning is based on the volume and specific resource requirements of the batch jobs submitted. This means that each step of a genomics workflow gets the most ideal instance to run on.

The AWS backend translates Cromwell task definitions into batch job definitions and submits them via API calls to a user-specified batch queue. Runtime parameters such as the container image to use, and resources like desired vCPUs and memory are also translated from the WDL task and transmitted to the batch job. A number of environment variables are automatically set on the job to support data localization and de-localization to the job instance. Ultimately, scientists and genomics researchers should be familiar with the backend method to submit jobs to AWS Batch because it uses their existing WDL files and research processes.

Getting started

To get started using Cromwell with AWS, create a custom AMI. This is necessary to ensure that the AMI is private to the account, encrypted, and has tooling specific to genomics workloads and Cromwell.

One feature of this tooling is the automatic creation and attachment of additional Amazon Elastic Block Store (Amazon EBS) capacity as additional data is copied onto the EC2 instance for processing. It also contains an ECS agent that has been customized to the needs of Cromwell, and a Cromwell Docker image responsible for interfacing the Cromwell task with Amazon S3.

After the custom AMI is created, install Cromwell on your workstation or EC2 instance. Configure an S3 bucket to hold Cromwell execution directories. For the purposes of this post, we refer to the bucket as s3-bucket-name. Lastly, go to the AWS Batch console, and create a job queue. Save the ARN of the queue, as this is needed later.

To get up these resources with a single click, this link provides a set of AWS CloudFormation templates that gets all the needed infrastructure running in minutes.

The next step is to configure Cromwell to work with AWS Batch and your newly created S3 bucket. Use the sample hello.wdl and hello.inputs files from the Cromwell AWS backend tutorial. You also need a custom configuration file so that Cromwell can interact with AWS Batch.

The following sample file can be used on an EC2 instance with the appropriate IAM role attached, or on a developer workstation with the AWS CLI configured. Keep in mind that you must replace <s3-bucket-name> in the configuration file with the appropriate bucket name. Also, replace “your ARN here” with the ARN of the job queue that you created earlier.

// aws.conf

include required(classpath("application"))

aws {

    application-name = "cromwell"
    
    auths = [
        {
         name = "default"
         scheme = "default"
        }
    ]
    
    region = "default"
    // uses region from ~/.aws/config set by aws configure command,
    // or us-east-1 by default
}

engine {
     filesystems {
         s3 {
            auth = "default"
         }
    }
}

backend {
     default = "AWSBATCH"
     providers {
         AWSBATCH {
             actor-factory = "cromwell.backend.impl.aws.AwsBatchBackendLifecycleActorFactory"
             config {
                 // Base bucket for workflow executions
                 root = "s3://<s3-bucket-name>/cromwell-execution"
                
                 // A reference to an auth defined in the `aws` stanza at the top. This auth is used to create
                 // Jobs and manipulate auth JSONs.
                 auth = "default"

                 numSubmitAttempts = 3
                 numCreateDefinitionAttempts = 3

                 concurrent-job-limit = 16
                
                 default-runtime-attributes {
                    queueArn: "<your ARN here>"
                 }
                
                 filesystems {
                     s3 {
                         // A reference to a potentially different auth for manipulating files via engine functions.
                         auth = "default"
                     }
                 }
             }
         }
     }
}

Now, you can run your workflow. The following command runs Hello World, and ensures that everything is connected properly:

$ java -Dconfig.file=aws.conf -jar cromwell-34.jar run hello.wdl -i hello.inputs

After the workflow has run, your workflow logs should report the workflow outputs.

[info] SingleWorkflowRunnerActor workflow finished with status 'Succeeded'.
{
 "outputs": {
    "wf_hello.hello.message": "Hello World! Welcome to Cromwell . . . on AWS!"
 },
 "id": "08213b40-bcf5-470d-b8b7-1d1a9dccb10e"
}

You also see your job in the “succeeded” section of the AWS Batch Jobs console.

After the environment is configured properly, other Cromwell WDL files can be used as usual.

Conclusion
With AWS Batch, a customized AMI instance, and Cromwell workflow definitions, AWS provides a simple solution to process genomics data easily. We invite you to incorporate this into your automated pipeline.

Machine Learning with AWS Fargate and AWS CodePipeline at Corteva Agriscience

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/machine-learning-with-aws-fargate-and-aws-codepipeline-at-corteva-agriscience/

This post contributed by Duke Takle and Kevin Hayes at Corteva Agriscience

At Corteva Agriscience, the agricultural division of DowDuPont, our purpose is to enrich the lives of those who produce and those who consume, ensuring progress for generations to come. As a global business, we support a network of research stations to improve agricultural productivity around the world

As analytical technology advances the volume of data, as well as the speed at which it must be processed, meeting the needs of our scientists poses unique challenges. Corteva Cloud Engineering teams are responsible for collaborating with and enabling software developers, data scientists, and others. Their work allows Corteva research and development to become the most efficient innovation machine in the agricultural industry.

Recently, our Systems and Innovations for Breeding and Seed Products organization approached the Cloud Engineering team with the challenge of how to deploy a novel machine learning (ML) algorithm for scoring genetic markers. The solution would require supporting labs across six continents in a process that is run daily. This algorithm replaces time-intensive manual scoring of genotypic assays with a robust, automated solution. When examining the solution space for this challenge, the main requirements for our solution were global deployability, application uptime, and scalability.

Before the implementing this algorithm in AWS, ML autoscoring was done as a proof of concept using pre-production instances on premises. It required several technicians to continue to process assays by hand. After implementing on AWS, we have enabled those technicians to be better used in other areas, such as technology development.

Solutions Considered

A RESTful web service seemed to be an obvious way to solve the problem presented. AWS has several patterns that could implement a RESTful web service, such as Amazon API Gateway, AWS Lambda, Amazon EC2, AWS Auto Scaling, Amazon Elastic Container Service (ECS) using the EC2 launch type, and AWS Fargate.

At the time, the project came into our backlog, we had just heard of Fargate. Fargate does have a few limitations (scratch storage, CPU, and memory), none of which were a problem. So EC2, Auto Scaling, and ECS with the EC2 launch type were ruled out because they would have introduced unneeded complexity. The unneeded complexity is mostly around management of EC2 instances to either run the application or the container needed for the solution.

When the project came into our group, there had been a substantial proof-of-concept done with a Docker container. While we are strong API Gateway and Lambda proponents, there is no need to duplicate processes or services that AWS provides. We also knew that we needed to be able to move fast. We wanted to put the power in the hands of our developers to focus on building out the solution. Additionally, we needed something that could scale across our organization and provide some rationalization in how we approach these problems. AWS services, such as Fargate, AWS CodePipeline, and AWS CloudFormation, made that possible.

Solution Overview

Our group prefers using existing AWS services to bring a complete project to the production environment.

CI/CD Pipeline

A complete discussion of the CI/CD pipeline for the project is beyond the scope of this post. However, in broad strokes, the pipeline is:

  1. Compile some C++ code wrapped in Python, create a Python wheel, and publish it to an artifact store.
  2. Create a Docker image with that wheel installed and publish it to ECR.
  3. Deploy and test the new image to our test environment.
  4. Deploy the new image to the production environment.

Solution

As mentioned earlier, the application is a Docker container deployed with the Fargate launch type. It uses an Aurora PostgreSQL DB instance for the backend data. The application itself is only needed internally so the Application Load Balancer is created with the scheme set to “internal” and deployed into our private application subnets.

Our environments are all constructed with CloudFormation templates. Each environment is constructed in a separate AWS account and connected back to a central utility account. The infrastructure stacks export a number of useful bits like the VPC, subnets, IAM roles, security groups, etc. This scheme allows us to move projects through the several accounts without changing the CloudFormation templates, just the parameters that are fed into them.

For this solution, we use an existing VPC, set of subnets, IAM role, and ACM certificate in the us-east-1 Region. The solution CloudFormation stack describes and manages the following resources:

AWS::ECS::Cluster*
AWS::EC2::SecurityGroup
AWS::EC2::SecurityGroupIngress
AWS::Logs::LogGroup
AWS::ECS::TaskDefinition*
AWS::ElasticLoadBalancingV2::LoadBalancer
AWS::ElasticLoadBalancingV2::TargetGroup
AWS::ElasticLoadBalancingV2::Listener
AWS::ECS::Service*
AWS::ApplicationAutoScaling::ScalableTarget
AWS::ApplicationAutoScaling::ScalingPolicy
AWS::ElasticLoadBalancingV2::ListenerRule

A complete discussion of all the resources for the solution is beyond the scope of this post. However, we can explore the resource definitions of the components specific to Fargate. The following three simple segments of CloudFormation are all that is needed to create a Fargate stack: an ECS cluster, task definition, and service. More complete examples of the CloudFormation templates are linked at the end of this post, with stack creation instructions.

AWS::ECS::Cluster:

"ECSCluster": {
    "Type":"AWS::ECS::Cluster",
    "Properties" : {
        "ClusterName" : { "Ref": "clusterName" }
    }
}

The ECS Cluster resource is a simple grouping for the other ECS resources to be created. The cluster created in this stack holds the tasks and service that implement the actual solution. Finally, in the AWS Management Console, the cluster is the entry point to find info about your ECS resources.

AWS::ECS::TaskDefinition

"fargateDemoTaskDefinition": {
    "Type": "AWS::ECS::TaskDefinition",
    "Properties": {
        "ContainerDefinitions": [
            {
                "Essential": "true",
                "Image": { "Ref": "taskImage" },
                "LogConfiguration": {
                    "LogDriver": "awslogs",
                    "Options": {
                        "awslogs-group": {
                            "Ref": "cloudwatchLogsGroup"
                        },
                        "awslogs-region": {
                            "Ref": "AWS::Region"
                        },
                        "awslogs-stream-prefix": "fargate-demo-app"
                    }
                },
                "Name": "fargate-demo-app",
                "PortMappings": [
                    {
                        "ContainerPort": 80
                    }
                ]
            }
        ],
        "ExecutionRoleArn": {"Fn::ImportValue": "fargateDemoRoleArnV1"},
        "Family": {
            "Fn::Join": [
                "",
                [ { "Ref": "AWS::StackName" }, "-fargate-demo-app" ]
            ]
        },
        "NetworkMode": "awsvpc",
        "RequiresCompatibilities" : [ "FARGATE" ],
        "TaskRoleArn": {"Fn::ImportValue": "fargateDemoRoleArnV1"},
        "Cpu": { "Ref": "cpuAllocation" },
        "Memory": { "Ref": "memoryAllocation" }
    }
}

The ECS Task Definition is where we specify and configure the container. Interesting things to note are the CPU and memory configuration items. It is important to note the valid combinations for CPU/memory settings, as shown in the following table.

CPUMemory
0.25 vCPU0.5 GB, 1 GB, and 2 GB
0.5 vCPUMin. 1 GB and Max. 4 GB, in 1-GB increments
1 vCPUMin. 2 GB and Max. 8 GB, in 1-GB increments
2 vCPUMin. 4 GB and Max. 16 GB, in 1-GB increments
4 vCPUMin. 8 GB and Max. 30 GB, in 1-GB increments

AWS::ECS::Service

"fargateDemoService": {
     "Type": "AWS::ECS::Service",
     "DependsOn": [
         "fargateDemoALBListener"
     ],
     "Properties": {
         "Cluster": { "Ref": "ECSCluster" },
         "DesiredCount": { "Ref": "minimumCount" },
         "LaunchType": "FARGATE",
         "LoadBalancers": [
             {
                 "ContainerName": "fargate-demo-app",
                 "ContainerPort": "80",
                 "TargetGroupArn": { "Ref": "fargateDemoTargetGroup" }
             }
         ],
         "NetworkConfiguration":{
             "AwsvpcConfiguration":{
                 "SecurityGroups": [
                     { "Ref":"fargateDemoSecuityGroup" }
                 ],
                 "Subnets":[
                    {"Fn::ImportValue": "privateSubnetOneV1"},
                    {"Fn::ImportValue": "privateSubnetTwoV1"},
                    {"Fn::ImportValue": "privateSubnetThreeV1"}
                 ]
             }
         },
         "TaskDefinition": { "Ref":"fargateDemoTaskDefinition" }
     }
}

The ECS Service resource is how we can configure where and how many instances of tasks are executed to solve our problem. In this case, we see that there are at least minimumCount instances of the task running in any of three private subnets in our VPC.

Conclusion

Deploying this algorithm on AWS using containers and Fargate allowed us to start running the application at scale with low support overhead. This has resulted in faster turnaround time with fewer staff and a concomitant reduction in cost.

“We are very excited with the deployment of Polaris, the autoscoring of the marker lab genotyping data using AWS technologies. This key technology deployment has enhanced performance, scalability, and efficiency of our global labs to deliver over 1.4 Billion data points annually to our key customers in Plant Breeding and Integrated Operations.”

Sandra Milach, Director of Systems and Innovations for Breeding and Seed Products.

We are distributing this solution to all our worldwide laboratories to harmonize data quality, and speed. We hope this enables an increase in the velocity of genetic gain to increase yields of crops for farmers around the world.

You can learn more about the work we do at Corteva at www.corteva.com.

Try it yourself:

The snippets above are instructive but not complete. We have published two repositories on GitHub that you can explore to see how we built this solution:

Note: the components in these repos do not include our production code, but they show you how this works using Amazon ECS and AWS Fargate.

EC2 Fleet – Manage Thousands of On-Demand and Spot Instances with One Request

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/ec2-fleet-manage-thousands-of-on-demand-and-spot-instances-with-one-request/

EC2 Spot Fleets are really cool. You can launch a fleet of Spot Instances that spans EC2 instance types and Availability Zones without having to write custom code to discover capacity or monitor prices. You can set the target capacity (the size of the fleet) in units that are meaningful to your application and have Spot Fleet create and then maintain the fleet on your behalf. Our customers are creating Spot Fleets of all sizes. For example, one financial service customer runs Monte Carlo simulations across 10 different EC2 instance types. They routinely make requests for hundreds of thousands of vCPUs and count on Spot Fleet to give them access to massive amounts of capacity at the best possible price.

EC2 Fleet
Today we are extending and generalizing the set-it-and-forget-it model that we pioneered in Spot Fleet with EC2 Fleet, a new building block that gives you the ability to create fleets that are composed of a combination of EC2 On-Demand, Reserved, and Spot Instances with a single API call. You tell us what you need, capacity and instance-wise, and we’ll handle all the heavy lifting. We will launch, manage, monitor and scale instances as needed, without the need for scaffolding code.

You can specify the capacity of your fleet in terms of instances, vCPUs, or application-oriented units, and also indicate how much of the capacity should be fulfilled by Spot Instances. The application-oriented units allow you to specify the relative power of each EC2 instance type in a way that directly maps to the needs of your application. All three capacity specification options (instances, vCPUs, and application-oriented units) are known as weights.

I think you’ll find a number ways this feature makes managing a fleet of instances easier, and believe that you will also appreciate the team’s near-term feature roadmap of interest (more on that in a bit).

Using EC2 Fleet
There are a number of ways that you can use this feature, whether you’re running a stateless web service, a big data cluster or a continuous integration pipeline. Today I’m going to describe how you can use EC2 Fleet for genomic processing, but this is similar to workloads like risk analysis, log processing or image rendering. Modern DNA sequencers can produce multiple terabytes of raw data each day, to process that data into meaningful information in a timely fashion you need lots of processing power. I’ll be showing you how to deploy a “grid” of worker nodes that can quickly crunch through secondary analysis tasks in parallel.

Projects in genomics can use the elasticity EC2 provides to experiment and try out new pipelines on hundreds or even thousands of servers. With EC2 you can access as many cores as you need and only pay for what you use. Prior to today, you would need to use the RunInstances API or an Auto Scaling group for the On-Demand & Reserved Instance portion of your grid. To get the best price performance you’d also create and manage a Spot Fleet or multiple Spot Auto Scaling groups with different instance types if you wanted to add Spot Instances to turbo-boost your secondary analysis. Finally, to automate scaling decisions across multiple APIs and Auto Scaling groups you would need to write Lambda functions that periodically assess your grid’s progress & backlog, as well as current Spot prices – modifying your Auto Scaling Groups and Spot Fleets accordingly.

You can now replace all of this with a single EC2 Fleet, analyzing genomes at scale for as little as $1 per analysis. In my grid, each step in in the pipeline requires 1 vCPU and 4 GiB of memory, a perfect match for M4 and M5 instances with 4 GiB of memory per vCPU. I will create a fleet using M4 and M5 instances with weights that correspond to the number of vCPUs on each instance:

  • m4.16xlarge – 64 vCPUs, weight = 64
  • m5.24xlarge – 96 vCPUs, weight = 96

This is expressed in a template that looks like this:

"Overrides": [
{
  "InstanceType": "m4.16xlarge",
  "WeightedCapacity": 64,
},
{
  "InstanceType": "m5.24xlarge",
  "WeightedCapacity": 96,
},
]

By default, EC2 Fleet will select the most cost effective combination of instance types and Availability Zones (both specified in the template) using the current prices for the Spot Instances and public prices for the On-Demand Instances (if you specify instances for which you have matching RIs, your discounts will apply). The default mode takes weights into account to get the instances that have the lowest price per unit. So for my grid, fleet will find the instance that offers the lowest price per vCPU.

Now I can request capacity in terms of vCPUs, knowing EC2 Fleet will select the lowest cost option using only the instance types I’ve defined as acceptable. Also, I can specify how many vCPUs I want to launch using On-Demand or Reserved Instance capacity and how many vCPUs should be launched using Spot Instance capacity:

"TargetCapacitySpecification": {
	"TotalTargetCapacity": 2880,
	"OnDemandTargetCapacity": 960,
	"SpotTargetCapacity": 1920,
	"DefaultTargetCapacityType": "Spot"
}

The above means that I want a total of 2880 vCPUs, with 960 vCPUs fulfilled using On-Demand and 1920 using Spot. The On-Demand price per vCPU is lower for m5.24xlarge than the On-Demand price per vCPU for m4.16xlarge, so EC2 Fleet will launch 10 m5.24xlarge instances to fulfill 960 vCPUs. Based on current Spot pricing (again, on a per-vCPU basis), EC2 Fleet will choose to launch 30 m4.16xlarge instances or 20 m5.24xlarges, delivering 1920 vCPUs either way.

Putting it all together, I have a single file (fl1.json) that describes my fleet:

    "LaunchTemplateConfigs": [
        {
            "LaunchTemplateSpecification": {
                "LaunchTemplateId": "lt-0e8c754449b27161c",
                "Version": "1"
            }
        "Overrides": [
        {
          "InstanceType": "m4.16xlarge",
          "WeightedCapacity": 64,
        },
        {
          "InstanceType": "m5.24xlarge",
          "WeightedCapacity": 96,
        },
      ]
        }
    ],
    "TargetCapacitySpecification": {
        "TotalTargetCapacity": 2880,
        "OnDemandTargetCapacity": 960,
        "SpotTargetCapacity": 1920,
        "DefaultTargetCapacityType": "Spot"
    }
}

I can launch my fleet with a single command:

$ aws ec2 create-fleet --cli-input-json file://home/ec2-user/fl1.json
{
    "FleetId":"fleet-838cf4e5-fded-4f68-acb5-8c47ee1b248a"
}

My entire fleet is created within seconds and was built using 10 m5.24xlarge On-Demand Instances and 30 m4.16xlarge Spot Instances, since the current Spot price was 1.5¢ per vCPU for m4.16xlarge and 1.6¢ per vCPU for m5.24xlarge.

Now lets imagine my grid has crunched through its backlog and no longer needs the additional Spot Instances. I can then modify the size of my fleet by changing the target capacity in my fleet specification, like this:

{         
    "TotalTargetCapacity": 960,
}

Since 960 was equal to the amount of On-Demand vCPUs I had requested, when I describe my fleet I will see all of my capacity being delivered using On-Demand capacity:

"TargetCapacitySpecification": {
	"TotalTargetCapacity": 960,
	"OnDemandTargetCapacity": 960,
	"SpotTargetCapacity": 0,
	"DefaultTargetCapacityType": "Spot"
}

When I no longer need my fleet I can delete it and terminate the instances in it like this:

$ aws ec2 delete-fleets --fleet-id fleet-838cf4e5-fded-4f68-acb5-8c47ee1b248a \
  --terminate-instances   
{
    "UnsuccessfulFleetDletetions": [],
    "SuccessfulFleetDeletions": [
        {
            "CurrentFleetState": "deleted_terminating",
            "PreviousFleetState": "active",
            "FleetId": "fleet-838cf4e5-fded-4f68-acb5-8c47ee1b248a"
        }
    ]
}

Earlier I described how RI discounts apply when EC2 Fleet launches instances for which you have matching RIs, so you might be wondering how else RI customers benefit from EC2 Fleet. Let’s say that I own regional RIs for M4 instances. In my EC2 Fleet I would remove m5.24xlarge and specify m4.10xlarge and m4.16xlarge. Then when EC2 Fleet creates the grid, it will quickly find M4 capacity across the sizes and AZs I’ve specified, and my RI discounts apply automatically to this usage.

In the Works
We plan to connect EC2 Fleet and EC2 Auto Scaling groups. This will let you create a single fleet that mixed instance types and Spot, Reserved and On-Demand, while also taking advantage of EC2 Auto Scaling features such as health checks and lifecycle hooks. This integration will also bring EC2 Fleet functionality to services such as Amazon ECS, Amazon EKS, and AWS Batch that build on and make use of EC2 Auto Scaling for fleet management.

Available Now
You can create and make use of EC2 Fleets today in all public AWS Regions!

Jeff;

Event-Driven Computing with Amazon SNS and AWS Compute, Storage, Database, and Networking Services

Post Syndicated from Christie Gifrin original https://aws.amazon.com/blogs/compute/event-driven-computing-with-amazon-sns-compute-storage-database-and-networking-services/

Contributed by Otavio Ferreira, Manager, Software Development, AWS Messaging

Like other developers around the world, you may be tackling increasingly complex business problems. A key success factor, in that case, is the ability to break down a large project scope into smaller, more manageable components. A service-oriented architecture guides you toward designing systems as a collection of loosely coupled, independently scaled, and highly reusable services. Microservices take this even further. To improve performance and scalability, they promote fine-grained interfaces and lightweight protocols.

However, the communication among isolated microservices can be challenging. Services are often deployed onto independent servers and don’t share any compute or storage resources. Also, you should avoid hard dependencies among microservices, to preserve maintainability and reusability.

If you apply the pub/sub design pattern, you can effortlessly decouple and independently scale out your microservices and serverless architectures. A pub/sub messaging service, such as Amazon SNS, promotes event-driven computing that statically decouples event publishers from subscribers, while dynamically allowing for the exchange of messages between them. An event-driven architecture also introduces the responsiveness needed to deal with complex problems, which are often unpredictable and asynchronous.

What is event-driven computing?

Given the context of microservices, event-driven computing is a model in which subscriber services automatically perform work in response to events triggered by publisher services. This paradigm can be applied to automate workflows while decoupling the services that collectively and independently work to fulfil these workflows. Amazon SNS is an event-driven computing hub, in the AWS Cloud, that has native integration with several AWS publisher and subscriber services.

Which AWS services publish events to SNS natively?

Several AWS services have been integrated as SNS publishers and, therefore, can natively trigger event-driven computing for a variety of use cases. In this post, I specifically cover AWS compute, storage, database, and networking services, as depicted below.

Compute services

  • Auto Scaling: Helps you ensure that you have the correct number of Amazon EC2 instances available to handle the load for your application. You can configure Auto Scaling lifecycle hooks to trigger events, as Auto Scaling resizes your EC2 cluster.As an example, you may want to warm up the local cache store on newly launched EC2 instances, and also download log files from other EC2 instances that are about to be terminated. To make this happen, set an SNS topic as your Auto Scaling group’s notification target, then subscribe two Lambda functions to this SNS topic. The first function is responsible for handling scale-out events (to warm up cache upon provisioning), whereas the second is in charge of handling scale-in events (to download logs upon termination).

  • AWS Elastic Beanstalk: An easy-to-use service for deploying and scaling web applications and web services developed in a number of programming languages. You can configure event notifications for your Elastic Beanstalk environment so that notable events can be automatically published to an SNS topic, then pushed to topic subscribers.As an example, you may use this event-driven architecture to coordinate your continuous integration pipeline (such as Jenkins CI). That way, whenever an environment is created, Elastic Beanstalk publishes this event to an SNS topic, which triggers a subscribing Lambda function, which then kicks off a CI job against your newly created Elastic Beanstalk environment.

  • Elastic Load Balancing: Automatically distributes incoming application traffic across Amazon EC2 instances, containers, or other resources identified by IP addresses.You can configure CloudWatch alarms on Elastic Load Balancing metrics, to automate the handling of events derived from Classic Load Balancers. As an example, you may leverage this event-driven design to automate latency profiling in an Amazon ECS cluster behind a Classic Load Balancer. In this example, whenever your ECS cluster breaches your load balancer latency threshold, an event is posted by CloudWatch to an SNS topic, which then triggers a subscribing Lambda function. This function runs a task on your ECS cluster to trigger a latency profiling tool, hosted on the cluster itself. This can enhance your latency troubleshooting exercise by making it timely.

Storage services

  • Amazon S3: Object storage built to store and retrieve any amount of data.You can enable S3 event notifications, and automatically get them posted to SNS topics, to automate a variety of workflows. For instance, imagine that you have an S3 bucket to store incoming resumes from candidates, and a fleet of EC2 instances to encode these resumes from their original format (such as Word or text) into a portable format (such as PDF).In this example, whenever new files are uploaded to your input bucket, S3 publishes these events to an SNS topic, which in turn pushes these messages into subscribing SQS queues. Then, encoding workers running on EC2 instances poll these messages from the SQS queues; retrieve the original files from the input S3 bucket; encode them into PDF; and finally store them in an output S3 bucket.

  • Amazon EFS: Provides simple and scalable file storage, for use with Amazon EC2 instances, in the AWS Cloud.You can configure CloudWatch alarms on EFS metrics, to automate the management of your EFS systems. For example, consider a highly parallelized genomics analysis application that runs against an EFS system. By default, this file system is instantiated on the “General Purpose” performance mode. Although this performance mode allows for lower latency, it might eventually impose a scaling bottleneck. Therefore, you may leverage an event-driven design to handle it automatically.Basically, as soon as the EFS metric “Percent I/O Limit” breaches 95%, CloudWatch could post this event to an SNS topic, which in turn would push this message into a subscribing Lambda function. This function automatically creates a new file system, this time on the “Max I/O” performance mode, then switches the genomics analysis application to this new file system. As a result, your application starts experiencing higher I/O throughput rates.

  • Amazon Glacier: A secure, durable, and low-cost cloud storage service for data archiving and long-term backup.You can set a notification configuration on an Amazon Glacier vault so that when a job completes, a message is published to an SNS topic. Retrieving an archive from Amazon Glacier is a two-step asynchronous operation, in which you first initiate a job, and then download the output after the job completes. Therefore, SNS helps you eliminate polling your Amazon Glacier vault to check whether your job has been completed, or not. As usual, you may subscribe SQS queues, Lambda functions, and HTTP endpoints to your SNS topic, to be notified when your Amazon Glacier job is done.

  • AWS Snowball: A petabyte-scale data transport solution that uses secure appliances to transfer large amounts of data.You can leverage Snowball notifications to automate workflows related to importing data into and exporting data from AWS. More specifically, whenever your Snowball job status changes, Snowball can publish this event to an SNS topic, which in turn can broadcast the event to all its subscribers.As an example, imagine a Geographic Information System (GIS) that distributes high-resolution satellite images to users via Web browser. In this example, the GIS vendor could capture up to 80 TB of satellite images; create a Snowball job to import these files from an on-premises system to an S3 bucket; and provide an SNS topic ARN to be notified upon job status changes in Snowball. After Snowball changes the job status from “Importing” to “Completed”, Snowball publishes this event to the specified SNS topic, which delivers this message to a subscribing Lambda function, which finally creates a CloudFront web distribution for the target S3 bucket, to serve the images to end users.

Database services

  • Amazon RDS: Makes it easy to set up, operate, and scale a relational database in the cloud.RDS leverages SNS to broadcast notifications when RDS events occur. As usual, these notifications can be delivered via any protocol supported by SNS, including SQS queues, Lambda functions, and HTTP endpoints.As an example, imagine that you own a social network website that has experienced organic growth, and needs to scale its compute and database resources on demand. In this case, you could provide an SNS topic to listen to RDS DB instance events. When the “Low Storage” event is published to the topic, SNS pushes this event to a subscribing Lambda function, which in turn leverages the RDS API to increase the storage capacity allocated to your DB instance. The provisioning itself takes place within the specified DB maintenance window.

  • Amazon ElastiCache: A web service that makes it easy to deploy, operate, and scale an in-memory data store or cache in the cloud.ElastiCache can publish messages using Amazon SNS when significant events happen on your cache cluster. This feature can be used to refresh the list of servers on client machines connected to individual cache node endpoints of a cache cluster. For instance, an ecommerce website fetches product details from a cache cluster, with the goal of offloading a relational database and speeding up page load times. Ideally, you want to make sure that each web server always has an updated list of cache servers to which to connect.To automate this node discovery process, you can get your ElastiCache cluster to publish events to an SNS topic. Thus, when ElastiCache event “AddCacheNodeComplete” is published, your topic then pushes this event to all subscribing HTTP endpoints that serve your ecommerce website, so that these HTTP servers can update their list of cache nodes.

  • Amazon Redshift: A fully managed data warehouse that makes it simple to analyze data using standard SQL and BI (Business Intelligence) tools.Amazon Redshift uses SNS to broadcast relevant events so that data warehouse workflows can be automated. As an example, imagine a news website that sends clickstream data to a Kinesis Firehose stream, which then loads the data into Amazon Redshift, so that popular news and reading preferences might be surfaced on a BI tool. At some point though, this Amazon Redshift cluster might need to be resized, and the cluster enters a ready-only mode. Hence, this Amazon Redshift event is published to an SNS topic, which delivers this event to a subscribing Lambda function, which finally deletes the corresponding Kinesis Firehose delivery stream, so that clickstream data uploads can be put on hold.At a later point, after Amazon Redshift publishes the event that the maintenance window has been closed, SNS notifies a subscribing Lambda function accordingly, so that this function can re-create the Kinesis Firehose delivery stream, and resume clickstream data uploads to Amazon Redshift.

  • AWS DMS: Helps you migrate databases to AWS quickly and securely. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database.DMS also uses SNS to provide notifications when DMS events occur, which can automate database migration workflows. As an example, you might create data replication tasks to migrate an on-premises MS SQL database, composed of multiple tables, to MySQL. Thus, if replication tasks fail due to incompatible data encoding in the source tables, these events can be published to an SNS topic, which can push these messages into a subscribing SQS queue. Then, encoders running on EC2 can poll these messages from the SQS queue, encode the source tables into a compatible character set, and restart the corresponding replication tasks in DMS. This is an event-driven approach to a self-healing database migration process.

Networking services

  • Amazon Route 53: A highly available and scalable cloud-based DNS (Domain Name System). Route 53 health checks monitor the health and performance of your web applications, web servers, and other resources.You can set CloudWatch alarms and get automated Amazon SNS notifications when the status of your Route 53 health check changes. As an example, imagine an online payment gateway that reports the health of its platform to merchants worldwide, via a status page. This page is hosted on EC2 and fetches platform health data from DynamoDB. In this case, you could configure a CloudWatch alarm for your Route 53 health check, so that when the alarm threshold is breached, and the payment gateway is no longer considered healthy, then CloudWatch publishes this event to an SNS topic, which pushes this message to a subscribing Lambda function, which finally updates the DynamoDB table that populates the status page. This event-driven approach avoids any kind of manual update to the status page visited by merchants.

  • AWS Direct Connect (AWS DX): Makes it easy to establish a dedicated network connection from your premises to AWS, which can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections.You can monitor physical DX connections using CloudWatch alarms, and send SNS messages when alarms change their status. As an example, when a DX connection state shifts to 0 (zero), indicating that the connection is down, this event can be published to an SNS topic, which can fan out this message to impacted servers through HTTP endpoints, so that they might reroute their traffic through a different connection instead. This is an event-driven approach to connectivity resilience.

More event-driven computing on AWS

In addition to SNS, event-driven computing is also addressed by Amazon CloudWatch Events, which delivers a near real-time stream of system events that describe changes in AWS resources. With CloudWatch Events, you can route each event type to one or more targets, including:

Many AWS services publish events to CloudWatch. As an example, you can get CloudWatch Events to capture events on your ETL (Extract, Transform, Load) jobs running on AWS Glue and push failed ones to an SQS queue, so that you can retry them later.

Conclusion

Amazon SNS is a pub/sub messaging service that can be used as an event-driven computing hub to AWS customers worldwide. By capturing events natively triggered by AWS services, such as EC2, S3 and RDS, you can automate and optimize all kinds of workflows, namely scaling, testing, encoding, profiling, broadcasting, discovery, failover, and much more. Business use cases presented in this post ranged from recruiting websites, to scientific research, geographic systems, social networks, retail websites, and news portals.

Start now by visiting Amazon SNS in the AWS Management Console, or by trying the AWS 10-Minute Tutorial, Send Fan-out Event Notifications with Amazon SNS and Amazon SQS.

 

New – Amazon EC2 Instances with Up to 8 NVIDIA Tesla V100 GPUs (P3)

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-amazon-ec2-instances-with-up-to-8-nvidia-tesla-v100-gpus-p3/

Driven by customer demand and made possible by on-going advances in the state-of-the-art, we’ve come a long way since the original m1.small instance that we launched in 2006, with instances that are emphasize compute power, burstable performance, memory size, local storage, and accelerated computing.

The New P3
Today we are making the next generation of GPU-powered EC2 instances available in four AWS regions. Powered by up to eight NVIDIA Tesla V100 GPUs, the P3 instances are designed to handle compute-intensive machine learning, deep learning, computational fluid dynamics, computational finance, seismic analysis, molecular modeling, and genomics workloads.

P3 instances use customized Intel Xeon E5-2686v4 processors running at up to 2.7 GHz. They are available in three sizes (all VPC-only and EBS-only):

ModelNVIDIA Tesla V100 GPUsGPU MemoryNVIDIA NVLinkvCPUsMain MemoryNetwork BandwidthEBS Bandwidth
p3.2xlarge116 GiBn/a861 GiBUp to 10 Gbps1.5 Gbps
p3.8xlarge464 GiB200 GBps32244 GiB10 Gbps7 Gbps
p3.16xlarge8128 GiB300 GBps64488 GiB25 Gbps14 Gbps

Each of the NVIDIA GPUs is packed with 5,120 CUDA cores and another 640 Tensor cores and can deliver up to 125 TFLOPS of mixed-precision floating point, 15.7 TFLOPS of single-precision floating point, and 7.8 TFLOPS of double-precision floating point. On the two larger sizes, the GPUs are connected together via NVIDIA NVLink 2.0 running at a total data rate of up to 300 GBps. This allows the GPUs to exchange intermediate results and other data at high speed, without having to move it through the CPU or the PCI-Express fabric.

What’s a Tensor Core?
I had not heard the term Tensor core before starting to write this post. According to this very helpful post on the NVIDIA Blog, Tensor cores are designed to speed up the training and inference of large, deep neural networks. Each core is able to quickly and efficiently multiply a pair of 4×4 half-precision (also known as FP16) matrices together, add the resulting 4×4 matrix to another half or single-precision (FP32) matrix, and store the resulting 4×4 matrix in either half or single-precision form. Here’s a diagram from NVIDIA’s blog post:

This operation is in the innermost loop of the training process for a deep neural network, and is an excellent example of how today’s NVIDIA GPU hardware is purpose-built to address a very specific market need. By the way, the mixed-precision qualifier on the Tensor core performance means that it is flexible enough to work with with a combination of 16-bit and 32-bit floating point values.

Performance in Perspective
I always like to put raw performance numbers into a real-world perspective so that they are easier to relate to and (hopefully) more meaningful. This turned out to be surprisingly difficult, given that the eight NVIDIA Tesla V100 GPUs on a single p3.16xlarge can do 125 trillion single-precision floating point multiplications per second.

Let’s go back to the dawn of the microprocessor era, and consider the Intel 8080A chip that powered the MITS Altair that I bought in the summer of 1977. With a 2 MHz clock, it was able to do about 832 multiplications per second (I used this data and corrected it for the faster clock speed). The p3.16xlarge is roughly 150 billion times faster. However, just 1.2 billion seconds have gone by since that summer. In other words, I can do 100x more calculations today in one second than my Altair could have done in the last 40 years!

What about the innovative 8087 math coprocessor that was an optional accessory for the IBM PC that was announced in the summer of 1981? With a 5 MHz clock and purpose-built hardware, it was able to do about 52,632 multiplications per second. 1.14 billion seconds have elapsed since then, p3.16xlarge is 2.37 billion times faster, so the poor little PC would be barely halfway through a calculation that would run for 1 second today.

Ok, how about a Cray-1? First delivered in 1976, this supercomputer was able to perform vector operations at 160 MFLOPS, making the p3.x16xlarge 781,000 times faster. It could have iterated on some interesting problem 1500 times over the years since it was introduced.

Comparisons between the P3 and today’s scale-out supercomputers are harder to make, given that you can think of the P3 as a step-and-repeat component of a supercomputer that you can launch on as as-needed basis.

Run One Today
In order to take full advantage of the NVIDIA Tesla V100 GPUs and the Tensor cores, you will need to use CUDA 9 and cuDNN7. These drivers and libraries have already been added to the newest versions of the Windows AMIs and will be included in an updated Amazon Linux AMI that is scheduled for release on November 7th. New packages are already available in our repos if you want to to install them on your existing Amazon Linux AMI.

The newest AWS Deep Learning AMIs come preinstalled with the latest releases of Apache MxNet, Caffe2, and Tensorflow (each with support for the NVIDIA Tesla V100 GPUs), and will be updated to support P3 instances with other machine learning frameworks such as Microsoft Cognitive Toolkit and PyTorch as soon as these frameworks release support for the NVIDIA Tesla V100 GPUs. You can also use the NVIDIA Volta Deep Learning AMI for NGC.

P3 instances are available in the US East (Northern Virginia), US West (Oregon), EU (Ireland), and Asia Pacific (Tokyo) Regions in On-Demand, Spot, Reserved Instance, and Dedicated Host form.

Jeff;

 

Building High-Throughput Genomics Batch Workflows on AWS: Workflow Layer (Part 4 of 4)

Post Syndicated from Andy Katz original https://aws.amazon.com/blogs/compute/building-high-throughput-genomics-batch-workflows-on-aws-workflow-layer-part-4-of-4/

Aaron Friedman is a Healthcare and Life Sciences Partner Solutions Architect at AWS

Angel Pizarro is a Scientific Computing Technical Business Development Manager at AWS

This post is the fourth in a series on how to build a genomics workflow on AWS. In Part 1, we introduced a general architecture, shown below, and highlighted the three common layers in a batch workflow:

  • Job
  • Batch
  • Workflow

In Part 2, you built a Docker container for each job that needed to run as part of your workflow, and stored them in Amazon ECR.

In Part 3, you tackled the batch layer and built a scalable, elastic, and easily maintainable batch engine using AWS Batch. This solution took care of dynamically scaling your compute resources in response to the number of runnable jobs in your job queue length as well as managed job placement.

In part 4, you build out the workflow layer of your solution using AWS Step Functions and AWS Lambda. You then run an end-to-end genomic analysis―specifically known as exome secondary analysis―for many times at a cost of less than $1 per exome.

Step Functions makes it easy to coordinate the components of your applications using visual workflows. Building applications from individual components that each perform a single function lets you scale and change your workflow quickly. You can use the graphical console to arrange and visualize the components of your application as a series of steps, which simplify building and running multi-step applications. You can change and add steps without writing code, so you can easily evolve your application and innovate faster.

An added benefit of using Step Functions to define your workflows is that the state machines you create are immutable. While you can delete a state machine, you cannot alter it after it is created. For regulated workloads where auditing is important, you can be assured that state machines you used in production cannot be altered.

In this blog post, you will create a Lambda state machine to orchestrate your batch workflow. For more information on how to create a basic state machine, please see this Step Functions tutorial.

All code related to this blog series can be found in the associated GitHub repository here.

Build a state machine building block

To skip the following steps, we have provided an AWS CloudFormation template that can deploy your Step Functions state machine. You can use this in combination with the setup you did in part 3 to quickly set up the environment in which to run your analysis.

The state machine is composed of smaller state machines that submit a job to AWS Batch, and then poll and check its execution.

The steps in this building block state machine are as follows:

  1. A job is submitted.
    Each analytical module/job has its own Lambda function for submission and calls the batchSubmitJob Lambda function that you built in the previous blog post. You will build these specialized Lambda functions in the following section.
  2. The state machine queries the AWS Batch API for the job status.
    This is also a Lambda function.
  3. The job status is checked to see if the job has completed.
    If the job status equals SUCCESS, proceed to log the final job status. If the job status equals FAILED, end the execution of the state machine. In all other cases, wait 30 seconds and go back to Step 2.

Here is the JSON representing this state machine.

{
  "Comment": "A simple example that submits a Job to AWS Batch",
  "StartAt": "SubmitJob",
  "States": {
    "SubmitJob": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:<account-id>::function:batchSubmitJob",
      "Next": "GetJobStatus"
    },
    "GetJobStatus": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:<account-id>:function:batchGetJobStatus",
      "Next": "CheckJobStatus",
      "InputPath": "$",
      "ResultPath": "$.status"
    },
    "CheckJobStatus": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.status",
          "StringEquals": "FAILED",
          "End": true
        },
        {
          "Variable": "$.status",
          "StringEquals": "SUCCEEDED",
          "Next": "GetFinalJobStatus"
        }
      ],
      "Default": "Wait30Seconds"
    },
    "Wait30Seconds": {
      "Type": "Wait",
      "Seconds": 30,
      "Next": "GetJobStatus"
    },
    "GetFinalJobStatus": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:<account-id>:function:batchGetJobStatus",
      "End": true
    }
  }
}

Building the Lambda functions for the state machine

You need two basic Lambda functions for this state machine. The first one submits a job to AWS Batch and the second checks the status of the AWS Batch job that was submitted.

In AWS Step Functions, you specify an input as JSON that is read into your state machine. Each state receives the aggregate of the steps immediately preceding it, and you can specify which components a state passes on to its children. Because you are using Lambda functions to execute tasks, one of the easiest routes to take is to modify the input JSON, represented as a Python dictionary, within the Lambda function and return the entire dictionary back for the next state to consume.

Building the batchSubmitIsaacJob Lambda function

For Step 1 above, you need a Lambda function for each of the steps in your analysis workflow. As you created a generic Lambda function in the previous post to submit a batch job (batchSubmitJob), you can use that function as the basis for the specialized functions you’ll include in this state machine. Here is such a Lambda function for the Isaac aligner.

from __future__ import print_function

import boto3
import json
import traceback

lambda_client = boto3.client('lambda')



def lambda_handler(event, context):
    try:
        # Generate output put
        bam_s3_path = '/'.join([event['resultsS3Path'], event['sampleId'], 'bam/'])

        depends_on = event['dependsOn'] if 'dependsOn' in event else []

        # Generate run command
        command = [
            '--bam_s3_folder_path', bam_s3_path,
            '--fastq1_s3_path', event['fastq1S3Path'],
            '--fastq2_s3_path', event['fastq2S3Path'],
            '--reference_s3_path', event['isaac']['referenceS3Path'],
            '--working_dir', event['workingDir']
        ]

        if 'cmdArgs' in event['isaac']:
            command.extend(['--cmd_args', event['isaac']['cmdArgs']])
        if 'memory' in event['isaac']:
            command.extend(['--memory', event['isaac']['memory']])

        # Submit Payload
        response = lambda_client.invoke(
            FunctionName='batchSubmitJob',
            InvocationType='RequestResponse',
            LogType='Tail',
            Payload=json.dumps(dict(
                dependsOn=depends_on,
                containerOverrides={
                    'command': command,
                },
                jobDefinition=event['isaac']['jobDefinition'],
                jobName='-'.join(['isaac', event['sampleId']]),
                jobQueue=event['isaac']['jobQueue']
            )))

        response_payload = response['Payload'].read()

        # Update event
        event['bamS3Path'] = bam_s3_path
        event['jobId'] = json.loads(response_payload)['jobId']
        
        return event
    except Exception as e:
        traceback.print_exc()
        raise e

In the Lambda console, create a Python 2.7 Lambda function named batchSubmitIsaacJob and paste in the above code. Use the LambdaBatchExecutionRole that you created in the previous post. For more information, see Step 2.1: Create a Hello World Lambda Function.

This Lambda function reads in the inputs passed to the state machine it is part of, formats the data for the batchSubmitJob Lambda function, invokes that Lambda function, and then modifies the event dictionary to pass onto the subsequent states. You can repeat these for each of the other tools, which can be found in the tools//lambda/lambda_function.py script in the GitHub repo.

Building the batchGetJobStatus Lambda function

For Step 2 above, the process queries the AWS Batch DescribeJobs API action with jobId to identify the state that the job is in. You can put this into a Lambda function to integrate it with Step Functions.

In the Lambda console, create a new Python 2.7 function with the LambdaBatchExecutionRole IAM role. Name your function batchGetJobStatus and paste in the following code. This is similar to the batch-get-job-python27 Lambda blueprint.

from __future__ import print_function

import boto3
import json

print('Loading function')

batch_client = boto3.client('batch')

def lambda_handler(event, context):
    # Log the received event
    print("Received event: " + json.dumps(event, indent=2))
    # Get jobId from the event
    job_id = event['jobId']

    try:
        response = batch_client.describe_jobs(
            jobs=[job_id]
        )
        job_status = response['jobs'][0]['status']
        return job_status
    except Exception as e:
        print(e)
        message = 'Error getting Batch Job status'
        print(message)
        raise Exception(message)

Structuring state machine input

You have structured the state machine input so that general file references are included at the top-level of the JSON object, and any job-specific items are contained within a nested JSON object. At a high level, this is what the input structure looks like:

{
        "general_field_1": "value1",
        "general_field_2": "value2",
        "general_field_3": "value3",
        "job1": {},
        "job2": {},
        "job3": {}
}

Building the full state machine

By chaining these state machine components together, you can quickly build flexible workflows that can process genomes in multiple ways. The development of the larger state machine that defines the entire workflow uses four of the above building blocks. You use the Lambda functions that you built in the previous section. Rename each building block submission to match the tool name.

We have provided a CloudFormation template to deploy your state machine and the associated IAM roles. In the CloudFormation console, select Create Stack, choose your template (deploy_state_machine.yaml), and enter in the ARNs for the Lambda functions you created.

Continue through the rest of the steps and deploy your stack. Be sure to check the box next to "I acknowledge that AWS CloudFormation might create IAM resources."

Once the CloudFormation stack is finished deploying, you should see the following image of your state machine.

In short, you first submit a job for Isaac, which is the aligner you are using for the analysis. Next, you use parallel state to split your output from "GetFinalIsaacJobStatus" and send it to both your variant calling step, Strelka, and your QC step, Samtools Stats. These then are run in parallel and you annotate the results from your Strelka step with snpEff.

Putting it all together

Now that you have built all of the components for a genomics secondary analysis workflow, test the entire process.

We have provided sequences from an Illumina sequencer that cover a region of the genome known as the exome. Most of the positions in the genome that we have currently associated with disease or human traits reside in this region, which is 1–2% of the entire genome. The workflow that you have built works for both analyzing an exome, as well as an entire genome.

Additionally, we have provided prebuilt reference genomes for Isaac, located at:

s3://aws-batch-genomics-resources/reference/

If you are interested, we have provided a script that sets up all of that data. To execute that script, run the following command on a large EC2 instance:

make reference REGISTRY=<your-ecr-registry>

Indexing and preparing this reference takes many hours on a large-memory EC2 instance. Be careful about the costs involved and note that the data is available through the prebuilt reference genomes.

Starting the execution

In a previous section, you established a provenance for the JSON that is fed into your state machine. For ease, we have auto-populated the input JSON for you to the state machine. You can also find this in the GitHub repo under workflow/test.input.json:

{
  "fastq1S3Path": "s3://aws-batch-genomics-resources/fastq/SRR1919605_1.fastq.gz",
  "fastq2S3Path": "s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz",
  "referenceS3Path": "s3://aws-batch-genomics-resources/reference/hg38.fa",
  "resultsS3Path": "s3://<bucket>/genomic-workflow/results",
  "sampleId": "NA12878_states_1",
  "workingDir": "/scratch",
  "isaac": {
    "jobDefinition": "isaac-myenv:1",
    "jobQueue": "arn:aws:batch:us-east-1:<account-id>:job-queue/highPriority-myenv",
    "referenceS3Path": "s3://aws-batch-genomics-resources/reference/isaac/"
  },
  "samtoolsStats": {
    "jobDefinition": "samtools_stats-myenv:1",
    "jobQueue": "arn:aws:batch:us-east-1:<account-id>:job-queue/lowPriority-myenv"
  },
  "strelka": {
    "jobDefinition": "strelka-myenv:1",
    "jobQueue": "arn:aws:batch:us-east-1:<account-id>:job-queue/highPriority-myenv",
    "cmdArgs": " --exome "
  },
  "snpEff": {
    "jobDefinition": "snpeff-myenv:1",
    "jobQueue": "arn:aws:batch:us-east-1:<account-id>:job-queue/lowPriority-myenv",
    "cmdArgs": " -t hg38 "
  }
}

You are now at the stage to run your full genomic analysis. Copy the above to a new text file, change paths and ARNs to the ones that you created previously, and save your JSON input as input.states.json.

In the CLI, execute the following command. You need the ARN of the state machine that you created in the previous post:

aws stepfunctions start-execution --state-machine-arn <your-state-machine-arn> --input file://input.states.json

Your analysis has now started. By using Spot Instances with AWS Batch, you can quickly scale out your workflows while concurrently optimizing for cost. While this is not guaranteed, most executions of the workflows presented here should cost under $1 for a full analysis.

Monitoring the execution

The output from the above CLI command gives you the ARN that describes the specific execution. Copy that and navigate to the Step Functions console. Select the state machine that you created previously and paste the ARN into the search bar.

The screen shows information about your specific execution. On the left, you see where your execution currently is in the workflow.

In the following screenshot, you can see that your workflow has successfully completed the alignment job and moved onto the subsequent steps, which are variant calling and generating quality information about your sample.

You can also navigate to the AWS Batch console and see that progress of all of your jobs reflected there as well.

Finally, after your workflow has completed successfully, check out the S3 path to which you wrote all of your files. If you run a ls –recursive command on the S3 results path, specified in the input to your state machine execution, you should see something similar to the following:

2017-05-02 13:46:32 6475144340 genomic-workflow/results/NA12878_run1/bam/sorted.bam
2017-05-02 13:46:34    7552576 genomic-workflow/results/NA12878_run1/bam/sorted.bam.bai
2017-05-02 13:46:32         45 genomic-workflow/results/NA12878_run1/bam/sorted.bam.md5
2017-05-02 13:53:20      68769 genomic-workflow/results/NA12878_run1/stats/bam_stats.dat
2017-05-02 14:05:12        100 genomic-workflow/results/NA12878_run1/vcf/stats/runStats.tsv
2017-05-02 14:05:12        359 genomic-workflow/results/NA12878_run1/vcf/stats/runStats.xml
2017-05-02 14:05:12  507577928 genomic-workflow/results/NA12878_run1/vcf/variants/genome.S1.vcf.gz
2017-05-02 14:05:12     723144 genomic-workflow/results/NA12878_run1/vcf/variants/genome.S1.vcf.gz.tbi
2017-05-02 14:05:12  507577928 genomic-workflow/results/NA12878_run1/vcf/variants/genome.vcf.gz
2017-05-02 14:05:12     723144 genomic-workflow/results/NA12878_run1/vcf/variants/genome.vcf.gz.tbi
2017-05-02 14:05:12   30783484 genomic-workflow/results/NA12878_run1/vcf/variants/variants.vcf.gz
2017-05-02 14:05:12    1566596 genomic-workflow/results/NA12878_run1/vcf/variants/variants.vcf.gz.tbi

Modifications to the workflow

You have now built and run your genomics workflow. While diving deep into modifications to this architecture are beyond the scope of these posts, we wanted to leave you with several suggestions of how you might modify this workflow to satisfy additional business requirements.

  • Job tracking with Amazon DynamoDB
    In many cases, such as if you are offering Genomics-as-a-Service, you might want to track the state of your jobs with DynamoDB to get fine-grained records of how your jobs are running. This way, you can easily identify the cost of individual jobs and workflows that you run.
  • Resuming from failure
    Both AWS Batch and Step Functions natively support job retries and can cover many of the standard cases where a job might be interrupted. There may be cases, however, where your workflow might fail in a way that is unpredictable. In this case, you can use custom error handling with AWS Step Functions to build out a workflow that is even more resilient. Also, you can build in fail states into your state machine to fail at any point, such as if a batch job fails after a certain number of retries.
  • Invoking Step Functions from Amazon API Gateway
    You can use API Gateway to build an API that acts as a "front door" to Step Functions. You can create a POST method that contains the input JSON to feed into the state machine you built. For more information, see the Implementing Serverless Manual Approval Steps in AWS Step Functions and Amazon API Gateway blog post.

Conclusion

While the approach we have demonstrated in this series has been focused on genomics, it is important to note that this can be generalized to nearly any high-throughput batch workload. We hope that you have found the information useful and that it can serve as a jump-start to building your own batch workloads on AWS with native AWS services.

For more information about how AWS can enable your genomics workloads, be sure to check out the AWS Genomics page.

Other posts in this four-part series:

Please leave any questions and comments below.

Building High-Throughput Genomic Batch Workflows on AWS: Batch Layer (Part 3 of 4)

Post Syndicated from Andy Katz original https://aws.amazon.com/blogs/compute/building-high-throughput-genomic-batch-workflows-on-aws-batch-layer-part-3-of-4/

Aaron Friedman is a Healthcare and Life Sciences Partner Solutions Architect at AWS

Angel Pizarro is a Scientific Computing Technical Business Development Manager at AWS

This post is the third in a series on how to build a genomics workflow on AWS. In Part 1, we introduced a general architecture, shown below, and highlighted the three common layers in a batch workflow:

  • Job
  • Batch
  • Workflow

In Part 2, you built a Docker container for each job that needed to run as part of your workflow, and stored them in Amazon ECR.

In Part 3, you tackle the batch layer and build a scalable, elastic, and easily maintainable batch engine using AWS Batch.

AWS Batch enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. It dynamically provisions the optimal quantity and type of compute resources (for example, CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs that you submit. With AWS Batch, you do not need to install and manage your own batch computing software or server clusters, which allows you to focus on analyzing results, such as those of your genomic analysis.

Integrating applications into AWS Batch

If you are new to AWS Batch, we recommend reading Setting Up AWS Batch to ensure that you have the proper permissions and AWS environment.

After you have a working environment, you define several types of resources:

  • IAM roles that provide service permissions
  • A compute environment that launches and terminates compute resources for jobs
  • A custom Amazon Machine Image (AMI)
  • A job queue to submit the units of work and to schedule the appropriate resources within the compute environment to execute those jobs
  • Job definitions that define how to execute an application

After the resources are created, you’ll test the environment and create an AWS Lambda function to send generic jobs to the queue.

This genomics workflow covers the basic steps. For more information, see Getting Started with AWS Batch.

Creating the necessary IAM roles

AWS Batch simplifies batch processing by managing a number of underlying AWS services so that you can focus on your applications. As a result, you create IAM roles that give the service permissions to act on your behalf. In this section, deploy the AWS CloudFormation template included in the GitHub repository and extract the ARNs for later use.

To deploy the stack, go to the top level in the repo with the following command:

aws cloudformation create-stack --template-body file://batch/setup/iam.template.yaml --stack-name iam --capabilities CAPABILITY_NAMED_IAM

You can capture the output from this stack in the Outputs tab in the CloudFormation console:

Creating the compute environment

In AWS Batch, you will set up a managed compute environments. Managed compute environments automatically launch and terminate compute resources on your behalf based on the aggregate resources needed by your jobs, such as vCPU and memory, and simple boundaries that you define.

When defining your compute environment, specify the following:

  • Desired instance types in your environment
  • Min and max vCPUs in the environment
  • The Amazon Machine Image (AMI) to use
  • Percentage value for bids on the Spot Market and VPC subnets that can be used.

AWS Batch then provisions an elastic and heterogeneous pool of Amazon EC2 instances based on the aggregate resource requirements of jobs sitting in the RUNNABLE state. If a mix of CPU and memory-intensive jobs are ready to run, AWS Batch provisions the appropriate ratio and size of CPU and memory-optimized instances within your environment. For this post, you will use the simplest configuration, in which instance types are set to "optimal" allowing AWS Batch to choose from the latest C, M, and R EC2 instance families.

While you could create this compute environment in the console, we provide the following CLI commands. Replace the subnet IDs and key name with your own private subnets and key, and the image-id with the image you will build in the next section.

ACCOUNTID=<your account id>
SERVICEROLE=<from output in CloudFormation template>
IAMFLEETROLE=<from output in CloudFormation template>
JOBROLEARN=<from output in CloudFormation template>
SUBNETS=<comma delimited list of subnets>
SECGROUPS=<your security groups>
SPOTPER=50 # percentage of on demand
IMAGEID=<ami-id corresponding to the one you created>
INSTANCEROLE=<from output in CloudFormation template>
REGISTRY=${ACCOUNTID}.dkr.ecr.us-east-1.amazonaws.com
KEYNAME=<your key name>
MAXCPU=1024 # max vCPUs in compute environment
ENV=myenv

# Creates the compute environment
aws batch create-compute-environment --compute-environment-name genomicsEnv-$ENV --type MANAGED --state ENABLED --service-role ${SERVICEROLE} --compute-resources type=SPOT,minvCpus=0,maxvCpus=$MAXCPU,desiredvCpus=0,instanceTypes=optimal,imageId=$IMAGEID,subnets=$SUBNETS,securityGroupIds=$SECGROUPS,ec2KeyPair=$KEYNAME,instanceRole=$INSTANCEROLE,bidPercentage=$SPOTPER,spotIamFleetRole=$IAMFLEETROLE

Creating the custom AMI for AWS Batch

While you can use default Amazon ECS-optimized AMIs with AWS Batch, you can also provide your own image in managed compute environments. We will use this feature to provision additional scratch EBS storage on each of the instances that AWS Batch launches and also to encrypt both the Docker and scratch EBS volumes.

AWS Batch has the same requirements for your AMI as Amazon ECS. To build the custom image, modify the default Amazon ECS-Optimized Amazon Linux AMI in the following ways:

  • Attach a 1 TB scratch volume to /dev/sdb
  • Encrypt the Docker and new scratch volumes
  • Mount the scratch volume to /docker_scratch by modifying /etcfstab

The first two tasks can be addressed when you create the custom AMI in the console. Spin up a small t2.micro instance, and proceed through the standard EC2 instance launch.

After your instance has launched, record the IP address and then SSH into the instance. Copy and paste the following code:

sudo yum -y update
sudo parted /dev/xvdb mklabel gpt
sudo parted /dev/xvdb mkpart primary 0% 100%
sudo mkfs -t ext4 /dev/xvdb1
sudo mkdir /docker_scratch
sudo echo -e '/dev/xvdb1\t/docker_scratch\text4\tdefaults\t0\t0' | sudo tee -a /etc/fstab
sudo mount -a

This auto-mounts your scratch volume to /docker_scratch, which is your scratch directory for batch processing. Next, create your new AMI and record the image ID.

Creating the job queues

AWS Batch job queues are used to coordinate the submission of batch jobs. Your jobs are submitted to job queues, which can be mapped to one or more compute environments. Job queues have priority relative to each other. You can also specify the order in which they consume resources from your compute environments.

In this solution, use two job queues. The first is for high priority jobs, such as alignment or variant calling. Set this with a high priority (1000) and map back to the previously created compute environment. Next, set a second job queue for low priority jobs, such as quality statistics generation. To create these compute environments, enter the following CLI commands:

aws batch create-job-queue --job-queue-name highPriority-${ENV} --compute-environment-order order=0,computeEnvironment=genomicsEnv-${ENV}  --priority 1000 --state ENABLED
aws batch create-job-queue --job-queue-name lowPriority-${ENV} --compute-environment-order order=0,computeEnvironment=genomicsEnv-${ENV}  --priority 1 --state ENABLED

Creating the job definitions

To run the Isaac aligner container image locally, supply the Amazon S3 locations for the FASTQ input sequences, the reference genome to align to, and the output BAM file. For more information, see tools/isaac/README.md.

The Docker container itself also requires some information on a suitable mountable volume so that it can read and write files temporary files without running out of space.

Note: In the following example, the FASTQ files as well as the reference files to run are in a publicly available bucket.

FASTQ1=s3://aws-batch-genomics-resources/fastq/SRR1919605_1.fastq.gz
FASTQ2=s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz
REF=s3://aws-batch-genomics-resources/reference/isaac/
BAM=s3://mybucket/genomic-workflow/test_results/bam/

mkdir ~/scratch

docker run --rm -ti -v $(HOME)/scratch:/scratch $REPO_URI --bam_s3_folder_path $BAM \
--fastq1_s3_path $FASTQ1 \
--fastq2_s3_path $FASTQ2 \
--reference_s3_path $REF \
--working_dir /scratch 

Locally running containers can typically expand their CPU and memory resource headroom. In AWS Batch, the CPU and memory requirements are hard limits and are allocated to the container image at runtime.

Isaac is a fairly resource-intensive algorithm, as it creates an uncompressed index of the reference genome in memory to match the query DNA sequences. The large memory space is shared across multiple CPU threads, and Isaac can scale almost linearly with the number of CPU threads given to it as a parameter.

To fit these characteristics, choose an optimal instance size to maximize the number of CPU threads based on a given large memory footprint, and deploy a Docker container that uses all of the instance resources. In this case, we chose a host instance with 80+ GB of memory and 32+ vCPUs. The following code is example JSON that you can pass to the AWS CLI to create a job definition for Isaac.

aws batch register-job-definition --job-definition-name isaac-${ENV} --type container --retry-strategy attempts=3 --container-properties '
{"image": "'${REGISTRY}'/isaac",
"jobRoleArn":"'${JOBROLEARN}'",
"memory":80000,
"vcpus":32,
"mountPoints": [{"containerPath": "/scratch", "readOnly": false, "sourceVolume": "docker_scratch"}],
"volumes": [{"name": "docker_scratch", "host": {"sourcePath": "/docker_scratch"}}]
}'

You can copy and paste the following code for the other three job definitions:

aws batch register-job-definition --job-definition-name strelka-${ENV} --type container --retry-strategy attempts=3 --container-properties '
{"image": "'${REGISTRY}'/strelka",
"jobRoleArn":"'${JOBROLEARN}'",
"memory":32000,
"vcpus":32,
"mountPoints": [{"containerPath": "/scratch", "readOnly": false, "sourceVolume": "docker_scratch"}],
"volumes": [{"name": "docker_scratch", "host": {"sourcePath": "/docker_scratch"}}]
}'

aws batch register-job-definition --job-definition-name snpeff-${ENV} --type container --retry-strategy attempts=3 --container-properties '
{"image": "'${REGISTRY}'/snpeff",
"jobRoleArn":"'${JOBROLEARN}'",
"memory":10000,
"vcpus":4,
"mountPoints": [{"containerPath": "/scratch", "readOnly": false, "sourceVolume": "docker_scratch"}],
"volumes": [{"name": "docker_scratch", "host": {"sourcePath": "/docker_scratch"}}]
}'

aws batch register-job-definition --job-definition-name samtoolsStats-${ENV} --type container --retry-strategy attempts=3 --container-properties '
{"image": "'${REGISTRY}'/samtools_stats",
"jobRoleArn":"'${JOBROLEARN}'",
"memory":10000,
"vcpus":4,
"mountPoints": [{"containerPath": "/scratch", "readOnly": false, "sourceVolume": "docker_scratch"}],
"volumes": [{"name": "docker_scratch", "host": {"sourcePath": "/docker_scratch"}}]
}'

The value for "image" comes from the previous post on creating a Docker image and publishing to ECR. The value for jobRoleArn you can find from the output of the CloudFormation template that you deployed earlier. In addition to providing the number of CPU cores and memory required by Isaac, you also give it a storage volume for scratch and staging. The volume comes from the previously defined custom AMI.

Testing the environment

After you have created the Isaac job definition, you can submit the job using the AWS Batch submitJob API action. While the base mappings for Docker run are taken care of in the job definition that you just built, the specific job parameters should be specified in the container overrides section of the API call. Here’s what this would look like in the CLI, using the same parameters as in the bash commands shown earlier:

aws batch submit-job --job-name testisaac --job-queue highPriority-${ENV} --job-definition isaac-${ENV}:1 --container-overrides '{
"command": [
			"--bam_s3_folder_path", "s3://mybucket/genomic-workflow/test_batch/bam/",
            "--fastq1_s3_path", "s3://aws-batch-genomics-resources/fastq/ SRR1919605_1.fastq.gz",
            "--fastq2_s3_path", "s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz",
            "--reference_s3_path", "s3://aws-batch-genomics-resources/reference/isaac/",
            "--working_dir", "/scratch",
			"—cmd_args", " --exome ",]
}'

When you execute a submitJob call, jobId is returned. You can then track the progress of your job using the describeJobs API action:

aws batch describe-jobs –jobs <jobId returned from submitJob>

You can also track the progress of all of your jobs in the AWS Batch console dashboard.

To see exactly where a RUNNING job is at, use the link in the AWS Batch console to direct you to the appropriate location in CloudWatch logs.

Completing the batch environment setup

To finish, create a Lambda function to submit a generic AWS Batch job.

In the Lambda console, create a Python 2.7 Lambda function named batchSubmitJob. Copy and paste the following code. This is similar to the batch-submit-job-python27 Lambda blueprint. Use the LambdaBatchExecutionRole that you created earlier. For more information about creating functions, see Step 2.1: Create a Hello World Lambda Function.

from __future__ import print_function

import json
import boto3

batch_client = boto3.client('batch')

def lambda_handler(event, context):
    # Log the received event
    print("Received event: " + json.dumps(event, indent=2))
    # Get parameters for the SubmitJob call
    # http://docs.aws.amazon.com/batch/latest/APIReference/API_SubmitJob.html
    job_name = event['jobName']
    job_queue = event['jobQueue']
    job_definition = event['jobDefinition']
    
    # containerOverrides, dependsOn, and parameters are optional
    container_overrides = event['containerOverrides'] if event.get('containerOverrides') else {}
    parameters = event['parameters'] if event.get('parameters') else {}
    depends_on = event['dependsOn'] if event.get('dependsOn') else []
    
    try:
        response = batch_client.submit_job(
            dependsOn=depends_on,
            containerOverrides=container_overrides,
            jobDefinition=job_definition,
            jobName=job_name,
            jobQueue=job_queue,
            parameters=parameters
        )
        
        # Log response from AWS Batch
        print("Response: " + json.dumps(response, indent=2))
        
        # Return the jobId
        event['jobId'] = response['jobId']
        return event
    
    except Exception as e:
        print(e)
        message = 'Error getting Batch Job status'
        print(message)
        raise Exception(message)

Conclusion

In part 3 of this series, you successfully set up your data processing, or batch, environment in AWS Batch. We also provided a Python script in the corresponding GitHub repo that takes care of all of the above CLI arguments for you, as well as building out the job definitions for all of the jobs in the workflow: Isaac, Strelka, SAMtools, and snpEff. You can check the script’s README for additional documentation.

In Part 4, you’ll cover the workflow layer using AWS Step Functions and AWS Lambda.

Please leave any questions and comments below.

AWS Enables Consortium Science to Accelerate Discovery

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-enables-consortium-science-to-accelerate-discovery/

My colleague Mia Champion is a scientist (check out her publications), an AWS Certified Solutions Architect, and an AWS Certified Developer. The time that she spent doing research on large-data datasets gave her an appreciation for the value of cloud computing in the bioinformatics space, which she summarizes and explains in the guest post below!

Jeff;


Technological advances in scientific research continue to enable the collection of exponentially growing datasets that are also increasing in the complexity of their content. The global pace of innovation is now also fueled by the recent cloud-computing revolution, which provides researchers with a seemingly boundless scalable and agile infrastructure. Now, researchers can remove the hindrances of having to own and maintain their own sequencers, microscopes, compute clusters, and more. Using the cloud, scientists can easily store, manage, process and share datasets for millions of patient samples with gigabytes and more of data for each individual. As American physicist, John Bardeen once said: “Science is a collaborative effort. The combined results of several people working together is much more effective than could be that of an individual scientist working alone”.

Prioritizing Reproducible Innovation, Democratization, and Data Protection
Today, we have many individual researchers and organizations leveraging secure cloud enabled data sharing on an unprecedented scale and producing innovative, customized analytical solutions using the AWS cloud.  But, can secure data sharing and analytics be done on such a collaborative scale as to revolutionize the way science is done across a domain of interest or even across discipline/s of science? Can building a cloud-enabled consortium of resources remove the analytical variability that leads to diminished reproducibility, which has long plagued the interpretability and impact of research discoveries? The answers to these questions are ‘yes’ and initiatives such as the Neuro Cloud Consortium, The Global Alliance for Genomics and Health (GA4GH), and The Sage Bionetworks Synapse platform, which powers many research consortiums including the DREAM challenges, are starting to put into practice model cloud-initiatives that will not only provide impactful discoveries in the areas of neuroscience, infectious disease, and cancer, but are also revolutionizing the way in which scientific research is done.

Bringing Crowd Developed Models, Algorithms, and Functions to the Data
Collaborative projects have traditionally allowed investigators to download datasets such as those used for comparative sequence analysis or for training a deep learning algorithm on medical imaging data. Investigators were then able to develop and execute their analysis using institutional clusters, local workstations, or even laptops:

This method of collaboration is problematic for many reasons. The first concern is data security, since dataset download essentially permits “chain-data-sharing” with any number of recipients. Second, analytics done using compute environments that are not templated at some level introduces the risk of variable analytics that itself is not reproducible by a different investigator, or even the same investigator using a different compute environment. Third, the required data dump, processing, and then re-upload or distribution to the collaborative group is highly inefficient and dependent upon each individual’s networking and compute capabilities. Overall, traditional methods of scientific collaboration have introduced methods in which security is compromised and time to discovery is hampered.

Using the AWS cloud, collaborative researchers can share datasets easily and securely by taking advantage of Identity and Access Management (IAM) policy restrictions for user bucket access as well as S3 bucket policies or Access Control Lists (ACLs). To streamline analysis and ensure data security, many researchers are eliminating the necessity to download datasets entirely by leveraging resources that facilitate moving the analytics to the data source and/or taking advantage of remote API requests to access a shared database or data lake. One way our customers are accomplishing this is to leverage container based Docker technology to provide collaborators with a way to submit algorithms or models for execution on the system hosting the shared datasets:

Docker container images have all of the application’s dependencies bundled together, and therefore provide a high degree of versatility and portability, which is a significant advantage over using other executable-based approaches. In the case of collaborative machine learning projects, each docker container will contain applications, language runtime, packages and libraries, as well as any of the more popular deep learning frameworks commonly used by researchers including: MXNet, Caffe, TensorFlow, and Theano.

A common feature in these frameworks is the ability to leverage a host machine’s Graphical Processing Units (GPUs) for significant acceleration of the matrix and vector operations involved in the machine learning computations. As such, researchers with these objectives can leverage EC2’s new P2 instance types in order to power execution of submitted machine learning models. In addition, GPUs can be mounted directly to containers using the NVIDIA Docker tool and appear at the system level as additional devices. By leveraging Amazon EC2 Container Service and the EC2 Container Registry, collaborators are able to execute analytical solutions submitted to the project repository by their colleagues in a reproducible fashion as well as continue to build on their existing environment.  Researchers can also architect a continuous deployment pipeline to run their docker-enabled workflows.

In conclusion, emerging cloud-enabled consortium initiatives serve as models for the broader research community for how cloud-enabled community science can expedite discoveries in Precision Medicine while also providing a platform where data security and discovery reproducibility is inherent to the project execution.

Mia D. Champion, Ph.D.

 

EC2 F1 Instances with FPGAs – Now Generally Available

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/ec2-f1-instances-with-fpgas-now-generally-available/

We launched the Developer Preview of the FPGA-equipped F1 instances at AWS re:Invent. The response to the announcement was quick and overwhelming! We received over 2000 requests for entry, and were able to provide over 200 developers with access to the Hardware Development Kit (HDK) and the actual F1 instances.

In the post that I wrote for re:Invent, I told you that:

This highly parallelized model is ideal for building custom accelerators to process compute-intensive problems. Properly programmed, an FPGA has the potential to provide a 30x speedup to many types of genomics, seismic analysis, financial risk analysis, big data search, and encryption algorithms and applications.

During the preview, partners and developers have been working on all sorts of exciting tools, services, and applications. I’ll tell you more about them in just a moment.

Now Generally Available
Today we are making the F1 instances generally available in the US East (Northern Virginia) Region, with plans to bring them to other regions before too long.

We continued to add features and functions during the preview, while also making the development tools more efficient and easier to use. Here’s a summary:

Developer Community – We launched the AWS FPGA Development Forum to provide a place for FPGA developers to hang out and to communicate with us and with each other.

HDK and SDK – We published the EC2 FPGA Hardware (HDK) and Software Development Kit to GitHub, and made many improvements in response to feedback that we received during the preview.

The improvements include support for VHDL (in addition to Verilog), an improved virtual lab environment (Virtual JTAG, Virtual LED, and Virtual DipSwitch), AWS libraries for FPGA management and the FPGA runtime, and support for OpenCL including the AWS OpenCL runtime library.

FPGA Developer AMI – This Marketplace AMI contains a full set of FPGA development tools including an RTL compiler and simulator, along with Xilinx SDAccel for OpenCL development, all tuned for use on C4, M4, and R4 instances.

FPGAs At Work
Here’s a sampling of the impressive work that our partners have been doing with the F1’s:

Edico Genome is deploying their DRAGEN Bio-IT Platform on F1 instances, with the expectation that it will provide whole-genome sequencing that runs in real time.

Ryft offers the Ryft Cloud, an accelerator for data analytics and machine learning that extends Elastic Stack. It sources data from Amazon Kinesis, Amazon Simple Storage Service (S3), Amazon Elastic Block Store (EBS), and local instance storage and uses massive bitwise parallelism to drive performance. The product supports high-level JDBC, ODBC, and REST interfaces along with low-level C, C++, Java, and Python APIs (see the Ryft API page for more information).

Reconfigure.io launched a cloud-based service that allows you to program FPGAs using the Go programming language. You can build, test, and deploy your code from within their cloud-based environment while taking advantage of concurrency-oriented language features such as goroutines (lightweight threads), channels, and selects.

NGCodec ported their RealityCodec video encoder to the F1 and used it to produce broadcast-quality video at 80 frames per second. Their solution can encode up to 32 independent video streams on a single F1 instance (read their new post, You Deserve Better than Grainy Giraffes, to learn more).

FPGAs In School & Research
Research groups and graduate classes at top-tier universities contacted us via AWS Educate and were eager to gain access to F1 instances.

UCLA‘s CS133 class (Parallel and Distributed Computing) is setting up an F1-based FPGA lab that will be operational within 3 or 4 weeks. According to UCLA Chancellor’s Professor Jason Cong, they are expanding multiple research projects to cover F1 including FPGA performance debugging, machine learning acceleration, Spark to FPGA compilation, and systolic array compilation.

Last month we announced that we are collaborating with the National Science Foundation (NSF) to foster innovation in big data research (read AWS Collaborates With the National Science Foundation to Foster Innovation to learn more and to find out how to apply for a grant).

FPGA’s in the AWS Marketplace
As I shared in my original post, we have built a complete beginning to end solution that lets developers build FPGA-powered applications and services and list them in the AWS Marketplace. I can’t wait to see what kinds of cool things show up there!

Jeff;

Month in Review: December 2016

Post Syndicated from Derek Young original https://aws.amazon.com/blogs/big-data/month-in-review-december-2016/

Another month of big data solutions on the Big Data Blog.

Take a look at our summaries below and learn, comment, and share. Thank you for reading!

Implementing Authorization and Auditing using Apache Ranger on Amazon EMR
Apache Ranger is a framework to enable, monitor, and manage comprehensive data security across the Hadoop platform. Features include centralized security administration, fine-grained authorization across many Hadoop components (Hadoop, Hive, HBase, Storm, Knox, Solr, Kafka, and YARN) and central auditing. In this post, walk through the steps to enable authorization and audit for Amazon EMR clusters using Apache Ranger.

Amazon Redshift Engineering’s Advanced Table Design Playbook
Amazon Redshift is a fully managed, petabyte scale, massively parallel data warehouse that offers simple operations and high performance. In practice, the best way to improve query performance by orders of magnitude is by tuning Amazon Redshift tables to better meet your workload requirements. This five-part blog series will guide you through applying distribution styles, sort keys, and compression encodings and configuring tables for data durability and recovery purposes.

Interactive Analysis of Genomic Datasets Using Amazon Athena
In this post, learn to prepare genomic data for analysis with Amazon Athena. We’ll demonstrate how Athena is well-adapted to address common genomics query paradigms using the Thousand Genomes dataset hosted on Amazon S3, a seminal genomics study. Although this post is focused on genomic analysis, similar approaches can be applied to any discipline where large-scale, interactive analysis is required.

Joining and Enriching Streaming Data on Amazon Kinesis
In this blog post, learn three approaches for joining and enriching streaming data on Amazon Kinesis Streams by using Amazon Kinesis Analytics, AWS Lambda, and Amazon DynamoDB.

Using SaltStack to Run Commands in Parallel on Amazon EMR
SaltStack is an open source project for automation and configuration management. It started as a remote execution engine designed to scale to many machines while delivering high-speed execution. You can now use the new bootstrap action that installs SaltStack on Amazon EMR. It provides a basic configuration that enables selective targeting of the nodes based on instance roles, instance groups, and other parameters.

Building an Event-Based Analytics Pipeline for Amazon Game Studios’ Breakaway
Amazon Game Studios’ new title Breakaway is an online 4v4 team battle sport that delivers fast action, teamwork, and competition. In this post, learn the technical details of how the Breakaway team uses AWS to collect, process, and analyze gameplay telemetry to answer questions about arena design.

Respond to State Changes on Amazon EMR Clusters with Amazon CloudWatch Events
With new support for Amazon EMR in Amazon CloudWatch Events, you can be notified quickly and programmatically respond to state changes in your EMR clusters. Additionally, these events are also displayed in the Amazon EMR console. CloudWatch Events allows you to create filters and rules to match these events and route them to Amazon SNS topics, AWS Lambda functions, Amazon SQS queues, streams in Amazon Kinesis Streams, or built-in targets.

Run Jupyter Notebook and JupyterHub on Amazon EMR
Data scientists who run Jupyter and JupyterHub on Amazon EMR can use Python, R, Julia, and Scala to process, analyze, and visualize big data stored in Amazon S3. Jupyter notebooks can be saved to S3 automatically, so users can shut down and launch new EMR clusters, as needed. See how EMR makes it easy to spin up clusters with different sizes and CPU/memory configurations to suit different workloads and budgets.

Derive Insights from IoT in Minutes using AWS IoT, Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight
In this post, see how you can build a business intelligence capability for streaming IoT device data using AWS serverless and managed services. You can be up and running in minutes―starting small, but able to easily grow to millions of devices and billions of messages.

Serving Real-Time Machine Learning Predictions on Amazon EMR
The typical progression for creating and using a trained model for recommendations falls into two general areas: training the model and hosting the model. Model training has become a well-known standard practice. In this post, we highlight one way to host those recommendations using Amazon EMR with JobServer

Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning
In this post, learn to generate a predictive model for flight delays that can be used to help pick the flight least likely to add to your travel stress. To accomplish this, you’ll use Apache Spark running on Amazon EMR for extracting, transforming, and loading (ETL) the data, Amazon Redshift for analysis, and Amazon Machine Learning for creating predictive models.

FROM THE ARCHIVE

Running sparklyr – RStudio’s R Interface to Spark on Amazon EMR
Sparklyr is an R interface to Spark that allows users to use Spark as the backend for dplyr, one of the most popular data manipulation packages. Sparklyr provides interfaces to Spark packages and also allows users to query data in Spark using SQL and develop extensions for the full Spark API. This short post shows you how to run RStudio and sparklyr on EMR.


Want to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.

Leave a comment below to let us know what big data topics you’d like to see next on the AWS Big Data Blog.

Serving Real-Time Machine Learning Predictions on Amazon EMR

Post Syndicated from Derek Graeber original https://aws.amazon.com/blogs/big-data/serving-real-time-machine-learning-predictions-on-amazon-emr/

The typical progression for creating and using a trained model for recommendations falls into two general areas: training the model and hosting the model. Model training has become a well-known standard practice. We want to highlight one of many ways to host those recommendations (for example, see the Analyzing Genomics Data at Scale using R, AWS Lambda, and Amazon API Gateway post).

In this post, we look at one possible way to host a trained ALS model on Amazon EMR using Apache Spark to serve movie predictions in real time. It is a continuation of two recent posts that are prerequisite:

In future posts we will cover other alternatives for serving real-time machine-learning predictions, namely AWS Lambda and Amazon EC2 Container Service, by running the prediction functions locally and loading the saved models from S3 to the local execution environments.

Walkthrough: Trained ALS model

For this walkthrough, you use the MovieLens dataset as set forth in the Building a Recommendation Engine post; the data model should have already been generated and persisted to Amazon S3. It uses the Alternating Least Squares (ALS) algorithm to train the data for generating the proper model.

Using JobServer, you take that model and persist it in memory in JobServer on Amazon EMR. After it’s persisted, you can expose RESTful endpoints to AWS Lambda, which in turn can be invoked from a static UI page hosted on S3, securing access with Amazon Cognito.

Here are the steps that you follow:

  1. Create the infrastructure, including EMR with JobServer and Lambda.
  2. Load the trained model into Spark on EMR via JobServer.
  3. Stage a static HTML page on S3.
  4. Access the AWS Lambda endpoints via the static HTML page authenticated with Amazon Cognito.

The following diagram shows the infrastructure architecture.

o_realtime_1_1_1

Create the hosting environment

Because you are already familiar with JobServer on EMR from the prerequisite Installing and Running JobServer post, this post won’t review most of that work. You use EMR 4.7.1, as does the earlier post. At the time, of this writing, JobServer does not have a version for Spark 2.0, so you use JobServer v.0.6.2.

Be sure to get the updated aws-blog-jobserver-emr . The project structure has not changed, but the code and configuration has been added. Specifically, under the <project> directory, reference the following:

  • <project>/BA – the bootstrap actions to install JobServer on EMR
  • <project>/cfn – the location of the necessary AWS CloudFormation templates
  • <project>/html – the static HTML page to be hosted in S3
  • <project>/jobserver_configs – the configurations for JobServer
  • <project>/ml_commands – sample invocation commands using CURL (just reference)
  • <project>/policy – the role policies for artifacts
  • <project>/python_lambda – the source code for the AWS Lambda Python code
  • <project>/src – the Scala source code for JobServer

You expose three endpoints in JobServer (and Lambda):

  1. LoadModelAndData – This takes both the model and data stored on S3 and loads them into JobServer. You MUST invoke this endpoint first.
  2. MoviesRec – The recommendation endpoint for all movies for a user (in this case, userId=100).
  3. MoviesRecByGenre – The recommendation for moves by a specific genre for a specific user (in this case, userId=100)

Use AWS CloudFormation to create the EMR cluster that JobServer runs on and the Lambda functions that access JobServer.

To get started, prep the deployment and review the assumptions.

Prep the deployment and stage configurations for JobServer

You use S3 as the landing point for all necessary configurations for JobServer and Lambda. You MUST be familiar with the JobServer on EMR post and understand the configurations discussed in detail. We touch on them only briefly in this post as we walk you through the following steps. For each step, be sure to note the full S3 object path.

  1. Compile the JobServer code with Maven and put the output .jar file in the S3 bucket (jobserverEmr-1.0.jar).
  2. Put the file <project>/jobserver_configs/emr_contexts_ml.conf in the S3 bucket.
  3. Put the file <project>/jobserver_configs/emr_v1.6.2.conf in the S3 bucket.
  4. Open the BA at <project>/BA/install_jobserver_ML.sh and update the following properties:
    • EMR_SH_FILE property – the path to the object on S3 to emr_v1.6.2.conf
    • EMR_CONF_FILE property – the path to the object on S3 to emr_contexts_ml.conf
  5. Put the install_jobserver_ML.sh BA in the S3 bucket.
  6. Open the BA at <project>/BA/startServer.sh and update the properties:
    • EMR_JOBSERVER_FILE – object on S3 to jobserverEmr-1.0.jar
  7. Put the startServer.sh BA in the S3 bucket.

At this point, your JobServer is configured. Right now, you should have noted the following:

  • The full S3 object path to the install_jobserver_ML.sh bootstrap action
  • The full S3 object path to startServer.sh bootstrap action

Now, you can create the Python code for Lambda.

Prep the deployment and stage configurations for Lambda

Create a zip file with the Python code and a dependency, using Python 2.7 for the runtime. Under <project>/python_lambda, the models.py file is the source code. It uses the sjs-client API. Install it for Python to use and put it in the deployment package zip file. For more information, see Creating a Deployment Package (Python).

pip install python-sjsclient –t <project> 

Zip the models.py script with the dependency sjs-client libraries (make sure that models.py is at the root and not in a directory). In the working environment, name the output .zip file python-lambda.zip or similar. Place this zip file in the S3 bucket and be sure to make a note of the full S3 object path.

Now that you have all the dependencies, you can move onto creating the environment for CloudFormation.

Create the JobServer environment

To create the JobServer environment, use CloudFormation and the items you staged on S3. The CloudFormation template is located at <project>/cfn/ml-jobserver.template.

It assumes the following:

  1. The user creating the environment has access to IAM, Route 53, Lambda, and EMR.
  2. It is being run in the default VPC of the account, in a public subnet.
  3. The account already has the EMR_DefaultRole and EMR_EC2_DefaultRole roles defined.

Get the VPCID value of the default VPC and the SubnetID value for the subnet in which the artifacts are created. Provide the following seven parameters as inputs to the CloudFormation template:

  • VPCID – the default VPC
    • vpc-XXXXXXXX
  • SubnetID – the subnet in the VPC to use
    • subnet-XXXXXXXX
  • EMRKeyPair – the name of an EC2 key to be installed on the EMR master node
    • YOUR-EC2-KEYPAIR
  • BaJobServerLoc – where to put the install JobServer BA
    • s3://<S3folder>/jobserver/BA/install_jobserver_ML_BA.sh
  • StepStartJobServerLoc – where to put the startServer.sh BA
    • s3:// <S3folder>/jobserver/BA/startServer.sh
  • LambdaS3BucketName – the name of the bucket where the Python code is located
    • <S3folder>
  • LambdaS3BucketKey – the key (without the bucket) where your Python code is located
    • ml/lambda/python_lambda.zip

Browse to the console and execute this template being sure to supply the required parameters. This template takes approximately 10-12 minutes to execute.

When the template has completed, the EMR cluster with JobServer is installed with your ML-Jobserver code and the Lambda endpoints are installed. You can find the names of the Lambda endpoints in the output section. You need these later on to host a static webpage in S3 and invoke the endpoints. The three endpoints created correspond to the three access points you deployed in JobServer:

  • LoadDataLambdaARN
  • RecommendationLambdaARN
  • GenreLambdaARN

o_realtime_2

Using the following template for each of the Lambda functions created, you can test each of the Lambda endpoints (be sure to test LoadDataLambda first). In this use case, the model you created with Zeppelin was persisted to S3 as denoted and the MovieLens data is on S3:

{
  "userId": 100,
  "genre": "Comedy",
  "s3DataLoc": "dgraeberaws-blogs/ml/data/movielens/small/",
  "s3ModelLoc": "dgraeberaws-blogs/ml/models/movielens/recommendations/"
}

With the server deployed and the model loaded and tested via Lambda test events, you can move on to creating a static HTML page to access these endpoints.

Create the static UI

You host a static HTML page on S3 and authenticate (unauthorized) with Amazon Cognito. In this use case, you use a federated identity pool, created using the console. You can either create a new IAM user manually or let the Amazon Cognito console create one for you. For the role that is the unauthenticated role, you must have a policy attached to that role to allow access to Lambda as well as the proper Amazon Cognito trust policy (see <project>/policy for samples).

After the identity pool is created, make note of the pool ID. Now, edit the mltest.html code located in <project>/html. Make four updates to give this page access to the Lambda code:

  • The Amazon Cognito identity pool ID
  • The ARN of the LoadDataLambdaARN function (from the CloudFormation output)
  • The ARN of the GenreLambdaARN function (from the CloudFormation output)
  • The ARN of the RecommendationLambdaARN function (from the CloudFormation output)

Edit the mltest.html page with these values, as indicated below:

o_realtime_3
Create an S3 bucket and configure it as a static website. After it’s created, place the mltest.html page in the bucket (with proper permissions for public read access) and browse to that object.

o_realtime_4

Conclusion

We have shown you one possible way to host a real-time recommendation engine on Amazon EMR with JobServer. You can quickly create infrastructure and load the model data. Also, using RESTful endpoints and AWS Lambda, you can expose the model to end users to provide recommendations in real time. This is only one example of a way to provide real-time predictions using standard technologies. Look for future posts related to Machine Learning on the AWS Big Data Blog.

If you have questions or suggestions, please comment below.


About the Authors


derek_graeber_90_1Derek Graeber is a Senior Consultant in Big Data & Analytics for AWS Professional Services.
He works with enterprise customers to provide leadership on big data projects, helping them realize their business goals when running on AWS. In his spare time, he enjoys spending time with his wife and family, occasionally getting out to reaffirm that he will never be a good golfer.

 

guy_ernest_90Guy Ernest is a principal business development manager for machine learning in AWS. He has the exciting opportunity to help shape and deliver on a strategy to build mind share and broad use of Amazon’s cloud computing platform for AI, machine learning and deep learning use cases. In his spare time, he enjoys spending time with his wife and family, gathering embarrassing stories, to share in talks about Amazon and the future of AI.


Related

Readmission Prediction Through Patient Risk Stratification Using Amazon Machine Learning

Readmission_1

 

 

 

 

 

 


Interactive Analysis of Genomic Datasets Using Amazon Athena

Post Syndicated from Aaron Friedman original https://aws.amazon.com/blogs/big-data/interactive-analysis-of-genomic-datasets-using-amazon-athena/

Aaron Friedman is a Healthcare and Life Sciences Solutions Architect with Amazon Web Services

The genomics industry is in the midst of a data explosion. Due to the rapid drop in the cost to sequence genomes, genomics is now central to many medical advances. When your genome is sequenced and analyzed, raw sequencing files are processed in a multi-step workflow to identify where your genome differs from a standard reference. Your variations are stored in a Variant Call Format (VCF) file, which is then combined with other individuals to enable population-scale analyses. Many of these datasets are publicly available, and an increasing number are hosted on AWS as part of our Open Data project.

To mine genomic data for new discoveries, researchers in both industry and academia build complex models to analyze populations at scale. When building models, they first explore the datasets-of-interest to understand what questions the data might answer. In this step, interactivity is key, as it allows them to move easily from one question to the next.

Recently, we launched Amazon Athena as an interactive query service to analyze data on Amazon S3. With Amazon Athena there are no clusters to manage and tune, no infrastructure to setup or manage, and customers pay only for the queries they run. Athena is able to query many file types straight from S3. This flexibility gives you the ability to interact easily with your datasets, whether they are in a raw text format (CSV/JSON) or specialized formats (e.g. Parquet). By being able to flexibly query different types of data sources, researchers can more rapidly progress through the data exploration phase for discovery. Additionally, researchers don’t have to know nuances of managing and running a big data system. This makes Athena an excellent complement to data warehousing on Amazon Redshift and big data analytics on Amazon EMR 

In this post, I discuss how to prepare genomic data for analysis with Amazon Athena as well as demonstrating how Athena is well-adapted to address common genomics query paradigms.  I use the Thousand Genomes dataset hosted on Amazon S3, a seminal genomics study, to demonstrate these approaches. All code that is used as part of this post is available in our GitHub repository.

Although this post is focused on genomic analysis, similar approaches can be applied to any discipline where large-scale, interactive analysis is required.

 

Select, aggregate, annotate query pattern in genomics

Genomics researchers may ask different questions of their dataset, such as:

  • What variations in a genome may increase the risk of developing disease?
  • What positions in the genome have abnormal levels of variation, suggesting issues in quality of sequencing or errors in the genomic reference?
  • What variations in a genome influence how an individual may respond to a specific drug treatment?
  • Does a group of individuals contain a higher frequency of a genomic variant known to alter response to a drug relative to the general population?

All these questions, and more, can be generalized under a common query pattern I like to call “Select, Aggregate, Annotate”. Some of our genomics customers, such as Human Longevity, Inc., routinely use this query pattern in their work.

In each of the above queries, you execute the following steps:

SELECT: Specify the cohort of individuals meeting certain criteria (disease, drug response, age, BMI, entire population, etc.).

AGGREGATE: Generate summary statistics of genomic variants across the cohort that you selected.

ANNOTATE: Assign meaning to each of the variants by joining on known information about each variant.

Dataset preparation

Properly organizing your dataset is one of the most critical decisions for enabling fast, interactive analysies. Based on the query pattern I just described, the table representing your population needs to have the following information:

  • A unique sample ID corresponding to each sample in your population
  • Information about each variant, specifically its location in the genome as well as the specific deviation from the reference
  • Information about how many times in a sample a variant occurs (0, 1, or 2 times) as well as if there are multiple variants in the same site. This is known as a genotype.

The extract, transform, load (ETL) process to generate the appropriate data representation has two main steps. First, you use ADAM, a genomics analysis platform built on top of Spark, to convert the variant information residing a VCF file to Parquet for easier downstream analytics, in a process similar to the one described in the Will Spark Power the Data behind Precision Medicine? post. Then, you use custom Python code to massage the data and select only the appropriate fields that you need for analysis with Athena.

First, spin up an EMR cluster (version 5.0.3) for the ETL process. I used a c4.xlarge for my master node and m4.4xlarges with 1 TB of scratch for my core nodes.

After you SSH into your master node, clone the git repository. You can also put this in as a bootstrap action when spinning up your cluster.

sudo yum –y install git
git clone https://github.com/awslabs/aws-big-data-blog.git 

You then need to install ADAM and configure it to run with Spark 2.0. In your terminal, enter the following:

# Install Maven
wget http://apache.claz.org/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
tar xvzf apache-maven-3.3.9-bin.tar.gz
export PATH=/home/hadoop/apache-maven-3.3.9/bin:$PATH

# Install ADAM
git clone https://github.com/bigdatagenomics/adam.git
cd adam
sed -i 's/2.6.0/2.7.2/' pom.xml
./scripts/move_to_spark_2.sh
export MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=256m"
mvn clean package -DskipTests

export PATH=/home/hadoop/adam/bin:$PATH

Now that ADAM is installed, enter the working directory of the repo that you first cloned, which contains the code relevant to this post. Your next step is to convert a VCF containing all the information about your population.

For this post, you are going to focus on a single region in the genome, defined as chromosome 22, for all 2,504 samples in the Thousand Genomes dataset. As chromosome 22 was the first chromosome to be sequenced as part of the Human Genome Project, it is your first here as well. You can scale the size of your cluster depending on whether you are looking at just chromosome 22, or the entire genome. In the following lines, replace mytestbucket with your bucket name of choice.

PARQUETS3PATH=s3://<mytestbucket>/thousand_genomes/grch37/chr22.parquet/
cd ~/aws-big-data-blog/aws-blog-athena-genomics/etl/thousand_genomes/
chmod +x ./convert_vcf.sh
./convert_vcf.sh $PARQUETS3PATH

After this conversion completes, you can reduce your Parquet file to only the fields you need.

TRIMMEDS3PATH=s3://<mytestbucket>/thousand_genomes/grch37_trimmed/chr22.parquet/
spark-submit --master yarn --executor-memory 40G --deploy-mode cluster create_trimmed_parquet.py --input_s3_path $PARQUETS3PATH --output_s3_path $TRIMMEDS3PATH 

Annotation data preparation

For this post, use the ClinVar dataset, which is an archive of information about variation in genomes and health. While Parquet is a preferred format for Athena relative to raw text, using the ClinVar TXT file demonstrates the ability of Athena to read multiple file types.

In a Unix terminal, execute the following commands (replace mytestbucket with your bucket name of choice):

ANNOPATH=s3://<mytestbucket>/clinvar/

wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz

# Need to strip out the first line (header)
zcat variant_summary.txt.gz | sed '1d' > temp ; mv -f temp variant_summary.trim.txt ; gzip variant_summary.trim.txt

aws s3 cp variant_summary.trim.txt.gz $ANNOPATH --sse

Creating Tables using Amazon Athena

In the recent Amazon Athena – Interactive SQL Queries for Data in Amazon S3 post, Jeff Barr covered how to get started with Athena and navigate to the console. Athena uses the Hive DDL when you create, alter, or drop a table. This means you can use the standard Hive syntax for creating tables. These commands can also be found in the aws-blog-athena-genomics GitHub repo under sql/setup.

First, create your database:

CREATE DATABASE kg;

Next, define your table with the appropriate mappings from the previous ETL processes. Be sure to change to your appropriate bucket name.

For your population data:

If you want to skip the ETL process described above for your population data, you can use the following path for your analysis: s3://aws-bigdata-blog/artifacts/athena_genomics.parquet/ – be sure to substitute it in for the LOCATION below.

CREATE EXTERNAL TABLE demo.samplevariants
(
  alternateallele STRING
  ,chromosome STRING
  ,endposition BIGINT
  ,genotype0 STRING
  ,genotype1 STRING
  ,referenceallele STRING
  ,sampleid STRING
  ,startposition BIGINT
)
STORED AS PARQUET
LOCATION ' s3://<mytestbucket>/thousand_genomes/grch37_trimmed/chr22.parquet/';

For your annotation data:

CREATE EXTERNAL TABLE demo.clinvar
(
    alleleId STRING
    ,variantType STRING
    ,hgvsName STRING
    ,geneID STRING
    ,geneSymbol STRING
    ,hgncId STRING
    ,clinicalSignificance STRING
    ,clinSigSimple STRING
    ,lastEvaluated STRING
    ,rsId STRING
    ,dbVarId STRING
    ,rcvAccession STRING
    ,phenotypeIds STRING
    ,phenotypeList STRING
    ,origin STRING
    ,originSimple STRING
    ,assembly STRING
    ,chromosomeAccession STRING
    ,chromosome STRING
    ,startPosition INT
    ,endPosition INT
    ,referenceAllele STRING
    ,alternateAllele STRING
    ,cytogenetic STRING
    ,reviewStatus STRING
    ,numberSubmitters STRING
    ,guidelines STRING
    ,testedInGtr STRING
    ,otherIds STRING
    ,submitterCategories STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://<mytestbucket>/clinvar/';

You can confirm that both tables have been created by choosing the eye icon next to a table name in the console. This automatically runs a query akin to SELECT * FROM table LIMIT 10. If it succeeds, then your data has been loaded successfully.

o_athena_genomes_1

Applying the select, aggregate, annotate paradigm

In this section, I walk you through how to use the query pattern described earlier to answer two common questions in genomics. You examine the frequency of different genotypes, which is a distinct combination of all fields in the samplevariants table with the exception of sampleid. These queries can also be found in the aws-blog-athena-genomics GitHub repo under sql/queries.

Population drug response

What small molecules/drugs are most likely to affect a subpopulation of individuals (ancestry, age, etc.) based on their genomic information?

In this query, assume that you have some phenotype data about your population. In this case, also assume that all samples sharing the pattern “NA12” are part of a specific demographic.

In this query, use sampleid as your predicate pushdown. The general steps are:

  1. Filter by the samples in your subpopulation
  2. Aggregate variant frequencies for the subpopulation-of-interest
  3. Join on ClinVar dataset
  4. Filter by variants that have been implicated in drug-response
  5. Order by highest frequency variants

To answer this question, you can craft the following query:

SELECT
    count(*)/cast(numsamples AS DOUBLE) AS genotypefrequency
    ,cv.rsid
    ,cv.phenotypelist
    ,sv.chromosome
    ,sv.startposition
    ,sv.endposition
    ,sv.referenceallele
    ,sv.alternateallele
    ,sv.genotype0
    ,sv.genotype1
FROM demo.samplevariants sv
CROSS JOIN
    (SELECT count(1) AS numsamples
    FROM
        (SELECT DISTINCT sampleid
        FROM demo.samplevariants
        WHERE sampleid LIKE 'NA12%'))
JOIN demo.clinvar cv
ON sv.chromosome = cv.chromosome
    AND sv.startposition = cv.startposition - 1
    AND sv.endposition = cv.endposition
    AND sv.referenceallele = cv.referenceallele
    AND sv.alternateallele = cv.alternateallele
WHERE assembly='GRCh37'
    AND cv.clinicalsignificance LIKE '%response%'
    AND sampleid LIKE 'NA12%'
GROUP BY  sv.chromosome
          ,sv.startposition
          ,sv.endposition
          ,sv.referenceallele
          ,sv.alternateallele
          ,sv.genotype0
          ,sv.genotype1
          ,cv.clinicalsignificance
          ,cv.phenotypelist
          ,cv.rsid
          ,numsamples
ORDER BY  genotypefrequency DESC LIMIT 50

Enter the query into the console and you see something like the following:

o_athena_genomes_2

When you inspect the results, you can quickly see (often in a matter of seconds!) that this population of individuals has a high frequency of variants associated with the metabolism of debrisoquine.

o_athena_genomes_3

Quality control

Are you systematically finding locations in your reference that are called as variation? In other words, what positions in the genome have abnormal levels of variation, suggesting issues in quality of sequencing or errors in the reference? Are any of these variants implicated in clinically? If so, could this be a false positive clinical finding?

In this query, the entire population is needed so the SELECT predicate is not used. The general steps are:

  1. Aggregate variant frequencies for the entire Thousand Genomes population
  2. Join on the ClinVar dataset
  3. Filter by variants that have been implicated in disease
  4. Order by the highest frequency variants

This translates into the following query:

SELECT
    count(*)/cast(numsamples AS DOUBLE) AS genotypefrequency
    ,cv.clinicalsignificance
    ,cv.phenotypelist
    ,sv.chromosome
    ,sv.startposition
    ,sv.endposition
    ,sv.referenceallele
    ,sv.alternateallele
    ,sv.genotype0
    ,sv.genotype1
FROM demo.samplevariants sv
CROSS JOIN
    (SELECT count(1) AS numsamples
    FROM
        (SELECT DISTINCT sampleid
        FROM demo.samplevariants))
    JOIN demo.clinvar cv
    ON sv.chromosome = cv.chromosome
        AND sv.startposition = cv.startposition - 1
        AND sv.endposition = cv.endposition
        AND sv.referenceallele = cv.referenceallele
        AND sv.alternateallele = cv.alternateallele
WHERE assembly='GRCh37'
        AND cv.clinsigsimple='1'
GROUP BY  sv.chromosome ,
          sv.startposition
          ,sv.endposition
          ,sv.referenceallele
          ,sv.alternateallele
          ,sv.genotype0
          ,sv.genotype1
          ,cv.clinicalsignificance
          ,cv.phenotypelist
          ,numsamples
ORDER BY  genotypefrequency DESC LIMIT 50

As you can see in the following screenshot, the highest frequency results have conflicting information, being listed both as potentially causing disease and being benign. In genomics, variants with a higher frequency are less likely to cause disease. Your quick analysis allows you to discount the pathogenic clinical significance annotations of these high genotype frequency variants.

o_athena_genomes_4

 

Summary

I hope you have seen how you can easily integrate datasets of different types and values, and combine them to quickly derive new meaning from your data. Do you have a dataset you want to explore or want to learn more about Amazon Athena? Check out our Getting Started Guide and happy querying!

If you have questions or suggestions, please comment below.


About the author

friedman_90Dr. Aaron Friedman a Healthcare and Life Sciences Partner Solutions Architect at Amazon Web Services. He works with our ISVs and SIs to architect healthcare solutions on AWS, and bring the best possible experience to their customers. His passion is working at the intersection of science, big data, and software. In his spare time, he’s out hiking or learning a new thing to cook.

 

 


Related

Analyzing Data in S3 using Amazon Athena

o_athena_10