Tag Archives: vcf

Building High-Throughput Genomics Batch Workflows on AWS: Workflow Layer (Part 4 of 4)

Post Syndicated from Andy Katz original https://aws.amazon.com/blogs/compute/building-high-throughput-genomics-batch-workflows-on-aws-workflow-layer-part-4-of-4/

Aaron Friedman is a Healthcare and Life Sciences Partner Solutions Architect at AWS

Angel Pizarro is a Scientific Computing Technical Business Development Manager at AWS

This post is the fourth in a series on how to build a genomics workflow on AWS. In Part 1, we introduced a general architecture, shown below, and highlighted the three common layers in a batch workflow:

  • Job
  • Batch
  • Workflow

In Part 2, you built a Docker container for each job that needed to run as part of your workflow, and stored them in Amazon ECR.

In Part 3, you tackled the batch layer and built a scalable, elastic, and easily maintainable batch engine using AWS Batch. This solution took care of dynamically scaling your compute resources in response to the number of runnable jobs in your job queue length as well as managed job placement.

In part 4, you build out the workflow layer of your solution using AWS Step Functions and AWS Lambda. You then run an end-to-end genomic analysis―specifically known as exome secondary analysis―for many times at a cost of less than $1 per exome.

Step Functions makes it easy to coordinate the components of your applications using visual workflows. Building applications from individual components that each perform a single function lets you scale and change your workflow quickly. You can use the graphical console to arrange and visualize the components of your application as a series of steps, which simplify building and running multi-step applications. You can change and add steps without writing code, so you can easily evolve your application and innovate faster.

An added benefit of using Step Functions to define your workflows is that the state machines you create are immutable. While you can delete a state machine, you cannot alter it after it is created. For regulated workloads where auditing is important, you can be assured that state machines you used in production cannot be altered.

In this blog post, you will create a Lambda state machine to orchestrate your batch workflow. For more information on how to create a basic state machine, please see this Step Functions tutorial.

All code related to this blog series can be found in the associated GitHub repository here.

Build a state machine building block

To skip the following steps, we have provided an AWS CloudFormation template that can deploy your Step Functions state machine. You can use this in combination with the setup you did in part 3 to quickly set up the environment in which to run your analysis.

The state machine is composed of smaller state machines that submit a job to AWS Batch, and then poll and check its execution.

The steps in this building block state machine are as follows:

  1. A job is submitted.
    Each analytical module/job has its own Lambda function for submission and calls the batchSubmitJob Lambda function that you built in the previous blog post. You will build these specialized Lambda functions in the following section.
  2. The state machine queries the AWS Batch API for the job status.
    This is also a Lambda function.
  3. The job status is checked to see if the job has completed.
    If the job status equals SUCCESS, proceed to log the final job status. If the job status equals FAILED, end the execution of the state machine. In all other cases, wait 30 seconds and go back to Step 2.

Here is the JSON representing this state machine.

{
  "Comment": "A simple example that submits a Job to AWS Batch",
  "StartAt": "SubmitJob",
  "States": {
    "SubmitJob": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:<account-id>::function:batchSubmitJob",
      "Next": "GetJobStatus"
    },
    "GetJobStatus": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:<account-id>:function:batchGetJobStatus",
      "Next": "CheckJobStatus",
      "InputPath": "$",
      "ResultPath": "$.status"
    },
    "CheckJobStatus": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.status",
          "StringEquals": "FAILED",
          "End": true
        },
        {
          "Variable": "$.status",
          "StringEquals": "SUCCEEDED",
          "Next": "GetFinalJobStatus"
        }
      ],
      "Default": "Wait30Seconds"
    },
    "Wait30Seconds": {
      "Type": "Wait",
      "Seconds": 30,
      "Next": "GetJobStatus"
    },
    "GetFinalJobStatus": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:<account-id>:function:batchGetJobStatus",
      "End": true
    }
  }
}

Building the Lambda functions for the state machine

You need two basic Lambda functions for this state machine. The first one submits a job to AWS Batch and the second checks the status of the AWS Batch job that was submitted.

In AWS Step Functions, you specify an input as JSON that is read into your state machine. Each state receives the aggregate of the steps immediately preceding it, and you can specify which components a state passes on to its children. Because you are using Lambda functions to execute tasks, one of the easiest routes to take is to modify the input JSON, represented as a Python dictionary, within the Lambda function and return the entire dictionary back for the next state to consume.

Building the batchSubmitIsaacJob Lambda function

For Step 1 above, you need a Lambda function for each of the steps in your analysis workflow. As you created a generic Lambda function in the previous post to submit a batch job (batchSubmitJob), you can use that function as the basis for the specialized functions you’ll include in this state machine. Here is such a Lambda function for the Isaac aligner.

from __future__ import print_function

import boto3
import json
import traceback

lambda_client = boto3.client('lambda')



def lambda_handler(event, context):
    try:
        # Generate output put
        bam_s3_path = '/'.join([event['resultsS3Path'], event['sampleId'], 'bam/'])

        depends_on = event['dependsOn'] if 'dependsOn' in event else []

        # Generate run command
        command = [
            '--bam_s3_folder_path', bam_s3_path,
            '--fastq1_s3_path', event['fastq1S3Path'],
            '--fastq2_s3_path', event['fastq2S3Path'],
            '--reference_s3_path', event['isaac']['referenceS3Path'],
            '--working_dir', event['workingDir']
        ]

        if 'cmdArgs' in event['isaac']:
            command.extend(['--cmd_args', event['isaac']['cmdArgs']])
        if 'memory' in event['isaac']:
            command.extend(['--memory', event['isaac']['memory']])

        # Submit Payload
        response = lambda_client.invoke(
            FunctionName='batchSubmitJob',
            InvocationType='RequestResponse',
            LogType='Tail',
            Payload=json.dumps(dict(
                dependsOn=depends_on,
                containerOverrides={
                    'command': command,
                },
                jobDefinition=event['isaac']['jobDefinition'],
                jobName='-'.join(['isaac', event['sampleId']]),
                jobQueue=event['isaac']['jobQueue']
            )))

        response_payload = response['Payload'].read()

        # Update event
        event['bamS3Path'] = bam_s3_path
        event['jobId'] = json.loads(response_payload)['jobId']
        
        return event
    except Exception as e:
        traceback.print_exc()
        raise e

In the Lambda console, create a Python 2.7 Lambda function named batchSubmitIsaacJob and paste in the above code. Use the LambdaBatchExecutionRole that you created in the previous post. For more information, see Step 2.1: Create a Hello World Lambda Function.

This Lambda function reads in the inputs passed to the state machine it is part of, formats the data for the batchSubmitJob Lambda function, invokes that Lambda function, and then modifies the event dictionary to pass onto the subsequent states. You can repeat these for each of the other tools, which can be found in the tools//lambda/lambda_function.py script in the GitHub repo.

Building the batchGetJobStatus Lambda function

For Step 2 above, the process queries the AWS Batch DescribeJobs API action with jobId to identify the state that the job is in. You can put this into a Lambda function to integrate it with Step Functions.

In the Lambda console, create a new Python 2.7 function with the LambdaBatchExecutionRole IAM role. Name your function batchGetJobStatus and paste in the following code. This is similar to the batch-get-job-python27 Lambda blueprint.

from __future__ import print_function

import boto3
import json

print('Loading function')

batch_client = boto3.client('batch')

def lambda_handler(event, context):
    # Log the received event
    print("Received event: " + json.dumps(event, indent=2))
    # Get jobId from the event
    job_id = event['jobId']

    try:
        response = batch_client.describe_jobs(
            jobs=[job_id]
        )
        job_status = response['jobs'][0]['status']
        return job_status
    except Exception as e:
        print(e)
        message = 'Error getting Batch Job status'
        print(message)
        raise Exception(message)

Structuring state machine input

You have structured the state machine input so that general file references are included at the top-level of the JSON object, and any job-specific items are contained within a nested JSON object. At a high level, this is what the input structure looks like:

{
        "general_field_1": "value1",
        "general_field_2": "value2",
        "general_field_3": "value3",
        "job1": {},
        "job2": {},
        "job3": {}
}

Building the full state machine

By chaining these state machine components together, you can quickly build flexible workflows that can process genomes in multiple ways. The development of the larger state machine that defines the entire workflow uses four of the above building blocks. You use the Lambda functions that you built in the previous section. Rename each building block submission to match the tool name.

We have provided a CloudFormation template to deploy your state machine and the associated IAM roles. In the CloudFormation console, select Create Stack, choose your template (deploy_state_machine.yaml), and enter in the ARNs for the Lambda functions you created.

Continue through the rest of the steps and deploy your stack. Be sure to check the box next to "I acknowledge that AWS CloudFormation might create IAM resources."

Once the CloudFormation stack is finished deploying, you should see the following image of your state machine.

In short, you first submit a job for Isaac, which is the aligner you are using for the analysis. Next, you use parallel state to split your output from "GetFinalIsaacJobStatus" and send it to both your variant calling step, Strelka, and your QC step, Samtools Stats. These then are run in parallel and you annotate the results from your Strelka step with snpEff.

Putting it all together

Now that you have built all of the components for a genomics secondary analysis workflow, test the entire process.

We have provided sequences from an Illumina sequencer that cover a region of the genome known as the exome. Most of the positions in the genome that we have currently associated with disease or human traits reside in this region, which is 1–2% of the entire genome. The workflow that you have built works for both analyzing an exome, as well as an entire genome.

Additionally, we have provided prebuilt reference genomes for Isaac, located at:

s3://aws-batch-genomics-resources/reference/

If you are interested, we have provided a script that sets up all of that data. To execute that script, run the following command on a large EC2 instance:

make reference REGISTRY=<your-ecr-registry>

Indexing and preparing this reference takes many hours on a large-memory EC2 instance. Be careful about the costs involved and note that the data is available through the prebuilt reference genomes.

Starting the execution

In a previous section, you established a provenance for the JSON that is fed into your state machine. For ease, we have auto-populated the input JSON for you to the state machine. You can also find this in the GitHub repo under workflow/test.input.json:

{
  "fastq1S3Path": "s3://aws-batch-genomics-resources/fastq/SRR1919605_1.fastq.gz",
  "fastq2S3Path": "s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz",
  "referenceS3Path": "s3://aws-batch-genomics-resources/reference/hg38.fa",
  "resultsS3Path": "s3://<bucket>/genomic-workflow/results",
  "sampleId": "NA12878_states_1",
  "workingDir": "/scratch",
  "isaac": {
    "jobDefinition": "isaac-myenv:1",
    "jobQueue": "arn:aws:batch:us-east-1:<account-id>:job-queue/highPriority-myenv",
    "referenceS3Path": "s3://aws-batch-genomics-resources/reference/isaac/"
  },
  "samtoolsStats": {
    "jobDefinition": "samtools_stats-myenv:1",
    "jobQueue": "arn:aws:batch:us-east-1:<account-id>:job-queue/lowPriority-myenv"
  },
  "strelka": {
    "jobDefinition": "strelka-myenv:1",
    "jobQueue": "arn:aws:batch:us-east-1:<account-id>:job-queue/highPriority-myenv",
    "cmdArgs": " --exome "
  },
  "snpEff": {
    "jobDefinition": "snpeff-myenv:1",
    "jobQueue": "arn:aws:batch:us-east-1:<account-id>:job-queue/lowPriority-myenv",
    "cmdArgs": " -t hg38 "
  }
}

You are now at the stage to run your full genomic analysis. Copy the above to a new text file, change paths and ARNs to the ones that you created previously, and save your JSON input as input.states.json.

In the CLI, execute the following command. You need the ARN of the state machine that you created in the previous post:

aws stepfunctions start-execution --state-machine-arn <your-state-machine-arn> --input file://input.states.json

Your analysis has now started. By using Spot Instances with AWS Batch, you can quickly scale out your workflows while concurrently optimizing for cost. While this is not guaranteed, most executions of the workflows presented here should cost under $1 for a full analysis.

Monitoring the execution

The output from the above CLI command gives you the ARN that describes the specific execution. Copy that and navigate to the Step Functions console. Select the state machine that you created previously and paste the ARN into the search bar.

The screen shows information about your specific execution. On the left, you see where your execution currently is in the workflow.

In the following screenshot, you can see that your workflow has successfully completed the alignment job and moved onto the subsequent steps, which are variant calling and generating quality information about your sample.

You can also navigate to the AWS Batch console and see that progress of all of your jobs reflected there as well.

Finally, after your workflow has completed successfully, check out the S3 path to which you wrote all of your files. If you run a ls –recursive command on the S3 results path, specified in the input to your state machine execution, you should see something similar to the following:

2017-05-02 13:46:32 6475144340 genomic-workflow/results/NA12878_run1/bam/sorted.bam
2017-05-02 13:46:34    7552576 genomic-workflow/results/NA12878_run1/bam/sorted.bam.bai
2017-05-02 13:46:32         45 genomic-workflow/results/NA12878_run1/bam/sorted.bam.md5
2017-05-02 13:53:20      68769 genomic-workflow/results/NA12878_run1/stats/bam_stats.dat
2017-05-02 14:05:12        100 genomic-workflow/results/NA12878_run1/vcf/stats/runStats.tsv
2017-05-02 14:05:12        359 genomic-workflow/results/NA12878_run1/vcf/stats/runStats.xml
2017-05-02 14:05:12  507577928 genomic-workflow/results/NA12878_run1/vcf/variants/genome.S1.vcf.gz
2017-05-02 14:05:12     723144 genomic-workflow/results/NA12878_run1/vcf/variants/genome.S1.vcf.gz.tbi
2017-05-02 14:05:12  507577928 genomic-workflow/results/NA12878_run1/vcf/variants/genome.vcf.gz
2017-05-02 14:05:12     723144 genomic-workflow/results/NA12878_run1/vcf/variants/genome.vcf.gz.tbi
2017-05-02 14:05:12   30783484 genomic-workflow/results/NA12878_run1/vcf/variants/variants.vcf.gz
2017-05-02 14:05:12    1566596 genomic-workflow/results/NA12878_run1/vcf/variants/variants.vcf.gz.tbi

Modifications to the workflow

You have now built and run your genomics workflow. While diving deep into modifications to this architecture are beyond the scope of these posts, we wanted to leave you with several suggestions of how you might modify this workflow to satisfy additional business requirements.

  • Job tracking with Amazon DynamoDB
    In many cases, such as if you are offering Genomics-as-a-Service, you might want to track the state of your jobs with DynamoDB to get fine-grained records of how your jobs are running. This way, you can easily identify the cost of individual jobs and workflows that you run.
  • Resuming from failure
    Both AWS Batch and Step Functions natively support job retries and can cover many of the standard cases where a job might be interrupted. There may be cases, however, where your workflow might fail in a way that is unpredictable. In this case, you can use custom error handling with AWS Step Functions to build out a workflow that is even more resilient. Also, you can build in fail states into your state machine to fail at any point, such as if a batch job fails after a certain number of retries.
  • Invoking Step Functions from Amazon API Gateway
    You can use API Gateway to build an API that acts as a "front door" to Step Functions. You can create a POST method that contains the input JSON to feed into the state machine you built. For more information, see the Implementing Serverless Manual Approval Steps in AWS Step Functions and Amazon API Gateway blog post.

Conclusion

While the approach we have demonstrated in this series has been focused on genomics, it is important to note that this can be generalized to nearly any high-throughput batch workload. We hope that you have found the information useful and that it can serve as a jump-start to building your own batch workloads on AWS with native AWS services.

For more information about how AWS can enable your genomics workloads, be sure to check out the AWS Genomics page.

Other posts in this four-part series:

Please leave any questions and comments below.

Interactive Analysis of Genomic Datasets Using Amazon Athena

Post Syndicated from Aaron Friedman original https://aws.amazon.com/blogs/big-data/interactive-analysis-of-genomic-datasets-using-amazon-athena/

Aaron Friedman is a Healthcare and Life Sciences Solutions Architect with Amazon Web Services

The genomics industry is in the midst of a data explosion. Due to the rapid drop in the cost to sequence genomes, genomics is now central to many medical advances. When your genome is sequenced and analyzed, raw sequencing files are processed in a multi-step workflow to identify where your genome differs from a standard reference. Your variations are stored in a Variant Call Format (VCF) file, which is then combined with other individuals to enable population-scale analyses. Many of these datasets are publicly available, and an increasing number are hosted on AWS as part of our Open Data project.

To mine genomic data for new discoveries, researchers in both industry and academia build complex models to analyze populations at scale. When building models, they first explore the datasets-of-interest to understand what questions the data might answer. In this step, interactivity is key, as it allows them to move easily from one question to the next.

Recently, we launched Amazon Athena as an interactive query service to analyze data on Amazon S3. With Amazon Athena there are no clusters to manage and tune, no infrastructure to setup or manage, and customers pay only for the queries they run. Athena is able to query many file types straight from S3. This flexibility gives you the ability to interact easily with your datasets, whether they are in a raw text format (CSV/JSON) or specialized formats (e.g. Parquet). By being able to flexibly query different types of data sources, researchers can more rapidly progress through the data exploration phase for discovery. Additionally, researchers don’t have to know nuances of managing and running a big data system. This makes Athena an excellent complement to data warehousing on Amazon Redshift and big data analytics on Amazon EMR 

In this post, I discuss how to prepare genomic data for analysis with Amazon Athena as well as demonstrating how Athena is well-adapted to address common genomics query paradigms.  I use the Thousand Genomes dataset hosted on Amazon S3, a seminal genomics study, to demonstrate these approaches. All code that is used as part of this post is available in our GitHub repository.

Although this post is focused on genomic analysis, similar approaches can be applied to any discipline where large-scale, interactive analysis is required.

 

Select, aggregate, annotate query pattern in genomics

Genomics researchers may ask different questions of their dataset, such as:

  • What variations in a genome may increase the risk of developing disease?
  • What positions in the genome have abnormal levels of variation, suggesting issues in quality of sequencing or errors in the genomic reference?
  • What variations in a genome influence how an individual may respond to a specific drug treatment?
  • Does a group of individuals contain a higher frequency of a genomic variant known to alter response to a drug relative to the general population?

All these questions, and more, can be generalized under a common query pattern I like to call “Select, Aggregate, Annotate”. Some of our genomics customers, such as Human Longevity, Inc., routinely use this query pattern in their work.

In each of the above queries, you execute the following steps:

SELECT: Specify the cohort of individuals meeting certain criteria (disease, drug response, age, BMI, entire population, etc.).

AGGREGATE: Generate summary statistics of genomic variants across the cohort that you selected.

ANNOTATE: Assign meaning to each of the variants by joining on known information about each variant.

Dataset preparation

Properly organizing your dataset is one of the most critical decisions for enabling fast, interactive analysies. Based on the query pattern I just described, the table representing your population needs to have the following information:

  • A unique sample ID corresponding to each sample in your population
  • Information about each variant, specifically its location in the genome as well as the specific deviation from the reference
  • Information about how many times in a sample a variant occurs (0, 1, or 2 times) as well as if there are multiple variants in the same site. This is known as a genotype.

The extract, transform, load (ETL) process to generate the appropriate data representation has two main steps. First, you use ADAM, a genomics analysis platform built on top of Spark, to convert the variant information residing a VCF file to Parquet for easier downstream analytics, in a process similar to the one described in the Will Spark Power the Data behind Precision Medicine? post. Then, you use custom Python code to massage the data and select only the appropriate fields that you need for analysis with Athena.

First, spin up an EMR cluster (version 5.0.3) for the ETL process. I used a c4.xlarge for my master node and m4.4xlarges with 1 TB of scratch for my core nodes.

After you SSH into your master node, clone the git repository. You can also put this in as a bootstrap action when spinning up your cluster.

sudo yum –y install git
git clone https://github.com/awslabs/aws-big-data-blog.git 

You then need to install ADAM and configure it to run with Spark 2.0. In your terminal, enter the following:

# Install Maven
wget http://apache.claz.org/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
tar xvzf apache-maven-3.3.9-bin.tar.gz
export PATH=/home/hadoop/apache-maven-3.3.9/bin:$PATH

# Install ADAM
git clone https://github.com/bigdatagenomics/adam.git
cd adam
sed -i 's/2.6.0/2.7.2/' pom.xml
./scripts/move_to_spark_2.sh
export MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=256m"
mvn clean package -DskipTests

export PATH=/home/hadoop/adam/bin:$PATH

Now that ADAM is installed, enter the working directory of the repo that you first cloned, which contains the code relevant to this post. Your next step is to convert a VCF containing all the information about your population.

For this post, you are going to focus on a single region in the genome, defined as chromosome 22, for all 2,504 samples in the Thousand Genomes dataset. As chromosome 22 was the first chromosome to be sequenced as part of the Human Genome Project, it is your first here as well. You can scale the size of your cluster depending on whether you are looking at just chromosome 22, or the entire genome. In the following lines, replace mytestbucket with your bucket name of choice.

PARQUETS3PATH=s3://<mytestbucket>/thousand_genomes/grch37/chr22.parquet/
cd ~/aws-big-data-blog/aws-blog-athena-genomics/etl/thousand_genomes/
chmod +x ./convert_vcf.sh
./convert_vcf.sh $PARQUETS3PATH

After this conversion completes, you can reduce your Parquet file to only the fields you need.

TRIMMEDS3PATH=s3://<mytestbucket>/thousand_genomes/grch37_trimmed/chr22.parquet/
spark-submit --master yarn --executor-memory 40G --deploy-mode cluster create_trimmed_parquet.py --input_s3_path $PARQUETS3PATH --output_s3_path $TRIMMEDS3PATH 

Annotation data preparation

For this post, use the ClinVar dataset, which is an archive of information about variation in genomes and health. While Parquet is a preferred format for Athena relative to raw text, using the ClinVar TXT file demonstrates the ability of Athena to read multiple file types.

In a Unix terminal, execute the following commands (replace mytestbucket with your bucket name of choice):

ANNOPATH=s3://<mytestbucket>/clinvar/

wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz

# Need to strip out the first line (header)
zcat variant_summary.txt.gz | sed '1d' > temp ; mv -f temp variant_summary.trim.txt ; gzip variant_summary.trim.txt

aws s3 cp variant_summary.trim.txt.gz $ANNOPATH --sse

Creating Tables using Amazon Athena

In the recent Amazon Athena – Interactive SQL Queries for Data in Amazon S3 post, Jeff Barr covered how to get started with Athena and navigate to the console. Athena uses the Hive DDL when you create, alter, or drop a table. This means you can use the standard Hive syntax for creating tables. These commands can also be found in the aws-blog-athena-genomics GitHub repo under sql/setup.

First, create your database:

CREATE DATABASE kg;

Next, define your table with the appropriate mappings from the previous ETL processes. Be sure to change to your appropriate bucket name.

For your population data:

If you want to skip the ETL process described above for your population data, you can use the following path for your analysis: s3://aws-bigdata-blog/artifacts/athena_genomics.parquet/ – be sure to substitute it in for the LOCATION below.

CREATE EXTERNAL TABLE demo.samplevariants
(
  alternateallele STRING
  ,chromosome STRING
  ,endposition BIGINT
  ,genotype0 STRING
  ,genotype1 STRING
  ,referenceallele STRING
  ,sampleid STRING
  ,startposition BIGINT
)
STORED AS PARQUET
LOCATION ' s3://<mytestbucket>/thousand_genomes/grch37_trimmed/chr22.parquet/';

For your annotation data:

CREATE EXTERNAL TABLE demo.clinvar
(
    alleleId STRING
    ,variantType STRING
    ,hgvsName STRING
    ,geneID STRING
    ,geneSymbol STRING
    ,hgncId STRING
    ,clinicalSignificance STRING
    ,clinSigSimple STRING
    ,lastEvaluated STRING
    ,rsId STRING
    ,dbVarId STRING
    ,rcvAccession STRING
    ,phenotypeIds STRING
    ,phenotypeList STRING
    ,origin STRING
    ,originSimple STRING
    ,assembly STRING
    ,chromosomeAccession STRING
    ,chromosome STRING
    ,startPosition INT
    ,endPosition INT
    ,referenceAllele STRING
    ,alternateAllele STRING
    ,cytogenetic STRING
    ,reviewStatus STRING
    ,numberSubmitters STRING
    ,guidelines STRING
    ,testedInGtr STRING
    ,otherIds STRING
    ,submitterCategories STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://<mytestbucket>/clinvar/';

You can confirm that both tables have been created by choosing the eye icon next to a table name in the console. This automatically runs a query akin to SELECT * FROM table LIMIT 10. If it succeeds, then your data has been loaded successfully.

o_athena_genomes_1

Applying the select, aggregate, annotate paradigm

In this section, I walk you through how to use the query pattern described earlier to answer two common questions in genomics. You examine the frequency of different genotypes, which is a distinct combination of all fields in the samplevariants table with the exception of sampleid. These queries can also be found in the aws-blog-athena-genomics GitHub repo under sql/queries.

Population drug response

What small molecules/drugs are most likely to affect a subpopulation of individuals (ancestry, age, etc.) based on their genomic information?

In this query, assume that you have some phenotype data about your population. In this case, also assume that all samples sharing the pattern “NA12” are part of a specific demographic.

In this query, use sampleid as your predicate pushdown. The general steps are:

  1. Filter by the samples in your subpopulation
  2. Aggregate variant frequencies for the subpopulation-of-interest
  3. Join on ClinVar dataset
  4. Filter by variants that have been implicated in drug-response
  5. Order by highest frequency variants

To answer this question, you can craft the following query:

SELECT
    count(*)/cast(numsamples AS DOUBLE) AS genotypefrequency
    ,cv.rsid
    ,cv.phenotypelist
    ,sv.chromosome
    ,sv.startposition
    ,sv.endposition
    ,sv.referenceallele
    ,sv.alternateallele
    ,sv.genotype0
    ,sv.genotype1
FROM demo.samplevariants sv
CROSS JOIN
    (SELECT count(1) AS numsamples
    FROM
        (SELECT DISTINCT sampleid
        FROM demo.samplevariants
        WHERE sampleid LIKE 'NA12%'))
JOIN demo.clinvar cv
ON sv.chromosome = cv.chromosome
    AND sv.startposition = cv.startposition - 1
    AND sv.endposition = cv.endposition
    AND sv.referenceallele = cv.referenceallele
    AND sv.alternateallele = cv.alternateallele
WHERE assembly='GRCh37'
    AND cv.clinicalsignificance LIKE '%response%'
    AND sampleid LIKE 'NA12%'
GROUP BY  sv.chromosome
          ,sv.startposition
          ,sv.endposition
          ,sv.referenceallele
          ,sv.alternateallele
          ,sv.genotype0
          ,sv.genotype1
          ,cv.clinicalsignificance
          ,cv.phenotypelist
          ,cv.rsid
          ,numsamples
ORDER BY  genotypefrequency DESC LIMIT 50

Enter the query into the console and you see something like the following:

o_athena_genomes_2

When you inspect the results, you can quickly see (often in a matter of seconds!) that this population of individuals has a high frequency of variants associated with the metabolism of debrisoquine.

o_athena_genomes_3

Quality control

Are you systematically finding locations in your reference that are called as variation? In other words, what positions in the genome have abnormal levels of variation, suggesting issues in quality of sequencing or errors in the reference? Are any of these variants implicated in clinically? If so, could this be a false positive clinical finding?

In this query, the entire population is needed so the SELECT predicate is not used. The general steps are:

  1. Aggregate variant frequencies for the entire Thousand Genomes population
  2. Join on the ClinVar dataset
  3. Filter by variants that have been implicated in disease
  4. Order by the highest frequency variants

This translates into the following query:

SELECT
    count(*)/cast(numsamples AS DOUBLE) AS genotypefrequency
    ,cv.clinicalsignificance
    ,cv.phenotypelist
    ,sv.chromosome
    ,sv.startposition
    ,sv.endposition
    ,sv.referenceallele
    ,sv.alternateallele
    ,sv.genotype0
    ,sv.genotype1
FROM demo.samplevariants sv
CROSS JOIN
    (SELECT count(1) AS numsamples
    FROM
        (SELECT DISTINCT sampleid
        FROM demo.samplevariants))
    JOIN demo.clinvar cv
    ON sv.chromosome = cv.chromosome
        AND sv.startposition = cv.startposition - 1
        AND sv.endposition = cv.endposition
        AND sv.referenceallele = cv.referenceallele
        AND sv.alternateallele = cv.alternateallele
WHERE assembly='GRCh37'
        AND cv.clinsigsimple='1'
GROUP BY  sv.chromosome ,
          sv.startposition
          ,sv.endposition
          ,sv.referenceallele
          ,sv.alternateallele
          ,sv.genotype0
          ,sv.genotype1
          ,cv.clinicalsignificance
          ,cv.phenotypelist
          ,numsamples
ORDER BY  genotypefrequency DESC LIMIT 50

As you can see in the following screenshot, the highest frequency results have conflicting information, being listed both as potentially causing disease and being benign. In genomics, variants with a higher frequency are less likely to cause disease. Your quick analysis allows you to discount the pathogenic clinical significance annotations of these high genotype frequency variants.

o_athena_genomes_4

 

Summary

I hope you have seen how you can easily integrate datasets of different types and values, and combine them to quickly derive new meaning from your data. Do you have a dataset you want to explore or want to learn more about Amazon Athena? Check out our Getting Started Guide and happy querying!

If you have questions or suggestions, please comment below.


About the author

friedman_90Dr. Aaron Friedman a Healthcare and Life Sciences Partner Solutions Architect at Amazon Web Services. He works with our ISVs and SIs to architect healthcare solutions on AWS, and bring the best possible experience to their customers. His passion is working at the intersection of science, big data, and software. In his spare time, he’s out hiking or learning a new thing to cook.

 

 


Related

Analyzing Data in S3 using Amazon Athena

o_athena_10

Will Spark Power the Data behind Precision Medicine?

Post Syndicated from Christopher Crosbie original https://blogs.aws.amazon.com/bigdata/post/Tx1GE3J0NATVJ39/Will-Spark-Power-the-Data-behind-Precision-Medicine

Christopher Crosbie is a Healthcare and Life Science Solutions Architect with Amazon Web Services.
This post was co-authored by Ujjwal Ratan, a Solutions Architect with Amazon Web Services.

———————————

“And that’s the promise of precision medicine — delivering the right treatments, at the right time, every time to the right person.“ (President Obama, 2015 State of the Union address)

The promise of precision medicine that President Obama envisions with this statement is a far-reaching goal that will require sweeping changes to the ways physicians treat patients, health data is collected, and global collaborative research is performed. Precision medicine typically describes an approach for treating and preventing disease that takes into account a patient’s individual variation in genes, lifestyle, and environment. Achieving this mission relies on the intersection of several technology innovations and a major restructuring of health data to focus on the genetic makeup of an individual.

The healthcare ecosystem has chosen a variety of tools and techniques for working with big data, but one tool that comes up again and again in many of the architectures we design and review is Spark on Amazon EMR.

Spark is already known for being a major player in big data analysis, but it is additionally uniquely capable in advancing genomics algorithms given the complex nature of genomics research. This post introduces gene analysis using Spark on EMR and ADAM, for those new to precision medicine.

Development of precision medicine

Data-driven tailored treatments have been commonplace for certain treatments like blood transfusions for a long time. Historically, however, most treatment plans are deeply subjective, due to the many disparate pieces of information that physicians must tie together to make a health plan based on the individual’s specific conditions.

To move past the idiosyncratic nature of most medical treatments, we need to amass properly collected and curated biological data to compare and correlate outcomes and biomarkers across varying patient populations.

The data of precision medicine inherently involves the data representation of large volumes of living, mobile, and irrationally complex humans. The recent blog post How The Healthcare of Tomorrow is Being Delivered Today detailed the ways in which AWS underlies some of the most advanced and innovative big data technologies in precision medicine. Technologies like these will enable the research and analysis of the data structures necessary to create true individualized care.

Genomics data sets require exploration

The study of genomics dates back much further than the Obama era. The field benefits from the results of a prodigious amount of research spanning Gregor Mendel’s pea pods in the 1860s to the Human Genome Project of the 1990s.

As with most areas of science, building knowledge often goes hand in hand with legacy analysis features that turn out to be outdated as the discipline evolves. The generation of scientists following Mendel, for example, significantly altered his calculations due to the wide adoption of the p-value in statistics.

The anachronism for many of the most common genomics algorithms today is the failure to properly optimize for cloud technology. Memory requirements are often limited to a single compute node or expect a POSIX file system. These tools may have been sufficient for the computing task involved in analyzing the genome of a single person. However, a shift to cloud computing will be necessary as we move to the full-scale population studies that will be required to develop novel methods in precision medicine.

Many existing genomics algorithms must be refactored to address the scale at which research is done today. The Broad Institute of MIT and Harvard, a leading genomic research center, recognized this and moved many of the algorithms that could be pulled apart and parallelized into a MapReduce paradigm within a Genome Analysis Toolkit (GATK). Despite the MapReduce migration and many Hadoop-based projects such as BioPig, the bioinformatics community did not fully embrace the Hadoop ecosystem due to the sequential nature of the reads and the overhead associated with splitting the MapReduce tasks.

Precision medicine is also going to rely heavily on referencing public data sets. Generally available, open data sets are often downloaded and copied among many researcher centers. These multiple copies of the same data create an inefficiency for researchers that is addressed through the AWS Public Data Sets program. This program allows researchers to leverage popular genomics data sets like TCGA and ICGC without having to pay for raw storage.

Migrating genetics to Spark on Amazon EMR

The transitioning of genomics algorithms that are popular today to Spark is one path that scientists are taking to capture the distributed processing capabilities of the cloud. However, precision medicine will require an abundance of exploration and new approaches. Many of these are already being built on top of Spark on EMR.

This is because Spark on EMR provides much of the functionality that aligns well with the goals of the precision medicine. For instance, using Amazon S3 as an extension of HDFS storage makes it easy to share the results of your analysis with collaborators from all over the world by simply allowing access to an S3 URL. It also provides the ability for researchers to adjust their cluster to the algorithm they are trying to build instead of adjusting their algorithm to the cluster to which they have access. Competition for compute resources with other cluster users is another drawback that can be mitigated with a move towards EMR.

The introduction of Spark has now overcome many of the previous limitations associated with parallelizing the data based on index files that comprise the standard Variant Control Files (VCF) and Binary Alignment Map (BAM) formats used by genomics researchers. The AMPLab at UC Berkeley, through projects like ADAM, demonstrated the value of a Spark infrastructure for genomics data processing.

Moreover, genomics scientists began identifying the need to get away from the undifferentiated heavy lifting of developing custom distributed system code. These developments motivated the Broad to develop the next version of the GATK with an option to run in the cloud on Spark. The upcoming GATK is a major step forward for the scientific community since it will soon be able to incorporate many of the features of EMR, such as on-demand cluster of various types and Amazon S3–backed storage.

Although Spark on EMR provides many infrastructure advantages, Spark still speaks the languages that are popular with the research community. Languages like SparkR on EMR make for an easy transition into the cloud. That way when there are issues that arise such as needing to unpack Gzip-compressed files to make them split-able, familiar R code can be used to make the transformation.

What’s under the hood of ADAM?

ADAM is an in-memory MapReduce system. ADAM is the result of contributions from universities, biotech, and pharma companies. It is entirely open source under the Apache 2 license. ADAM uses Apache Avro and Parquet file formats for data storage to provide a schema-based description of genomic data. This eliminates the dependencies on format-based libraries, which in the past has created incompatibilities.

Apache Parquet is a columnar storage format that is well suited for genomic data. The primary reason for this is because genomic data consists of a large number of similar data items that is align well with columnar storage. By using Apache Parquet as its storage format, ADAM makes querying and storing genomic data highly efficient. It limits the I/O to data actually needed by loading only the columns that need to be accessed. Because of its columnar nature, storing data in Parquet saves space as a result of better compression ratios.

The columnar Parquet file format enables up to 25% improvement in storage volume compared to compressed genomics file formats like BAM. Moreover, the in-memory caching using Spark and the ability to parallelize data processing over multiple nodes have been known to provide up to 50 times better performance on average on a 100-node cluster.

In addition to using the Parquet format for columnar storage, ADAM makes use of a new schema for genomics data referred to as bdg-formats, a project that provides schemas for describing common genomic data types such as variants, assemblies, and genotypes. It uses Apache Avro as its base framework and as a result, works well with common programming languages and platforms. By using the Apache Avro-based schema as its core data structure, workflows built using ADAM are flexible and easy to maintain using familiar technologies.

ADAM on Amazon EMR

The figure above represents a typical genomic workload designed on AWS using ADAM. File formats like SAM/BAM and VCF are uploaded as objects on Amazon S3.

Amazon EMR provides a way to run Spark compute jobs with ADAM and other applications on an as-needed basis while keeping the genomics data itself stored in separate, cheap object storage.

Your first genomics analysis with ADAM on Amazon EMR

To install ADAM on EMR, launch an EMR cluster from the AWS Management Console. Make sure you select the option to install Spark, as shown in the screen shot below.

While launching the EMR cluster, you should configure the EMR master security groups to allow you access to port 22 so you can SSH to the master node. The master node should be able to communicate to the Internet to download the necessary packages and build ADAM.

ADAM installation steps

During the creation of a cluster, a shell script known as a bootstrap action can automate the installation process of ADAM. In case you want to install ADAM manually, the following instructions walk through what the script does.

After the EMR cluster with Apache Spark is up and running, you can log in into the master node using SSH and install ADAM. ADAM requires Maven to build the packages; after you install Maven, you can clone ADAM from the ADAM GitHub repository.

SSH into the master node of the EMR cluster.

Install Maven by downloading the apache libraries from its website as shown below:

/* create a new directory to install maven */
mkdir maven
cd maven
/* download the maven zip from the apache mirror website */
echo "copying the maven zip file from the apache mirror"
wget http://apachemirror.ovidiudan.com/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
echo "unzipping the file"
tar -zxvf apache-maven-3.3.9-bin.tar.gz
/* export the “MAVEN_HOME” path */
echo "exporting MAVEN_HOME"
export PATH=$HOME/maven/apache-maven-3.3.9/bin:$PATH

Before cloning ADAM from GitHub, install GIT on the the master node of the EMR cluster. This can be done by running the following command:

/* install git in the home directory */
cd $HOME
sudo yum install git

Clone the ADAM repository from GitHub and begin your build.

/* clone the ADAM repository from GitHub and build the package */
git clone https://github.com/bigdatagenomics/adam.git
cd adam

Finally, begin your build

export MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=256m"
mvn clean package -DskipTests

On completion, you see a message confirming that ADAM was built successfully.

Analysis of genomic data using ADAM

After installation, ADAM provides a set of functions to transform and analyze genomic data sets. The functions range from actions to count K-mers to conversion operations to convert file standard genomics formats like SAM, BAM or VCF to ADAM Parquet. A list of ADAM functions can be viewed by invoking adam-submit from the command line.

For the purposes of this analysis, you use a VCF file from an AWS public data set, specifically the 1000 genome project.

This is a project that aims to build the most detailed map of human genetic variation available. When complete, the 1000 genome project will have genomic data from over 2,661 people. Amazon S3 hosts the initial pilot data for this project in a public S3 bucket. The latest data set available to everyone hosts data for approximately 1700 people and is more than 200 TB in size.

From the master node of the EMR cluster, you can connect to the S3 bucket and download the file to the master node. The first step is to copy the file into HDFS so it can be accessed using ADAM. This can be done by running the following commands:

//copying a vcf file from S3 to master node
$ aws s3 cp s3://1000genomes/phase1/analysis_results/integrated_call_sets/ALL.chr1.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz /home/hadoop/

// Unzip the file
$ gunzip ALL.chr1.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz

//copying file to hdfs
$ hadoop fs -put /home/hadoop/ ALL.chr1.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf /user/hadoop/

The command generates more than 690 ADAM Parquet files. The following screenshot shows a subset of these files.

You can now process the ADAM Parquet files as regular Parquet files in Apache Spark using Scala. ADAM provides its users its own shell, which is invoked from the command line using adam-shell. In this example, read the ADAM  Parquet file as a data frame and print its schema. Then, go on to register the data frame as a temp table and use it to query the genomic data sets.

val gnomeDF = sqlContext.read.parquet("/user/hadoop/adamfiles/ ")
gnomeDF.printSchema()
gnomeDF.registerTempTable("gnome")
val gnome_data = sqlContext.sql("select count(*) from gnome")
gnome_data.show()

Here’s how to use ADAM interactive shell to perform a reads count and K-mer count. A k-mer count returns all of the strings of length k from this DNA sequence. This is an important first step to understanding all of the possible DNA sequences (of length 20) that are contained in this file.

import org.bdgenomics.adam.rdd.ADAMContext._
val reads = sc.loadAlignments("part-r-00000.gz.parquet").cache()
reads.count()
val gnomeDF = sqlContext.read.parquet("part-r-00000.gz.parquet")
gnomeDF.printSchema()
println(reads.first)

/*
The following command cuts reads into _k_-mers, and then counts the number of
occurrences of each _k_-mer
*/
val Kmers = reads.adamCountKmers(20).cache()

kmers.count()

Conclusion  

Spark on Amazon EMR provides unique advantages over traditional genomic processing and may become a necessary tool as genomics moves into the scale of population-based studies required for precision medicine. Innovative companies like Human Longevity Inc. have discussed at re:Invent  how they use AWS tools such as Spark on EMR with Amazon Redshift to build a platform that is pushing precision medicine forward.

Of course, simply distributing compute resources will not solve all of the complexities associated with understanding the human condition. There is an old adage that reminds us that nine women cannot be used to make a baby in one month. Often in biology problems, there is a need to wait for one step to be completed before the next can be undergone. This procedure does not necessarily lend itself well to Spark, which benefits from distributing many small tasks at once. There are still many areas such as sequence assembly that may not have an easy transition to Spark. However, Spark on Amazon EMR will be a very interesting project to watch on our move toward precision medicine.

If you have questions or suggestions, please leave a comment below.

Special thanks to Angel Pizaro for his help on this post!

———————————–

Related

Extending Seven Bridges Genomics with Amazon Redshift and R

Want to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.

 

Will Spark Power the Data behind Precision Medicine?

Post Syndicated from Christopher Crosbie original https://blogs.aws.amazon.com/bigdata/post/Tx1GE3J0NATVJ39/Will-Spark-Power-the-Data-behind-Precision-Medicine

Christopher Crosbie is a Healthcare and Life Science Solutions Architect with Amazon Web Services.
This post was co-authored by Ujjwal Ratan, a Solutions Architect with Amazon Web Services.

———————————

“And that’s the promise of precision medicine — delivering the right treatments, at the right time, every time to the right person.“ (President Obama, 2015 State of the Union address)

The promise of precision medicine that President Obama envisions with this statement is a far-reaching goal that will require sweeping changes to the ways physicians treat patients, health data is collected, and global collaborative research is performed. Precision medicine typically describes an approach for treating and preventing disease that takes into account a patient’s individual variation in genes, lifestyle, and environment. Achieving this mission relies on the intersection of several technology innovations and a major restructuring of health data to focus on the genetic makeup of an individual.

The healthcare ecosystem has chosen a variety of tools and techniques for working with big data, but one tool that comes up again and again in many of the architectures we design and review is Spark on Amazon EMR.

Spark is already known for being a major player in big data analysis, but it is additionally uniquely capable in advancing genomics algorithms given the complex nature of genomics research. This post introduces gene analysis using Spark on EMR and ADAM, for those new to precision medicine.

Development of precision medicine

Data-driven tailored treatments have been commonplace for certain treatments like blood transfusions for a long time. Historically, however, most treatment plans are deeply subjective, due to the many disparate pieces of information that physicians must tie together to make a health plan based on the individual’s specific conditions.

To move past the idiosyncratic nature of most medical treatments, we need to amass properly collected and curated biological data to compare and correlate outcomes and biomarkers across varying patient populations.

The data of precision medicine inherently involves the data representation of large volumes of living, mobile, and irrationally complex humans. The recent blog post How The Healthcare of Tomorrow is Being Delivered Today detailed the ways in which AWS underlies some of the most advanced and innovative big data technologies in precision medicine. Technologies like these will enable the research and analysis of the data structures necessary to create true individualized care.

Genomics data sets require exploration

The study of genomics dates back much further than the Obama era. The field benefits from the results of a prodigious amount of research spanning Gregor Mendel’s pea pods in the 1860s to the Human Genome Project of the 1990s.

As with most areas of science, building knowledge often goes hand in hand with legacy analysis features that turn out to be outdated as the discipline evolves. The generation of scientists following Mendel, for example, significantly altered his calculations due to the wide adoption of the p-value in statistics.

The anachronism for many of the most common genomics algorithms today is the failure to properly optimize for cloud technology. Memory requirements are often limited to a single compute node or expect a POSIX file system. These tools may have been sufficient for the computing task involved in analyzing the genome of a single person. However, a shift to cloud computing will be necessary as we move to the full-scale population studies that will be required to develop novel methods in precision medicine.

Many existing genomics algorithms must be refactored to address the scale at which research is done today. The Broad Institute of MIT and Harvard, a leading genomic research center, recognized this and moved many of the algorithms that could be pulled apart and parallelized into a MapReduce paradigm within a Genome Analysis Toolkit (GATK). Despite the MapReduce migration and many Hadoop-based projects such as BioPig, the bioinformatics community did not fully embrace the Hadoop ecosystem due to the sequential nature of the reads and the overhead associated with splitting the MapReduce tasks.

Precision medicine is also going to rely heavily on referencing public data sets. Generally available, open data sets are often downloaded and copied among many researcher centers. These multiple copies of the same data create an inefficiency for researchers that is addressed through the AWS Public Data Sets program. This program allows researchers to leverage popular genomics data sets like TCGA and ICGC without having to pay for raw storage.

Migrating genetics to Spark on Amazon EMR

The transitioning of genomics algorithms that are popular today to Spark is one path that scientists are taking to capture the distributed processing capabilities of the cloud. However, precision medicine will require an abundance of exploration and new approaches. Many of these are already being built on top of Spark on EMR.

This is because Spark on EMR provides much of the functionality that aligns well with the goals of the precision medicine. For instance, using Amazon S3 as an extension of HDFS storage makes it easy to share the results of your analysis with collaborators from all over the world by simply allowing access to an S3 URL. It also provides the ability for researchers to adjust their cluster to the algorithm they are trying to build instead of adjusting their algorithm to the cluster to which they have access. Competition for compute resources with other cluster users is another drawback that can be mitigated with a move towards EMR.

The introduction of Spark has now overcome many of the previous limitations associated with parallelizing the data based on index files that comprise the standard Variant Control Files (VCF) and Binary Alignment Map (BAM) formats used by genomics researchers. The AMPLab at UC Berkeley, through projects like ADAM, demonstrated the value of a Spark infrastructure for genomics data processing.

Moreover, genomics scientists began identifying the need to get away from the undifferentiated heavy lifting of developing custom distributed system code. These developments motivated the Broad to develop the next version of the GATK with an option to run in the cloud on Spark. The upcoming GATK is a major step forward for the scientific community since it will soon be able to incorporate many of the features of EMR, such as on-demand cluster of various types and Amazon S3–backed storage.

Although Spark on EMR provides many infrastructure advantages, Spark still speaks the languages that are popular with the research community. Languages like SparkR on EMR make for an easy transition into the cloud. That way when there are issues that arise such as needing to unpack Gzip-compressed files to make them split-able, familiar R code can be used to make the transformation.

What’s under the hood of ADAM?

ADAM is an in-memory MapReduce system. ADAM is the result of contributions from universities, biotech, and pharma companies. It is entirely open source under the Apache 2 license. ADAM uses Apache Avro and Parquet file formats for data storage to provide a schema-based description of genomic data. This eliminates the dependencies on format-based libraries, which in the past has created incompatibilities.

Apache Parquet is a columnar storage format that is well suited for genomic data. The primary reason for this is because genomic data consists of a large number of similar data items that is align well with columnar storage. By using Apache Parquet as its storage format, ADAM makes querying and storing genomic data highly efficient. It limits the I/O to data actually needed by loading only the columns that need to be accessed. Because of its columnar nature, storing data in Parquet saves space as a result of better compression ratios.

The columnar Parquet file format enables up to 25% improvement in storage volume compared to compressed genomics file formats like BAM. Moreover, the in-memory caching using Spark and the ability to parallelize data processing over multiple nodes have been known to provide up to 50 times better performance on average on a 100-node cluster.

In addition to using the Parquet format for columnar storage, ADAM makes use of a new schema for genomics data referred to as bdg-formats, a project that provides schemas for describing common genomic data types such as variants, assemblies, and genotypes. It uses Apache Avro as its base framework and as a result, works well with common programming languages and platforms. By using the Apache Avro-based schema as its core data structure, workflows built using ADAM are flexible and easy to maintain using familiar technologies.

ADAM on Amazon EMR

The figure above represents a typical genomic workload designed on AWS using ADAM. File formats like SAM/BAM and VCF are uploaded as objects on Amazon S3.

Amazon EMR provides a way to run Spark compute jobs with ADAM and other applications on an as-needed basis while keeping the genomics data itself stored in separate, cheap object storage.

Your first genomics analysis with ADAM on Amazon EMR

To install ADAM on EMR, launch an EMR cluster from the AWS Management Console. Make sure you select the option to install Spark, as shown in the screen shot below.

While launching the EMR cluster, you should configure the EMR master security groups to allow you access to port 22 so you can SSH to the master node. The master node should be able to communicate to the Internet to download the necessary packages and build ADAM.

ADAM installation steps

During the creation of a cluster, a shell script known as a bootstrap action can automate the installation process of ADAM. In case you want to install ADAM manually, the following instructions walk through what the script does.

After the EMR cluster with Apache Spark is up and running, you can log in into the master node using SSH and install ADAM. ADAM requires Maven to build the packages; after you install Maven, you can clone ADAM from the ADAM GitHub repository.

SSH into the master node of the EMR cluster.

Install Maven by downloading the apache libraries from its website as shown below:

/* create a new directory to install maven */
mkdir maven
cd maven
/* download the maven zip from the apache mirror website */
echo "copying the maven zip file from the apache mirror"
wget http://apachemirror.ovidiudan.com/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
echo "unzipping the file"
tar -zxvf apache-maven-3.3.9-bin.tar.gz
/* export the “MAVEN_HOME” path */
echo "exporting MAVEN_HOME"
export PATH=$HOME/maven/apache-maven-3.3.9/bin:$PATH

Before cloning ADAM from GitHub, install GIT on the the master node of the EMR cluster. This can be done by running the following command:

/* install git in the home directory */
cd $HOME
sudo yum install git

Clone the ADAM repository from GitHub and begin your build.

/* clone the ADAM repository from GitHub and build the package */
git clone https://github.com/bigdatagenomics/adam.git
cd adam

Finally, begin your build

export MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=256m"
mvn clean package -DskipTests

On completion, you see a message confirming that ADAM was built successfully.

Analysis of genomic data using ADAM

After installation, ADAM provides a set of functions to transform and analyze genomic data sets. The functions range from actions to count K-mers to conversion operations to convert file standard genomics formats like SAM, BAM or VCF to ADAM Parquet. A list of ADAM functions can be viewed by invoking adam-submit from the command line.

For the purposes of this analysis, you use a VCF file from an AWS public data set, specifically the 1000 genome project.

This is a project that aims to build the most detailed map of human genetic variation available. When complete, the 1000 genome project will have genomic data from over 2,661 people. Amazon S3 hosts the initial pilot data for this project in a public S3 bucket. The latest data set available to everyone hosts data for approximately 1700 people and is more than 200 TB in size.

From the master node of the EMR cluster, you can connect to the S3 bucket and download the file to the master node. The first step is to copy the file into HDFS so it can be accessed using ADAM. This can be done by running the following commands:

//copying a vcf file from S3 to master node
$ aws s3 cp s3://1000genomes/phase1/analysis_results/integrated_call_sets/ALL.chr1.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz /home/hadoop/

// Unzip the file
$ gunzip ALL.chr1.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz

//copying file to hdfs
$ hadoop fs -put /home/hadoop/ ALL.chr1.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf /user/hadoop/

The command generates more than 690 ADAM Parquet files. The following screenshot shows a subset of these files.

You can now process the ADAM Parquet files as regular Parquet files in Apache Spark using Scala. ADAM provides its users its own shell, which is invoked from the command line using adam-shell. In this example, read the ADAM  Parquet file as a data frame and print its schema. Then, go on to register the data frame as a temp table and use it to query the genomic data sets.

val gnomeDF = sqlContext.read.parquet("/user/hadoop/adamfiles/ ")
gnomeDF.printSchema()
gnomeDF.registerTempTable("gnome")
val gnome_data = sqlContext.sql("select count(*) from gnome")
gnome_data.show()

Here’s how to use ADAM interactive shell to perform a reads count and K-mer count. A k-mer count returns all of the strings of length k from this DNA sequence. This is an important first step to understanding all of the possible DNA sequences (of length 20) that are contained in this file.

import org.bdgenomics.adam.rdd.ADAMContext._
val reads = sc.loadAlignments("part-r-00000.gz.parquet").cache()
reads.count()
val gnomeDF = sqlContext.read.parquet("part-r-00000.gz.parquet")
gnomeDF.printSchema()
println(reads.first)

/*
The following command cuts reads into _k_-mers, and then counts the number of
occurrences of each _k_-mer
*/
val Kmers = reads.adamCountKmers(20).cache()

kmers.count()

Conclusion  

Spark on Amazon EMR provides unique advantages over traditional genomic processing and may become a necessary tool as genomics moves into the scale of population-based studies required for precision medicine. Innovative companies like Human Longevity Inc. have discussed at re:Invent  how they use AWS tools such as Spark on EMR with Amazon Redshift to build a platform that is pushing precision medicine forward.

Of course, simply distributing compute resources will not solve all of the complexities associated with understanding the human condition. There is an old adage that reminds us that nine women cannot be used to make a baby in one month. Often in biology problems, there is a need to wait for one step to be completed before the next can be undergone. This procedure does not necessarily lend itself well to Spark, which benefits from distributing many small tasks at once. There are still many areas such as sequence assembly that may not have an easy transition to Spark. However, Spark on Amazon EMR will be a very interesting project to watch on our move toward precision medicine.

If you have questions or suggestions, please leave a comment below.

Special thanks to Angel Pizaro for his help on this post!

———————————–

Related

Extending Seven Bridges Genomics with Amazon Redshift and R

Want to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.

 

Extending Seven Bridges Genomics with Amazon Redshift and R

Post Syndicated from Christopher Crosbie original https://blogs.aws.amazon.com/bigdata/post/TxB9H9MGP4JBBQ/Extending-Seven-Bridges-Genomics-with-Amazon-Redshift-and-R

Christopher Crosbie is a Healthcare and Life Science Solutions Architect with Amazon Web Services

The article was co-authored by Zeynep Onder, Scientist, Seven Bridges Genomics, an AWS Advanced Technology Partner.

“ACTGCTTCGACTCGGGTCCA“

That is probably not a coding language readily understood by many reading this blog post, but it is a programming framework that defines all life on the planet. These letters are known as base pairs in a DNA sequence and represent four chemicals found in all organisms. When put into a specific order, these DNA sequences contain the instructions that kick off processes which eventually render all the characteristics and traits (also known as phenotypes) we see in nature.

Sounds simple enough. Just store the code, perform some complex decoding algorithms, and you are on your way to a great scientific discovery. Right? Well, not quite. Genomics analysis is one of the biggest data problems out there.

Here’s why: You and I have around 20,000 – 25,000 distinct sequences of DNA (genes) that create proteins and thus contain instructions for every process from development to regeneration. This is out of the 3.2 billion individual letters (bases) in each of us. Thousands of other organisms have also been decoded and stored in databases because comparing genes across species can help us understand the way these genes actually function. Algorithms such as BLAST that can search DNA sequences from more than 260,000 organisms containing over 190 billion bases are now commonplace in bioinformatics. It has also been estimated that the total amount of DNA base pairs on Earth is somewhere in the range of 5.0 x 1037 or 50 trillion trillion trillion DNA base pairs. WOW! Forget about clickstream and server logs—nature has given us the ultimate Big Data problem.

Scientists, developers, and technologists from a variety of industries have all chosen AWS as their infrastructure platform to meet the challenges of data volume, data variety, and data velocity. It should be no surprise that the field of genomics has also converged on AWS tooling for meeting their big data needs. From storing and processing raw data from the machines that read it (sequencers) to research centers that want to build collaborative analysis environments, the AWS cloud has been an enabler for scientists making real discoveries in this field.

For blog readers who have not yet encountered this life sciences data, “genomics” is probably a term you have heard a lot in the past year – from President Obama’s precision medicine initiative to many mainstream news articles to potentially as part of your own healthcare. That is why for this Big Data Blog post, we’ll provide non-bioinformatics readers with a starting point for gaining an understanding of what is often seen as a clandestine technology, while at the same time offering those in the bioinformatics field a fresh look at ways the cloud can enhance their research.

We will walk through an end-to-end genomics analysis using the Seven Bridges Genomics platform, an AWS advanced technology partner that offers a solution for data scientists looking to analyze large-scale, next-generation, sequencing (NGS) data securely. We will also demonstrate how easy it is to extend this platform to use the additional analytics capabilities offered by AWS. First, we will identify the genetic “variants” in an individual—what makes them unique—and then conduct a comparison of variation across a group of people.

Seven Bridges Genomics platform

The Seven Bridges platform is a secure, end-to-end solution in itself. It manages all aspects of analysis, from data storage and logistics (via Amazon S3), to user tracking, permissions, logging pipeline executions to support reproducible analysis, and visualizing results.

We show you how use the visual interface of the platform. We like to think of this as an integrated development environment for bioinformatics, but each step in this workflow can also be performed via API. The Seven Bridges platform also offers an easy way to take the output of your analysis to other applications so you can extend its functionality. In this case, we will do additional work using Amazon Redshift and R.

You can follow along with this blog post by signing up for a free Seven Bridges trial account that comes with $100 of free computation and storage credits. This is enough for you to perform variant calling on about 50 whole exomes. In other words, you can modify and re-run the steps in this post 50 different ways before having to pay a dime.

Setting up our Seven Bridges environment

First, we will start with sequencing the 180,000 regions of the genome that make up protein-coding genes, also known as whole exome DNA sequencing. Our starting point will be a file in the FASTQ format, a text-based format for storing the biological sequences of A, C, T, and Gs.

We’ll use publicly available data from the 1000 Genomes project as an example, but you could just as easily upload your own FASTQ files. We also show you how to quickly modify the analysis to modify tool parameters or start with data in other formats. We walk you through the main points here, but the Seven Bridges Quick Start guide or this video also provide a tutorial.

The first step to run an analysis on the Seven Bridges platform is to create a project. Each project corresponds to a distinct scientific investigation, serving as a container for its data, analysis pipelines, results and collaborators. Think of this as our own private island where we can invite friends (collaborators) to join us. We maintain fine-grained control over what our friends can do on our island, from seeing files to running executions.

 Adding genomics data

Next, we need to add the data we need to analyze to the project. There are several ways to add files to the platform; we can add them using the graphical interface, the command line interface, or via FTP/HTTP. Here, we analyze publicly available 1000 Genome Project data that is available in the Public Files hosted on the Seven Bridges platform (which are also available on a public S3 bucket). The Seven Bridges Public Files repository contains reference files, data sets, and other frequently used genomic data that you might find useful. For this analysis, we’ll add two FASTQ files to our project. Because the two files have been sequenced together on the same lane, we set their Lane/Slide metadata same so that the pipeline can process them together.

Sequencing analysis typically works by comparing a unique FASTQ file to a complete reference genome. The Seven Bridges platform has the most common reference files pre­loaded so we won’t need to worry about this now. Of course, additional reference files can be added if we want to further customize our analysis.

Pipelines abound: Customizable conduits for GATK and beyond

Now that the data is ready, we need a pipeline to analyze them. The Seven Bridges platform comes pre­loaded with a broad range of gold standard bioinformatics pipelines that allow us to execute analyses immediately according to community best practices. We can use these as templates to immediately and reproducibly perform complex analyses routines without having to install any tools or manage data flow between them. Using the SDK, we can also put virtually any custom tool on the platform, and create completely custom pipelines.

In this example, we use the Whole Exome Analysis ­ BWA + GATK 2.3.9­Lite (with Metrics) as our starting point. In this pipeline, we first align FASTQ files to a reference using a software package known as the Burrows-Wheeler Aligner (BWA). Next, alignments are refined according to GATK best practices before we start to make the determination between what makes our FASTQ file different from the reference file. This step of looking for differences between our file and the reference file is known as variant calling and typically produces a  Variant Call Format (VCF) file.

While public pipelines are set based on best practices, we can tweak them to suit our needs using the pipeline editor.

Analyzing the source code of a person

Next, we run the analysis. Choosing Run on the pipeline brings us to the task execution page where we can select the files to analyze, and set any modifiable parameters (executions can also be started via the API). We also need to add any reference and annotation files required by our pipeline. Because we are using a Seven Bridges optimized pipeline as the starting point, the recommended reference files can all be selected and added with one click rather than hunting these down from public repositories.

After we choose Run, our optimized execution is performed and we receive an email in an hour or so, letting us know that it has been completed.

After the execution completes, we can return to the Tasks tab to see the inputs, parameters, and output files of our analysis. Everything is kept in one place so that we can return in months or years and always know the precise analysis that was performed. The pipeline we executed returns both aligned bam (alignment) files and VCFs, so if we decide to change the variant calling in the future, we don’t need to re­run the alignment.

Extending from individual analysis to population studies in AWS

The walkthrough thus far ran an analysis on an individual level, but what if we wanted to expand our analysis to an entire population?  AWS and Seven Bridges together offer a wealth of tools that are a perfect fit for this type of study. Let’s start with Amazon Redshift and the open source language R to see just a couple examples of how this might work.

First, we select all the resulting VCF files from the file browser and select get links. This generates signed URLs pointing to our file’s location in Amazon S3. Note that these links expire in 24 hours, so we want to use them right away.

As links will vary based on the 1000 genomes file picked and individual accounts, the rest of this post uses the chromosome 1 VCF file stored on Amazon S3 as an example.

Launching Bioconductor

A very common way to get started analyzing genomics data is by using the R software package, Bioconductor. Bioconductor is built for working with genomics data and has built-in functionality for interpreting many of the common genomics file types. It is also very simple to get started with Bioconductor in the cloud because an AWS CloudFormation template is provided and maintained for quickly provisioning an environment that is ready to go.

To launch a Bioconductor environment, choose AWS CloudFormation from the AWS Management Console, create a new stack, and enter the following value for Specify an Amazon S3 template URL under Template:

https://s3.amazonaws.com/bioc-cloudformation-templates/start_ssh_instance.json

We must be in the us-east-1 region for this template to function correctly; be aware that we must choose an instance type that supports an ‘HVM instance store’ so we avoid the micro instance type.

After the CloudFormation stack has been launched, we should see a status of “CREATE_COMPLETE” and an Outputs tab should be exposed which contains the following important pieces of data:

URL – The URL to click to begin working with Bioconductor and R.

Username – User name to gain entry to the website provided by the URL.

Password –  Password to gain entry to the website provided by the URL.

SSH Command – The command we can copy/paste into a terminal with SSH installed.

Before we jump into R, let’s add a couple items to our server that will come in handy later as we interact with AWS services from R. Connect to the instance using the SSH command provided by the outputs of the CloudFormation template. After we are in the instance, we run the following command:

sudo apt-get -y install libpq-dev

While logged into the instance, we can also use this opportunity to configure the AWS CLI. For the purposes of this post we’ll use the keys of a user who has read and write access to an Amazon S3 bucket.

Load the VCF into the R cloud

We are now ready to get started with R and Bioconductor by clicking the URL provided in the CloudFormation output. After we log in, we should be able to interact with a fully configured and pre-loaded RStudio environment.

First we need to pull the relevant VCF files into our Bioconductor environment. In this sample code, we use the VCF files provided in the 1000 Genomes project hosted on AWS but the links could be switched with those found in the file browser of the Seven Bridges Genomics platform.

#RCurl to retrieve data from S3
library(RCurl)

#The VariantAnnotation package
#from Bioconductor will give us additional functionality for working with #VCF files.
library(VariantAnnotation)

#No need to install these packages as they are part of the pre-configured
# Bioconductor install

#VCF files are composed of two parts. The data and the index stored in the #tbi file.
#The VariantAnnotation package relies on having both files side by side for #performing fast data lookups

#pull the binaries of the data file into R’s memory
bin_vcf <- getBinaryURL
("http://1000genomes.s3.amazonaws.com/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz", ssl.verifypeer=FALSE)

#pull the binaries of the index file into R’s memory
bin_vcf_tabix <- getBinaryURL("http://1000genomes.s3.amazonaws.com/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi",
ssl.verifypeer=FALSE)

#Write both files to the working directory of R.
#use getwd() to see the current directory and use setwd() to change location.
con <- file("imported_vcf.gz", open = "wb")
writeBin(bin_vcf, con)
close(con)

con <- file("imported_vcf.gz.tbi", open = "wb")
writeBin(bin_vcf_tabix, con)
close(con)

#Create strings specifying the file name location that can be used by R
vcf_file <- paste(getwd(), "imported_vcf.gz", sep="/")
vcf.tbi <- paste(getwd(), "imported_vcf.gz.tbi", sep="/")

#use functionality of VariantAnnotation to review the header
#of the VCF file.
info(scanVcfHeader(vcf_file))

Extracting Relevant Variant Data

As a trivial example to show the type of manipulation a bioinformatisis may want to do with Bioconductor before sending the data off for other AWS Analytics services for further analysis, lets filter by a range within the chromosome 1 vcf file and then extract the variations in genes that were actually found.

#filter by a region in the chromosome
loc <- GRanges(1, IRanges(232164611, 233164611))
params <- ScanVcfParam(which=loc)

#read the VCF file into an R object of type VCF.
#the “hg19” is the genome identifier
# and tells bioconductor to use the 19th version of the reference file
#from the human genome project.
vcf <- readVcf(TabixFile(vcf_file), "hg19", params)

#use fixed to pull out the REF, ALT, QUAL and FILTER fields of the VCF and parse
#them into a dataframe.
#Using this as an easy to interpret example.
variant_df <- fixed(vcf)

#to get an idea of what is contained in this dataset, run
(variant_df)

The results of this code should produce a sample similar to the following:

Using Bioconductor and R to send variants to Amazon Redshift

This appears to be a very structured data set that we may want to collect on an ongoing basis from multiple VCFs to perform large population studies later. This is a scenario that aligns perfectly with the AWS service. For more information about how Amazon Redshift relates to R as well as details on how to connect to Amazon Redshift from R, see Connecting R with Amazon Redshift.

To move this dataset effectively into Amazon Redshift, we want to use the COPY command of Amazon Redshift and take advantage of the wide bandwidth that S3 provides. Therefore, the first step is moving our fixed variant data frame into S3.

#move into S3
variant_df <- fixed(vcf)

#export the dataframe to a CSV
write.csv(variant_df, file = "my_variants.csv")

#this step requires that AWS configure was previously run on the
#EC2 instance with user keys that have read/write on YOUR-BUCKET
system("aws s3 mv my_variants.csv s3://YOUR-BUCKET/my_variants.csv", intern = TRUE)

Now that this data frame is available to us in Amazon S3, we can open a connection to Amazon Redshift and load the filtered variant data.

#on first run,
#we need to install the RPostgreSQL package.
#install.packages("RPostgreSQL")
#open a connection to the Amazon Redshift cluster
library(RPostgreSQL)
con <- dbConnect(dbDriver("PostgreSQL"),
host="YOUR-redshift-cluster1.xx.region.amazonaws.com",
user= "USER_NAME",
password="PASSWORD",
dbname="DATABASE_NAME",
port = 5439 )
#Send a query to create
#the table which will hold our variants
dbGetQuery(con,
"CREATE TABLE SAMPLE_VARIANTS
( ROW_ID int, REF varchar(50),
ALT varchar(50), QUAL int,
FILTER varchar(5) )"
)

#Use the Amazon Redshift COPY
#command to load the table from S3
dbGetQuery(con,
“ copy SAMPLE_VARIANTS from
‘s3://YOUR-BUCKET/my_variants.csv’
credentials ‘aws_access_key_id=YOUR_KEY;
aws_secret_access_key=YOUR_SECRET_KEY’ csv
IGNOREHEADER 1 maxerror as 250;" )

Running an analysis against Amazon Redshift from R

Now that the data has been loaded into Amazon Redshift, we can continue to add additional rows from more VCFs that are relevant to the region of the chromosome we wish to study, without having to worry about scaling since Amazon Redshift can easily query petabytes of data. We now also have the ability to write simple SQL against our dataset to find answers to our questions. For example, we can quickly ask what the top gene variations are in this chromosome region with the following statement:

top_variations <- dbGetQuery(con,
"select ref || ‘->’ || alt as variation, count(*) as variations_in_vcf from SAMPLE_VARIANTS group by ref,alt order by variations_in_vcf DESC"
)

The most common variation found in our chromosome range is that a C becomes a T. Most of our top changes are a single letter change, which is also known as a single-nucleotide polymorphism or SNP (pronounced “snip”).

In addition to massive scale-out capabilities, there’s another advantage to collecting relevant results in a single Amazon Redshift table: we are storing the specific variant data for the chromosome region we want to study in an easy-to-interpret and simple-to-query format. This makes it easy to cut our massive datasets with SQL and hone in on the relevant cohorts or regions of data we want to explore further. After we have used SQL to focus our data, we can return to R and run a more sophisticated analysis.

As a spurious example of what this might look like, let’s dig into our SNP example above. Simple counts let us know that the SNPs are certainly the common form of variation in the set. But what if we wanted to see if there were base pairs changes in the SNPs that tended to cluster together? We could examine this by running a correspondence analysis, which is a graphical method of exploring the relationship between variables in a contingency table. In this pseudo-analysis, we treat our ref and alt variables as categorical variables and produce a visualization that displays how our ref and alts move together in chi-squared space.

single_changes = "select ref , alt,
count(*) as variations_in_vcf from SAMPLE_VARIANTS
WHERE len(ref) = 1 and len(alt) = 1
group by ref,alt
order by variations_in_vcf DESC"
count_single_variations <- dbGetQuery(con,single_changes)
library(ca)
count_single_variations
correspondance_table <- with(count_single_variations, table(ref,alt)) # create a 2 way table
prop.table(correspondance_table, 1) # row percentages
prop.table(correspondance_table, 2) # column percentages
fit <- ca(correspondance_table)
print(fit) #basic results
plot(fit, mass = TRUE, contrib = "absolute", map =
"rowgreen", arrows = c(FALSE, TRUE)) # asymmetric map

Conclusion

We are in the midst of a fundamental change in the accessibility and usability of data. Massive datasets once thought to only be archival in nature or impossible to interpret are being opened up for examination due to the processing power of the cloud. The genes that control the makeup of all living creatures are no exception and are finally falling under the same scrutiny.

We hope that this walkthrough has given the non-bioinformatics reader a quick peek under the hood of this fascinating field. We also hope it has provided insight into what it means to begin to understand the data in your DNA. Armed with this knowledge and an improved genomics vocabulary, you can be part of the conversation surrounding the precision medicine revolution.

Additionally, we hope the seasoned bioinformatics reader has been exposed to new tools that could accelerate research and has been empowered to explore the capabilities of Amazon Redshift and the Seven Bridges Genomics platform. This post has just scratched the surface. There are over 30 tested, peer-reviewed pipelines and 400 open source applications available for analyzing multidimensional, next-generation sequencing data integrated into the Seven Bridges Genomics platform.

If you have questions or suggestions, please leave a comment below.

————————-

Related

Connecting R with Amazon Redshift

Extending Seven Bridges Genomics with Amazon Redshift and R

Post Syndicated from Christopher Crosbie original https://blogs.aws.amazon.com/bigdata/post/TxB9H9MGP4JBBQ/Extending-Seven-Bridges-Genomics-with-Amazon-Redshift-and-R

Christopher Crosbie is a Healthcare and Life Science Solutions Architect with Amazon Web Services

The article was co-authored by Zeynep Onder, Scientist, Seven Bridges Genomics, an AWS Advanced Technology Partner.

“ACTGCTTCGACTCGGGTCCA“

That is probably not a coding language readily understood by many reading this blog post, but it is a programming framework that defines all life on the planet. These letters are known as base pairs in a DNA sequence and represent four chemicals found in all organisms. When put into a specific order, these DNA sequences contain the instructions that kick off processes which eventually render all the characteristics and traits (also known as phenotypes) we see in nature.

Sounds simple enough. Just store the code, perform some complex decoding algorithms, and you are on your way to a great scientific discovery. Right? Well, not quite. Genomics analysis is one of the biggest data problems out there.

Here’s why: You and I have around 20,000 – 25,000 distinct sequences of DNA (genes) that create proteins and thus contain instructions for every process from development to regeneration. This is out of the 3.2 billion individual letters (bases) in each of us. Thousands of other organisms have also been decoded and stored in databases because comparing genes across species can help us understand the way these genes actually function. Algorithms such as BLAST that can search DNA sequences from more than 260,000 organisms containing over 190 billion bases are now commonplace in bioinformatics. It has also been estimated that the total amount of DNA base pairs on Earth is somewhere in the range of 5.0 x 1037 or 50 trillion trillion trillion DNA base pairs. WOW! Forget about clickstream and server logs—nature has given us the ultimate Big Data problem.

Scientists, developers, and technologists from a variety of industries have all chosen AWS as their infrastructure platform to meet the challenges of data volume, data variety, and data velocity. It should be no surprise that the field of genomics has also converged on AWS tooling for meeting their big data needs. From storing and processing raw data from the machines that read it (sequencers) to research centers that want to build collaborative analysis environments, the AWS cloud has been an enabler for scientists making real discoveries in this field.

For blog readers who have not yet encountered this life sciences data, “genomics” is probably a term you have heard a lot in the past year – from President Obama’s precision medicine initiative to many mainstream news articles to potentially as part of your own healthcare. That is why for this Big Data Blog post, we’ll provide non-bioinformatics readers with a starting point for gaining an understanding of what is often seen as a clandestine technology, while at the same time offering those in the bioinformatics field a fresh look at ways the cloud can enhance their research.

We will walk through an end-to-end genomics analysis using the Seven Bridges Genomics platform, an AWS advanced technology partner that offers a solution for data scientists looking to analyze large-scale, next-generation, sequencing (NGS) data securely. We will also demonstrate how easy it is to extend this platform to use the additional analytics capabilities offered by AWS. First, we will identify the genetic “variants” in an individual—what makes them unique—and then conduct a comparison of variation across a group of people.

Seven Bridges Genomics platform

The Seven Bridges platform is a secure, end-to-end solution in itself. It manages all aspects of analysis, from data storage and logistics (via Amazon S3), to user tracking, permissions, logging pipeline executions to support reproducible analysis, and visualizing results.

We show you how use the visual interface of the platform. We like to think of this as an integrated development environment for bioinformatics, but each step in this workflow can also be performed via API. The Seven Bridges platform also offers an easy way to take the output of your analysis to other applications so you can extend its functionality. In this case, we will do additional work using Amazon Redshift and R.

You can follow along with this blog post by signing up for a free Seven Bridges trial account that comes with $100 of free computation and storage credits. This is enough for you to perform variant calling on about 50 whole exomes. In other words, you can modify and re-run the steps in this post 50 different ways before having to pay a dime.

Setting up our Seven Bridges environment

First, we will start with sequencing the 180,000 regions of the genome that make up protein-coding genes, also known as whole exome DNA sequencing. Our starting point will be a file in the FASTQ format, a text-based format for storing the biological sequences of A, C, T, and Gs.

We’ll use publicly available data from the 1000 Genomes project as an example, but you could just as easily upload your own FASTQ files. We also show you how to quickly modify the analysis to modify tool parameters or start with data in other formats. We walk you through the main points here, but the Seven Bridges Quick Start guide or this video also provide a tutorial.

The first step to run an analysis on the Seven Bridges platform is to create a project. Each project corresponds to a distinct scientific investigation, serving as a container for its data, analysis pipelines, results and collaborators. Think of this as our own private island where we can invite friends (collaborators) to join us. We maintain fine-grained control over what our friends can do on our island, from seeing files to running executions.

 Adding genomics data

Next, we need to add the data we need to analyze to the project. There are several ways to add files to the platform; we can add them using the graphical interface, the command line interface, or via FTP/HTTP. Here, we analyze publicly available 1000 Genome Project data that is available in the Public Files hosted on the Seven Bridges platform (which are also available on a public S3 bucket). The Seven Bridges Public Files repository contains reference files, data sets, and other frequently used genomic data that you might find useful. For this analysis, we’ll add two FASTQ files to our project. Because the two files have been sequenced together on the same lane, we set their Lane/Slide metadata same so that the pipeline can process them together.

Sequencing analysis typically works by comparing a unique FASTQ file to a complete reference genome. The Seven Bridges platform has the most common reference files pre­loaded so we won’t need to worry about this now. Of course, additional reference files can be added if we want to further customize our analysis.

Pipelines abound: Customizable conduits for GATK and beyond

Now that the data is ready, we need a pipeline to analyze them. The Seven Bridges platform comes pre­loaded with a broad range of gold standard bioinformatics pipelines that allow us to execute analyses immediately according to community best practices. We can use these as templates to immediately and reproducibly perform complex analyses routines without having to install any tools or manage data flow between them. Using the SDK, we can also put virtually any custom tool on the platform, and create completely custom pipelines.

In this example, we use the Whole Exome Analysis ­ BWA + GATK 2.3.9­Lite (with Metrics) as our starting point. In this pipeline, we first align FASTQ files to a reference using a software package known as the Burrows-Wheeler Aligner (BWA). Next, alignments are refined according to GATK best practices before we start to make the determination between what makes our FASTQ file different from the reference file. This step of looking for differences between our file and the reference file is known as variant calling and typically produces a  Variant Call Format (VCF) file.

While public pipelines are set based on best practices, we can tweak them to suit our needs using the pipeline editor.

Analyzing the source code of a person

Next, we run the analysis. Choosing Run on the pipeline brings us to the task execution page where we can select the files to analyze, and set any modifiable parameters (executions can also be started via the API). We also need to add any reference and annotation files required by our pipeline. Because we are using a Seven Bridges optimized pipeline as the starting point, the recommended reference files can all be selected and added with one click rather than hunting these down from public repositories.

After we choose Run, our optimized execution is performed and we receive an email in an hour or so, letting us know that it has been completed.

After the execution completes, we can return to the Tasks tab to see the inputs, parameters, and output files of our analysis. Everything is kept in one place so that we can return in months or years and always know the precise analysis that was performed. The pipeline we executed returns both aligned bam (alignment) files and VCFs, so if we decide to change the variant calling in the future, we don’t need to re­run the alignment.

Extending from individual analysis to population studies in AWS

The walkthrough thus far ran an analysis on an individual level, but what if we wanted to expand our analysis to an entire population?  AWS and Seven Bridges together offer a wealth of tools that are a perfect fit for this type of study. Let’s start with Amazon Redshift and the open source language R to see just a couple examples of how this might work.

First, we select all the resulting VCF files from the file browser and select get links. This generates signed URLs pointing to our file’s location in Amazon S3. Note that these links expire in 24 hours, so we want to use them right away.

As links will vary based on the 1000 genomes file picked and individual accounts, the rest of this post uses the chromosome 1 VCF file stored on Amazon S3 as an example.

Launching Bioconductor

A very common way to get started analyzing genomics data is by using the R software package, Bioconductor. Bioconductor is built for working with genomics data and has built-in functionality for interpreting many of the common genomics file types. It is also very simple to get started with Bioconductor in the cloud because an AWS CloudFormation template is provided and maintained for quickly provisioning an environment that is ready to go.

To launch a Bioconductor environment, choose AWS CloudFormation from the AWS Management Console, create a new stack, and enter the following value for Specify an Amazon S3 template URL under Template:

https://s3.amazonaws.com/bioc-cloudformation-templates/start_ssh_instance.json

We must be in the us-east-1 region for this template to function correctly; be aware that we must choose an instance type that supports an ‘HVM instance store’ so we avoid the micro instance type.

After the CloudFormation stack has been launched, we should see a status of “CREATE_COMPLETE” and an Outputs tab should be exposed which contains the following important pieces of data:

URL – The URL to click to begin working with Bioconductor and R.

Username – User name to gain entry to the website provided by the URL.

Password –  Password to gain entry to the website provided by the URL.

SSH Command – The command we can copy/paste into a terminal with SSH installed.

Before we jump into R, let’s add a couple items to our server that will come in handy later as we interact with AWS services from R. Connect to the instance using the SSH command provided by the outputs of the CloudFormation template. After we are in the instance, we run the following command:

sudo apt-get -y install libpq-dev

While logged into the instance, we can also use this opportunity to configure the AWS CLI. For the purposes of this post we’ll use the keys of a user who has read and write access to an Amazon S3 bucket.

Load the VCF into the R cloud

We are now ready to get started with R and Bioconductor by clicking the URL provided in the CloudFormation output. After we log in, we should be able to interact with a fully configured and pre-loaded RStudio environment.

First we need to pull the relevant VCF files into our Bioconductor environment. In this sample code, we use the VCF files provided in the 1000 Genomes project hosted on AWS but the links could be switched with those found in the file browser of the Seven Bridges Genomics platform.

#RCurl to retrieve data from S3
library(RCurl)

#The VariantAnnotation package
#from Bioconductor will give us additional functionality for working with #VCF files.
library(VariantAnnotation)

#No need to install these packages as they are part of the pre-configured
# Bioconductor install

#VCF files are composed of two parts. The data and the index stored in the #tbi file.
#The VariantAnnotation package relies on having both files side by side for #performing fast data lookups

#pull the binaries of the data file into R’s memory
bin_vcf <- getBinaryURL
("http://1000genomes.s3.amazonaws.com/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz", ssl.verifypeer=FALSE)

#pull the binaries of the index file into R’s memory
bin_vcf_tabix <- getBinaryURL("http://1000genomes.s3.amazonaws.com/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi",
ssl.verifypeer=FALSE)

#Write both files to the working directory of R.
#use getwd() to see the current directory and use setwd() to change location.
con <- file("imported_vcf.gz", open = "wb")
writeBin(bin_vcf, con)
close(con)

con <- file("imported_vcf.gz.tbi", open = "wb")
writeBin(bin_vcf_tabix, con)
close(con)

#Create strings specifying the file name location that can be used by R
vcf_file <- paste(getwd(), "imported_vcf.gz", sep="/")
vcf.tbi <- paste(getwd(), "imported_vcf.gz.tbi", sep="/")

#use functionality of VariantAnnotation to review the header
#of the VCF file.
info(scanVcfHeader(vcf_file))

Extracting Relevant Variant Data

As a trivial example to show the type of manipulation a bioinformatisis may want to do with Bioconductor before sending the data off for other AWS Analytics services for further analysis, lets filter by a range within the chromosome 1 vcf file and then extract the variations in genes that were actually found.

#filter by a region in the chromosome
loc <- GRanges(1, IRanges(232164611, 233164611))
params <- ScanVcfParam(which=loc)

#read the VCF file into an R object of type VCF.
#the “hg19” is the genome identifier
# and tells bioconductor to use the 19th version of the reference file
#from the human genome project.
vcf <- readVcf(TabixFile(vcf_file), "hg19", params)

#use fixed to pull out the REF, ALT, QUAL and FILTER fields of the VCF and parse
#them into a dataframe.
#Using this as an easy to interpret example.
variant_df <- fixed(vcf)

#to get an idea of what is contained in this dataset, run
(variant_df)

The results of this code should produce a sample similar to the following:

Using Bioconductor and R to send variants to Amazon Redshift

This appears to be a very structured data set that we may want to collect on an ongoing basis from multiple VCFs to perform large population studies later. This is a scenario that aligns perfectly with the AWS service. For more information about how Amazon Redshift relates to R as well as details on how to connect to Amazon Redshift from R, see Connecting R with Amazon Redshift.

To move this dataset effectively into Amazon Redshift, we want to use the COPY command of Amazon Redshift and take advantage of the wide bandwidth that S3 provides. Therefore, the first step is moving our fixed variant data frame into S3.

#move into S3
variant_df <- fixed(vcf)

#export the dataframe to a CSV
write.csv(variant_df, file = "my_variants.csv")

#this step requires that AWS configure was previously run on the
#EC2 instance with user keys that have read/write on YOUR-BUCKET
system("aws s3 mv my_variants.csv s3://YOUR-BUCKET/my_variants.csv", intern = TRUE)

Now that this data frame is available to us in Amazon S3, we can open a connection to Amazon Redshift and load the filtered variant data.

#on first run,
#we need to install the RPostgreSQL package.
#install.packages("RPostgreSQL")
#open a connection to the Amazon Redshift cluster
library(RPostgreSQL)
con <- dbConnect(dbDriver("PostgreSQL"),
host="YOUR-redshift-cluster1.xx.region.amazonaws.com",
user= "USER_NAME",
password="PASSWORD",
dbname="DATABASE_NAME",
port = 5439 )
#Send a query to create
#the table which will hold our variants
dbGetQuery(con,
"CREATE TABLE SAMPLE_VARIANTS
( ROW_ID int, REF varchar(50),
ALT varchar(50), QUAL int,
FILTER varchar(5) )"
)

#Use the Amazon Redshift COPY
#command to load the table from S3
dbGetQuery(con,
“ copy SAMPLE_VARIANTS from
‘s3://YOUR-BUCKET/my_variants.csv’
credentials ‘aws_access_key_id=YOUR_KEY;
aws_secret_access_key=YOUR_SECRET_KEY’ csv
IGNOREHEADER 1 maxerror as 250;" )

Running an analysis against Amazon Redshift from R

Now that the data has been loaded into Amazon Redshift, we can continue to add additional rows from more VCFs that are relevant to the region of the chromosome we wish to study, without having to worry about scaling since Amazon Redshift can easily query petabytes of data. We now also have the ability to write simple SQL against our dataset to find answers to our questions. For example, we can quickly ask what the top gene variations are in this chromosome region with the following statement:

top_variations <- dbGetQuery(con,
"select ref || ‘->’ || alt as variation, count(*) as variations_in_vcf from SAMPLE_VARIANTS group by ref,alt order by variations_in_vcf DESC"
)

The most common variation found in our chromosome range is that a C becomes a T. Most of our top changes are a single letter change, which is also known as a single-nucleotide polymorphism or SNP (pronounced “snip”).

In addition to massive scale-out capabilities, there’s another advantage to collecting relevant results in a single Amazon Redshift table: we are storing the specific variant data for the chromosome region we want to study in an easy-to-interpret and simple-to-query format. This makes it easy to cut our massive datasets with SQL and hone in on the relevant cohorts or regions of data we want to explore further. After we have used SQL to focus our data, we can return to R and run a more sophisticated analysis.

As a spurious example of what this might look like, let’s dig into our SNP example above. Simple counts let us know that the SNPs are certainly the common form of variation in the set. But what if we wanted to see if there were base pairs changes in the SNPs that tended to cluster together? We could examine this by running a correspondence analysis, which is a graphical method of exploring the relationship between variables in a contingency table. In this pseudo-analysis, we treat our ref and alt variables as categorical variables and produce a visualization that displays how our ref and alts move together in chi-squared space.

single_changes = "select ref , alt,
count(*) as variations_in_vcf from SAMPLE_VARIANTS
WHERE len(ref) = 1 and len(alt) = 1
group by ref,alt
order by variations_in_vcf DESC"
count_single_variations <- dbGetQuery(con,single_changes)
library(ca)
count_single_variations
correspondance_table <- with(count_single_variations, table(ref,alt)) # create a 2 way table
prop.table(correspondance_table, 1) # row percentages
prop.table(correspondance_table, 2) # column percentages
fit <- ca(correspondance_table)
print(fit) #basic results
plot(fit, mass = TRUE, contrib = "absolute", map =
"rowgreen", arrows = c(FALSE, TRUE)) # asymmetric map

Conclusion

We are in the midst of a fundamental change in the accessibility and usability of data. Massive datasets once thought to only be archival in nature or impossible to interpret are being opened up for examination due to the processing power of the cloud. The genes that control the makeup of all living creatures are no exception and are finally falling under the same scrutiny.

We hope that this walkthrough has given the non-bioinformatics reader a quick peek under the hood of this fascinating field. We also hope it has provided insight into what it means to begin to understand the data in your DNA. Armed with this knowledge and an improved genomics vocabulary, you can be part of the conversation surrounding the precision medicine revolution.

Additionally, we hope the seasoned bioinformatics reader has been exposed to new tools that could accelerate research and has been empowered to explore the capabilities of Amazon Redshift and the Seven Bridges Genomics platform. This post has just scratched the surface. There are over 30 tested, peer-reviewed pipelines and 400 open source applications available for analyzing multidimensional, next-generation sequencing data integrated into the Seven Bridges Genomics platform.

If you have questions or suggestions, please leave a comment below.

————————-

Related

Connecting R with Amazon Redshift