Post Syndicated from Christopher Crosbie original https://blogs.aws.amazon.com/bigdata/post/TxB9H9MGP4JBBQ/Extending-Seven-Bridges-Genomics-with-Amazon-Redshift-and-R
Christopher Crosbie is a Healthcare and Life Science Solutions Architect with Amazon Web Services
The article was co-authored by Zeynep Onder, Scientist, Seven Bridges Genomics, an AWS Advanced Technology Partner.
That is probably not a coding language readily understood by many reading this blog post, but it is a programming framework that defines all life on the planet. These letters are known as base pairs in a DNA sequence and represent four chemicals found in all organisms. When put into a specific order, these DNA sequences contain the instructions that kick off processes which eventually render all the characteristics and traits (also known as phenotypes) we see in nature.
Sounds simple enough. Just store the code, perform some complex decoding algorithms, and you are on your way to a great scientific discovery. Right? Well, not quite. Genomics analysis is one of the biggest data problems out there.
Here’s why: You and I have around 20,000 – 25,000 distinct sequences of DNA (genes) that create proteins and thus contain instructions for every process from development to regeneration. This is out of the 3.2 billion individual letters (bases) in each of us. Thousands of other organisms have also been decoded and stored in databases because comparing genes across species can help us understand the way these genes actually function. Algorithms such as BLAST that can search DNA sequences from more than 260,000 organisms containing over 190 billion bases are now commonplace in bioinformatics. It has also been estimated that the total amount of DNA base pairs on Earth is somewhere in the range of 5.0 x 1037 or 50 trillion trillion trillion DNA base pairs. WOW! Forget about clickstream and server logs—nature has given us the ultimate Big Data problem.
Scientists, developers, and technologists from a variety of industries have all chosen AWS as their infrastructure platform to meet the challenges of data volume, data variety, and data velocity. It should be no surprise that the field of genomics has also converged on AWS tooling for meeting their big data needs. From storing and processing raw data from the machines that read it (sequencers) to research centers that want to build collaborative analysis environments, the AWS cloud has been an enabler for scientists making real discoveries in this field.
For blog readers who have not yet encountered this life sciences data, “genomics” is probably a term you have heard a lot in the past year – from President Obama’s precision medicine initiative to many mainstream news articles to potentially as part of your own healthcare. That is why for this Big Data Blog post, we’ll provide non-bioinformatics readers with a starting point for gaining an understanding of what is often seen as a clandestine technology, while at the same time offering those in the bioinformatics field a fresh look at ways the cloud can enhance their research.
We will walk through an end-to-end genomics analysis using the Seven Bridges Genomics platform, an AWS advanced technology partner that offers a solution for data scientists looking to analyze large-scale, next-generation, sequencing (NGS) data securely. We will also demonstrate how easy it is to extend this platform to use the additional analytics capabilities offered by AWS. First, we will identify the genetic “variants” in an individual—what makes them unique—and then conduct a comparison of variation across a group of people.
Seven Bridges Genomics platform
The Seven Bridges platform is a secure, end-to-end solution in itself. It manages all aspects of analysis, from data storage and logistics (via Amazon S3), to user tracking, permissions, logging pipeline executions to support reproducible analysis, and visualizing results.
We show you how use the visual interface of the platform. We like to think of this as an integrated development environment for bioinformatics, but each step in this workflow can also be performed via API. The Seven Bridges platform also offers an easy way to take the output of your analysis to other applications so you can extend its functionality. In this case, we will do additional work using Amazon Redshift and R.
You can follow along with this blog post by signing up for a free Seven Bridges trial account that comes with $100 of free computation and storage credits. This is enough for you to perform variant calling on about 50 whole exomes. In other words, you can modify and re-run the steps in this post 50 different ways before having to pay a dime.
Setting up our Seven Bridges environment
First, we will start with sequencing the 180,000 regions of the genome that make up protein-coding genes, also known as whole exome DNA sequencing. Our starting point will be a file in the FASTQ format, a text-based format for storing the biological sequences of A, C, T, and Gs.
We’ll use publicly available data from the 1000 Genomes project as an example, but you could just as easily upload your own FASTQ files. We also show you how to quickly modify the analysis to modify tool parameters or start with data in other formats. We walk you through the main points here, but the Seven Bridges Quick Start guide or this video also provide a tutorial.
The first step to run an analysis on the Seven Bridges platform is to create a project. Each project corresponds to a distinct scientific investigation, serving as a container for its data, analysis pipelines, results and collaborators. Think of this as our own private island where we can invite friends (collaborators) to join us. We maintain fine-grained control over what our friends can do on our island, from seeing files to running executions.
Adding genomics data
Next, we need to add the data we need to analyze to the project. There are several ways to add files to the platform; we can add them using the graphical interface, the command line interface, or via FTP/HTTP. Here, we analyze publicly available 1000 Genome Project data that is available in the Public Files hosted on the Seven Bridges platform (which are also available on a public S3 bucket). The Seven Bridges Public Files repository contains reference files, data sets, and other frequently used genomic data that you might find useful. For this analysis, we’ll add two FASTQ files to our project. Because the two files have been sequenced together on the same lane, we set their Lane/Slide metadata same so that the pipeline can process them together.
Sequencing analysis typically works by comparing a unique FASTQ file to a complete reference genome. The Seven Bridges platform has the most common reference files preloaded so we won’t need to worry about this now. Of course, additional reference files can be added if we want to further customize our analysis.
Pipelines abound: Customizable conduits for GATK and beyond
Now that the data is ready, we need a pipeline to analyze them. The Seven Bridges platform comes preloaded with a broad range of gold standard bioinformatics pipelines that allow us to execute analyses immediately according to community best practices. We can use these as templates to immediately and reproducibly perform complex analyses routines without having to install any tools or manage data flow between them. Using the SDK, we can also put virtually any custom tool on the platform, and create completely custom pipelines.
In this example, we use the Whole Exome Analysis BWA + GATK 2.3.9Lite (with Metrics) as our starting point. In this pipeline, we first align FASTQ files to a reference using a software package known as the Burrows-Wheeler Aligner (BWA). Next, alignments are refined according to GATK best practices before we start to make the determination between what makes our FASTQ file different from the reference file. This step of looking for differences between our file and the reference file is known as variant calling and typically produces a Variant Call Format (VCF) file.
While public pipelines are set based on best practices, we can tweak them to suit our needs using the pipeline editor.
Analyzing the source code of a person
Next, we run the analysis. Choosing Run on the pipeline brings us to the task execution page where we can select the files to analyze, and set any modifiable parameters (executions can also be started via the API). We also need to add any reference and annotation files required by our pipeline. Because we are using a Seven Bridges optimized pipeline as the starting point, the recommended reference files can all be selected and added with one click rather than hunting these down from public repositories.
After we choose Run, our optimized execution is performed and we receive an email in an hour or so, letting us know that it has been completed.
After the execution completes, we can return to the Tasks tab to see the inputs, parameters, and output files of our analysis. Everything is kept in one place so that we can return in months or years and always know the precise analysis that was performed. The pipeline we executed returns both aligned bam (alignment) files and VCFs, so if we decide to change the variant calling in the future, we don’t need to rerun the alignment.
Extending from individual analysis to population studies in AWS
The walkthrough thus far ran an analysis on an individual level, but what if we wanted to expand our analysis to an entire population? AWS and Seven Bridges together offer a wealth of tools that are a perfect fit for this type of study. Let’s start with Amazon Redshift and the open source language R to see just a couple examples of how this might work.
First, we select all the resulting VCF files from the file browser and select get links. This generates signed URLs pointing to our file’s location in Amazon S3. Note that these links expire in 24 hours, so we want to use them right away.
As links will vary based on the 1000 genomes file picked and individual accounts, the rest of this post uses the chromosome 1 VCF file stored on Amazon S3 as an example.
A very common way to get started analyzing genomics data is by using the R software package, Bioconductor. Bioconductor is built for working with genomics data and has built-in functionality for interpreting many of the common genomics file types. It is also very simple to get started with Bioconductor in the cloud because an AWS CloudFormation template is provided and maintained for quickly provisioning an environment that is ready to go.
To launch a Bioconductor environment, choose AWS CloudFormation from the AWS Management Console, create a new stack, and enter the following value for Specify an Amazon S3 template URL under Template:
We must be in the us-east-1 region for this template to function correctly; be aware that we must choose an instance type that supports an ‘HVM instance store’ so we avoid the micro instance type.
After the CloudFormation stack has been launched, we should see a status of “CREATE_COMPLETE” and an Outputs tab should be exposed which contains the following important pieces of data:
URL – The URL to click to begin working with Bioconductor and R.
Username – User name to gain entry to the website provided by the URL.
Password – Password to gain entry to the website provided by the URL.
SSH Command – The command we can copy/paste into a terminal with SSH installed.
Before we jump into R, let’s add a couple items to our server that will come in handy later as we interact with AWS services from R. Connect to the instance using the SSH command provided by the outputs of the CloudFormation template. After we are in the instance, we run the following command:
sudo apt-get -y install libpq-dev
While logged into the instance, we can also use this opportunity to configure the AWS CLI. For the purposes of this post we’ll use the keys of a user who has read and write access to an Amazon S3 bucket.
Load the VCF into the R cloud
We are now ready to get started with R and Bioconductor by clicking the URL provided in the CloudFormation output. After we log in, we should be able to interact with a fully configured and pre-loaded RStudio environment.
First we need to pull the relevant VCF files into our Bioconductor environment. In this sample code, we use the VCF files provided in the 1000 Genomes project hosted on AWS but the links could be switched with those found in the file browser of the Seven Bridges Genomics platform.
#RCurl to retrieve data from S3
#The VariantAnnotation package
#from Bioconductor will give us additional functionality for working with #VCF files.
#No need to install these packages as they are part of the pre-configured
# Bioconductor install
#VCF files are composed of two parts. The data and the index stored in the #tbi file.
#The VariantAnnotation package relies on having both files side by side for #performing fast data lookups
#pull the binaries of the data file into R’s memory
bin_vcf <- getBinaryURL
#pull the binaries of the index file into R’s memory
bin_vcf_tabix <- getBinaryURL("http://1000genomes.s3.amazonaws.com/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi",
#Write both files to the working directory of R.
#use getwd() to see the current directory and use setwd() to change location.
con <- file("imported_vcf.gz", open = "wb")
con <- file("imported_vcf.gz.tbi", open = "wb")
#Create strings specifying the file name location that can be used by R
vcf_file <- paste(getwd(), "imported_vcf.gz", sep="/")
vcf.tbi <- paste(getwd(), "imported_vcf.gz.tbi", sep="/")
#use functionality of VariantAnnotation to review the header
#of the VCF file.
Extracting Relevant Variant Data
As a trivial example to show the type of manipulation a bioinformatisis may want to do with Bioconductor before sending the data off for other AWS Analytics services for further analysis, lets filter by a range within the chromosome 1 vcf file and then extract the variations in genes that were actually found.
#filter by a region in the chromosome
loc <- GRanges(1, IRanges(232164611, 233164611))
params <- ScanVcfParam(which=loc)
#read the VCF file into an R object of type VCF.
#the “hg19” is the genome identifier
# and tells bioconductor to use the 19th version of the reference file
#from the human genome project.
vcf <- readVcf(TabixFile(vcf_file), "hg19", params)
#use fixed to pull out the REF, ALT, QUAL and FILTER fields of the VCF and parse
#them into a dataframe.
#Using this as an easy to interpret example.
variant_df <- fixed(vcf)
#to get an idea of what is contained in this dataset, run
The results of this code should produce a sample similar to the following:
Using Bioconductor and R to send variants to Amazon Redshift
This appears to be a very structured data set that we may want to collect on an ongoing basis from multiple VCFs to perform large population studies later. This is a scenario that aligns perfectly with the AWS service. For more information about how Amazon Redshift relates to R as well as details on how to connect to Amazon Redshift from R, see Connecting R with Amazon Redshift.
To move this dataset effectively into Amazon Redshift, we want to use the COPY command of Amazon Redshift and take advantage of the wide bandwidth that S3 provides. Therefore, the first step is moving our fixed variant data frame into S3.
#move into S3
variant_df <- fixed(vcf)
#export the dataframe to a CSV
write.csv(variant_df, file = "my_variants.csv")
#this step requires that AWS configure was previously run on the
#EC2 instance with user keys that have read/write on YOUR-BUCKET
system("aws s3 mv my_variants.csv s3://YOUR-BUCKET/my_variants.csv", intern = TRUE)
Now that this data frame is available to us in Amazon S3, we can open a connection to Amazon Redshift and load the filtered variant data.
#on first run,
#we need to install the RPostgreSQL package.
#open a connection to the Amazon Redshift cluster
con <- dbConnect(dbDriver("PostgreSQL"),
port = 5439 )
#Send a query to create
#the table which will hold our variants
"CREATE TABLE SAMPLE_VARIANTS
( ROW_ID int, REF varchar(50),
ALT varchar(50), QUAL int,
FILTER varchar(5) )"
#Use the Amazon Redshift COPY
#command to load the table from S3
“ copy SAMPLE_VARIANTS from
IGNOREHEADER 1 maxerror as 250;" )
Running an analysis against Amazon Redshift from R
Now that the data has been loaded into Amazon Redshift, we can continue to add additional rows from more VCFs that are relevant to the region of the chromosome we wish to study, without having to worry about scaling since Amazon Redshift can easily query petabytes of data. We now also have the ability to write simple SQL against our dataset to find answers to our questions. For example, we can quickly ask what the top gene variations are in this chromosome region with the following statement:
top_variations <- dbGetQuery(con,
"select ref || ‘->’ || alt as variation, count(*) as variations_in_vcf from SAMPLE_VARIANTS group by ref,alt order by variations_in_vcf DESC"
The most common variation found in our chromosome range is that a C becomes a T. Most of our top changes are a single letter change, which is also known as a single-nucleotide polymorphism or SNP (pronounced “snip”).
In addition to massive scale-out capabilities, there’s another advantage to collecting relevant results in a single Amazon Redshift table: we are storing the specific variant data for the chromosome region we want to study in an easy-to-interpret and simple-to-query format. This makes it easy to cut our massive datasets with SQL and hone in on the relevant cohorts or regions of data we want to explore further. After we have used SQL to focus our data, we can return to R and run a more sophisticated analysis.
As a spurious example of what this might look like, let’s dig into our SNP example above. Simple counts let us know that the SNPs are certainly the common form of variation in the set. But what if we wanted to see if there were base pairs changes in the SNPs that tended to cluster together? We could examine this by running a correspondence analysis, which is a graphical method of exploring the relationship between variables in a contingency table. In this pseudo-analysis, we treat our ref and alt variables as categorical variables and produce a visualization that displays how our ref and alts move together in chi-squared space.
single_changes = "select ref , alt,
count(*) as variations_in_vcf from SAMPLE_VARIANTS
WHERE len(ref) = 1 and len(alt) = 1
group by ref,alt
order by variations_in_vcf DESC"
count_single_variations <- dbGetQuery(con,single_changes)
correspondance_table <- with(count_single_variations, table(ref,alt)) # create a 2 way table
prop.table(correspondance_table, 1) # row percentages
prop.table(correspondance_table, 2) # column percentages
fit <- ca(correspondance_table)
print(fit) #basic results
plot(fit, mass = TRUE, contrib = "absolute", map =
"rowgreen", arrows = c(FALSE, TRUE)) # asymmetric map
We are in the midst of a fundamental change in the accessibility and usability of data. Massive datasets once thought to only be archival in nature or impossible to interpret are being opened up for examination due to the processing power of the cloud. The genes that control the makeup of all living creatures are no exception and are finally falling under the same scrutiny.
We hope that this walkthrough has given the non-bioinformatics reader a quick peek under the hood of this fascinating field. We also hope it has provided insight into what it means to begin to understand the data in your DNA. Armed with this knowledge and an improved genomics vocabulary, you can be part of the conversation surrounding the precision medicine revolution.
Additionally, we hope the seasoned bioinformatics reader has been exposed to new tools that could accelerate research and has been empowered to explore the capabilities of Amazon Redshift and the Seven Bridges Genomics platform. This post has just scratched the surface. There are over 30 tested, peer-reviewed pipelines and 400 open source applications available for analyzing multidimensional, next-generation sequencing data integrated into the Seven Bridges Genomics platform.
If you have questions or suggestions, please leave a comment below.
Connecting R with Amazon Redshift