Tag Archives: medicine

Analyze OpenFDA Data in R with Amazon S3 and Amazon Athena

Post Syndicated from Ryan Hood original https://aws.amazon.com/blogs/big-data/analyze-openfda-data-in-r-with-amazon-s3-and-amazon-athena/

One of the great benefits of Amazon S3 is the ability to host, share, or consume public data sets. This provides transparency into data to which an external data scientist or developer might not normally have access. By exposing the data to the public, you can glean many insights that would have been difficult with a data silo.

The openFDA project creates easy access to the high value, high priority, and public access data of the Food and Drug Administration (FDA). The data has been formatted and documented in consumer-friendly standards. Critical data related to drugs, devices, and food has been harmonized and can easily be called by application developers and researchers via API calls. OpenFDA has published two whitepapers that drill into the technical underpinnings of the API infrastructure as well as how to properly analyze the data in R. In addition, FDA makes openFDA data available on S3 in raw format.

In this post, I show how to use S3, Amazon EMR, and Amazon Athena to analyze the drug adverse events dataset. A drug adverse event is an undesirable experience associated with the use of a drug, including serious drug side effects, product use errors, product quality programs, and therapeutic failures.

Data considerations

Keep in mind that this data does have limitations. In addition, in the United States, these adverse events are submitted to the FDA voluntarily from consumers so there may not be reports for all events that occurred. There is no certainty that the reported event was actually due to the product. The FDA does not require that a causal relationship between a product and event be proven, and reports do not always contain the detail necessary to evaluate an event. Because of this, there is no way to identify the true number of events. The important takeaway to all this is that the information contained in this data has not been verified to produce cause and effect relationships. Despite this disclaimer, many interesting insights and value can be derived from the data to accelerate drug safety research.

Data analysis using SQL

For application developers who want to perform targeted searching and lookups, the API endpoints provided by the openFDA project are “ready to go” for software integration using a standard API powered by Elasticsearch, NodeJS, and Docker. However, for data analysis purposes, it is often easier to work with the data using SQL and statistical packages that expect a SQL table structure. For large-scale analysis, APIs often have query limits, such as 5000 records per query. This can cause extra work for data scientists who want to analyze the full dataset instead of small subsets of data.

To address the concern of requiring all the data in a single dataset, the openFDA project released the full 100 GB of harmonized data files that back the openFDA project onto S3. Athena is an interactive query service that makes it easy to analyze data in S3 using standard SQL. It’s a quick and easy way to answer your questions about adverse events and aspirin that does not require you to spin up databases or servers.

While you could point tools directly at the openFDA S3 files, you can find greatly improved performance and use of the data by following some of the preparation steps later in this post.

Architecture

This post explains how to use the following architecture to take the raw data provided by openFDA, leverage several AWS services, and derive meaning from the underlying data.

Steps:

  1. Load the openFDA /drug/event dataset into Spark and convert it to gzip to allow for streaming.
  2. Transform the data in Spark and save the results as a Parquet file in S3.
  3. Query the S3 Parquet file with Athena.
  4. Perform visualization and analysis of the data in R and Python on Amazon EC2.

Optimizing public data sets: A primer on data preparation

Those who want to jump right into preparing the files for Athena may want to skip ahead to the next section.

Transforming, or pre-processing, files is a common task for using many public data sets. Before you jump into the specific steps for transforming the openFDA data files into a format optimized for Athena, I thought it would be worthwhile to provide a quick exploration on the problem.

Making a dataset in S3 efficiently accessible with minimal transformation for the end user has two key elements:

  1. Partitioning the data into objects that contain a complete part of the data (such as data created within a specific month).
  2. Using file formats that make it easy for applications to locate subsets of data (for example, gzip, Parquet, ORC, etc.).

With these two key elements in mind, you can now apply transformations to the openFDA adverse event data to prepare it for Athena. You might find the data techniques employed in this post to be applicable to many of the questions you might want to ask of the public data sets stored in Amazon S3.

Before you get started, I encourage those who are interested in doing deeper healthcare analysis on AWS to make sure that you first read the AWS HIPAA Compliance whitepaper. This covers the information necessary for processing and storing patient health information (PHI).

Also, the adverse event analysis shown for aspirin is strictly for demonstration purposes and should not be used for any real decision or taken as anything other than a demonstration of AWS capabilities. However, there have been robust case studies published that have explored a causal relationship between aspirin and adverse reactions using OpenFDA data. If you are seeking research on aspirin or its risks, visit organizations such as the Centers for Disease Control and Prevention (CDC) or the Institute of Medicine (IOM).

Preparing data for Athena

For this walkthrough, you will start with the FDA adverse events dataset, which is stored as JSON files within zip archives on S3. You then convert it to Parquet for analysis. Why do you need to convert it? The original data download is stored in objects that are partitioned by quarter.

Here is a small sample of what you find in the adverse events (/drugs/event) section of the openFDA website.

If you were looking for events that happened in a specific quarter, this is not a bad solution. For most other scenarios, such as looking across the full history of aspirin events, it requires you to access a lot of data that you won’t need. The zip file format is not ideal for using data in place because zip readers must have random access to the file, which means the data can’t be streamed. Additionally, the zip files contain large JSON objects.

To read the data in these JSON files, a streaming JSON decoder must be used or a computer with a significant amount of RAM must decode the JSON. Opening up these files for public consumption is a great start. However, you still prepare the data with a few lines of Spark code so that the JSON can be streamed.

Step 1:  Convert the file types

Using Apache Spark on EMR, you can extract all of the zip files and pull out the events from the JSON files. To do this, use the Scala code below to deflate the zip file and create a text file. In addition, compress the JSON files with gzip to improve Spark’s performance and reduce your overall storage footprint. The Scala code can be run in either the Spark Shell or in an Apache Zeppelin notebook on your EMR cluster.

If you are unfamiliar with either Apache Zeppelin or the Spark Shell, the following posts serve as great references:

 

import scala.io.Source
import java.util.zip.ZipInputStream
import org.apache.spark.input.PortableDataStream
import org.apache.hadoop.io.compress.GzipCodec

// Input Directory
val inputFile = "s3://download.open.fda.gov/drug/event/2015q4/*.json.zip";

// Output Directory
val outputDir = "s3://{YOUR OUTPUT BUCKET HERE}/output/2015q4/";

// Extract zip files from 
val zipFiles = sc.binaryFiles(inputFile);

// Process zip file to extract the json as text file and save it
// in the output directory 
val rdd = zipFiles.flatMap((file: (String, PortableDataStream)) => {
    val zipStream = new ZipInputStream(file.2.open)
    val entry = zipStream.getNextEntry
    val iter = Source.fromInputStream(zipStream).getLines
    iter
}).map(.replaceAll("\s+","")).saveAsTextFile(outputDir, classOf[GzipCodec])

Step 2:  Transform JSON into Parquet

With just a few more lines of Scala code, you can use Spark’s abstractions to convert the JSON into a Spark DataFrame and then export the data back to S3 in Parquet format.

Spark requires the JSON to be in JSON Lines format to be parsed correctly into a DataFrame.

// Output Parquet directory
val outputDir = "s3://{YOUR OUTPUT BUCKET NAME}/output/drugevents"
// Input json file
val inputJson = "s3://{YOUR OUTPUT BUCKET NAME}/output/2015q4/*”
// Load dataframe from json file multiline 
val df = spark.read.json(sc.wholeTextFiles(inputJson).values)
// Extract results from dataframe
val results = df.select("results")
// Save it to Parquet
results.write.parquet(outputDir)

Step 3:  Create an Athena table

With the data cleanly prepared and stored in S3 using the Parquet format, you can now place an Athena table on top of it to get a better understanding of the underlying data.

Because the openFDA data structure incorporates several layers of nesting, it can be a complex process to try to manually derive the underlying schema in a Hive-compatible format. To shorten this process, you can load the top row of the DataFrame from the previous step into a Hive table within Zeppelin and then extract the “create  table” statement from SparkSQL.

results.createOrReplaceTempView("data")

val top1 = spark.sql("select * from data tablesample(1 rows)")

top1.write.format("parquet").mode("overwrite").saveAsTable("drugevents")

val show_cmd = spark.sql("show create table drugevents”).show(1, false)

This returns a “create table” statement that you can almost paste directly into the Athena console. Make some small modifications (adding the word “external” and replacing “using with “stored as”), and then execute the code in the Athena query editor. The table is created.

For the openFDA data, the DDL returns all string fields, as the date format used in your dataset does not conform to the yyy-mm-dd hh:mm:ss[.f…] format required by Hive. For your analysis, the string format works appropriately but it would be possible to extend this code to use a Presto function to convert the strings into time stamps.

CREATE EXTERNAL TABLE  drugevents (
   companynumb  string, 
   safetyreportid  string, 
   safetyreportversion  string, 
   receiptdate  string, 
   patientagegroup  string, 
   patientdeathdate  string, 
   patientsex  string, 
   patientweight  string, 
   serious  string, 
   seriousnesscongenitalanomali  string, 
   seriousnessdeath  string, 
   seriousnessdisabling  string, 
   seriousnesshospitalization  string, 
   seriousnesslifethreatening  string, 
   seriousnessother  string, 
   actiondrug  string, 
   activesubstancename  string, 
   drugadditional  string, 
   drugadministrationroute  string, 
   drugcharacterization  string, 
   drugindication  string, 
   drugauthorizationnumb  string, 
   medicinalproduct  string, 
   drugdosageform  string, 
   drugdosagetext  string, 
   reactionoutcome  string, 
   reactionmeddrapt  string, 
   reactionmeddraversionpt  string)
STORED AS parquet
LOCATION
  's3://{YOUR TARGET BUCKET}/output/drugevents'

With the Athena table in place, you can start to explore the data by running ad hoc queries within Athena or doing more advanced statistical analysis in R.

Using SQL and R to analyze adverse events

Using the openFDA data with Athena makes it very easy to translate your questions into SQL code and perform quick analysis on the data. After you have prepared the data for Athena, you can begin to explore the relationship between aspirin and adverse drug events, as an example. One of the most common metrics to measure adverse drug events is the Proportional Reporting Ratio (PRR). It is defined as:

PRR = (m/n)/( (M-m)/(N-n) )
Where
m = #reports with drug and event
n = #reports with drug
M = #reports with event in database
N = #reports in database

Gastrointestinal haemorrhage has the highest PRR of any reaction to aspirin when viewed in aggregate. One question you may want to ask is how the PRR has trended on a yearly basis for gastrointestinal haemorrhage since 2005.

Using the following query in Athena, you can see the PRR trend of “GASTROINTESTINAL HAEMORRHAGE” reactions with “ASPIRIN” since 2005:

with drug_and_event as 
(select rpad(receiptdate, 4, 'NA') as receipt_year
    , reactionmeddrapt
    , count(distinct (concat(safetyreportid,receiptdate,reactionmeddrapt))) as reports_with_drug_and_event 
from fda.drugevents
where rpad(receiptdate,4,'NA') 
     between '2005' and '2015' 
     and medicinalproduct = 'ASPIRIN'
     and reactionmeddrapt= 'GASTROINTESTINAL HAEMORRHAGE'
group by reactionmeddrapt, rpad(receiptdate, 4, 'NA') 
), reports_with_drug as 
(
select rpad(receiptdate, 4, 'NA') as receipt_year
    , count(distinct (concat(safetyreportid,receiptdate,reactionmeddrapt))) as reports_with_drug 
 from fda.drugevents 
 where rpad(receiptdate,4,'NA') 
     between '2005' and '2015' 
     and medicinalproduct = 'ASPIRIN'
group by rpad(receiptdate, 4, 'NA') 
), reports_with_event as 
(
   select rpad(receiptdate, 4, 'NA') as receipt_year
    , count(distinct (concat(safetyreportid,receiptdate,reactionmeddrapt))) as reports_with_event 
   from fda.drugevents
   where rpad(receiptdate,4,'NA') 
     between '2005' and '2015' 
     and reactionmeddrapt= 'GASTROINTESTINAL HAEMORRHAGE'
   group by rpad(receiptdate, 4, 'NA')
), total_reports as 
(
   select rpad(receiptdate, 4, 'NA') as receipt_year
    , count(distinct (concat(safetyreportid,receiptdate,reactionmeddrapt))) as total_reports 
   from fda.drugevents
   where rpad(receiptdate,4,'NA') 
     between '2005' and '2015' 
   group by rpad(receiptdate, 4, 'NA')
)
select  drug_and_event.receipt_year, 
(1.0 * drug_and_event.reports_with_drug_and_event/reports_with_drug.reports_with_drug)/ (1.0 * (reports_with_event.reports_with_event- drug_and_event.reports_with_drug_and_event)/(total_reports.total_reports-reports_with_drug.reports_with_drug)) as prr
, drug_and_event.reports_with_drug_and_event
, reports_with_drug.reports_with_drug
, reports_with_event.reports_with_event
, total_reports.total_reports
from drug_and_event
    inner join reports_with_drug on  drug_and_event.receipt_year = reports_with_drug.receipt_year   
    inner join reports_with_event on  drug_and_event.receipt_year = reports_with_event.receipt_year
    inner join total_reports on  drug_and_event.receipt_year = total_reports.receipt_year
order by  drug_and_event.receipt_year


One nice feature of Athena is that you can quickly connect to it via R or any other tool that can use a JDBC driver to visualize the data and understand it more clearly.

With this quick R script that can be run in R Studio either locally or on an EC2 instance, you can create a visualization of the PRR and Reporting Odds Ratio (RoR) for “GASTROINTESTINAL HAEMORRHAGE” reactions from “ASPIRIN” since 2005 to better understand these trends.

# connect to ATHENA
conn <- dbConnect(drv, '<Your JDBC URL>',s3_staging_dir="<Your S3 Location>",user=Sys.getenv(c("USER_NAME"),password=Sys.getenv(c("USER_PASSWORD"))

# Declare Adverse Event
adverseEvent <- "'GASTROINTESTINAL HAEMORRHAGE'"

# Build SQL Blocks
sqlFirst <- "SELECT rpad(receiptdate, 4, 'NA') as receipt_year, count(DISTINCT safetyreportid) as event_count FROM fda.drugsflat WHERE rpad(receiptdate,4,'NA') between '2005' and '2015'"
sqlEnd <- "GROUP BY rpad(receiptdate, 4, 'NA') ORDER BY receipt_year"

# Extract Aspirin with adverse event counts
sql <- paste(sqlFirst,"AND medicinalproduct ='ASPIRIN' AND reactionmeddrapt=",adverseEvent, sqlEnd,sep=" ")
aspirinAdverseCount = dbGetQuery(conn,sql)

# Extract Aspirin counts
sql <- paste(sqlFirst,"AND medicinalproduct ='ASPIRIN'", sqlEnd,sep=" ")
aspirinCount = dbGetQuery(conn,sql)

# Extract adverse event counts
sql <- paste(sqlFirst,"AND reactionmeddrapt=",adverseEvent, sqlEnd,sep=" ")
adverseCount = dbGetQuery(conn,sql)

# All Drug Adverse event Counts
sql <- paste(sqlFirst, sqlEnd,sep=" ")
allDrugCount = dbGetQuery(conn,sql)

# Select correct rows
selAll =  allDrugCount$receipt_year == aspirinAdverseCount$receipt_year
selAspirin = aspirinCount$receipt_year == aspirinAdverseCount$receipt_year
selAdverse = adverseCount$receipt_year == aspirinAdverseCount$receipt_year

# Calculate Numbers
m <- c(aspirinAdverseCount$event_count)
n <- c(aspirinCount[selAspirin,2])
M <- c(adverseCount[selAdverse,2])
N <- c(allDrugCount[selAll,2])

# Calculate proptional reporting ratio
PRR = (m/n)/((M-m)/(N-n))

# Calculate reporting Odds Ratio
d = n-m
D = N-M
ROR = (m/d)/(M/D)

# Plot the PRR and ROR
g_range <- range(0, PRR,ROR)
g_range[2] <- g_range[2] + 3
yearLen = length(aspirinAdverseCount$receipt_year)
axis(1,1:yearLen,lab=ax)
plot(PRR, type="o", col="blue", ylim=g_range,axes=FALSE, ann=FALSE)
axis(1,1:yearLen,lab=ax)
axis(2, las=1, at=1*0:g_range[2])
box()
lines(ROR, type="o", pch=22, lty=2, col="red")

As you can see, the PRR and RoR have both remained fairly steady over this time range. With the R Script above, all you need to do is change the adverseEvent variable from GASTROINTESTINAL HAEMORRHAGE to another type of reaction to analyze and compare those trends.

Summary

In this walkthrough:

  • You used a Scala script on EMR to convert the openFDA zip files to gzip.
  • You then transformed the JSON blobs into flattened Parquet files using Spark on EMR.
  • You created an Athena DDL so that you could query these Parquet files residing in S3.
  • Finally, you pointed the R package at the Athena table to analyze the data without pulling it into a database or creating your own servers.

If you have questions or suggestions, please comment below.


Next Steps

Take your skills to the next level. Learn how to optimize Amazon S3 for an architecture commonly used to enable genomic data analysis. Also, be sure to read more about running R on Amazon Athena.

 

 

 

 

 


About the Authors

Ryan Hood is a Data Engineer for AWS. He works on big data projects leveraging the newest AWS offerings. In his spare time, he enjoys watching the Cubs win the World Series and attempting to Sous-vide anything he can find in his refrigerator.

 

 

Vikram Anand is a Data Engineer for AWS. He works on big data projects leveraging the newest AWS offerings. In his spare time, he enjoys playing soccer and watching the NFL & European Soccer leagues.

 

 

Dave Rocamora is a Solutions Architect at Amazon Web Services on the Open Data team. Dave is based in Seattle and when he is not opening data, he enjoys biking and drinking coffee outside.

 

 

 

 

Protesters Physically Block HQ of Russian Web Blocking Watchdog

Post Syndicated from Andy original https://torrentfreak.com/protesters-physically-block-hq-of-russian-web-blocking-watchdog-170701/

Hardly a week goes by without the Russian web-blocking juggernaut rolling on to new targets. Whether they’re pirate websites, anonymity and proxy services, or sites that the government feels are inappropriate, web blocks are now a regular occurance in the region.

With thousands of domains and IP addresses blocked, the situation is serious. Just recently, however, blocks have been more problematic than usual. Telecoms watchdog Roskomnadzor, which oversees blocking, claims that innocent services are rarely hit. But critics say that overbroad IP address blockades are affecting the innocent.

Earlier this month there were reports that citizens across the country couldn’t access some of the country’s largest sites, including Google.ru, Yandex.ru, local Facebook variant vKontakte, and even the Telegram messaging app.

There have been various explanations for the problems, but the situation with Google appears to have stemmed from a redirect to an unauthorized gambling site. The problem was later resolved, and Google was removed from the register of banned sites, but critics say it should never have been included in the first place.

These and other developments have proven too much for some pro-freedom activists. This week they traveled to Roskomnadzor’s headquarters in St. Petersburg to give the blocking watchdog a small taste of its own medicine.

Activists from the “Open Russia” and “Civil Petersburg” movements positioned themselves outside the entrance to the telecom watchdog’s offices and built up their own barricade constructed from boxes. Each carried a label with the text “Blocked Citizens of Russia.”

Blockading the blockaders in Russia

“Freedom of information, like freedom of expression, are the basic values of our society. Those who try to attack them, must themselves be ‘blocked’ from society,” said Open Russia coordinator Andrei Pivovarov.

Rather like Internet blockades, the image above shows Open Russia’s blockade only partially doing its job by covering just three-quarters of Roskomnadzor’s entrance.

Whether that was deliberate or not is unknown but the video embedded below clearly shows staff walking around its perimeter. The protestors were probably just being considerate, but there are suggestions that staff might have been using VPNs or Tor.

Moving forward, new advice from Roskomnadzor to ISPs is that they should think beyond IP address and domain name blocking and consider using Deep Packet Inspection. This would help ensure blocks are carried out more accurately, the watchdog says.

There’s even a suggestion that rather than doing their own website filtering, Internet service providers could buy a “ready cleaned” Internet feed from an approved supplier instead. This would remove the need for additional filtering at their end, it’s argued, but it sounds like more problems waiting to happen.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Healthcare Industry Cybersecurity Report

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/06/healthcare_indu.html

New US government report: “Report on Improving Cybersecurity in the Health Care Industry.” It’s pretty scathing, but nothing in it will surprise regular readers of this blog.

It’s worth reading the executive summary, and then skimming the recommendations. Recommendations are in six areas.

The Task Force identified six high-level imperatives by which to organize its recommendations and action items. The imperatives are:

  1. Define and streamline leadership, governance, and expectations for health care industry cybersecurity.
  2. Increase the security and resilience of medical devices and health IT.

  3. Develop the health care workforce capacity necessary to prioritize and ensure cybersecurity awareness and technical capabilities.

  4. Increase health care industry readiness through improved cybersecurity awareness and education.

  5. Identify mechanisms to protect research and development efforts and intellectual property from attacks or exposure.

  6. Improve information sharing of industry threats, weaknesses, and mitigations.

News article.

Slashdot thread.

Torrents Help Researchers Worldwide to Study Babies’ Brains

Post Syndicated from Ernesto original https://torrentfreak.com/torrents-help-researchers-worldwide-to-study-babies-brains-170603/

One of the core pillars of academic research is sharing.

By letting other researchers know what you do, ideas are criticized, improved upon and extended. In today’s digital age, sharing is easier than ever before, especially with help from torrents.

One of the leading scientific projects that has adopted BitTorrent is the developing Human Connectome Project, or dHCP for short. The goal of the project is to map the brain wiring of developing babies in the wombs of their mothers.

To do so, a consortium of researchers with expertise ranging from computer science, to MRI physics and clinical medicine, has teamed up across three British institutions: Imperial College London, King’s College London and the University of Oxford.

The collected data is extremely valuable for the neuroscience community and the project has received mainstream press coverage and financial backing from the European Union Research Council. Not only to build the dataset, but also to share it with researchers around the globe. This is where BitTorrent comes in.

Sharing more than 150 GB of data with researchers all over the world can be quite a challenge. Regular HTTP downloads are not really up to the task, and many other transfer options have a high failure rate.

Baby brain scan (Credit: Developing Human Connectome Project)

This is why Jonathan Passerat-Palmbach, Research Associate Department of Computing Imperial College London, came up with the idea to embrace BitTorrent instead.

“For me, it was a no-brainer from day one that we couldn’t rely on plain old HTTP to make this dataset available. Our first pilot release is 150GB, and I expect the next ones to reach a couple of TB. Torrents seemed like the de facto solution to share this data with the world’s scientific community.” Passerat-Palmbach says.

The researchers opted to go for the Academic Torrents tracker, which specializes in sharing research data. A torrent with the first batch of images was made available there a few weeks ago.

“This initial release contains 3,629 files accounting for 167.20GB of data. While this figure might not appear extremely large at the moment, it will significantly grow as the project aims to make the data of 1,000 subjects available by the time it has completed.”

Torrent of the first dataset

The download numbers are nowhere in the region of an average Hollywood blockbuster, of course. Thus far the tracker has registered just 28 downloads. That said, as a superior and open file-transfer protocol, BitTorrent does aid in critical research that helps researchers to discover more about the development of conditions such as ADHD and autism.

Interestingly, the biggest challenges of implementing the torrent solution were not of a technical nature. Most time and effort went into assuring other team members that this was the right solution.

“I had to push for more than a year for the adoption of torrents within the consortium. While my colleagues could understand the potential of the approach and its technical inputs, they remained skeptical as to the feasibility to implement such a solution within an academic context and its reception by the world community.

“However, when the first dataset was put together, amounting to 150GB, it became obvious all the HTTP and FTP fallback plans would not fit our needs,” Passerat-Palmbach adds.

Baby brain scans (Credit: Developing Human Connectome Project)

When the consortium finally agreed that BitTorrent was an acceptable way to share the data, local IT staff at the university had to give their seal of approval. Imperial College London doesn’t allow torrent traffic to flow freely across the network, so an exception had to be made.

“Torrents are blocked across the wireless and VPN networks at Imperial. Getting an explicit firewall exception created for our seeding machine was not a walk in the park. It was the first time they were faced with such a situation and we were clearly told that it was not to become the rule.”

Then, finally, the data could be shared around the world.

While BitTorrent is probably the most efficient way to share large files, there were other proprietary solutions that could do the same. However, Passerat-Palmbach preferred not to force other researchers to install “proprietary black boxes” on their machines.

Torrents are free and open, which is more in line with the Open Access approach more academics take today.

Looking back, it certainly wasn’t a walk in the park to share the data via BitTorrent. Passerat-Palmbach was frequently confronted with the piracy stigma torrents have amoung many of his peers, even among younger generations.

“Considering how hard it was to convince my colleagues within the project to actually share this dataset using torrents (‘isn’t it illegal?’ and other kinds of misconceptions…), I think there’s still a lot of work to do to demystify the use of torrents with the public.

“I was even surprised to see that these misconceptions spread out not only to more senior scientists but also to junior researchers who I was expecting to be more tech-aware,” Passerat-Palmbach adds.

That said, the hard work is done now and in the months and years ahead the neuroscience community will have access to Petabytes of important data, with help from BitTorrent. That is definitely worth the effort.

Finally, we thought it was fitting to end with Passerat-Palmbach’s “pledge to seed,” which he shared with his peers. Keep on sharing!


On the importance of seeding

Dear fellow scientist,

Thank for you very much for the interest you are showing in the dHCP dataset!

Once you start downloading the dataset, you’ll notice that your torrent client mentions a sharing / seeding ratio. It means that as soon as you start downloading the dataset, you become part of our community of sharers and contribute to making the dataset available to other researchers all around the world!

There’s no reason to be scared! It’s perfectly legal as long as you’re allowed to have a copy of the dataset (that’s the bit you need to forward to your lab’s IT staff if they’re blocking your ports).

You’re actually providing a tremendous contribution to dHCP by spreading the data, so thank you again for that!

With your help, we can make sure this data remains available and can be downloaded relatively fast in the future. Over time, the dataset will grow and your contribution will be more and more important so that each and everyone of you can still obtain the data in the smoothest possible way.

We cannot do it without you. By seeding, you’re actually saying “cheers!” to your peers whom you downloaded your data from. So leave your client open and stay tuned!

All this is made possible thanks to the amazing folks at academictorrents and their infrastructure, so kudos academictorrents!

You can learn more about their project here and get some help to get started with torrent downloading here.

Jonathan Passerat-Palmbach

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

AWS Enables Consortium Science to Accelerate Discovery

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-enables-consortium-science-to-accelerate-discovery/

My colleague Mia Champion is a scientist (check out her publications), an AWS Certified Solutions Architect, and an AWS Certified Developer. The time that she spent doing research on large-data datasets gave her an appreciation for the value of cloud computing in the bioinformatics space, which she summarizes and explains in the guest post below!

Jeff;


Technological advances in scientific research continue to enable the collection of exponentially growing datasets that are also increasing in the complexity of their content. The global pace of innovation is now also fueled by the recent cloud-computing revolution, which provides researchers with a seemingly boundless scalable and agile infrastructure. Now, researchers can remove the hindrances of having to own and maintain their own sequencers, microscopes, compute clusters, and more. Using the cloud, scientists can easily store, manage, process and share datasets for millions of patient samples with gigabytes and more of data for each individual. As American physicist, John Bardeen once said: “Science is a collaborative effort. The combined results of several people working together is much more effective than could be that of an individual scientist working alone”.

Prioritizing Reproducible Innovation, Democratization, and Data Protection
Today, we have many individual researchers and organizations leveraging secure cloud enabled data sharing on an unprecedented scale and producing innovative, customized analytical solutions using the AWS cloud.  But, can secure data sharing and analytics be done on such a collaborative scale as to revolutionize the way science is done across a domain of interest or even across discipline/s of science? Can building a cloud-enabled consortium of resources remove the analytical variability that leads to diminished reproducibility, which has long plagued the interpretability and impact of research discoveries? The answers to these questions are ‘yes’ and initiatives such as the Neuro Cloud Consortium, The Global Alliance for Genomics and Health (GA4GH), and The Sage Bionetworks Synapse platform, which powers many research consortiums including the DREAM challenges, are starting to put into practice model cloud-initiatives that will not only provide impactful discoveries in the areas of neuroscience, infectious disease, and cancer, but are also revolutionizing the way in which scientific research is done.

Bringing Crowd Developed Models, Algorithms, and Functions to the Data
Collaborative projects have traditionally allowed investigators to download datasets such as those used for comparative sequence analysis or for training a deep learning algorithm on medical imaging data. Investigators were then able to develop and execute their analysis using institutional clusters, local workstations, or even laptops:

This method of collaboration is problematic for many reasons. The first concern is data security, since dataset download essentially permits “chain-data-sharing” with any number of recipients. Second, analytics done using compute environments that are not templated at some level introduces the risk of variable analytics that itself is not reproducible by a different investigator, or even the same investigator using a different compute environment. Third, the required data dump, processing, and then re-upload or distribution to the collaborative group is highly inefficient and dependent upon each individual’s networking and compute capabilities. Overall, traditional methods of scientific collaboration have introduced methods in which security is compromised and time to discovery is hampered.

Using the AWS cloud, collaborative researchers can share datasets easily and securely by taking advantage of Identity and Access Management (IAM) policy restrictions for user bucket access as well as S3 bucket policies or Access Control Lists (ACLs). To streamline analysis and ensure data security, many researchers are eliminating the necessity to download datasets entirely by leveraging resources that facilitate moving the analytics to the data source and/or taking advantage of remote API requests to access a shared database or data lake. One way our customers are accomplishing this is to leverage container based Docker technology to provide collaborators with a way to submit algorithms or models for execution on the system hosting the shared datasets:

Docker container images have all of the application’s dependencies bundled together, and therefore provide a high degree of versatility and portability, which is a significant advantage over using other executable-based approaches. In the case of collaborative machine learning projects, each docker container will contain applications, language runtime, packages and libraries, as well as any of the more popular deep learning frameworks commonly used by researchers including: MXNet, Caffe, TensorFlow, and Theano.

A common feature in these frameworks is the ability to leverage a host machine’s Graphical Processing Units (GPUs) for significant acceleration of the matrix and vector operations involved in the machine learning computations. As such, researchers with these objectives can leverage EC2’s new P2 instance types in order to power execution of submitted machine learning models. In addition, GPUs can be mounted directly to containers using the NVIDIA Docker tool and appear at the system level as additional devices. By leveraging Amazon EC2 Container Service and the EC2 Container Registry, collaborators are able to execute analytical solutions submitted to the project repository by their colleagues in a reproducible fashion as well as continue to build on their existing environment.  Researchers can also architect a continuous deployment pipeline to run their docker-enabled workflows.

In conclusion, emerging cloud-enabled consortium initiatives serve as models for the broader research community for how cloud-enabled community science can expedite discoveries in Precision Medicine while also providing a platform where data security and discovery reproducibility is inherent to the project execution.

Mia D. Champion, Ph.D.

 

Pollexy – Building a Special Needs Voice Assistant with Amazon Polly and Raspberry Pi

Post Syndicated from Ana Visneski original https://aws.amazon.com/blogs/aws/pollexy-building-a-special-needs-voice-assistant-with-amazon-polly-and-raspberry-pi/

April is Autism Awareness month and about 1 in 68 children in the U.S. have been identified with autism spectrum disorder (ASD) (CDC 2014). In this post from Troy Larson, a Sr. Devops Cloud Architect here at AWS, you get an introduction to a project he has been working on to help his son Calvin.

I have been asked how the minds at AWS come up with so many different ideas. Sometimes they come from a deeply personal place, where someone sees a way to help others. Pollexy is an amazing example of just that. Read about Pollexy and then watch the video here.

-Ana


Background

As a computer programming parent of a 16-year old non-verbal teenage boy with autism, I have been constantly searching over the years to find ways to use technology to make our lives together safer, happier and more comfortable. At the core of this challenge is the most basic of all human interaction—communication. While Calvin is able to respond to verbal instruction, he is not able to speak responsively. In his entire life, we’ve never had a conversation. He is able to be left alone in his room to play, but most every task or set of tasks requires a human to verbally prompt him along the way. Having other children and responsibilities in the home, at times the intensity of supervision can be negatively impactful on the home dynamic.

Genesis

When I saw the announcement of Amazon Polly and Amazon Lex at re:Invent last year, I immediately started churning on how we could leverage these technologies to assist Calvin. He responds well to human verbal prompts, but would he understand a digital voice? So one Saturday, I setup a Raspberry Pi in his room and closed his door and crouched around the corner with other family members so Calvin couldn’t see us. I connected to the Raspberry Pi and instructed Polly to speak in Joanna’s familiar pacific tone, “Calvin, it’s time to take a potty break. Go out of your bedroom and go to the bathroom.” In a few seconds, we heard his doorknob turn and I poked my head out of my hiding place. Calvin passed by, looking at me quizzically, then went into the bathroom as Joanna had instructed. We all looked at each other in amazement—he had listened and responded perfectly to the completely invisible voice of someone he’d never heard before. After discussing some ideas around this with co-workers, a colleague suggested I enter the IoT and AI Science Fair at our annual AWS Sales Kick-Off meeting. Less than two months after the Polly and Lex announcement and 3500 lines of code later, Pollexy—along with Calvin–debuted at the Science Fair.

Overview

Pollexy (“Polly” + “Lex”) is a Raspberry Pi and mobile-based special needs verbal assistant that lets caretakers schedule audio task prompts and messages both on a recurring schedule and/or on-demand. Caretakers can schedule regular medicine reminder messages or hourly bathroom break messages, for example, and at the same time use their Amazon Echo and mobile device to request a specific message be played immediately. Caretakers can even set it up so that the person needs to confirm that they’ve heard the message. For example, my son won’t pay attention to Pollexy unless Pollexy first asks him to “Push the blue button.” Pollexy will wait until he has pushed the button and then speak the actual message. Other people may be able to respond verbally using Lex, or not require a confirmation at all. Pollexy can be tailored to what works best.

And then most importantly—and most challenging—in a large house, how do we make sure the person is in the room where we play the message? What if we have a special needs adult living in an in-law suite? Are they in the living room or the kitchen? And what about multiple people? What if we have multiple people in different areas of the house, each of whom has a message? Let’s explore the basic elements and tie the pieces together.

Basic Elements of Pollexy

In the spirit of Amazon’s Leadership Principle “Invent and Simplify,” we want to minimize the complexity of the Pollexy architecture. We can break Pollexy down into three types of objects and three components, all of which work together in a way that’s easily explainable.

Object #1: Person

Pollexy can support any number of people. A person is a uniquely identifiable name. We can set basic preferences such as “requires confirmation” and most importantly, we can define a location schedule. This means that we can create an Outlook-like schedule that sets preferences where someone should be in the house.

Object #2: Location

A location is simply a uniquely identifiable location where a device is physically sitting. Based on the user’s location schedule, Pollexy will know which device to contact first, second, third, etc. We can also “mute” devices if needed (naptime, etc.)

Object #3: Message

Obviously, this is the actual message we want to play. Attached to each message is a person and a recurring schedule (only if it’s not a one-time message). We don’t store location with the message, because Pollexy figures out the person’s location when the message is ready to be delivered.

Component #1: Scheduler

Every message needs to be scheduled. This is a command-line tool where you basically say Tell “Calvin” that “you need to brush your teeth” every night at 8 p.m. This message is then stored in DynamoDB, waiting to be picked up by the queueing Lambda function.

Component #2: Queueing Engine

Every minute, a Lambda runs and checks the scheduler to see if there is a message or messages ready to be delivered. If a message is ready, it looks up the person’s location schedule and figures out where they are and then pushes the message or messages into an SQS queue for that location.

Component #3: Speaker Engine

Every minute on the Raspberry Pi device, the speaker engine spins up and checks the SQS for its location. If there are messages, then the speaker engine looks at the user’s preferences and initiates communication to convey the message. If the person doesn’t respond, the speaker engine will check if the person has a secondary location in their schedule and drop the message in the SQS Queue for that location. In the end, a message will either be delivered or eventually just timeout (if someone is out of the house for the day).

Respect and Freedom are the Keys

We often take our personal privacy and respect for granted, so imagine even for a special needs person, the lack of privacy and freedom around having a person constantly in your presence. This is exaggerated for those in the autism spectrum where invasion of personal space can escalate a sense of invasion, turning into anger and frustration. Pollexy becomes their own personal, gentle and never-flustered friend to coach to them along the way, giving them confidence, respect and the sense of privacy and freedom we all want to enjoy.

-Troy Larson

AWS Hot Startups – March 2017

Post Syndicated from Ana Visneski original https://aws.amazon.com/blogs/aws/aws-hot-startups-march-2017/

As the madness of March rounds up, take a break from all the basketball and check out the cool startups Tina Barr brings you for this month!

-Ana


The arrival of spring brings five new startups this month:

  • Amino Apps – providing social networks for hundreds of thousands of communities.
  • Appboy – empowering brands to strengthen customer relationships.
  • Arterys – revolutionizing the medical imaging industry.
  • Protenus – protecting patient data for healthcare organizations.
  • Syapse – improving targeted cancer care with shared data from across the country.

In case you missed them, check out February’s hot startups here.

Amino Apps (New York, NY)
Amino Logo
Amino Apps was founded on the belief that interest-based communities were underdeveloped and outdated, particularly when it came to mobile. CEO Ben Anderson and CTO Yin Wang created the app to give users access to hundreds of thousands of communities, each of them a complete social network dedicated to a single topic. Some of the largest communities have over 1 million members and are built around topics like popular TV shows, video games, sports, and an endless number of hobbies and other interests. Amino hosts communities from around the world and is currently available in six languages with many more on the way.

Navigating the Amino app is easy. Simply download the app (iOS or Android), sign up with a valid email address, choose a profile picture, and start exploring. Users can search for communities and join any that fit their interests. Each community has chatrooms, multimedia content, quizzes, and a seamless commenting system. If a community doesn’t exist yet, users can create it in minutes using the Amino Creator and Manager app (ACM). The largest user-generated communities are turned into their own apps, which gives communities their own piece of real estate on members’ phones, as well as in app stores.

Amino’s vast global network of hundreds of thousands of communities is run on AWS services. Every day users generate, share, and engage with an enormous amount of content across hundreds of mobile applications. By leveraging AWS services including Amazon EC2, Amazon RDS, Amazon S3, Amazon SQS, and Amazon CloudFront, Amino can continue to provide new features to their users while scaling their service capacity to keep up with user growth.

Interested in joining Amino? Check out their jobs page here.

Appboy (New York, NY)
In 2011, Bill Magnuson, Jon Hyman, and Mark Ghermezian saw a unique opportunity to strengthen and humanize relationships between brands and their customers through technology. The trio created Appboy to empower brands to build long-term relationships with their customers and today they are the leading lifecycle engagement platform for marketing, growth, and engagement teams. The team recognized that as rapid mobile growth became undeniable, many brands were becoming frustrated with the lack of compelling and seamless cross-channel experiences offered by existing marketing clouds. Many of today’s top mobile apps and enterprise companies trust Appboy to take their marketing to the next level. Appboy manages user profiles for nearly 700 million monthly active users, and is used to power more than 10 billion personalized messages monthly across a multitude of channels and devices.

Appboy creates a holistic user profile that offers a single view of each customer. That user profile in turn powers contextual cross-channel messaging, lifecycle engagement automation, and robust campaign insights and optimization opportunities. Appboy offers solutions that allow brands to create push notifications, targeted emails, in-app and in-browser messages, news feed cards, and webhooks to enhance the user experience and increase customer engagement. The company prides itself on its interoperability, connecting to a variety of complimentary marketing tools and technologies so brands can build the perfect stack to enable their strategies and experiments in real time.

AWS makes it easy for Appboy to dynamically size all of their service components and automatically scale up and down as needed. They use an array of services including Elastic Load Balancing, AWS Lambda, Amazon CloudWatch, Auto Scaling groups, and Amazon S3 to help scale capacity and better deal with unpredictable customer loads.

To keep up with the latest marketing trends and tactics, visit the Appboy digital magazine, Relate. Appboy was also recently featured in the #StartupsOnAir video series where they gave insight into their AWS usage.

Arterys (San Francisco, CA)
Getting test results back from a physician can often be a time consuming and tedious process. Clinicians typically employ a variety of techniques to manually measure medical images and then make their assessments. Arterys founders Fabien Beckers, John Axerio-Cilies, Albert Hsiao, and Shreyas Vasanawala realized that much more computation and advanced analytics were needed to harness all of the valuable information in medical images, especially those generated by MRI and CT scanners. Clinicians were often skipping measurements and making assessments based mostly on qualitative data. Their solution was to start a cloud/AI software company focused on accelerating data-driven medicine with advanced software products for post-processing of medical images.

Arterys’ products provide timely, accurate, and consistent quantification of images, improve speed to results, and improve the quality of the information offered to the treating physician. This allows for much better tracking of a patient’s condition, and thus better decisions about their care. Advanced analytics, such as deep learning and distributed cloud computing, are used to process images. The first Arterys product can contour cardiac anatomy as accurately as experts, but takes only 15-20 seconds instead of the 45-60 minutes required to do it manually. Their computing cloud platform is also fully HIPAA compliant.

Arterys relies on a variety of AWS services to process their medical images. Using deep learning and other advanced analytic tools, Arterys is able to render images without latency over a web browser using AWS G2 instances. They use Amazon EC2 extensively for all of their compute needs, including inference and rendering, and Amazon S3 is used to archive images that aren’t needed immediately, as well as manage costs. Arterys also employs Amazon Route 53, AWS CloudTrail, and Amazon EC2 Container Service.

Check out this quick video about the technology that Arterys is creating. They were also recently featured in the #StartupsOnAir video series and offered a quick demo of their product.

Protenus (Baltimore, MD)
Protenus Logo
Protenus founders Nick Culbertson and Robert Lord were medical students at Johns Hopkins Medical School when they saw first-hand how Electronic Health Record (EHR) systems could be used to improve patient care and share clinical data more efficiently. With increased efficiency came a huge issue – an onslaught of serious security and privacy concerns. Over the past two years, 140 million medical records have been breached, meaning that approximately 1 in 3 Americans have had their health data compromised. Health records contain a repository of sensitive information and a breach of that data can cause major havoc in a patient’s life – namely identity theft, prescription fraud, Medicare/Medicaid fraud, and improper performance of medical procedures. Using their experience and knowledge from former careers in the intelligence community and involvement in a leading hedge fund, Nick and Robert developed the prototype and algorithms that launched Protenus.

Today, Protenus offers a number of solutions that detect breaches and misuse of patient data for healthcare organizations nationwide. Using advanced analytics and AI, Protenus’ health data insights platform understands appropriate vs. inappropriate use of patient data in the EHR. It also protects privacy, aids compliance with HIPAA regulations, and ensures trust for patients and providers alike.

Protenus built and operates its SaaS offering atop Amazon EC2, where Dedicated Hosts and encrypted Amazon EBS volume are used to ensure compliance with HIPAA regulation for the storage of Protected Health Information. They use Elastic Load Balancing and Amazon Route 53 for DNS, enabling unique, secure client specific access points to their Protenus instance.

To learn more about threats to patient data, read Hospitals’ Biggest Threat to Patient Data is Hiding in Plain Sight on the Protenus blog. Also be sure to check out their recent video in the #StartupsOnAir series for more insight into their product.

Syapse (Palo Alto, CA)
Syapse provides a comprehensive software solution that enables clinicians to treat patients with precision medicine for targeted cancer therapies — treatments that are designed and chosen using genetic or molecular profiling. Existing hospital IT doesn’t support the robust infrastructure and clinical workflows required to treat patients with precision medicine at scale, but Syapse centralizes and organizes patient data to clinicians at the point of care. Syapse offers a variety of solutions for oncologists that allow them to access the full scope of patient data longitudinally, view recommended treatments or clinical trials for similar patients, and track outcomes over time. These solutions are helping health systems across the country to improve patient outcomes by offering the most innovative care to cancer patients.

Leading health systems such as Stanford Health Care, Providence St. Joseph Health, and Intermountain Healthcare are using Syapse to improve patient outcomes, streamline clinical workflows, and scale their precision medicine programs. A group of experts known as the Molecular Tumor Board (MTB) reviews complex cases and evaluates patient data, documents notes, and disseminates treatment recommendations to the treating physician. Syapse also provides reports that give health system staff insight into their institution’s oncology care, which can be used toward quality improvement, business goals, and understanding variables in the oncology service line.

Syapse uses Amazon Virtual Private Cloud, Amazon EC2 Dedicated Instances, and Amazon Elastic Block Store to build a high-performance, scalable, and HIPAA-compliant data platform that enables health systems to make precision medicine part of routine cancer care for patients throughout the country.

Be sure to check out the Syapse blog to learn more and also their recent video on the #StartupsOnAir video series where they discuss their product, HIPAA compliance, and more about how they are using AWS.

Thank you for checking out another month of awesome hot startups!

-Tina Barr

 

AWS and Healthcare: View From the Floor of HIMSS17

Post Syndicated from Ana Visneski original https://aws.amazon.com/blogs/aws/aws-and-healthcare-view-from-the-floor-of-himss17/

Jordin Green is our guest writer today, with an inside view from the floor of HIMSS17.

-Ana


HIMMS1Empathy. It’s not always a word you hear associated with technology but one might argue it should be a central tenet for the application of technology in healthcare, and I was reminded of that fact as I wandered the halls representing AWS at HIMSS17, the Healthcare Information and Management System Society annual meeting.

At Amazon, we’re taught to obsess over our customers, but this obsession takes on a new level of responsibility when those customers are directly working on improving the lives of patients and the overall wellness of society. Thinking about the challenges that healthcare professionals are dealing with every day drives home how important it is for AWS to ensure that using the cloud in healthcare is as frictionless as possible. So with that in mind I wanted to share some of the things I saw in and around HIMSS17 regarding healthcare and AWS.

I started my week at the HIMSS Cloud Computing Forum, which was a new full-day HIMSS pre-day focused on educating the healthcare community on cloud. I was particularly struck by the breadth of cloud use cases being explored throughout the industry, even compared to a few years ago. The program featured presentations on cloud-based care coordination, precision medicine, and security. Perhaps one of the most interesting presentations came from Jessica Kahn from the Center for Medicare & Medicaid Services (CMS), talking about the analytics platform that CMS has built in the cloud along with Nuna. Jessica talked about how a cloud-based platform allows CMS to make decisions based on data that is a month old, rather than a year or years old. Additionally, using an automated, rules-based approach, policymakers can directly query the data without having to rely on developers, bringing agility. Nuna has already gained a number of insights from hosting this de-identified Medicaid data for policy research, and is now looking to expand its services to private insurance.

The exhibition opened on Monday, and I was really excited to talk to customers about the new Healthcare & Life Sciences category in the AWS Marketplace. AWS Marketplace is a managed and curated software catalog that helps customers innovate faster and reduce costs, by making it easy to discover, evaluate, procure, immediately deploy and manage 3rd party software solutions. When a customer purchases software via the Marketplace, all of the infrastructure needed to run on AWS is deployed automatically, using the same pay-as-you-go pricing model that AWS uses. Creation of a dedicated category of healthcare is a huge step forward in making it easier for our customers to deploy cloud-based solutions. Our new category features telehealth solutions, products for managing HIPAA compliance, and products that can be used for revenue cycle management from AWS Partner Network (APN) such as PokitDok. We’re just getting started with this category; look for new additions throughout the year.

Later in the week, I tried to spend time at the number of APN Partners exhibiting at HIMSS this year, and it’s safe to say our ecosystem also had lots of moments to shine. Orion Health announced that they will migrate their Amadeus precision medicine platform to the AWS Cloud. Orion has been deploying on top of AWS for a while now, including notably the California-wide Health Information Exchange CalINDEX. Amadeus currently manages 110 million patient records; the migration will represent a significant volume of clinical data running on AWS. New APN Partner Merck announced a new Alexa skill challenge, asking developers to come up with new, innovative ways to use Alexa in the management of chronic disease. Healthcare Competency Partner ClearDATA announced its new fully-managed Containers-as-a-Service product, which simplifies development of healthcare applications by providing developers with a HIPAA-compliant environment for building, testing, and deployment.

This is only a small sample of the activity going on at HIMSS this year, and it’s impossible to capture everything in one post. You learn more about healthcare on AWS on our Cloud Computing in Healthcare page. Nonetheless, after spending four days diving in to healthcare IT, it was great to see how AWS is enabling our healthcare customers and partners deliver solutions that are impacting millions of lives across the globe.

Jordin Green

AWS Marketplace Adds Healthcare & Life Sciences Category

Post Syndicated from Ana Visneski original https://aws.amazon.com/blogs/aws/aws-marketplace-adds-healthcare-life-sciences-category/

Wilson To and Luis Daniel Soto are our guest bloggers today, telling you about a new industry vertical category that is being added to the AWS Marketplace.Check it out!

-Ana


AWS Marketplace is a managed and curated software catalog that helps customers innovate faster and reduce costs, by making it easy to discover, evaluate, procure, immediately deploy and manage 3rd party software solutions.  To continue supporting our customers, we’re now adding a new industry vertical category: Healthcare & Life Sciences.

healthpost

This new category brings together best-of-breed software tools and solutions from our growing vendor ecosystem that have been adapted to, or built from the ground up, to serve the healthcare and life sciences industry.

Healthcare
Within the AWS Marketplace HCLS category, you can find solutions for Clinical information systems, population health and analytics, health administration and compliance services. Some offerings include:

  1. Allgress GetCompliant HIPAA Edition – Reduce the cost of compliance management and adherence by providing compliance professionals improved efficiency by automating the management of their compliance processes around HIPAA.
  2. ZH Healthcare BlueEHS – Deploy a customizable, ONC-certified EHR that empowers doctors to define their clinical workflows and treatment plans to enhance patient outcomes.
  3. Dicom Systems DCMSYS CloudVNA – DCMSYS Vendor Neutral Archive offers a cost-effective means of consolidating disparate imaging systems into a single repository, while providing enterprise-wide access and archiving of all medical images and other medical records.

Life Sciences

  1. National Instruments LabVIEW – Graphical system design software that provides scientists and engineers with the tools needed to create and deploy measurement and control systems through simple yet powerful networks.
  2. NCBI Blast – Analysis tools and datasets that allow users to perform flexible sequence similarity searches.
  3. Acellera AceCloud – Innovative tools and technologies for the study of biophysical phenomena. Acellera leverages the power of AWS Cloud to enable molecular dynamics simulations.

Healthcare and life sciences companies deal with huge amounts of data, and many of their data sets are some of the most complex in the world. From physicians and nurses to researchers and analysts, these users are typically hampered by their current systems. Their legacy software cannot let them efficiently store or effectively make use of the immense amounts of data they work with. And protracted and complex software purchasing cycles keep them from innovating at speed to stay ahead of market and industry trends. Data analytics and business intelligence solutions in AWS Marketplace offer specialized support for these industries, including:

  • Tableau Server – Enable teams to visualize across costs, needs, and outcomes at once to make the most of resources. The solution helps hospitals identify the impact of evidence-based medicine, wellness programs, and patient engagement.
  • TIBCO Spotfire and JasperSoft. TIBCO provides technical teams powerful data visualization, data analytics, and predictive analytics for Amazon Redshift, Amazon RDS, and popular database sources via AWS Marketplace.
  • Qlik Sense Enterprise. Qlik enables healthcare organizations to explore clinical, financial and operational data through visual analytics to discover insights which lead to improvements in care, reduced costs and delivering higher value to patients.

With more than 5,000 listings across more than 35 categories, AWS Marketplace simplifies software licensing and procurement by enabling customers to accept user agreements, choose pricing options, and automate the deployment of software and associated AWS resources with just a few clicks. AWS Marketplace also simplifies billing for customers by delivering a single invoice detailing business software and AWS resource usage on a monthly basis.

With AWS Marketplace, we can help drive operational efficiencies and reduce costs in these ways:

  • Easily bring in new solutions to solve increasingly complex issues, gain quick insight into the huge amounts of data users handle.
  • Healthcare data will be more actionable. We offer pay-as-you-go solutions that make it considerably easier and more cost-effective to ingest, store, analyze, and disseminate data.
  • Deploy healthcare and life sciences software with 1-Click ease — then evaluate and deploy it in minutes. Users can now speed up their historically slow cycles in software procurement and implementation.
  • Pay only for what’s consumed — and manage software costs on your AWS bill.
  • In addition to the already secure AWS Cloud, AWS Marketplace offers industry-leading solutions to help you secure operating systems, platforms, applications and data that can integrate with existing controls in your AWS Cloud and hybrid environment.

Click here to see who the current list of vendors are in our new Healthcare & Life Sciences category.

Come on In
If you are a healthcare ISV and would like to list and sell your products on AWS, visit our Sell in AWS Marketplace page.

– Wilson To and Luis Daniel Soto

AWS Big Data is Coming to HIMSS!

Post Syndicated from Christopher Crosbie original https://aws.amazon.com/blogs/big-data/aws-big-data-is-coming-to-himss/

The AWS Big Data team is coming to HIMSS, the industry-leading conference for professionals in the field of healthcare technology. The conference brings together more than 40,000 health IT professionals, clinicians, administrators, and vendors to talk about the latest innovations in health technology. Because transitioning healthcare to the cloud is at the forefront of this year’s conversations, for the first time, HIMSS is hosting a conference pre-day on February 19 that is focused on the use of cloud in healthcare.

explore_aws_healthcare

This year’s conference will be held at the Orange County Convention Center in Orlando, Florida from February 20 – 23. You can visit us at booth 6969 to learn about how AWS healthcare customers like Cambia and Cleveland Clinic are leveraging cloud-based analytics to support healthcare’s digital transformation. The booth will be staffed by AWS certified solution architects who can answer questions about transitioning existing health applications into the cloud or creating new big data solutions to meet the evolving needs of healthcare.

If you’re interested in understanding how your health data skills fit in at AWS, there will be recruiters and hiring manages onsite to discuss AWS career opportunities. Just send e-mail to [email protected] to set up an informal chat.

Thousands of healthcare customers are using AWS to change the way they deliver care, engage with patients, or incorporate new technology into their organization by using HIPAA-eligible big data services such as:

  • Amazon EMR, a managed Hadoop framework.
  • Amazon DynamoDB, a fast and flexible NoSQL database service.
  • Amazon Aurora [MySQL-compatible edition only], a relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases.
  • Amazon Redshift, a fast, simple, cost-effective data warehouse.
  • Amazon S3, a durable, massively scalable object store.

Check out some past AWS Big Data Blog posts to see how these technologies are being used to improve healthcare:

For more information about how healthcare customers are using AWS, visit aws.amazon.com/health.

The Visual Theremin

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/the-visual-theremin/

The theremin.

To some, it’s the instrument that reminds us of one of popular culture’s most famous theme songs. To others, it’s the confusing flailing of arms to create music. And to others still, it sounds like a medicine or a small rodent.

Theremin plays the theremin

“Honey, can you pass me a tube of Theremin? The theremin bit me and I think it might be infected.”

In order to help their visitors better understand the origin of the theremin, Australia’s MAAS Powerhouse Museum built an interactive exhibit, allowing visitors to get their hands on, or rather get their hands a few inches above, this unique instrument.

MAAS Powerhouse Theremin Raspberry Pi

As advocates of learning by doing, the team didn’t simply want to let their museumgoers hear the theremin. They wanted them to see it too. So to accomplish this task, they chose a BitScope Blade Uno, along with a Raspberry Pi, BitScope, and LCD monitor.

MAAS Theremin Raspberry Pi

The Raspberry Pi and BitScope are left on display so visitors may better understand how the exhibit works.

BitScope go into a deeper explanation of how the entire exhibit fits together on their website:

The Raspberry Pi, powered and mounted on the BitScope Blade Uno, provided the computing platform to drive the display (via HDMI), and it also ran the BitScope application.

The BitScope itself, also mounted on the Blade, was connected (via USB) to the Raspberry Pi and powered by the Blade.

The theremin output was connected via splitters (so they could also be connected to a sound amplifier) and BNC terminated coaxial cables to the analogue inputs via a BitScope probe adapter.

We’d love to see a video of the MAAS theremin in action so if you’re reading this, MAAS, please send us a video. Until then, have this:

Sheldon’s Theremin

Sheldon playing the star trek soundtrack on his theremin

The post The Visual Theremin appeared first on Raspberry Pi.

FDA Recommendations on Medical-Device Cybersecurity

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/01/fda_recommendat.html

The FDA has issued a report giving medical devices guidance on computer and network security. There’s nothing particularly new or interesting; it reads like standard security advice: write secure software, patch bugs, and so on.

Note that these are “non-binding recommendations,” so I’m really not sure why they bothered.

EDITED TO ADD (1/13): Why they bothered.

Create a Healthcare Data Hub with AWS and Mirth Connect

Post Syndicated from Joseph Fontes original https://aws.amazon.com/blogs/big-data/create-a-healthcare-data-hub-with-aws-and-mirth-connect/

As anyone visiting their doctor may have noticed, gone are the days of physicians recording their notes on paper. Physicians are more likely to enter the exam room with a laptop than with paper and pen. This change is the byproduct of efforts to improve patient outcomes, increase efficiency, and drive population health. Pushing for these improvements has created many new data opportunities as well as challenges. Using a combination of AWS services and open source software, we can use these new datasets to work towards these goals and beyond.

When you get a physical examination, your doctor’s office has an electronic chart with information about your demographics (name, date of birth, address, etc.), healthcare history, and current visit. When you go to the hospital for an emergency, a whole new record is created that may contain duplicate or conflicting information. A simple example would be that my primary care doctor lists me as Joe whereas the hospital lists me as Joseph.

Providers record patient information across different software platforms. Each of these platforms can have varying implementations of complex healthcare data standards. Also, each system needs to communicate with a central repository called a health information exchange (HIE) to build a central, complete clinical record for each patient.

In this post, I demonstrate the capability to consume different data types as messages, transform the information within the messages, and then use AWS service to take action depending on the message type.

Overview of Mirth Connect

Using open source technologies on AWS, you can build a system that transforms, stores, and processes this data as needed. The system can scale to meet the ever-increasing demands of modern medicine. The project that ingests and processes this data is called Mirth Connect.

Mirth Connect is an open source, cross-platform, bidirectional, healthcare integration engine. This project is a standalone server that functions as a central point for routing and processing healthcare information.

Running Mirth Connect on AWS provides the necessary scalability and elasticity to meet the current and future needs of healthcare organizations.

Healthcare data hub walkthrough

Healthcare information comes from various sources and can generate large amounts of data:

  • Health information exchange (HIE)
  • Electronic health records system (EHR)
  • Practice management system (PMS)
  • Insurance company systems
  • Pharmacy systems
  • Other source systems that can make data accessible

Messages typically require some form of modification (transformation) to accommodate ingestion and processing in other systems. Using another project, Blue Button, you can dissect large healthcare messages and locate the sections/items of interest. You can also convert those messages into other formats for storage and analysis.

Data types

The examples in this post focus on the following data types representing information made available from a typical healthcare organization:

HL7 version 2 messages define both a message format and communication protocol for health information. They are broken into different message types depending on the information that they transmit.

There are many message types available, such as ordering labs, prescription dispensing, billing, and more. During a routine doctor visit, numerous messages are created for each patient. This provides a lot of information but also a challenge in storage and processing. For a full list of message types, see Data Definition Tables, section A6. The two types used for this post are:

  • ADT A01 (patient admission and visit notification)

View a sample HL7 ADT A01 message

  • SIU S12 (new appointment booking)

View a sample SIU S12 message

As you can see, this text is formatted as delimited data, where the delimiters are defined in the top line message called the MSG segment. Mirth Connect can parse these messages and communicate using the standard HL7 network protocol.

o_mblog-016

CCD documents, on the other hand, were defined to give the patient a complete snapshot of their medical record. These documents are formatted in XML and use coding systems for procedures, diagnosis, prescriptions, and lab orders. In addition, they contain patient demographics, information about patient vital signs, and an ever-expanding list of other available patient information.

Health insurance companies can make data available to healthcare organizations through various formats. Many times, the term CSV is used to identify not just comma-delimited data types but also data types with other delimiters in use. This data type is commonly used to send datasets from many different industries, including healthcare.

Finally, DICOM is a standard used for medical imaging. This format is used for many different medical purposes such as x-rays, ultrasound, CT scans, MRI results, and more. These files can contain both image files as well as text results and details about the test. For many medical imaging tests, the image files can be quite large as they must contain very detailed information. The storage of these large files—in addition to the processing of the reporting data contained in the files—requires the ability to scale both storage and compute capacity. You must also use a storage option providing extensive data durability.

Infrastructure design

Before designing and deploying a healthcare application on AWS, make sure that you read through the AWS HIPAA Compliance whitepaper. This covers the information necessary for processing and storing patient health information (PHI).

To meet the needs of scalability and elasticity with the Mirth Connect application, use Amazon EC2 instances with Amazon EBS storage. This allows you to take advantage of AWS features such as Auto Scaling for instances. EBS allows you to deploy encrypted volumes, readily provision multiple block devices with varying sizes and throughput, and create snapshots for backups.

Mirth Connect also requires a database backend that must be secure, highly available, and scalable. To meet these needs with a HIPAA-eligible AWS service, use Amazon RDS with MySQL.

Next, implement JSON as the standard format for the message data storage to facilitate easier document query and retrieval. With the core functionality of Mirth Connect as well as other open source software offerings, you can convert each message type to JSON. The storage and retrieval of these converted messages requires the use of a database. If you were to use a relational database, you would either need to store blobs of text for items/fields not yet defined or create each column that may be needed for a message.

With the ever-updating standards and increases in data volume and complexity, it can be advantageous to use a database engine that can automatically expand the item schema. For this post, use Amazon DynamoDB. DynamoDB is a fully managed, fast, and flexible NoSQL database service providing scalable and reliable low latency data access.

If you use DynamoDB with JSON, you don’t have to define each data element field of a database message before insert. Because DynamoDB is a managed service, it eliminates the administrative overhead of deploying and maintaining a distributed, scalable NoSQL database cluster. For more information about the addition of the JSON data type to DynamoDB, see the Amazon DynamoDB Update – JSON, Expanded Free Tier, Flexible Scaling, Larger Items AWS Blog post.

The file storage requirements for running a healthcare integration engine can include the ability to store large numbers of small files, large numbers of large files, and everything in between. To ensure that the requirements for available storage space and data durability are met, use Amazon S3 to store the original files as well as processed files and images.

Finally, use some additional AWS services to further facilitate automation. One of these examples is Amazon CloudWatch. By sending Mirth Connect metrics to CloudWatch, we can enable AWS alerts and actions that ensure optimal application availability, efficiency, and responsiveness. An example Mirth Connect implementation design is available in the following diagram:

o_aws-service-diagram-2

AWS service configuration

When you launch the EC2 instance, make sure to encrypt volumes that contain patient information. Also, it would be a good practice to also encrypt the boot partition. For more information, see the New – Encrypted EBS Boot Volumes AWS Blog post. I am storing the protected information on a separate encrypted volume to make for easier snapshots, mounted as /mirth/. For more information, see Amazon EBS Encryption.

Also, launch an Amazon RDS instance to store the Mirth Connect configuration information, message information, and logs. For more information, see Encrypting Amazon RDS Resources.

During the configuration and deployment of your RDS instance, update the max_allowed_packet value to accommodate the larger size of DICOM images. Change this setting by creating a parameter group for Amazon RDS MySQL/MariaDB/Aurora. For more information, see Working with DB Parameter Groups. I have set the value to 1073741824.

After the change is complete, you can compare database parameter groups to ensure that the variable has been set properly. In the Amazon RDS console, choose Parameter groups and check the box next to the default parameter group for your database and the parameter group that you created. Choose Compare Parameters.

o_mblog-017

Use an Amazon DynamoDB table to store JSON-formatted message data. For more information, see Creating Tables and Loading Sample Data.

For your table name, use mirthdata. In the following examples, I used mirthid as the primary key as well as a sort key named mirthdate. The sort key contains the date and time that each message is processed.

o_mblog-009

o_mblog-018

Set up an Amazon S3 bucket to store your messages, images, and other information. Make your bucket name globally unique. For this post, I used the bucket named mirth-blog-data; you can create a name for your bucket and update the Mirth Connect example configuration to reflect that value. For more information, see Protecting Data Using Encryption.

Mirth Connect deployment

You can find installation packaging options for Mirth Connect at Downloads. Along with the Mirth Connect installation, also download and install the Mirth Connect command line interface. For information about installation and initial configuration steps, see the Mirth Connect Getting Started Guide.

After it’s installed, ensure that the Mirth Connect service starts up automatically on reboot, and then start the service.

 

chkconfig --add mcservice
/etc/init.d/mcservice start

o_mblog-001

The Mirth Connect Getting Started Guide covers the initial configuration steps and installation. On first login, you are asked to configure the admin user and to set a password.

o_mblog-004

o_mblog-005

o_mblog-006

You are now presented with the Mirth Connect administrative console. You can view an example console with a few deployed channels in the following screenshot.

o_mblog-019

By default, Mirth Connect uses the Derby database on the local instance for configuration and message storage. Your next agenda item is to switch to a MySQL-compatible database with Amazon RDS. For information about changing the Mirth Connect database type, see Moving Mirth Connect configuration from one database to another on the Mirth website.

Using the Mirth Connect command line interface (CLI), you can perform various functions for CloudWatch integration. Run the channel list command from an SSH console on your Mirth Connect EC2 instance.

In the example below, replace USER with the user name to use for connection (typically, admin if you have not created other users). Replace PASSWORD with the password that you set for the associated user. After you are logged in, run the command channel list to ensure that the connection is working properly:

 

mccommand -a https://127.0.0.1:8443 -u USER -p PASSWORD
channel list
quit

o_mblog-020

Using the command line interface, you can create bash scripts using the CloudWatch put-metric-data function. This allows you to push Mirth Connect metrics directly into CloudWatch. From there, you could build alerts and actions based on this information.

 

echo "dump stats \"/tmp/stats.txt\"" > m-cli-dump.txt
mccommand -a https://127.0.0.1:8443 -u admin -p $MIRTH_PASS -s m-cli-dump.txt
cat /tmp/stats.txt

o_mblog-021

More information for the administration and configuration of your new Mirth Connect implementation can be found on both the Mirth Connect Wiki as well as the PDF user guide.

Example configuration

Mirth Connect is designed around the concept of channels. Each channel has a source and one or more destinations. In addition, messages accepted from the source can be modified (transformed) when they are ingested at the source, or while they are being sent to their destination. The guides highlighted above as well as the Mirth Connect Wiki discuss these concepts and their configuration options.

The example configuration has been made available as independent channels available for download. You are encouraged to import and review the channel details to understand the design of the message flow and transformation. For a full list of example channels, see mirth-channels in the Big Data Blog GitHub repository.

Channels:

  • CCD-Pull – Processes the files in the directory
  • CCD-to-BB – Uses Blue Button to convert CCD into JSON
  • CCD-to-S3 – Sends files to S3
  • BB-to-DynamoDB – Inserts Blue Button JSON into DynamoDB tables
  • DICOM-Pull – Processes the DICOM files from the directory
  • DICOM-Image – Process DICOM embedded images and stores data in Amazon S3 and Amazon RDS
  • DICOM-to-JSON – Uses DCM4CHE to convert XML into JSON. Also inserts JSON into DynamoDB tables and saves the DICOM image attachment and original file in S3
  • HL7-to-DynamoDB – Inserts JSON data into Amazon DynamoDB
  • HL7-Inbound – Pulls HL7 messages into engine and uses NodeJS web service to convert data into JSON

For these examples, you are using file readers and file writers as opposed to the standard HL7 and DICOM network protocols. This allows you to test message flow without setting up HL7 and DICOM network brokers. To use the example configurations, create the necessary local file directories:

mkdir -p /mirth/app/{s3,ccd,bb,dicom,hl7}/{input,output,processed,raw}

Blue Button library

Blue Button is an initiative to make patient health records available and readable to patients. This initiative has driven the creation of a JavaScript library that you use to parse the CCD (XML) documents into JSON. This project can be used to both parse and create CCD documents.

For this post, run this library as a service using Node.js. This library is available for installation as blue-button via the Node.js package manager. For details about the package, see the npmjs website. Instructions for the installation of Node.js via a package manager can be found on nodejs.org.

For a simple example of running a Blue Button web server on the local Amazon EC2 instance with the test data, see the bb-conversion-listener.js script in the Big Data Blog repository.

Processing clinical messages

The flow diagram below illustrates the flow of CCDA/CCD messages as they are processed through Mirth Connect.

o_flow-ccd-1

Individual message files are placed in the “/mirth/app/ccd/input/” directory. From there, the channel, “CCD-Pull” reads the file and then transfers to another directory for processing. The message data is then sent to the Mirth Connect channel “CCD-to-S3”, which then saves a copy of the message to S3 while also creating an entry for each item in Amazon RDS. This allows you to create a more efficient file-naming pattern for S3 while still maintaining a link between the original file name and the new S3 key.

Next, the Mirth Connect channel “CCD-to-BB” passes the message data via an HTTP POST to a web service running on your Mirth Connect instance using the Blue Button open source healthcare project. The HTTP response contains the JSON-formatted translation of the document you sent. This JSON data can now be inserted into a DynamoDB table. When you are building a production environment, this web service could be separated onto its own instance and set up to run behind an Application Load Balancer with Auto Scaling.

You can find sample files online for processing across different locations. Try the following for CCD and DICOM examples:

AWS service and Mirth integration

To integrate with various AWS services, Mirth Connect can use the AWS SDK for Java by invoking custom Java code. For more information, see How to create and invoke custom Java code in Mirth Connect on the Mirth website.

For more information about the deployment and use of the AWS SDK for Java, see Getting Started with the AWS SDK for Java.

To facilitate the use of the example channels, a sample Java class implementation for use with Mirth Connect can be found at https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-mirth-healthcare-hub.

For more information about building the code example using Maven, see Using the SDK with Apache Maven.

You can view the addition of these libraries to each channel in the following screenshots.

o_mblog-011

o_mblog-012

After test messages are processed, you can view the contents in DynamoDB.

o_mblog-015

DICOM messages

The dcm4che project is a collection of open source tools supporting the DICOM standard. When a DICOM message is passed through Mirth Connect, the text content is transformed to XML with the image included as an attachment. You convert the DICOM XML into JSON for storage in DynamoDB and store both the original DICOM message as well as the separate attachment in S3, using the AWS SDK for Java and the dcm4che project. The source code for dcm4che can be found in the dcm4che GitHub repository.

Readers can find the example dcm4che functions used in this example in the mirth-aws-dicom-app in the AWS Big Data GitHub repository.

The following screenshots are examples of the DICOM message data converted to XML in Mirth Connect, along with the view of the attached medical image. Also, you can view an example of the data stored in DynamoDB.

o_mblog-013

o_mblog-014

 

The DICOM process flow can be seen in the following diagram.

o_flow-dicom-1

Conclusion

With the combination of channels, you now have your original messages saved and available for archive via Amazon Glacier. You also have each message stored in DynamoDB as discrete data elements, with references to each message stored in Amazon RDS for use with message correlation. Finally, you have your transformed messages, processed messages, and image attachments saved in S3, making them available for direct use or archiving.

Later, you can implement analysis of the data with Amazon EMR. This would allow the processing of information directly from S3 as well as the data in Amazon RDS and DynamoDB.

With an open source health information hub, you have a centralized repository for information traveling throughout healthcare organizations, with community support and development constantly improving functionality and capabilities. Using AWS, health organizations are able to focus on building better healthcare solutions rather than building data centers.

If you have questions or suggestions, please comment below.


About the Author

joe_fontes_90Joseph Fontes is a Solutions Architect at AWS with the Emerging Partner program. He works with AWS Partners of all sizes, industries, and technologies to facilitate technical adoption of the AWS platform with emphasis on efficiency, performance, reliability, and scalability. Joe has a passion for technology and open source which is shared with his loving wife and two sons.

 

 


Related

How Eliza Corporation Moved Healthcare Data to the Cloud

o_eliza_1

Interactive Analysis of Genomic Datasets Using Amazon Athena

Post Syndicated from Aaron Friedman original https://aws.amazon.com/blogs/big-data/interactive-analysis-of-genomic-datasets-using-amazon-athena/

Aaron Friedman is a Healthcare and Life Sciences Solutions Architect with Amazon Web Services

The genomics industry is in the midst of a data explosion. Due to the rapid drop in the cost to sequence genomes, genomics is now central to many medical advances. When your genome is sequenced and analyzed, raw sequencing files are processed in a multi-step workflow to identify where your genome differs from a standard reference. Your variations are stored in a Variant Call Format (VCF) file, which is then combined with other individuals to enable population-scale analyses. Many of these datasets are publicly available, and an increasing number are hosted on AWS as part of our Open Data project.

To mine genomic data for new discoveries, researchers in both industry and academia build complex models to analyze populations at scale. When building models, they first explore the datasets-of-interest to understand what questions the data might answer. In this step, interactivity is key, as it allows them to move easily from one question to the next.

Recently, we launched Amazon Athena as an interactive query service to analyze data on Amazon S3. With Amazon Athena there are no clusters to manage and tune, no infrastructure to setup or manage, and customers pay only for the queries they run. Athena is able to query many file types straight from S3. This flexibility gives you the ability to interact easily with your datasets, whether they are in a raw text format (CSV/JSON) or specialized formats (e.g. Parquet). By being able to flexibly query different types of data sources, researchers can more rapidly progress through the data exploration phase for discovery. Additionally, researchers don’t have to know nuances of managing and running a big data system. This makes Athena an excellent complement to data warehousing on Amazon Redshift and big data analytics on Amazon EMR 

In this post, I discuss how to prepare genomic data for analysis with Amazon Athena as well as demonstrating how Athena is well-adapted to address common genomics query paradigms.  I use the Thousand Genomes dataset hosted on Amazon S3, a seminal genomics study, to demonstrate these approaches. All code that is used as part of this post is available in our GitHub repository.

Although this post is focused on genomic analysis, similar approaches can be applied to any discipline where large-scale, interactive analysis is required.

 

Select, aggregate, annotate query pattern in genomics

Genomics researchers may ask different questions of their dataset, such as:

  • What variations in a genome may increase the risk of developing disease?
  • What positions in the genome have abnormal levels of variation, suggesting issues in quality of sequencing or errors in the genomic reference?
  • What variations in a genome influence how an individual may respond to a specific drug treatment?
  • Does a group of individuals contain a higher frequency of a genomic variant known to alter response to a drug relative to the general population?

All these questions, and more, can be generalized under a common query pattern I like to call “Select, Aggregate, Annotate”. Some of our genomics customers, such as Human Longevity, Inc., routinely use this query pattern in their work.

In each of the above queries, you execute the following steps:

SELECT: Specify the cohort of individuals meeting certain criteria (disease, drug response, age, BMI, entire population, etc.).

AGGREGATE: Generate summary statistics of genomic variants across the cohort that you selected.

ANNOTATE: Assign meaning to each of the variants by joining on known information about each variant.

Dataset preparation

Properly organizing your dataset is one of the most critical decisions for enabling fast, interactive analysies. Based on the query pattern I just described, the table representing your population needs to have the following information:

  • A unique sample ID corresponding to each sample in your population
  • Information about each variant, specifically its location in the genome as well as the specific deviation from the reference
  • Information about how many times in a sample a variant occurs (0, 1, or 2 times) as well as if there are multiple variants in the same site. This is known as a genotype.

The extract, transform, load (ETL) process to generate the appropriate data representation has two main steps. First, you use ADAM, a genomics analysis platform built on top of Spark, to convert the variant information residing a VCF file to Parquet for easier downstream analytics, in a process similar to the one described in the Will Spark Power the Data behind Precision Medicine? post. Then, you use custom Python code to massage the data and select only the appropriate fields that you need for analysis with Athena.

First, spin up an EMR cluster (version 5.0.3) for the ETL process. I used a c4.xlarge for my master node and m4.4xlarges with 1 TB of scratch for my core nodes.

After you SSH into your master node, clone the git repository. You can also put this in as a bootstrap action when spinning up your cluster.

sudo yum –y install git
git clone https://github.com/awslabs/aws-big-data-blog.git 

You then need to install ADAM and configure it to run with Spark 2.0. In your terminal, enter the following:

# Install Maven
wget http://apache.claz.org/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
tar xvzf apache-maven-3.3.9-bin.tar.gz
export PATH=/home/hadoop/apache-maven-3.3.9/bin:$PATH

# Install ADAM
git clone https://github.com/bigdatagenomics/adam.git
cd adam
sed -i 's/2.6.0/2.7.2/' pom.xml
./scripts/move_to_spark_2.sh
export MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=256m"
mvn clean package -DskipTests

export PATH=/home/hadoop/adam/bin:$PATH

Now that ADAM is installed, enter the working directory of the repo that you first cloned, which contains the code relevant to this post. Your next step is to convert a VCF containing all the information about your population.

For this post, you are going to focus on a single region in the genome, defined as chromosome 22, for all 2,504 samples in the Thousand Genomes dataset. As chromosome 22 was the first chromosome to be sequenced as part of the Human Genome Project, it is your first here as well. You can scale the size of your cluster depending on whether you are looking at just chromosome 22, or the entire genome. In the following lines, replace mytestbucket with your bucket name of choice.

PARQUETS3PATH=s3://<mytestbucket>/thousand_genomes/grch37/chr22.parquet/
cd ~/aws-big-data-blog/aws-blog-athena-genomics/etl/thousand_genomes/
chmod +x ./convert_vcf.sh
./convert_vcf.sh $PARQUETS3PATH

After this conversion completes, you can reduce your Parquet file to only the fields you need.

TRIMMEDS3PATH=s3://<mytestbucket>/thousand_genomes/grch37_trimmed/chr22.parquet/
spark-submit --master yarn --executor-memory 40G --deploy-mode cluster create_trimmed_parquet.py --input_s3_path $PARQUETS3PATH --output_s3_path $TRIMMEDS3PATH 

Annotation data preparation

For this post, use the ClinVar dataset, which is an archive of information about variation in genomes and health. While Parquet is a preferred format for Athena relative to raw text, using the ClinVar TXT file demonstrates the ability of Athena to read multiple file types.

In a Unix terminal, execute the following commands (replace mytestbucket with your bucket name of choice):

ANNOPATH=s3://<mytestbucket>/clinvar/

wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz

# Need to strip out the first line (header)
zcat variant_summary.txt.gz | sed '1d' > temp ; mv -f temp variant_summary.trim.txt ; gzip variant_summary.trim.txt

aws s3 cp variant_summary.trim.txt.gz $ANNOPATH --sse

Creating Tables using Amazon Athena

In the recent Amazon Athena – Interactive SQL Queries for Data in Amazon S3 post, Jeff Barr covered how to get started with Athena and navigate to the console. Athena uses the Hive DDL when you create, alter, or drop a table. This means you can use the standard Hive syntax for creating tables. These commands can also be found in the aws-blog-athena-genomics GitHub repo under sql/setup.

First, create your database:

CREATE DATABASE kg;

Next, define your table with the appropriate mappings from the previous ETL processes. Be sure to change to your appropriate bucket name.

For your population data:

If you want to skip the ETL process described above for your population data, you can use the following path for your analysis: s3://aws-bigdata-blog/artifacts/athena_genomics.parquet/ – be sure to substitute it in for the LOCATION below.

CREATE EXTERNAL TABLE demo.samplevariants
(
  alternateallele STRING
  ,chromosome STRING
  ,endposition BIGINT
  ,genotype0 STRING
  ,genotype1 STRING
  ,referenceallele STRING
  ,sampleid STRING
  ,startposition BIGINT
)
STORED AS PARQUET
LOCATION ' s3://<mytestbucket>/thousand_genomes/grch37_trimmed/chr22.parquet/';

For your annotation data:

CREATE EXTERNAL TABLE demo.clinvar
(
    alleleId STRING
    ,variantType STRING
    ,hgvsName STRING
    ,geneID STRING
    ,geneSymbol STRING
    ,hgncId STRING
    ,clinicalSignificance STRING
    ,clinSigSimple STRING
    ,lastEvaluated STRING
    ,rsId STRING
    ,dbVarId STRING
    ,rcvAccession STRING
    ,phenotypeIds STRING
    ,phenotypeList STRING
    ,origin STRING
    ,originSimple STRING
    ,assembly STRING
    ,chromosomeAccession STRING
    ,chromosome STRING
    ,startPosition INT
    ,endPosition INT
    ,referenceAllele STRING
    ,alternateAllele STRING
    ,cytogenetic STRING
    ,reviewStatus STRING
    ,numberSubmitters STRING
    ,guidelines STRING
    ,testedInGtr STRING
    ,otherIds STRING
    ,submitterCategories STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://<mytestbucket>/clinvar/';

You can confirm that both tables have been created by choosing the eye icon next to a table name in the console. This automatically runs a query akin to SELECT * FROM table LIMIT 10. If it succeeds, then your data has been loaded successfully.

o_athena_genomes_1

Applying the select, aggregate, annotate paradigm

In this section, I walk you through how to use the query pattern described earlier to answer two common questions in genomics. You examine the frequency of different genotypes, which is a distinct combination of all fields in the samplevariants table with the exception of sampleid. These queries can also be found in the aws-blog-athena-genomics GitHub repo under sql/queries.

Population drug response

What small molecules/drugs are most likely to affect a subpopulation of individuals (ancestry, age, etc.) based on their genomic information?

In this query, assume that you have some phenotype data about your population. In this case, also assume that all samples sharing the pattern “NA12” are part of a specific demographic.

In this query, use sampleid as your predicate pushdown. The general steps are:

  1. Filter by the samples in your subpopulation
  2. Aggregate variant frequencies for the subpopulation-of-interest
  3. Join on ClinVar dataset
  4. Filter by variants that have been implicated in drug-response
  5. Order by highest frequency variants

To answer this question, you can craft the following query:

SELECT
    count(*)/cast(numsamples AS DOUBLE) AS genotypefrequency
    ,cv.rsid
    ,cv.phenotypelist
    ,sv.chromosome
    ,sv.startposition
    ,sv.endposition
    ,sv.referenceallele
    ,sv.alternateallele
    ,sv.genotype0
    ,sv.genotype1
FROM demo.samplevariants sv
CROSS JOIN
    (SELECT count(1) AS numsamples
    FROM
        (SELECT DISTINCT sampleid
        FROM demo.samplevariants
        WHERE sampleid LIKE 'NA12%'))
JOIN demo.clinvar cv
ON sv.chromosome = cv.chromosome
    AND sv.startposition = cv.startposition - 1
    AND sv.endposition = cv.endposition
    AND sv.referenceallele = cv.referenceallele
    AND sv.alternateallele = cv.alternateallele
WHERE assembly='GRCh37'
    AND cv.clinicalsignificance LIKE '%response%'
    AND sampleid LIKE 'NA12%'
GROUP BY  sv.chromosome
          ,sv.startposition
          ,sv.endposition
          ,sv.referenceallele
          ,sv.alternateallele
          ,sv.genotype0
          ,sv.genotype1
          ,cv.clinicalsignificance
          ,cv.phenotypelist
          ,cv.rsid
          ,numsamples
ORDER BY  genotypefrequency DESC LIMIT 50

Enter the query into the console and you see something like the following:

o_athena_genomes_2

When you inspect the results, you can quickly see (often in a matter of seconds!) that this population of individuals has a high frequency of variants associated with the metabolism of debrisoquine.

o_athena_genomes_3

Quality control

Are you systematically finding locations in your reference that are called as variation? In other words, what positions in the genome have abnormal levels of variation, suggesting issues in quality of sequencing or errors in the reference? Are any of these variants implicated in clinically? If so, could this be a false positive clinical finding?

In this query, the entire population is needed so the SELECT predicate is not used. The general steps are:

  1. Aggregate variant frequencies for the entire Thousand Genomes population
  2. Join on the ClinVar dataset
  3. Filter by variants that have been implicated in disease
  4. Order by the highest frequency variants

This translates into the following query:

SELECT
    count(*)/cast(numsamples AS DOUBLE) AS genotypefrequency
    ,cv.clinicalsignificance
    ,cv.phenotypelist
    ,sv.chromosome
    ,sv.startposition
    ,sv.endposition
    ,sv.referenceallele
    ,sv.alternateallele
    ,sv.genotype0
    ,sv.genotype1
FROM demo.samplevariants sv
CROSS JOIN
    (SELECT count(1) AS numsamples
    FROM
        (SELECT DISTINCT sampleid
        FROM demo.samplevariants))
    JOIN demo.clinvar cv
    ON sv.chromosome = cv.chromosome
        AND sv.startposition = cv.startposition - 1
        AND sv.endposition = cv.endposition
        AND sv.referenceallele = cv.referenceallele
        AND sv.alternateallele = cv.alternateallele
WHERE assembly='GRCh37'
        AND cv.clinsigsimple='1'
GROUP BY  sv.chromosome ,
          sv.startposition
          ,sv.endposition
          ,sv.referenceallele
          ,sv.alternateallele
          ,sv.genotype0
          ,sv.genotype1
          ,cv.clinicalsignificance
          ,cv.phenotypelist
          ,numsamples
ORDER BY  genotypefrequency DESC LIMIT 50

As you can see in the following screenshot, the highest frequency results have conflicting information, being listed both as potentially causing disease and being benign. In genomics, variants with a higher frequency are less likely to cause disease. Your quick analysis allows you to discount the pathogenic clinical significance annotations of these high genotype frequency variants.

o_athena_genomes_4

 

Summary

I hope you have seen how you can easily integrate datasets of different types and values, and combine them to quickly derive new meaning from your data. Do you have a dataset you want to explore or want to learn more about Amazon Athena? Check out our Getting Started Guide and happy querying!

If you have questions or suggestions, please comment below.


About the author

friedman_90Dr. Aaron Friedman a Healthcare and Life Sciences Partner Solutions Architect at Amazon Web Services. He works with our ISVs and SIs to architect healthcare solutions on AWS, and bring the best possible experience to their customers. His passion is working at the intersection of science, big data, and software. In his spare time, he’s out hiking or learning a new thing to cook.

 

 


Related

Analyzing Data in S3 using Amazon Athena

o_athena_10

Human Longevity, Inc. – Changing Medicine Through Genomics Research

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/human-longevity-inc-changing-medicine-through-genomics-research/

Human Longevity, Inc. (HLI) is at the forefront of genomics research and wants to build the world’s largest database of human genomes along with related phenotype and clinical data, all in support of preventive healthcare. In today’s guest post, Yaron Turpaz,  Bryan Coon, and Ashley Van Zeeland, talk about how they are using AWS to store the massive amount of data that is being generated as part of this effort to revolutionize medicine.

Jeff;


When Human Longevity, Inc. launched in 2013, our founders recognized the challenges that lie ahead. A genome contains all the information needed to build and maintain an organism; in humans, a copy of the entire genome, which contains more than three billion DNA base pairs, is contained in all cells that have a nucleus. Our goal is to sequence one million genomes and deliver that information—along with integrated health records and disease-risk models—to researchers and physicians. They, in turn, can interpret the data to provide targeted, personalized health plans and identify the optimal treatment for cancer and other serious health risks far earlier than has been possible in the past. The intent is to transform medicine by fostering preventive healthcare and risk prevention in place of the traditional “sick care” model, when people wind up seeing their doctors only after symptoms manifest.

Our work in developing and applying large-scale computing and machine learning to genomics research entails the collection, analysis, and storage of immense amounts of data from DNA-sequencing technology provided by companies like Illumina. Raw data from a single genome consumes about 100 gigabytes; that number increases as we align the genomic information with annotation and phenotype sources and analyze it for health insights.

From the beginning, we knew our choice of compute and storage technology would have a direct impact on the success of the company. Using the cloud was clearly the best option. We’re experts in genomics, and don’t want to spend resources building and maintaining an IT infrastructure. We chose to go all in on AWS for the breadth of the platform, the critical scalability we need, and the expertise AWS has developed in big data. We also saw that the pace of innovation at AWS—and its deliberate strategy of keeping costs as low as possible for customers—would be critical in enabling our vision.

Leveraging the Range of AWS Services

 Spectral karyotype analysis / Image courtesy of Human Longevity, Inc.

Spectral karyotype analysis / Image courtesy of Human Longevity, Inc.

Today, we’re using a broad range of AWS services for all kinds of compute and storage tasks. For example, the HLI Knowledgebase leverages a distributed system infrastructure comprised of Amazon S3 storage and a large number of Amazon EC2 nodes. This helps us achieve resource isolation, scalability, speed of provisioning, and near real-time response time for our petabyte-scale database queries and dynamic cohort builder. The flexibility of AWS services makes it possible for our customized Amazon Machine Images and pre-built, BTRFS-partitioned Amazon EBS volumes to achieve turn-up time in seconds instead of minutes. We use Amazon EMR for executing Spark queries against our data lake at the scale we need. AWS Lambda is a fantastic tool for hooking into Amazon S3 events and communicating with apps, allowing us to simply drop in code with the business logic already taken care of. We use Auto Scaling based on demand, and AWS OpsWorks for managing a Docker pipeline.

We also leverage the cost controls provided by Amazon EC2 Spot and Reserved Instance types. When we first started, we used on-demand instances, but the costs started to grow significantly. With Spot and Reserved Instances, we can allocate compute resources based on specific needs and workflows. The flexibility of AWS services enables us to make extensive use of dockerized containers through the resource-management services provided by Apache Mesos. Hundreds of dynamic Amazon EC2 nodes in both our persistent and spot abstraction layers are dynamically adjusted to scale up or down based on usage demand and the latest AWS pricing information. We achieve substantial savings by sharing this dynamically scaled compute cluster with our Knowledgebase service and the internal genomic and oncology computation pipelines. This flexibility gives us the compute power we need while keeping costs down. We estimate these choices have helped us reduce our compute costs by up to 50 percent from the on-demand model.

We’ve also worked with AWS Professional Services to address a particularly hard data-storage challenge. We have genomics data in hundreds of Amazon S3 buckets, many of them in the petabyte range and containing billions of objects. Within these collections are millions of objects that are unused, or used once or twice and never to be used again. It can be overwhelming to sift through these billions of objects in search of one in particular. It presents an additional challenge when trying to identify what files or file types are candidates for the Amazon S3-Infrequent Access storage class. Professional Services helped us with a solution for indexing Amazon S3 objects that saves us time and money.

Moving Faster at Lower Cost
Our decision to use AWS came at the right time, occurring at the inflection point of two significant technologies: gene sequencing and cloud computing. Not long ago, it took a full year and cost about $100 million to sequence a single genome. Today we can sequence a genome in about three days for a few thousand dollars. This dramatic improvement in speed and lower cost, along with rapidly advancing visualization and analytics tools, allows us to collect and analyze vast amounts of data in close to real time. Users can take that data and test a hypothesis on a disease in a matter of days or hours, compared to months or years. That ultimately benefits patients.

Our business includes HLI Health Nucleus, a genomics-powered clinical research program that uses whole-genome sequence analysis, advanced clinical imaging, machine learning, and curated personal health information to deliver the most complete picture of individual health. We believe this will dramatically enhance the practice of medicine as physicians identify, treat, and prevent diseases, allowing their patients to live longer, healthier lives.

Learn More
Learn more about how AWS supports genomics in the cloud, and see how genomics innovator Illumina uses AWS for accelerated, cost-effective gene sequencing.

Yaron Turpaz (Chief Information Officer), Bryan Coon (Head of Enterprise Services), and Ashley Van Zeeland (Chief Technology Officer).

President Obama Talks About AI Risk, Cybersecurity, and More

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2016/10/president_obama_1.html

Interesting interview:

Obama: Traditionally, when we think about security and protecting ourselves, we think in terms of armor or walls. Increasingly, I find myself looking to medicine and thinking about viruses, antibodies. Part of the reason why cybersecurity continues to be so hard is because the threat is not a bunch of tanks rolling at you but a whole bunch of systems that may be vulnerable to a worm getting in there. It means that we’ve got to think differently about our security, make different investments that may not be as sexy but may actually end up being as important as anything.

What I spend a lot of time worrying about are things like pandemics. You can’t build walls in order to prevent the next airborne lethal flu from landing on our shores. Instead, what we need to be able to do is set up systems to create public health systems in all parts of the world, click triggers that tell us when we see something emerging, and make sure we’ve got quick protocols and systems that allow us to make vaccines a lot smarter. So if you take a public health model, and you think about how we can deal with, you know, the problems of cybersecurity, a lot may end up being really helpful in thinking about the AI threats.