Keeping an eye on bee life cycles is a brilliant example of how Raspberry Pi sensors help us understand the world around us, says Rosie Hattersley
The setup featuring an Arduino, RF receiver, USB cable and Raspberry Pi
Getting to design and build things for a living sounds like a dream job, especially if it also involves Raspberry Pi and wildlife. Glyn Hudson has always enjoyed making things and set up a company manufacturing open-source energy monitoring tools shortly after graduating from university. With access to several hives at his keen apiarist parents’ garden in Snowdonia, Glyn set up BeeMonitor using some of the tools he used at work to track the beehives’ inhabitants.
Glyn checking the original BeeMonitor setup
“The aim of the project was to put together a system to monitor the health of a bee colony by monitoring the temperature and humidity inside and outside the hive over multiple years,” explains Glyn. “Bees need all the help and love they can get at the moment and without them pollinating our plants, weíd struggle to grow crops. They maintain a 34∞C core brood temperature (± 0.5∞C) even when the ambient temperature drops below freezing. Maintaining this temperature when a brood is present is a key indicator of colony health.”
Wi-Fi not spot
BeeMonitor has been tracking the hives’ population since 2012 and is one of the earliest examples of a Raspberry Pi project. Glyn built most of the parts for BeeMonitor himself. Open-source software developed for the OpenEnergyMonitor project provides a data-logging and graphing platform that can be viewed online.
BeeMonitor complete with solar panel to power it. The Snowdonia bees produce 12 to 15 kg of honey per year
The hives were too far from the house for WiFi to reach, so Glyn used a low-power RF sensor connected to an Arduino which was placed inside the hive to take readings. These were received by a Raspberry Pi connected to the internet.
Diagram showing what information BeeMonitor is trying to establish
At first, there was both a DS18B20 temperature sensor and a DHT22 humidity sensor inside the beehive, along with the Arduino (setup info can be found here). Data from these was saved to an SD card, the obvious drawback being that this didn’t display real-time data readings. In his initial setup, Glyn also had to extract and analyse the CSV data himself. “This was very time-consuming but did result in some interesting data,” he says.
Sensor-y overload
Almost as soon as BeeMonitor was running successfully, Glyn realised he wanted to make the data live on the internet. This would enable him to view live beehive data from anywhere and also allow other people to engage in the data.
“This is when Raspberry Pi came into its own,” he says. He also decided to drop the DHT22 humidity sensor. “It used a lot of power and the bees didn’t like it – they kept covering the sensor in wax! Oddly, the bees don’t seem to mind the DS218B20 temperature sensor, presumably since it’s a round metal object compared to the plastic grille of the DHT22,” notes Glyn.
Unlike the humidity sensor, the bees don’t seem to mind the temperature probe
The system has been running for eight years with minimal intervention and is powered by an old car battery and a small solar PV panel. Running costs are negligible: “Raspberry Pi is perfect for getting projects like this up and running quickly and reliably using very little power,” says Glyn. He chose it because of the community behind the hardware. “That was one of Raspberry Pi’s greatest assets and what attracted me to the platform, as well as the competitive price point!” The whole setup cost him about £50.
Glyn tells us we could set up a basic monitor using Raspberry Pi, a DS28B20 temperature sensor, a battery pack, and a solar panel.
Raspberry Pi devices are often used by scientists, especially in biology to capture and analyse data, and a particularly striking – and sobering – project has made the news this week. Researchers at UMass Amherst have created FluSense, a dictionary-sized piece of equipment comprising a cheap microphone array, a thermal sensor, an Intel Movidius 2 neural computing engine, and a Raspberry Pi. FluSense monitors crowd sounds to forecast outbreaks of viral respiratory disease like seasonal flu; naturally, the headlines about their work have focused on its potential relevance to the COVID-19 pandemic.
Forsad Al Hossain and Tauhidur Rahman with the FluSense device. Image courtesy of the University of Massachusetts Amherst
The device can distinguish coughing from other sounds. When cough data is combined with information about the size of the crowd in a location, it can provide an index predicting how many people are likely to be experiencing flu symptoms.
It was successfully tested in in four health clinic waiting rooms, and now, PhD student Forsad Al Hossain and his adviser, assistant professor Tauhidur Rahman, plan to roll FluSense out in other large spaces to capture data on a larger scale and strengthen the device’s capabilities. Privacy concerns are mitigated by heavy encryption, and Al Hossain and Rahman explain that the emphasis is on aggregating data, not identifying sickness in any single patient.
The researchers believe the secret to FluSense’s success lies in how much of the processing work is done locally, via the neural computing engine and Raspberry Pi: “Symptom information is sent wirelessly to the lab for collation, of course, but the heavy lifting is accomplished at the edge.”
Image courtesy of the University of Massachusetts Amherst
FluSense offers a different set of advantages to other tools, such as the extremely popular self-reporting app developed by researchers at Kings College Hospital in London, UK, together with startup Zoe. Approaches like this rely on the public to sign up, and that’s likely to skew the data they gather, because people in some demographic groups are more likely than others to be motivated and able to participate. FluSense can be installed to capture data passively from groups across the entire population. This could be particularly helpful to underprivileged groups who are less likely to have access to healthcare.
Makers, engineers, and scientists across the world are rising to the challenge of tackling COVID-19. One notable initiative is the Montreal General Hospital Foundation’s challenge to quickly design a low-cost, easy to use ventilator which can be built locally to serve patients, with a prize of CAD $200,000 on offer. The winning designs will be made available to download for free.
There is, of course, loads of chatter on the Raspberry Pi forum about the role computing has in beating the virus. We particularly liked this PSA letting you know how to free up some of your unused processing power for those researching treatments.
Screenshot via @deeplocal on Instagram
And to end on a cheering note, we *heart* this project from @deeplocal on Instagram. They’ve created a Raspberry Pi-powered soap dispenser which will play 20 seconds of your favourite song to keep you at the sink and make sure you’re washing your hands for long enough to properly protect yourself.
Note: This blog provides an alternate solution to Visualizing Amazon GuardDuty Findings, in which the authors describe how to build an Amazon Elasticsearch Service-powered Kibana dashboard to ingest and visualize Amazon GuardDuty findings.
Amazon GuardDuty is a managed threat detection service powered by machine learning that can monitor your AWS environment with just a few clicks. GuardDuty can identify threats such as unusual API calls or potentially unauthorized users attempting to access your servers. Many customers also like to visualize their findings in order to generate additional meaningful insights. For example, you might track resources affected by security threats to see how they evolve over time.
In this post, we provide a solution to ingest, process, and visualize your GuardDuty finding logs in a completely serverless fashion. Serverless applications automatically run and scale in response to events you define, rather than requiring you to provision, scale, and manage servers. Our solution covers how to build a pipeline that ingests findings into Amazon Simple Storage Service (Amazon S3), transforms their nested JSON structure into tabular form using Amazon Athena and AWS Glue, and creates visualizations using Amazon QuickSight. We aim to provide both an easy-to-implement and cost-effective solution for consuming and analyzing your GuardDuty findings, and to more generally showcase a repeatable example for processing and visualizing many types of complex JSON logs.
Many customers already maintain centralized logging solutions using Amazon Elasticsearch Service (Amazon ES). If you want to incorporate GuardDuty findings with an existing solution, we recommend referencing this blog post to get started. If you don’t have an existing solution or previous experience with Amazon ES, if you prefer to use serverless technologies, or if you’re familiar with more traditional business intelligence tools, read on!
Before you get started
To follow along with this post, you’ll need to enable GuardDuty in order to start generating findings. See Setting Up Amazon GuardDuty for details if you haven’t already done so. Once enabled, GuardDuty will automatically generate findings as events occur. If you have public-facing compute resources in the same region in which you’ve enabled GuardDuty, you may soon find that they are being scanned quite often. All the more reason to continue reading!
You’ll also need Amazon QuickSight enabled in your account for the visualization sections of this post. You can find instructions in Setting Up Amazon QuickSight.
Architecture from end to end
Figure 1: Complete architecture from findings to visualization
Figure 1 highlights the solution architecture, from finding generation all the way through final visualization. The steps are as follows:
Build a GuardDuty findings dashboard in Amazon QuickSight
Below, we’ve included an AWS CloudFormation template to launch a complete ingest pipeline (Steps 1-4) so that we can focus this post on the steps dedicated to building the actual visualizations (Steps 5-6). We cover steps 1-4 briefly in the next section to provide context, and we provide links to the pertinent pages in the documentation for those of you interested in building your own pipeline.
Ingest (Steps 1-4): Get Amazon GuardDuty findings into Amazon S3 and AWS Glue Data Catalog
Figure 2: In this section, we’ll cover the services highlighted in blue
Step 1: Deliver GuardDuty findings to Amazon CloudWatch Events
Step 3: Use AWS Lambda to reorganize Amazon S3 folder structure
Kinesis Data Firehose will automatically create a datetime-based file hierarchy to organize the findings as they come in. Due to the variability of the GuardDuty finding types, we recommend reorganizing the file hierarchy with a folder for each finding type, with separate datetime subfolders for each. This will make it easier to target findings that you want to focus on in your visualization. The provided AWS CloudFormation template utilizes an AWS Lambda function to rewrite the files in a new hierarchy as new files are written to S3. You can use the code provided in it along with Using AWS Lambda with S3 to trigger your own function that reorganizes the data. Once the Lambda function has run, the S3 bucket structure should look similar to the structure we show in figure 3.
Figure 3: Sample S3 bucket structure
Step 4: Catalog the GuardDuty findings using AWS Glue
With the reorganized findings stored in S3, use an AWS Glue crawler to scan and catalog each finding type. The CloudFormation template we provided schedules the crawler to run once a day. You can also run it on demand as needed. To build your own crawler, refer to Cataloging Tables with a Crawler. Assuming GuardDuty has generated findings in your account, you can navigate to the GuardDuty findings database in the AWS Glue Data Catalog. It should look something like figure 4:
Figure 4: List of finding type tables in the AWS Glue Catalog
Before moving on to the next step, take some time to explore the schema structure of the tables. Selecting one of the tables will bring you to a page that looks like what’s shown in figure 5.
Figure 5: Schema information for a single finding table
You should see that most of the columns contain basic information about each finding, but there’s a column named detail that is of type struct. Select it to expand, as shown in figure 6.
Figure 6: The “detail” column expanded
Ah, this is where the interesting information is tucked away! The tables for each finding may differ slightly, but in all cases the detail column will hold the bulk of the information you’ll want to visualize. See GuardDuty Active Finding Types for information on what you should expect to find in the logs for each finding type. In the next step, we’ll focus on unpacking detail to prepare it for visualization!
Process (Step 5): Unpack nested JSON and configure views with Amazon Athena
Figure 7: In this section, we’ll cover the services highlighted in blue
Note: This step picks up where the CloudFormation template finishes
Explore the table structure (again) in the Amazon Athena console
Begin by navigating to Athena from the AWS Management Console. Once there, you should see a drop-down menu with a list of databases. These are the same databases that are available in the AWS Glue Data Catalog. Choose the database with your GuardDuty findings and expand a table.
Figure 8: Expanded table in the Athena console
This should look very familiar to the table information you explored in step 4, including the detailstruct!
You’ll need a method to unpack the struct in order to effectively visualize the data. There are many methods and tools to approach this problem. One that we recommend (and will show) is to use SQL queries within Athena to construct tabular views. This approach will allow you to push the bulk of the processing work to Athena. It will also allow you to simplify building visualizations when using Amazon QuickSight by providing a more conventional tabular format.
Extract details for use in visualization using SQL
The following examples contain SQL statements that will provide everything necessary to extract the necessary fields from the detail struct of the Recon:EC2/PortProbeUnprotectedPort finding to build the Amazon QuickSight dashboard we showcase in the next section. The examples also cover most of the operations you’ll need to work with the elements found in GuardDuty findings (such as deeply nested data with lists), and they serve as a good starting point for constructing your own custom queries. In general, you’ll want to traverse the nested layers (i.e. root.detail.service.count) and create new records for each item in an embedded list that you want to target using the UNNEST function. See this blog for even more examples of constructing queries on complex JSON data using Amazon Athena.
Simply copy the SQL statements that you want into the Athena query field to build the port_probe_geo and affected_instances views.
Note: If your account has yet to generate Recon:EC2/PortProbeUnprotectedPort findings, you can generate sample findings to follow along.
CREATE OR REPLACE VIEW "port_probe_geo" AS
WITH getportdetails AS (
SELECT id, portdetails
FROM by_finding_type
CROSS JOIN UNNEST(detail.service.action.portProbeAction.portProbeDetails) WITH ORDINALITY AS p (portdetails, portdetailsindex)
)
SELECT
root.id AS id,
root.region AS region,
root.time AS time,
root.detail.type AS type,
root.detail.service.count AS count,
portdetails.localportdetails.port AS localport,
portdetails.localportdetails.portname AS localportname,
portdetails.remoteipdetails.geolocation.lon AS longitude,
portdetails.remoteipdetails.geolocation.lat AS latitude,
portdetails.remoteipdetails.country.countryname AS country,
portdetails.remoteipdetails.city.cityname AS city
FROM
by_finding_type as root, getPortDetails
WHERE
root.id = getportdetails.id
CREATE OR REPLACE VIEW "affected_instances" AS
SELECT
max(root.detail.service.count) AS count,
date_parse(root.time,'%Y-%m-%dT%H:%i:%sZ') as time,
root.detail.resource.instancedetails.instanceid
FROM
recon_ec2_portprobeunprotectedport AS root
GROUP BY
root.detail.resource.instancedetails.instanceid,
time
Visualize (Step 6): Build a GuardDuty findings dashboard in Amazon QuickSight
Figure 9: In this section we will cover the services highlighted in blue
Now that you’ve created tabular views using Athena, you can jump into Amazon QuickSight from the AWS Management Console and begin visualizing! If you haven’t already done so, enable Amazon QuickSight in your account by following the instructions for Setting Up Amazon QuickSight.
For this example, we’ll leverage the geo_port_probe view to build a geographic visualization and see the locations from which nefarious actors are launching port probes.
Creating an analysis
In the upper left-hand corner of the Amazon QuickSight console select New analysis and then New data set.
Figure 10: Create a new analysis
To utilize the views you built in the previous step, select Athena as the data source. Give your data source a name (in our example, we use “port probe geo”), and select the database that contains the views you created in the previous section. Then select Visualize.
Figure 11: Available data sources in Amazon QuickSight. Be sure to choose Athena!
Figure 12: Select the “port prob geo view” you created in step 5
Viz time!
From the Visual types menu in the bottom left corner, select the globe icon to create a map. Then select the latitude and longitude geospatial coordinates. Choose count (with a max aggregation) for size. Finally, select localportname to break the data down by color.
Figure 13: A visual containing a map of port probe scans in Amazon QuickSight
Voila! A detailed map of your environment’s attackers!
Build out a dashboard
Once you like how everything looks, you can move on to adding more visuals to create a full monitoring dashboard.
To add another visual to the analysis, select Add and then Add visual.
Figure 14: Add another visual using the ‘Add’ option from the Amazon QuickSight menu bar
If the new visual will use the same dataset, then you can immediately start selecting fields to build it. If you want to create a visual from a different data set (our example dashboard below adds the affected_instances view), follow the Creating Data Sets guide to add a new data set. Then return to the current analysis and associate the data set with the analysis by selecting the pencil icon shown below and selecting Add data set.
Figure 15: Adding a new data set to your Amazon QuickSight analysis
Repeat this process until you’ve built out everything you need in your monitoring dashboard. Once it’s completed, you can publish the dashboard by selecting Share and then Publish dashboard.
Figure 16: Publish your dashboard using the “Share” option of the Amazon QuickSight menu
Here’s an example of a dashboard we created using the port_probe_geo and affected_instances views:
Figure 17: An example dashboard created using the “port_probe_geo” and “affected_instances” views
What does something like this cost?
To get an idea of the scale of the cost, we’ve provided a small pricing example (accurate as of the writing of this blog) that assumes 10,000 GuardDuty findings per month with an average payload size of 5KB.
$0.44 per DPU hour (2 DPU minimum and 10 minute minimum) = $0.15 per crawler run
30 crawler runs
$4.50
Total Processing Cost
$4.53
Oh, the joys of a consumption-based model: Less than five dollars per month for all of that processing!
From here, all that remains are your visualization costs using Amazon QuickSight. This pricing is highly dependent upon your number of users and their respective usage patterns. See the Amazon QuickSight pricing page for more specific details.
Summary
In this post, we demonstrated how you can ingest your GuardDuty findings into S3, process them with AWS Glue and Amazon Athena, and visualize with Amazon QuickSight. All serverless! Each portion of what we showed can be used in tandem or on its own for this or many other data sets. Go launch the template and get started monitoring your AWS environment!
Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.
Many companies across the globe use Amazon DynamoDB to store and query historical user-interaction data. DynamoDB is a fast NoSQL database used by applications that need consistent, single-digit millisecond latency.
Often, customers want to turn their valuable data in DynamoDB into insights by analyzing a copy of their table stored in Amazon S3. Doing this separates their analytical queries from their low-latency critical paths. This data can be the primary source for understanding customers’ past behavior, predicting future behavior, and generating downstream business value. Customers often turn to DynamoDB because of its great scalability and high availability. After a successful launch, many customers want to use the data in DynamoDB to predict future behaviors or provide personalized recommendations.
DynamoDB is a good fit for low-latency reads and writes, but it’s not practical to scan all data in a DynamoDB database to train a model. In this post, I demonstrate how you can use DynamoDB table data copied to Amazon S3 by AWS Data Pipeline to predict customer behavior. I also demonstrate how you can use this data to provide personalized recommendations for customers using Amazon SageMaker. You can also run ad hoc queries using Amazon Athena against the data. DynamoDB recently released on-demand backups to create full table backups with no performance impact. However, it’s not suitable for our purposes in this post, so I chose AWS Data Pipeline instead to create managed backups are accessible from other services.
To do this, I describe how to read the DynamoDB backup file format in Data Pipeline. I also describe how to convert the objects in S3 to a CSV format that Amazon SageMaker can read. In addition, I show how to schedule regular exports and transformations using Data Pipeline. The sample data used in this post is from Bank Marketing Data Set of UCI.
The solution that I describe provides the following benefits:
Separates analytical queries from production traffic on your DynamoDB table, preserving your DynamoDB read capacity units (RCUs) for important production requests
Automatically updates your model to get real-time predictions
Optimizes for performance (so it doesn’t compete with DynamoDB RCUs after the export) and for cost (using data you already have)
Makes it easier for developers of all skill levels to use Amazon SageMaker
All code and data set in this post are available in this .zip file.
Solution architecture
The following diagram shows the overall architecture of the solution.
The steps that data follows through the architecture are as follows:
Data Pipeline regularly copies the full contents of a DynamoDB table as JSON into an S3
Exported JSON files are converted to comma-separated value (CSV) format to use as a data source for Amazon SageMaker.
Amazon SageMaker renews the model artifact and update the endpoint.
The converted CSV is available for ad hoc queries with Amazon Athena.
Data Pipeline controls this flow and repeats the cycle based on the schedule defined by customer requirements.
Building the auto-updating model
This section discusses details about how to read the DynamoDB exported data in Data Pipeline and build automated workflows for real-time prediction with a regularly updated model.
Find the automation_script.sh file and edit it for your environment. For example, you need to replace 's3://<your bucket>/<datasource path>/' with your own S3 path to the data source for Amazon ML. In the script, the text enclosed by angle brackets—< and >—should be replaced with your own path.
Upload the json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar file to your S3 path so that the ADD jar command in Apache Hive can refer to it.
For this solution, the banking.csv should be imported into a DynamoDB table.
Export a DynamoDB table
To export the DynamoDB table to S3, open the Data Pipeline console and choose the Export DynamoDB table to S3 template. In this template, Data Pipeline creates an Amazon EMR cluster and performs an export in the EMRActivity activity. Set proper intervals for backups according to your business requirements.
One core node(m3.xlarge) provides the default capacity for the EMR cluster and should be suitable for the solution in this post. Leave the option to resize the cluster before running enabled in the TableBackupActivity activity to let Data Pipeline scale the cluster to match the table size. The process of converting to CSV format and renewing models happens in this EMR cluster.
For a more in-depth look at how to export data from DynamoDB, see Export Data from DynamoDB in the Data Pipeline documentation.
Add the script to an existing pipeline
After you export your DynamoDB table, you add an additional EMR step to EMRActivity by following these steps:
Open the Data Pipeline console and choose the ID for the pipeline that you want to add the script to.
For Actions, choose Edit.
In the editing console, choose the Activities category and add an EMR step using the custom script downloaded in the previous section, as shown below.
Paste the following command into the new step after the data upload step:
The element #{output.directoryPath} references the S3 path where the data pipeline exports DynamoDB data as JSON. The path should be passed to the script as an argument.
The bash script has two goals, converting data formats and renewing the Amazon SageMaker model. Subsequent sections discuss the contents of the automation script.
Automation script: Convert JSON data to CSV with Hive
We use Apache Hive to transform the data into a new format. The Hive QL script to create an external table and transform the data is included in the custom script that you added to the Data Pipeline definition.
When you run the Hive scripts, do so with the -e option. Also, define the Hive table with the 'org.openx.data.jsonserde.JsonSerDe' row format to parse and read JSON format. The SQL creates a Hive EXTERNAL table, and it reads the DynamoDB backup data on the S3 path passed to it by Data Pipeline.
Note: You should create the table with the “EXTERNAL” keyword to avoid the backup data being accidentally deleted from S3 if you drop the table.
The full automation script for converting follows. Add your own bucket name and data source path in the highlighted areas.
After creating an external table, you need to read data. You then use the INSERT OVERWRITE DIRECTORY ~ SELECT command to write CSV data to the S3 path that you designated as the data source for Amazon SageMaker.
Depending on your requirements, you can eliminate or process the columns in the SELECT clause in this step to optimize data analysis. For example, you might remove some columns that have unpredictable correlations with the target value because keeping the wrong columns might expose your model to “overfitting” during the training. In this post, customer_id columns is removed. Overfitting can make your prediction weak. More information about overfitting can be found in the topic Model Fit: Underfitting vs. Overfitting in the Amazon ML documentation.
Automation script: Renew the Amazon SageMaker model
After the CSV data is replaced and ready to use, create a new model artifact for Amazon SageMaker with the updated dataset on S3. For renewing model artifact, you must create a new training job. Training jobs can be run using the AWS SDK ( for example, Amazon SageMaker boto3 ) or the Amazon SageMaker Python SDK that can be installed with “pip install sagemaker” command as well as the AWS CLI for Amazon SageMaker described in this post.
In addition, consider how to smoothly renew your existing model without service impact, because your model is called by applications in real time. To do this, you need to create a new endpoint configuration first and update a current endpoint with the endpoint configuration that is just created.
#!/bin/bash
## Define variable
REGION=$2
DTTIME=`date +%Y-%m-%d-%H-%M-%S`
ROLE="<your AmazonSageMaker-ExecutionRole>"
# Select containers image based on region.
case "$REGION" in
"us-west-2" )
IMAGE="174872318107.dkr.ecr.us-west-2.amazonaws.com/linear-learner:latest"
;;
"us-east-1" )
IMAGE="382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner:latest"
;;
"us-east-2" )
IMAGE="404615174143.dkr.ecr.us-east-2.amazonaws.com/linear-learner:latest"
;;
"eu-west-1" )
IMAGE="438346466558.dkr.ecr.eu-west-1.amazonaws.com/linear-learner:latest"
;;
*)
echo "Invalid Region Name"
exit 1 ;
esac
# Start training job and creating model artifact
TRAINING_JOB_NAME=TRAIN-${DTTIME}
S3OUTPUT="s3://<your bucket name>/model/"
INSTANCETYPE="ml.m4.xlarge"
INSTANCECOUNT=1
VOLUMESIZE=5
aws sagemaker create-training-job --training-job-name ${TRAINING_JOB_NAME} --region ${REGION} --algorithm-specification TrainingImage=${IMAGE},TrainingInputMode=File --role-arn ${ROLE} --input-data-config '[{ "ChannelName": "train", "DataSource": { "S3DataSource": { "S3DataType": "S3Prefix", "S3Uri": "s3://<your bucket name>/<datasource path>/", "S3DataDistributionType": "FullyReplicated" } }, "ContentType": "text/csv", "CompressionType": "None" , "RecordWrapperType": "None" }]' --output-data-config S3OutputPath=${S3OUTPUT} --resource-config InstanceType=${INSTANCETYPE},InstanceCount=${INSTANCECOUNT},VolumeSizeInGB=${VOLUMESIZE} --stopping-condition MaxRuntimeInSeconds=120 --hyper-parameters feature_dim=20,predictor_type=binary_classifier
# Wait until job completed
aws sagemaker wait training-job-completed-or-stopped --training-job-name ${TRAINING_JOB_NAME} --region ${REGION}
# Get newly created model artifact and create model
MODELARTIFACT=`aws sagemaker describe-training-job --training-job-name ${TRAINING_JOB_NAME} --region ${REGION} --query 'ModelArtifacts.S3ModelArtifacts' --output text `
MODELNAME=MODEL-${DTTIME}
aws sagemaker create-model --region ${REGION} --model-name ${MODELNAME} --primary-container Image=${IMAGE},ModelDataUrl=${MODELARTIFACT} --execution-role-arn ${ROLE}
# create a new endpoint configuration
CONFIGNAME=CONFIG-${DTTIME}
aws sagemaker create-endpoint-config --region ${REGION} --endpoint-config-name ${CONFIGNAME} --production-variants VariantName=Users,ModelName=${MODELNAME},InitialInstanceCount=1,InstanceType=ml.m4.xlarge
# create or update the endpoint
STATUS=`aws sagemaker describe-endpoint --endpoint-name ServiceEndpoint --query 'EndpointStatus' --output text --region ${REGION} `
if [[ $STATUS -ne "InService" ]] ;
then
aws sagemaker create-endpoint --endpoint-name ServiceEndpoint --endpoint-config-name ${CONFIGNAME} --region ${REGION}
else
aws sagemaker update-endpoint --endpoint-name ServiceEndpoint --endpoint-config-name ${CONFIGNAME} --region ${REGION}
fi
Grant permission
Before you execute the script, you must grant proper permission to Data Pipeline. Data Pipeline uses the DataPipelineDefaultResourceRole role by default. I added the following policy to DataPipelineDefaultResourceRole to allow Data Pipeline to create, delete, and update the Amazon SageMaker model and data source in the script.
After you deploy a model into production using Amazon SageMaker hosting services, your client applications use this API to get inferences from the model hosted at the specified endpoint. This approach is useful for interactive web, mobile, or desktop applications.
Following, I provide a simple Python code example that queries against Amazon SageMaker endpoint URL with its name (“ServiceEndpoint”) and then uses them for real-time prediction.
Data Pipeline exports DynamoDB table data into S3. The original JSON data should be kept to recover the table in the rare event that this is needed. Data Pipeline then converts JSON to CSV so that Amazon SageMaker can read the data.Note: You should select only meaningful attributes when you convert CSV. For example, if you judge that the “campaign” attribute is not correlated, you can eliminate this attribute from the CSV.
Train the Amazon SageMaker model with the new data source.
When a new customer comes to your site, you can judge how likely it is for this customer to subscribe to your new product based on “predictedScores” provided by Amazon SageMaker.
If the new user subscribes your new product, your application must update the attribute “y” to the value 1 (for yes). This updated data is provided for the next model renewal as a new data source. It serves to improve the accuracy of your prediction. With each new entry, your application can become smarter and deliver better predictions.
Running ad hoc queries using Amazon Athena
Amazon Athena is a serverless query service that makes it easy to analyze large amounts of data stored in Amazon S3 using standard SQL. Athena is useful for examining data and collecting statistics or informative summaries about data. You can also use the powerful analytic functions of Presto, as described in the topic Aggregate Functions of Presto in the Presto documentation.
With the Data Pipeline scheduled activity, recent CSV data is always located in S3 so that you can run ad hoc queries against the data using Amazon Athena. I show this with example SQL statements following. For an in-depth description of this process, see the post Interactive SQL Queries for Data in Amazon S3 on the AWS News Blog.
Creating an Amazon Athena table and running it
Simply, you can create an EXTERNAL table for the CSV data on S3 in Amazon Athena Management Console.
=== Table Creation ===
CREATE EXTERNAL TABLE datasource (
age int,
job string,
marital string ,
education string,
default string,
housing string,
loan string,
contact string,
month string,
day_of_week string,
duration int,
campaign int,
pdays int ,
previous int ,
poutcome string,
emp_var_rate double,
cons_price_idx double,
cons_conf_idx double,
euribor3m double,
nr_employed double,
y int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n'
LOCATION 's3://<your bucket name>/<datasource path>/';
The following query calculates the correlation coefficient between the target attribute and other attributes using Amazon Athena.
=== Sample Query ===
SELECT corr(age,y) AS correlation_age_and_target,
corr(duration,y) AS correlation_duration_and_target,
corr(campaign,y) AS correlation_campaign_and_target,
corr(contact,y) AS correlation_contact_and_target
FROM ( SELECT age , duration , campaign , y ,
CASE WHEN contact = 'telephone' THEN 1 ELSE 0 END AS contact
FROM datasource
) datasource ;
Conclusion
In this post, I introduce an example of how to analyze data in DynamoDB by using table data in Amazon S3 to optimize DynamoDB table read capacity. You can then use the analyzed data as a new data source to train an Amazon SageMaker model for accurate real-time prediction. In addition, you can run ad hoc queries against the data on S3 using Amazon Athena. I also present how to automate these procedures by using Data Pipeline.
You can adapt this example to your specific use case at hand, and hopefully this post helps you accelerate your development. You can find more examples and use cases for Amazon SageMaker in the video AWS 2017: Introducing Amazon SageMaker on the AWS website.
Yong Seong Lee is a Cloud Support Engineer for AWS Big Data Services. He is interested in every technology related to data/databases and helping customers who have difficulties in using AWS services. His motto is “Enjoy life, be curious and have maximum experience.”
In the wake of the Cambridge Analytica scandal, news articles and commentators have focused on what Facebook knows about us. A lot, it turns out. It collects data from our posts, our likes, our photos, things we type and delete without posting, and things we do while not on Facebook and even when we’re offline. It buys data about us from others. And it can infer even more: our sexual orientation, political beliefs, relationship status, drug use, and other personality traits — even if we didn’t take the personality test that Cambridge Analytica developed.
But for every article about Facebook’s creepy stalker behavior, thousands of other companies are breathing a collective sigh of relief that it’s Facebook and not them in the spotlight. Because while Facebook is one of the biggest players in this space, there are thousands of other companies that spy on and manipulate us for profit.
Harvard Business School professor Shoshana Zuboff calls it “surveillance capitalism.” And as creepy as Facebook is turning out to be, the entire industry is far creepier. It has existed in secret far too long, and it’s up to lawmakers to force these companies into the public spotlight, where we can all decide if this is how we want society to operate and — if not — what to do about it.
There are 2,500 to 4,000 data brokers in the United States whose business is buying and selling our personal data. Last year, Equifax was in thenews when hackers stole personal information on 150 million people, including Social Security numbers, birth dates, addresses, and driver’s license numbers.
You certainly didn’t give it permission to collect any of that information. Equifax is one of those thousands of data brokers, most of them you’ve never heard of, selling your personal information without your knowledge or consent to pretty much anyone who will pay for it.
Surveillance capitalism takes this one step further. Companies like Facebook and Google offer you free services in exchange for your data. Google’s surveillance isn’t in the news, but it’s startlingly intimate. We never lie to our search engines. Our interests and curiosities, hopes and fears, desires and sexual proclivities, are all collected and saved. Add to that the websites we visit that Google tracks through its advertising network, our Gmail accounts, our movements via Google Maps, and what it can collect from our smartphones.
That phone is probably the most intimate surveillance device ever invented. It tracks our location continuously, so it knows where we live, where we work, and where we spend our time. It’s the first and last thing we check in a day, so it knows when we wake up and when we go to sleep. We all have one, so it knows who we sleep with. Uber used just some of that information to detect one-night stands; your smartphone provider and any app you allow to collect location data knows a lot more.
Surveillance capitalism drives much of the internet. It’s behind most of the “free” services, and many of the paid ones as well. Its goal is psychological manipulation, in the form of personalized advertising to persuade you to buy something or do something, like vote for a candidate. And while the individualized profile-driven manipulation exposed by Cambridge Analytica feels abhorrent, it’s really no different from what every company wants in the end. This is why all your personal information is collected, and this is why it is so valuable. Companies that can understand it can use it against you.
None of this is new. The media has been reporting on surveillance capitalism for years. In 2015, I wrote a book about it. Back in 2010, the Wall Street Journal publishedan award-winning two-year series about how people are tracked both online and offline, titled “What They Know.”
Surveillance capitalism is deeply embedded in our increasingly computerized society, and if the extent of it came to light there would be broad demands for limits and regulation. But because this industry can largely operate in secret, only occasionally exposed after a data breach or investigative report, we remain mostly ignorant of its reach.
This might change soon. In 2016, the European Union passed the comprehensive General Data Protection Regulation, or GDPR. The details of the law are far too complex to explain here, but some of the things it mandates are that personal data of EU citizens can only be collected and saved for “specific, explicit, and legitimate purposes,” and only with explicit consent of the user. Consent can’t be buried in the terms and conditions, nor can it be assumed unless the user opts in. This law will take effect in May, and companies worldwide are bracing for its enforcement.
Because pretty much all surveillance capitalism companies collect data on Europeans, this will expose the industry like nothing else. Here’s just one example. In preparation for this law, PayPal quietlypublished a list of over 600 companies it might share your personal data with. What will it be like when every company has to publish this sort of information, and explicitly explain how it’s using your personal data? We’re about to find out.
In the wake of this scandal, even Mark Zuckerberg saidthat his industry probably should be regulated, although he’s certainly not wishing for the sorts of comprehensive regulation the GDPR is bringing to Europe.
He’s right. Surveillance capitalism has operated without constraints for far too long. And advances in both big data analysis and artificial intelligence will make tomorrow’s applications far creepier than today’s. Regulation is the only answer.
The first step to any regulation is transparency. Who has our data? Is it accurate? What are they doing with it? Who are they selling it to? How are they securing it? Can we delete it? I don’t see any hope of Congress passing a GDPR-like data protection law anytime soon, but it’s not too far-fetched to demand laws requiring these companies to be more transparent in what they’re doing.
One of the responses to the Cambridge Analytica scandal is that people are deleting their Facebook accounts. It’s hard to do right, and doesn’t do anything about the data that Facebook collectsaboutpeople who don’t use Facebook. But it’s a start. The market can put pressure on these companies to reduce their spying on us, but it can only do that if we force the industry out of its secret shadows.
Mid-march in the USA means millions of people watching, and betting on, college basketball (I live here but I didn’t make the rules). As the NCAA college championship continues I wanted to briefly highlight the work of Wesley Pasfield one of our Professional Services Machine Learning Specialists. Wesley was able to take data from kenpom.com and College Basketball Reference to build a model predicting the outcome of March Madness using the Amazon SageMaker built-in XGBoost algorithm.
Wesley walks us through grabbing the data, performing an exploratory data analysis (EDA in the data science lingo), reshaping the data for the xgboost algorithm, using the SageMaker SDK to create a training job for two different models, and finally creating a SageMaker inference endpoint for serving predictions at https://cbbpredictions.com/. Check out part one of the post and part two.
Pretty cool right? Why not open the notebook and give the xgboost algorithm a try? Just know that there are a few caveats to the predictions so don’t go making your champion prediction just yet!
Applying technology to healthcare data has the potential to produce many exciting and important outcomes. The analysis produced from healthcare data can empower clinicians to improve the health of individuals and populations by enabling them to make better decisions that enhance the care they provide.
The Observational Health Data Sciences and Informatics (OHDSI, pronounced “Odyssey”) program and community is working toward this goal by producing data standards and open-source solutions to store and analyze observational health data. Using the OHDSI tools, you can visualize the health of your entire population. You can build cohorts of patients, analyze incidence rates for various conditions, and estimate the effect of treatments on patients with certain conditions. You can also model health outcome predictions using machine learning algorithms.
One of the challenges often faced when working with big data tools is the expense of the infrastructure required to run them. Another challenge is the learning curve to implement and begin using these tools. Amazon Web Services has enabled us to address many of the classic IT challenges by making enterprise class infrastructure and technology available in an affordable, elastic, and automated way. This blog post demonstrates how to combine some of the OHDSI projects (Atlas, Achilles, WebAPI, and the OMOP Common Data Model) with AWS technologies. By doing so, you can quickly and inexpensively implement a health data science and informatics environment.
Shown following is just one example of the population health analysis that is possible with the OHDSI tools. This visualization shows the prevalence of various drugs within the given population of people. This information helps researchers and clinicians discover trends and make better informed decisions about patient health.
OHDSI application architecture on AWS
Before deploying an application on AWS that transmits, processes, or stores protected health information (PHI) or personally identifiable information (PII), address your organization’s compliance concerns. Make sure that you have worked with your internal compliance and legal team to ensure compliance with the laws and regulations that govern your organization. To understand how you can use AWS services as a part of your overall compliance program, see the AWS HIPAA Compliance whitepaper. With that said, we paid careful attention to the HIPAA control set during the design of this solution.
This blog post presents a complete OHDSI application environment, including a data warehouse with sample data. It has the following features:
Following, you can see a block diagram of how the OHDSI tools map to the services provided by AWS.
Atlas is the web application that researchers interact with to perform analysis. Atlas interacts with the underlying databases through a web services application named WebAPI. In this example, both Atlas and WebAPI are deployed and managed by AWS Elastic Beanstalk. Elastic Beanstalk is an easy-to-use service for deploying and scaling web applications. Simply upload the Atlas and WebAPI code and Elastic Beanstalk automatically handles the deployment. It covers everything from capacity provisioning, load balancing, autoscaling, and high availability, to application health monitoring. Using a feature of Elastic Beanstalk called ebextensions, the Atlas and WebAPI servers are customized to use an encrypted storage volume for the middleware application logs.
Atlas stores the state of the various patient cohorts that are analyzed in a dedicated database separate from your observational health data. This database is provided by Amazon Aurora with PostgreSQL compatibility.
Amazon Aurora is a relational database built for the cloud that combines the performance and availability of high-end commercial databases with the simplicity and cost-effectiveness of open-source databases. It provides cost-efficient and resizable capacity while automating time-consuming administration tasks such as hardware provisioning, database setup, patching, and backups. It is configured for high availability and uses encryption at rest for the database and backups, and encryption in flight for the JDBC connections.
All of your observational health data is stored inside the OHDSI Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). This model also stores useful vocabulary tables that help to translate values from various data sources (like EHR systems and claims data).
The OMOP CDM schema is deployed onto Amazon Redshift. Amazon Redshift is a fast, fully managed data warehouse that allows you to run complex analytic queries against petabytes of structured data. It uses using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution. You can also resize an Amazon Redshift cluster as your requirements for it change.
The solution in this blog post automatically loads de-identified sample data of 1,000 people from the CMS 2008–2010 Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF). The data has helpful formatting from LTS Computing LLC. Vocabulary data from the OHDSI Athena project is also loaded into the OMOP CDM, and a results set is computed by OHDSI Achilles.
Following is a detailed technical diagram showing the configuration of the architecture to be deployed.
Deploying OHDSI on AWS
Everything just described is automatically deployed by using an AWS CloudFormation template. Using this template, you can quickly get started with the OHDSI project. The CloudFormation templates for this deployment as well as all of the supporting scripts and source code can be found in the AWS Labs GitHub repo.
From your AWS account, open the CloudFormation Management Console and choose Create Stack. From there, copy and paste the following URL in the Specify an Amazon S3 template URL box, and choose Next.
On the next screen, you provide a Stack Name (this can be anything you like) and a few other parameters for your OHDSI environment.
You use the DatabasePassword parameter to set the password for the master user account of the Amazon Redshift and Aurora databases.
You use the EBEndpoint name to generate a unique URL for Atlas to access the OHDSI environment. It is http://EBEndpoint.AWS-Region.elasticbeanstalk.com, where EBEndpoint.AWS-Region indicates the Elastic Beanstalk endpoint and AWS Region. You can configure this URL through Elastic Beanstalk if you want to change it in the future.
You use the KPair option to choose one of your existing Amazon EC2 key pairs to use with the instances that Elastic Beanstalk deploys. By doing this, you can gain administrative access to these instances in the future if you need to. If you don’t already have an Amazon EC2 key pair, you can generate one for free. You do this by going to the Key Pairs section of the EC2 console and choosing Generate Key Pair.
Finally, you use the UserIPRange parameter to specify a CIDR IP address range from which to access your OHDSI environment. By default, your OHDSI environment is accessible over the public internet. Use UserIPRange to limit access over the Internet to a single IP address or a range of IP addresses that represent users you want to have access. Through additional configuration, you can also make your OHDSI environment completely private and accessible only through a VPN or AWS Direct Connect private circuit.
When you’ve provided all Parameters, choose Next.
On the next screen, you can provide some other optional information like tags at your discretion, or just choose Next.
On the next screen, you can review what will be deployed. At the bottom of the screen, there is a check box for you to acknowledge that AWS CloudFormation might create IAM resources with custom names. This is correct; the template being deployed creates four custom roles that give permission for the AWS services involved to communicate with each other. Details of these permissions are inside the CloudFormation template referenced in the URL given in the first step. Check the box acknowledging this and choose Next.
You can watch as CloudFormation builds out your OHDSI architecture. A CloudFormation deployment is called a stack. The parent stack creates two child stacks, one containing the VPC and IAM roles and another created by Elastic Beanstalk with the Atlas and WebAPI servers. When all three stacks have reached the green CREATE_COMPLETE status, as shown in the screenshot following, then the OHDSI architecture has been deployed.
There is still some work going on behind the scenes, though. To watch the progress, browse to the Amazon Redshift section of your AWS Management Console and choose the Amazon Redshift cluster that was created for your OHDSI architecture. After you do so, you can observe the Loads and Queries tabs.
First, on the Loads tab, you can see the CMS De-SynPUF sample data and Athena vocabulary data being loaded into the OMOP Common Data Model. After you see the VOCABULARY table reach the COMPLETED status (as shown following), all of the sample and vocabulary data has been loaded.
After the data loads, the Achilles computation starts. On the Queries tab, you can watch Achilles running queries against your database to build out the Results schema. Achilles runs a large number of queries, and the entire process can take quite some time (about 20 minutes for the sample data we’ve loaded). Eventually, no new queries show up in the Queries tab, which shows that the Achilles computation is completed. The entire process from the time you executed the CloudFormation template until the Achilles computation is completed usually takes about an hour and 15 minutes.
At this point, you can browse to the Elastic Beanstalk section of the AWS Management Console. There, you can choose the OHDSI Application and Environment (green box) that was deployed by the CloudFormation template. At the top of the dashboard, as shown following, you see a link to a URL. This URL matches the name you provided in the EBEndpoint parameter of the CloudFormation template. Choose this URL, and you can start using Atlas to explore the CMS DE-SynPUF sample data!
Cost of deploying this environment
It used to be common to see healthcare data analytics environments deployed in an on-premises data center with expensive data warehouse appliances and virtualized environments. The cloud era has democratized the availability of the infrastructure required to do this type of data analysis, so that now it is within reach of even small organizations. This environment can expand to analyze petabyte-scale health data, and you only pay for what you need. See an estimated breakdown of the monthly cost components for this environment as deployed on the AWS Solution Calculator.
It’s also worth noting that this environment does not have to be run all of the time. If you are only performing analyses periodically, you can terminate the environment when you are finished and restore it from the database backups when you want to continue working. This would reduce the cost of operation even further.
Summary
Now that you have a fully functional OHDSI environment with sample data, you can use this to explore and learn the toolset and its capabilities. After learning with the sample data, you can begin gaining insights by analyzing your own organization’s health data. You can do this using an extract, transform, load (ETL) process from one or more of your health data sources.
James Wiggins is a senior healthcare solutions architect at AWS. He is passionate about using technology to help organizations positively impact world health. He also loves spending time with his wife and three children.
Data that describe processes in a spatial context are everywhere in our day-to-day lives and they dominate big data problems. Map data, for instance, whether describing networks of roads or remote sensing data from satellites, get us where we need to go. Atmospheric data from simulations and sensors underlie our weather forecasts and climate models. Devices and sensors with GPS can provide a spatial context to nearly all mobile data.
In this post, we introduce the WIND toolkit, a huge (500 TB), open weather model dataset that’s available to the world on Amazon’s cloud services. We walk through how to access this data and some of the open-source software developed to make it easily accessible. Our solution considers a subset of geospatial data that exist on a grid (raster) and explores ways to provide access to large-scale raster data from weather models. The solution uses foundational AWS services and the Hierarchical Data Format (HDF), a well adopted format for scientific data.
The approach developed here can be extended to any data that fit in an HDF5 file, which can describe sparse and dense vectors and matrices of arbitrary dimensions. This format is already popular within the physical sciences for both experimental and simulation data. We discuss solutions to gridded data storage for a massive dataset of public weather model outputs called the Wind Integration National Dataset (WIND) toolkit. We also highlight strategies that are general to other large geospatial data management problems.
Wind Integration National Dataset
As variable renewable power penetration levels increase in power systems worldwide, the importance of renewable integration studies to ensure continued economic and reliable operation of the power grid is also increasing. The WIND toolkit is the largest freely available grid integration dataset to date.
The WIND toolkit was developed by 3TIER by Vaisala. They were under a subcontract to the National Renewable Energy Laboratory (NREL) to support studies on integration of wind energy into the existing US grid. NREL is a part of a network of national laboratories for the US Department of Energy and has a mission to advance the science and engineering of energy efficiency, sustainable transportation, and renewable power technologies.
The toolkit has been used by consultants, research groups, and universities worldwide to support grid integration studies. Less traditional uses also include resource assessments for wind plants (such as those powering Amazon data centers), and studying the effects of weather on California condor migrations in the Baja peninsula.
The diversity of applications highlights the value of accessible, open public data. Yet, there’s a catch: the dataset is huge. The WIND toolkit provides simulated atmospheric (weather) data at a two-km spatial resolution and five-minute temporal resolution at multiple heights for seven years. The entire dataset is half a petabyte (500 TB) in size and is stored in the NREL High Performance Computing data center in Golden, Colorado. Making this dataset publicly available easily and in a cost-effective manner is a major challenge.
As other laboratories and public institutions work to release their data to the world, they may face similar challenges to those that we experienced. Some prior, well-intentioned efforts to release huge datasets as-is have resulted in data resources that are technically available but fundamentally unusable. They may be stored in an unintuitive format or indexed and organized to support only a subset of potential uses. Downloading hundreds of terabytes of data is often impractical. Most users don’t have access to a big data cluster (or super computer) to slice and dice the data as they need after it’s downloaded.
We aim to provide a large amount of data (50 terabytes) to the public in a way that is efficient, scalable, and easy to use. In many cases, researchers can access these huge cloud-located datasets using the same software and algorithms they have developed for smaller datasets stored locally. Only the pieces of data they need for their individual analysis must be downloaded. To make this work in practice, we worked with the HDF Group and have built upon their forthcoming Highly Scalable Data Service.
In the rest of this post, we discuss how the HSDS software was developed to use Amazon EC2 and Amazon S3 resources to provide convenient and scalable access to these huge geospatial datasets. We describe how the HSDS service has been put to work for the WIND Toolkit dataset and demonstrate how to access it using the h5pyd Python library and the REST API. We conclude with information about our ongoing work to release more ‘open’ datasets to the public using AWS services, and ways to improve and extend the HSDS with newer Amazon services like Amazon ECS and AWS Lambda.
Developing a scalable service for big geospatial data
The HDF5 file format and API have been used for many years and is an effective means of storing large scientific datasets. For example, NASA’s Earth Observing System (EOS) satellites collect more than 16 TBs of data per day using HDF5.
With the rise of the cloud, there are new challenges and opportunities to rethink how HDF5 can be enhanced to work effectively as a component in a cloud-native architecture. For the HDF Group, working with NREL has been a great opportunity to put ideas into practice with a production-size dataset.
An HDF5 file consists of a directed graph of group and dataset objects. Datasets can be thought of as a multidimensional array with support for user-defined metadata tags and compression. Typical operations on datasets would be reading or writing data to a regular subregion (a hyperslab) or reading and writing individual elements (a point selection). Also, group and dataset objects may each contain an arbitrary number of the user-defined metadata elements known as attributes.
Many people have used the HDF library in applications developed or ported to run on EC2 instances, but there are a number of constraints that often prove problematic:
The HDF5 library can’t read directly from HDF5 files stored as S3 objects. The entire file (often many GB in size) would need to be copied to local storage before the first byte can be read. Also, the instance must be configured with the appropriately sized EBS volume)
The HDF library only has access to the computational resources of the instance itself (as opposed to a cluster of instances), so many operations are bottlenecked by the library.
Any modifications to the HDF5 file would somehow have to be synchronized with changes that other instances have made to same file before writing back to S3.
Using a pattern common to many offerings from AWS, the solution to these constraints is to develop a service framework around the HDF data model. Using this model, the HDF Group has created the Highly Scalable Data Service (HSDS) that provides all the functionality that traditionally was provided by the HDF5 library. By using the service, you don’t need to manage your own file volumes, but can just read and write whatever data that you need.
Because the service manages the actual data persistence to a durable medium (S3, in this case), you don’t need to worry about disk management. Simply stream the data you need from the service as you need it. Secondly, putting the functionality behind a service allows some tricks to increase performance (described in more detail later). And lastly, HSDS allows any number of clients to access the data at the same time, enabling HDF5 to be used as a coordination mechanism for multiple readers and writers.
In designing the HSDS architecture, we gave much thought to how to achieve scalability of the HSDS service. For accessing HDF5 data, there are two different types of scaling to consider:
Multiple clients making many requests to the service
Single requests that require a significant amount of data processing
To deal with the first scaling challenge, as with most services, we considered how the service responds as the request rate increases. AWS provides some great tools that help in this regard:
Auto Scaling groups
Elastic Load Balancing load balancers
The ability of S3 to handle large aggregate throughput rates
By using a cluster of EC2 instances behind a load balancer, you can handle different client loads in a cost-effective manner.
The second scaling challenge concerns single requests that would take significant processing time with just one compute node. One example of this from the WIND toolkit would be extracting all the values in the seven-year time span for a given geographic point and dataset.
In HDF5, large datasets are typically stored as “chunks”; that is, a regular partition of the array. In HSDS, each chunk is stored as a binary object in S3. The sequential approach to retrieving the time series values would be for the service to read each chunk needed from S3, extract the needed elements, and go on to the next chunk. In this case, that would involve processing 2557 chunks, and would be quite slow.
Fortunately, with HSDS, you can speed this up quite a bit by exploiting the compute and I/O capabilities of the cluster. Upon receiving the request, the receiving node can use other nodes in the cluster to read different portions of the selection. With multiple nodes reading from S3 in parallel, performance improves as the cluster size increases.
The diagram below illustrates how this works in simplified case of four chunks and four nodes.
This architecture has worked in well in practice. In testing with the WIND toolkit and time series extraction, we observed a request latency of ~60 seconds using four nodes vs. ~5 seconds with 40 nodes. Performance roughly scales with the size of the cluster.
A planned enhancement to this is to use AWS Lambda for the worker processing. This enables 1000-way parallel reads at a reasonable cost, as you only pay for the milliseconds of CPU time used with AWS Lambda.
Public access to atmospheric data using HSDS and AWS
An early challenge in releasing the WIND toolkit data was in deciding how to subset the data for different use cases. In general, few researchers need access to the entire 0.5 PB of data and a great deal of efficiency and cost reduction can be gained by making directed constituent datasets.
NREL grid integration researchers initially extracted a 2-TB subset by selecting 120,000 points where the wind resource seemed appropriate for development. They also chose only those data important for wind applications (100-m wind speed, converted to power), the most interesting locations for those performing grid studies. To support the remaining users who needed more data resolution, we down-sampled the data to a 60-minute temporal resolution, keeping all the other variables and spatial resolution intact. This reduced dataset is 50 TB of data describing 30+ atmospheric variables of data for 7 years at a 60-minute temporal resolution.
The WindViz browser-based Gridded Wind Toolkit Visualizer was created as an example implementation of the HSDS REST API in JavaScript. The visualizer is written in the style of ECMAScript 2016 using a modern development toolchain that includes webpack and Babel. The source code is available through our GitHub repository. The demo page is hosted via GitHub pages, and we use a cross-origin AJAX request to fetch data from the HSDS service running on the EC2 infrastructure. The visualizer can be used to explore the gridded wind toolkit data on a map. Achieve full spatial resolution by zooming in to a specific region.
Programmatic access is possible using the h5pyd Python library, a distributed analog to the widely used h5py library. Users interact with the datasets (variables) and slice the data from its (time x longitude x latitude) cube form as they see fit.
Examples and use cases are described in a set of Jupyter notebooks and available on GitHub:
Now you have a Jupyter notebook server running on your EC2 server.
From your laptop, create an SSH tunnel:
$ ssh –L 8888:localhost:8888 (IP address of the EC2 server)
Now, you can browse to localhost:8888 using the correct token, and interact with the notebooks as if they were local. Within the directory, there are examples for accessing the HSDS API and plotting wind and weather data using matplotlib.
Controlling access and defraying costs
A final concern is rate limiting and access control. Although the HSDS service is scalable and relatively robust, we had a few practical concerns:
How can we protect from malicious or accidental use that may lead to high egress fees (for example, someone who attempts to repeatedly download the entire dataset from S3)?
How can we keep track of who is using the data both to document the value of the data resource and to justify the costs?
If costs become too high, can we charge for some or all API use to help cover the costs?
To approach these problems, we investigated using Amazon API Gateway and its simplified integration with the AWS Marketplace for SaaS monetization as well as third-party API proxies.
In the end, we chose to use API Umbrella due to its close involvement with http://data.gov. While AWS Marketplace is a compelling option for future datasets, the decision was made to keep this dataset entirely open, at least for now. As community use and associated costs grow, we’ll likely revisit Marketplace. Meanwhile, API Umbrella provides controls for rate limiting and API key registration out of the box and was simple to implement as a front-end proxy to HSDS. Those applications that may want to charge for API use can accomplish a similar strategy using Amazon API Gateway and AWS Marketplace.
Ongoing work and other resources
As NREL and other government research labs, municipalities, and organizations try to share data with the public, we expect many of you will face similar challenges to those we have tried to approach with the architecture described in this post. Providing large datasets is one challenge. Doing so in a way that is affordable and convenient for users is an entirely more difficult goal. Using AWS cloud-native services and the existing foundation of the HDF file format has allowed us to tackle that challenge in a meaningful way.
Dr. Caleb Phillips is a senior scientist with the Data Analysis and Visualization Group within the Computational Sciences Center at the National Renewable Energy Laboratory. Caleb comes from a background in computer science systems, applied statistics, computational modeling, and optimization. His work at NREL spans the breadth of renewable energy technologies and focuses on applying modern data science techniques to data problems at scale.
Dr. Caroline Draxl is a senior scientist at NREL. She supports the research and modeling activities of the US Department of Energy from mesoscale to wind plant scale. Caroline uses mesoscale models to research wind resources in various countries, and participates in on- and offshore boundary layer research and in the coupling of the mesoscale flow features (kilometer scale) to the microscale (tens of meters). She holds a M.S. degree in Meteorology and Geophysics from the University of Innsbruck, Austria, and a PhD in Meteorology from the Technical University of Denmark.
John Readey has been a Senior Architect at The HDF Group since he joined in June 2014. His interests include web services related to HDF, applications that support the use of HDF and data visualization.Before joining The HDF Group, John worked at Amazon.com from 2006–2014 where he developed service-based systems for eCommerce and AWS.
Jordan Perr-Sauer is an RPP intern with the Data Analysis and Visualization Group within the Computational Sciences Center at the National Renewable Energy Laboratory. Jordan hopes to use his professional background in software engineering and his academic training in applied mathematics to solve the challenging problems facing America and the world.
Artificial intelligence technologies have the potential to upend the longstanding advantage that attack has over defense on the Internet. This has to do with the relative strengths and weaknesses of people and computers, how those all interplay in Internet security, and where AI technologies might change things.
You can divide Internet security tasks into two sets: what humans do well and what computers do well. Traditionally, computers excel at speed, scale, and scope. They can launch attacks in milliseconds and infect millions of computers. They can scan computer code to look for particular kinds of vulnerabilities, and data packets to identify particular kinds of attacks.
Humans, conversely, excel at thinking and reasoning. They can look at the data and distinguish a real attack from a false alarm, understand the attack as it’s happening, and respond to it. They can find new sorts of vulnerabilities in systems. Humans are creative and adaptive, and can understand context.
Computers — so far, at least — are bad at what humans do well. They’re not creative or adaptive. They don’t understand context. They can behave irrationally because of those things.
Humans are slow, and get bored at repetitive tasks. They’re terrible at big data analysis. They use cognitive shortcuts, and can only keep a few data points in their head at a time. They can also behave irrationally because of those things.
AI will allow computers to take over Internet security tasks from humans, and then do them faster and at scale. Here are possible AI capabilities:
Discovering new vulnerabilities — and, more importantly, new types of vulnerabilities in systems, both by the offense to exploit and by the defense to patch, and then automatically exploiting or patching them.
Reacting and adapting to an adversary’s actions, again both on the offense and defense sides. This includes reasoning about those actions and what they mean in the context of the attack and the environment.
Abstracting lessons from individual incidents, generalizing them across systems and networks, and applying those lessons to increase attack and defense effectiveness elsewhere.
Identifying strategic and tactical trends from large datasets and using those trends to adapt attack and defense tactics.
That’s an incomplete list. I don’t think anyone can predict what AI technologies will be capable of. But it’s not unreasonable to look at what humans do today and imagine a future where AIs are doing the same things, only at computer speeds, scale, and scope.
Both attack and defense will benefit from AI technologies, but I believe that AI has the capability to tip the scales more toward defense. There will be better offensive and defensive AI techniques. But here’s the thing: defense is currently in a worse position than offense precisely because of the human components. Present-day attacks pit the relative advantages of computers and humans against the relative weaknesses of computers and humans. Computers moving into what are traditionally human areas will rebalance that equation.
Roy Amara famously said that we overestimate the short-term effects of new technologies, but underestimate their long-term effects. AI is notoriously hard to predict, so many of the details I speculate about are likely to be wrong — and AI is likely to introduce new asymmetries that we can’t foresee. But AI is the most promising technology I’ve seen for bringing defense up to par with offense. For Internet security, that will change everything.
This essay previously appeared in the March/April 2018 issue of IEEE Security & Privacy.
As we head into 2018 and start looking forward to longer days in the Northern hemisphere, I thought I’d take a look back at last year’s weather using data from Raspberry Pi Oracle Weather Stations. One of the great things about the kit is that as well as uploading all its readings to the shared online Oracle database, it stores them locally on the Pi in a MySQL or MariaDB database. This means you can use the power of SQL queries coupled with Python code to do automatic data analysis.
Soggy Surrey
My Weather Station has only been installed since May, so I didn’t have a full 52 weeks of my own data to investigate. Still, my station recorded more than 70000 measurements. Living in England, the first thing I wanted to know was: which was the wettest month? Unsurprisingly, both in terms of average daily rainfall and total rainfall, the start of the summer period — exactly when I went on a staycation — was the soggiest:
What about the global Weather Station community?
Even soggier Bavaria
Here things get slightly trickier. Although we have a shiny Oracle database full of all participating schools’ sensor readings, some of the data needs careful interpretation. Many kits are used as part of the school curriculum and do not always record genuine outdoor conditions. Nevertheless, it appears that Adalbert Stifter Gymnasium in Bavaria, Germany, had an even wetter 2017 than my home did:
The records Robert-Dannemann Schule in Westerstede, Germany, is a good example of data which was most likely collected while testing and investigating the weather station sensors, rather than in genuine external conditions. Unless this school’s Weather Station was transported to a planet which suffers from extreme hurricanes, it wasn’t actually subjected to wind speeds above 1000km/h in November. Dismissing these and all similarly suspect records, I decided to award the ‘Windiest location of the year’ prize to CEIP Noalla-Telleiro, Spain.
This school is right on the coast, and is subject to some strong and squally weather systems.
Weather Station at CEIP Noalla-Telleiro
They’ve mounted their wind vane and anemometer nice and high, so I can see how they were able to record such high wind velocities.
A couple of Weather Stations have recently been commissioned in equally exposed places — it will be interesting to see whether they will record even higher speeds during 2018.
Highs and lows
After careful analysis and a few disqualifications (a couple of Weather Stations in contention for this category were housed indoors), the ‘Hottest location’ award went to High School of Chalastra in Thessaloniki, Greece. There were a couple of Weather Stations (the one at The Marwadi Education Foundation in India, for example) that reported higher average temperatures than Chalastra’s 24.54 ºC. However, they had uploaded far fewer readings and their data coverage of 2017 was only partial.
At the other end of the thermometer, the location with the coldest average temperature is École de la Rose Sauvage in Calgary, Canada, with a very chilly 9.9 ºC.
Weather Station at École de la Rose Sauvage
I suspect this school has a good chance of retaining the title: their lowest 2017 temperature of -24 ºC is likely to be beaten in 2018 due to extreme weather currently bringing a freezing start to the year in that part of the world.
If you have an Oracle Raspberry Pi Weather Station and would like to perform an annual review of your local data, you can use this Python script as a starting point. It will display a monthly summary of the temperature and rainfall for 2017, and you should be able to customise the code to focus on other sensor data or on a particular time of year. We’d love to see your results, so please share your findings with [email protected], and we’ll send you some limited-edition Weather Station stickers.
One of the technology areas I thoroughly enjoy is the Internet of Things (IoT). Even as a child I used to infuriate my parents by taking apart the toys they would purchase for me to see how they worked and if I could somehow put them back together. It seems somehow I was destined to end up the tough and ever-changing world of technology. Therefore, it’s no wonder that I am really enjoying learning and tinkering with IoT devices and technologies. It combines my love of development and software engineering with my curiosity around circuits, controllers, and other facets of the electrical engineering discipline; even though an electrical engineer I can not claim to be.
Despite all of the information that is collected by the deployment of IoT devices and solutions, I honestly never really thought about the need to analyze, search, and process this data until I came up against a scenario where it became of the utmost importance to be able to search and query through loads of sensory data for an anomaly occurrence. Of course, I understood the importance of analytics for businesses to make accurate decisions and predictions to drive the organization’s direction. But it didn’t occur to me initially, how important it was to make analytics an integral part of my IoT solutions. Well, I learned my lesson just in time because this re:Invent a service is launching to make it easier for anyone to process and analyze IoT messages and device data.
Hello, AWS IoT Analytics! AWS IoT Analytics is a fully managed service of AWS IoT that provides advanced data analysis of data collected from your IoT devices. With the AWS IoT Analytics service, you can process messages, gather and store large amounts of device data, as well as, query your data. Also, the new AWS IoTAnalytics service feature integrates with Amazon Quicksight for visualization of your data and brings the power of machine learning through integration with Jupyter Notebooks.
Benefits of AWS IoT Analytics
Helps with predictive analysis of data by providing access to pre-built analytical functions
Provides ability to visualize analytical output from service
Provides tools to clean up data
Can help identify patterns in the gathered data
Be In the Know: IoT Analytics Concepts
Channel: archives the raw, unprocessed messages and collects data from MQTT topics.
Pipeline: consumes messages from channels and allows message processing.
Activities: perform transformations on your messages including filtering attributes and invoking lambda functions advanced processing.
Data Store: Used as a queryable repository for processed messages. Provide ability to have multiple datastores for messages coming from different devices or locations or filtered by message attributes.
Data Set: Data retrieval view from a data store, can be generated by a recurring schedule.
Getting Started with AWS IoT Analytics
First, I’ll create a channel to receive incoming messages. This channel can be used to ingest data sent to the channel via MQTT or messages directed from the Rules Engine. To create a channel, I’ll select the Channels menu option and then click the Create a channel button.
I’ll name my channel, TaraIoTAnalyticsID and give the Channel a MQTT topic filter of Temperature. To complete the creation of my channel, I will click the Create Channel button.
Now that I have my Channel created, I need to create a Data Store to receive and store the messages received on the Channel from my IoT device. Remember you can set up multiple Data Stores for more complex solution needs, but I’ll just create one Data Store for my example. I’ll select Data Stores from menu panel and click Create a data store.
I’ll name my Data Store, TaraDataStoreID, and once I click the Create the data store button and I would have successfully set up a Data Store to house messages coming from my Channel.
Now that I have my Channel and my Data Store, I will need to connect the two using a Pipeline. I’ll create a simple pipeline that just connects my Channel and Data Store, but you can create a more robust pipeline to process and filter messages by adding Pipeline activities like a Lambda activity.
To create a pipeline, I’ll select the Pipelines menu option and then click the Create a pipeline button.
I will not add an Attribute for this pipeline. So I will click Next button.
As we discussed there are additional pipeline activities that I can add to my pipeline for the processing and transformation of messages but I will keep my first pipeline simple and hit the Next button.
The final step in creating my pipeline is for me to select my previously created Data Store and click Create Pipeline.
All that is left for me to take advantage of the AWS IoT Analytics service is to create an IoT rule that sends data to an AWS IoT Analytics channel. Wow, that was a super easy process to set up analytics for IoT devices.
If I wanted to create a Data Set as a result of queries run against my data for visualization with Amazon Quicksight or integrate with Jupyter Notebooks to perform more advanced analytical functions, I can choose the Analyze menu option to bring up the screens to create data sets and access the Juypter Notebook instances.
Summary
As you can see, it was a very simple process to set up the advanced data analysis for AWS IoT. With AWS IoT Analytics, you have the ability to collect, visualize, process, query and store large amounts of data generated from your AWS IoT connected device. Additionally, you can access the AWS IoT Analytics service in a myriad of different ways; the AWS Command Line Interface (AWS CLI), the AWS IoT API, language-specific AWS SDKs, and AWS IoT Device SDKs.
AWS IoT Analytics is available today for you to dig into the analysis of your IoT data. To learn more about AWS IoT and AWS IoT Analytics go to the AWS IoT Analytics product page and/or the AWS IoT documentation.
We can’t believe that there are just few days left before re:Invent 2017. If you are attending this year, you’ll want to check out our Big Data sessions! The Big Data and Machine Learning categories are bigger than ever. As in previous years, you can find these sessions in various tracks, including Analytics & Big Data, Deep Learning Summit, Artificial Intelligence & Machine Learning, Architecture, and Databases.
We have great sessions from organizations and companies like Vanguard, Cox Automotive, Pinterest, Netflix, FINRA, Amtrak, AmazonFresh, Sysco Foods, Twilio, American Heart Association, Expedia, Esri, Nextdoor, and many more. All sessions are recorded and made available on YouTube. In addition, all slide decks from the sessions will be available on SlideShare.net after the conference.
This post highlights the sessions that will be presented as part of the Analytics & Big Data track, as well as relevant sessions from other tracks like Architecture, Artificial Intelligence & Machine Learning, and IoT. If you’re interested in Machine Learning sessions, don’t forget to check out our Guide to Machine Learning at re:Invent 2017.
This year’s session catalog contains the following breakout sessions.
Raju Gulabani, VP, Database, Analytics and AI at AWS will discuss the evolution of database and analytics services in AWS, the new database and analytics services and features we launched this year, and our vision for continued innovation in this space. We are witnessing an unprecedented growth in the amount of data collected, in many different forms. Storage, management, and analysis of this data require database services that scale and perform in ways not possible before. AWS offers a collection of database and other data services—including Amazon Aurora, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Amazon ElastiCache, Amazon Kinesis, and Amazon EMR—to process, store, manage, and analyze data. In this session, we provide an overview of AWS database and analytics services and discuss how customers are using these services today.
Deep dive customer use cases
ABD401 – How Netflix Monitors Applications in Near Real-Time with Amazon Kinesis Thousands of services work in concert to deliver millions of hours of video streams to Netflix customers every day. These applications vary in size, function, and technology, but they all make use of the Netflix network to communicate. Understanding the interactions between these services is a daunting challenge both because of the sheer volume of traffic and the dynamic nature of deployments. In this session, we first discuss why Netflix chose Kinesis Streams to address these challenges at scale. We then dive deep into how Netflix uses Kinesis Streams to enrich network traffic logs and identify usage patterns in real time. Lastly, we cover how Netflix uses this system to build comprehensive dependency maps, increase network efficiency, and improve failure resiliency. From this session, you will learn how to build a real-time application monitoring system using network traffic logs and get real-time, actionable insights.
In this session, learn how Nextdoor replaced their home-grown data pipeline based on a topology of Flume nodes with a completely serverless architecture based on Kinesis and Lambda. By making these changes, they improved both the reliability of their data and the delivery times of billions of records of data to their Amazon S3–based data lake and Amazon Redshift cluster. Nextdoor is a private social networking service for neighborhoods.
ABD205 – Taking a Page Out of Ivy Tech’s Book: Using Data for Student Success Data speaks. Discover how Ivy Tech, the nation’s largest singly accredited community college, uses AWS to gather, analyze, and take action on student behavioral data for the betterment of over 3,100 students. This session outlines the process from inception to implementation across the state of Indiana and highlights how Ivy Tech’s model can be applied to your own complex business problems.
ABD207 – Leveraging AWS to Fight Financial Crime and Protect National Security Banks aren’t known to share data and collaborate with one another. But that is exactly what the Mid-Sized Bank Coalition of America (MBCA) is doing to fight digital financial crime—and protect national security. Using the AWS Cloud, the MBCA developed a shared data analytics utility that processes terabytes of non-competitive customer account, transaction, and government risk data. The intelligence produced from the data helps banks increase the efficiency of their operations, cut labor and operating costs, and reduce false positive volumes. The collective intelligence also allows greater enforcement of Anti-Money Laundering (AML) regulations by helping members detect internal risks—and identify the challenges to detecting these risks in the first place. This session demonstrates how the AWS Cloud supports the MBCA to deliver advanced data analytics, provide consistent operating models across financial institutions, reduce costs, and strengthen national security.
ABD208 – Cox Automotive Empowered to Scale with Splunk Cloud & AWS and Explores New Innovation with Amazon Kinesis Firehose In this session, learn how Cox Automotive is using Splunk Cloud for real time visibility into its AWS and hybrid environments to achieve near instantaneous MTTI, reduce auction incidents by 90%, and proactively predict outages. We also introduce a highly anticipated capability that allows you to ingest, transform, and analyze data in real time using Splunk and Amazon Kinesis Firehose to gain valuable insights from your cloud resources. It’s now quicker and easier than ever to gain access to analytics-driven infrastructure monitoring using Splunk Enterprise & Splunk Cloud.
ABD209 – Accelerating the Speed of Innovation with a Data Sciences Data & Analytics Hub at Takeda Historically, silos of data, analytics, and processes across functions, stages of development, and geography created a barrier to R&D efficiency. Gathering the right data necessary for decision-making was challenging due to issues of accessibility, trust, and timeliness. In this session, learn how Takeda is undergoing a transformation in R&D to increase the speed-to-market of high-impact therapies to improve patient lives. The Data and Analytics Hub was built, with Deloitte, to address these issues and support the efficient generation of data insights for functions such as clinical operations, clinical development, medical affairs, portfolio management, and R&D finance. In the AWS hosted data lake, this data is processed, integrated, and made available to business end users through data visualization interfaces, and to data scientists through direct connectivity. Learn how Takeda has achieved significant time reductions—from weeks to minutes—to gather and provision data that has the potential to reduce cycle times in drug development. The hub also enables more efficient operations and alignment to achieve product goals through cross functional team accountability and collaboration due to the ability to access the same cross domain data.
ABD210 – Modernizing Amtrak: Serverless Solution for Real-Time Data Capabilities As the nation’s only high-speed intercity passenger rail provider, Amtrak needs to know critical information to run their business such as: Who’s onboard any train at any time? How are booking and revenue trending? Amtrak was faced with unpredictable and often slow response times from existing databases, ranging from seconds to hours; existing booking and revenue dashboards were spreadsheet-based and manual; multiple copies of data were stored in different repositories, lacking integration and consistency; and operations and maintenance (O&M) costs were relatively high. Join us as we demonstrate how Deloitte and Amtrak successfully went live with a cloud-native operational database and analytical datamart for near-real-time reporting in under six months. We highlight the specific challenges and the modernization of architecture on an AWS native Platform as a Service (PaaS) solution. The solution includes cloud-native components such as AWS Lambda for microservices, Amazon Kinesis and AWS Data Pipeline for moving data, Amazon S3 for storage, Amazon DynamoDB for a managed NoSQL database service, and Amazon Redshift for near-real time reports and dashboards. Deloitte’s solution enabled “at scale” processing of 1 million transactions/day and up to 2K transactions/minute. It provided flexibility and scalability, largely eliminate the need for system management, and dramatically reduce operating costs. Moreover, it laid the groundwork for decommissioning legacy systems, anticipated to save at least $1M over 3 years.
ABD211 – Sysco Foods: A Journey from Too Much Data to Curated Insights In this session, we detail Sysco’s journey from a company focused on hindsight-based reporting to one focused on insights and foresight. For this shift, Sysco moved from multiple data warehouses to an AWS ecosystem, including Amazon Redshift, Amazon EMR, AWS Data Pipeline, and more. As the team at Sysco worked with Tableau, they gained agile insight across their business. Learn how Sysco decided to use AWS, how they scaled, and how they became more strategic with the AWS ecosystem and Tableau.
ABD217 – From Batch to Streaming: How Amazon Flex Uses Real-time Analytics to Deliver Packages on Time Reducing the time to get actionable insights from data is important to all businesses, and customers who employ batch data analytics tools are exploring the benefits of streaming analytics. Learn best practices to extend your architecture from data warehouses and databases to real-time solutions. Learn how to use Amazon Kinesis to get real-time data insights and integrate them with Amazon Aurora, Amazon RDS, Amazon Redshift, and Amazon S3. The Amazon Flex team describes how they used streaming analytics in their Amazon Flex mobile app, used by Amazon delivery drivers to deliver millions of packages each month on time. They discuss the architecture that enabled the move from a batch processing system to a real-time system, overcoming the challenges of migrating existing batch data to streaming data, and how to benefit from real-time analytics.
ABD218 – How EuroLeague Basketball Uses IoT Analytics to Engage Fans IoT and big data have made their way out of industrial applications, general automation, and consumer goods, and are now a valuable tool for improving consumer engagement across a number of industries, including media, entertainment, and sports. The low cost and ease of implementation of AWS analytics services and AWS IoT have allowed AGT, a leader in IoT, to develop their IoTA analytics platform. Using IoTA, AGT brought a tailored solution to EuroLeague Basketball for real-time content production and fan engagement during the 2017-18 season. In this session, we take a deep dive into how this solution is architected for secure, scalable, and highly performant data collection from athletes, coaches, and fans. We also talk about how the data is transformed into insights and integrated into a content generation pipeline. Lastly, we demonstrate how this solution can be easily adapted for other industries and applications.
ABD222 – How to Confidently Unleash Data to Meet the Needs of Your Entire Organization Where are you on the spectrum of IT leaders? Are you confident that you’re providing the technology and solutions that consistently meet or exceed the needs of your internal customers? Do your peers at the executive table see you as an innovative technology leader? Innovative IT leaders understand the value of getting data and analytics directly into the hands of decision makers, and into their own. In this session, Daren Thayne, Domo’s Chief Technology Officer, shares how innovative IT leaders are helping drive a culture change at their organizations. See how transformative it can be to have real-time access to all of the data that’ is relevant to YOUR job (including a complete view of your entire AWS environment), as well as understand how it can help you lead the way in applying that same pattern throughout your entire company
ABD303 – Developing an Insights Platform – Sysco’s Journey from Disparate Systems to Data Lake and Beyond Sysco has nearly 200 operating companies across its multiple lines of business throughout the United States, Canada, Central/South America, and Europe. As the global leader in food services, Sysco identified the need to streamline the collection, transformation, and presentation of data produced by the distributed units and systems, into a central data ecosystem. Sysco’s Business Intelligence and Analytics team addressed these requirements by creating a data lake with scalable analytics and query engines leveraging AWS services. In this session, Sysco will outline their journey from a hindsight reporting focused company to an insights driven organization. They will cover solution architecture, challenges, and lessons learned from deploying a self-service insights platform. They will also walk through the design patterns they used and how they designed the solution to provide predictive analytics using Amazon Redshift Spectrum, Amazon S3, Amazon EMR, AWS Glue, Amazon Elasticsearch Service and other AWS services.
ABD309 – How Twilio Scaled Its Data-Driven Culture As a leading cloud communications platform, Twilio has always been strongly data-driven. But as headcount and data volumes grew—and grew quickly—they faced many new challenges. One-off, static reports work when you’re a small startup, but how do you support a growth stage company to a successful IPO and beyond? Today, Twilio’s data team relies on AWS and Looker to provide data access to 700 colleagues. Departments have the data they need to make decisions, and cloud-based scale means they get answers fast. Data delivers real-business value at Twilio, providing a 360-degree view of their customer, product, and business. In this session, you hear firsthand stories directly from the Twilio data team and learn real-world tips for fostering a truly data-driven culture at scale.
ABD310 – How FINRA Secures Its Big Data and Data Science Platform on AWS FINRA uses big data and data science technologies to detect fraud, market manipulation, and insider trading across US capital markets. As a financial regulator, FINRA analyzes highly sensitive data, so information security is critical. Learn how FINRA secures its Amazon S3 Data Lake and its data science platform on Amazon EMR and Amazon Redshift, while empowering data scientists with tools they need to be effective. In addition, FINRA shares AWS security best practices, covering topics such as AMI updates, micro segmentation, encryption, key management, logging, identity and access management, and compliance.
ABD331 – Log Analytics at Expedia Using Amazon Elasticsearch Service Expedia uses Amazon Elasticsearch Service (Amazon ES) for a variety of mission-critical use cases, ranging from log aggregation to application monitoring and pricing optimization. In this session, the Expedia team reviews how they use Amazon ES and Kibana to analyze and visualize Docker startup logs, AWS CloudTrail data, and application metrics. They share best practices for architecting a scalable, secure log analytics solution using Amazon ES, so you can add new data sources almost effortlessly and get insights quickly
ABD316 – American Heart Association: Finding Cures to Heart Disease Through the Power of Technology Combining disparate datasets and making them accessible to data scientists and researchers is a prevalent challenge for many organizations, not just in healthcare research. American Heart Association (AHA) has built a data science platform using Amazon EMR, Amazon Elasticsearch Service, and other AWS services, that corrals multiple datasets and enables advanced research on phenotype and genotype datasets, aimed at curing heart diseases. In this session, we present how AHA built this platform and the key challenges they addressed with the solution. We also provide a demo of the platform, and leave you with suggestions and next steps so you can build similar solutions for your use cases
ABD319 – Tooling Up for Efficiency: DIY Solutions @ Netflix At Netflix, we have traditionally approached cloud efficiency from a human standpoint, whether it be in-person meetings with the largest service teams or manually flipping reservations. Over time, we realized that these manual processes are not scalable as the business continues to grow. Therefore, in the past year, we have focused on building out tools that allow us to make more insightful, data-driven decisions around capacity and efficiency. In this session, we discuss the DIY applications, dashboards, and processes we built to help with capacity and efficiency. We start at the ten thousand foot view to understand the unique business and cloud problems that drove us to create these products, and discuss implementation details, including the challenges encountered along the way. Tools discussed include Picsou, the successor to our AWS billing file cost analyzer; Libra, an easy-to-use reservation conversion application; and cost and efficiency dashboards that relay useful financial context to 50+ engineering teams and managers.
ABD312 – Deep Dive: Migrating Big Data Workloads to AWS Customers are migrating their analytics, data processing (ETL), and data science workloads running on Apache Hadoop, Spark, and data warehouse appliances from on-premise deployments to AWS in order to save costs, increase availability, and improve performance. AWS offers a broad set of analytics services, including solutions for batch processing, stream processing, machine learning, data workflow orchestration, and data warehousing. This session will focus on identifying the components and workflows in your current environment; and providing the best practices to migrate these workloads to the right AWS data analytics product. We will cover services such as Amazon EMR, Amazon Athena, Amazon Redshift, Amazon Kinesis, and more. We will also feature Vanguard, an American investment management company based in Malvern, Pennsylvania with over $4.4 trillion in assets under management. Ritesh Shah, Sr. Program Manager for Cloud Analytics Program at Vanguard, will describe how they orchestrated their migration to AWS analytics services, including Hadoop and Spark workloads to Amazon EMR. Ritesh will highlight the technical challenges they faced and overcame along the way, as well as share common recommendations and tuning tips to accelerate the time to production.
ABD402 – How Esri Optimizes Massive Image Archives for Analytics in the Cloud Petabyte scale archives of satellites, planes, and drones imagery continue to grow exponentially. They mostly exist as semi-structured data, but they are only valuable when accessed and processed by a wide range of products for both visualization and analysis. This session provides an overview of how ArcGIS indexes and structures data so that any part of it can be quickly accessed, processed, and analyzed by reading only the minimum amount of data needed for the task. In this session, we share best practices for structuring and compressing massive datasets in Amazon S3, so it can be analyzed efficiently. We also review a number of different image formats, including GeoTIFF (used for the Public Datasets on AWS program, Landsat on AWS), cloud optimized GeoTIFF, MRF, and CRF as well as different compression approaches to show the effect on processing performance. Finally, we provide examples of how this technology has been used to help image processing and analysis for the response to Hurricane Harvey.
ABD329 – A Look Under the Hood – How Amazon.com Uses AWS Services for Analytics at Massive Scale Amazon’s consumer business continues to grow, and so does the volume of data and the number and complexity of the analytics done in support of the business. In this session, we talk about how Amazon.com uses AWS technologies to build a scalable environment for data and analytics. We look at how Amazon is evolving the world of data warehousing with a combination of a data lake and parallel, scalable compute engines such as Amazon EMR and Amazon Redshift.
ABD327 – Migrating Your Traditional Data Warehouse to a Modern Data Lake In this session, we discuss the latest features of Amazon Redshift and Redshift Spectrum, and take a deep dive into its architecture and inner workings. We share many of the recent availability, performance, and management enhancements and how they improve your end user experience. You also hear from 21st Century Fox, who presents a case study of their fast migration from an on-premises data warehouse to Amazon Redshift. Learn how they are expanding their data warehouse to a data lake that encompasses multiple data sources and data formats. This architecture helps them tie together siloed business units and get actionable 360-degree insights across their consumer base. MCL202 – Ally Bank & Cognizant: Transforming Customer Experience Using Amazon Alexa Given the increasing popularity of natural language interfaces such as Voice as User technology or conversational artificial intelligence (AI), Ally® Bank was looking to interact with customers by enabling direct transactions through conversation or voice. They also needed to develop a capability that allows third parties to connect to the bank securely for information sharing and exchange, using oAuth, an authentication protocol seen as the future of secure banking technology. Cognizant’s Architecture team partnered with Ally Bank’s Enterprise Architecture group and identified the right product for oAuth integration with Amazon Alexa and third-party technologies. In this session, we discuss how building products with conversational AI helps Ally Bank offer an innovative customer experience; increase retention through improved data-driven personalization; increase the efficiency and convenience of customer service; and gain deep insights into customer needs through data analysis and predictive analytics to offer new products and services.
MCL317 – Orchestrating Machine Learning Training for Netflix Recommendations At Netflix, we use machine learning (ML) algorithms extensively to recommend relevant titles to our 100+ million members based on their tastes. Everything on the member home page is an evidence-driven, A/B-tested experience that we roll out backed by ML models. These models are trained using Meson, our workflow orchestration system. Meson distinguishes itself from other workflow engines by handling more sophisticated execution graphs, such as loops and parameterized fan-outs. Meson can schedule Spark jobs, Docker containers, bash scripts, gists of Scala code, and more. Meson also provides a rich visual interface for monitoring active workflows and inspecting execution logs. It has a powerful Scala DSL for authoring workflows as well as the REST API. In this session, we focus on how Meson trains recommendation ML models in production, and how we have re-architected it to scale up for a growing need of broad ETL applications within Netflix. As a driver for this change, we have had to evolve the persistence layer for Meson. We talk about how we migrated from Cassandra to Amazon RDS backed by Amazon Aurora
MCL350 – Humans vs. the Machines: How Pinterest Uses Amazon Mechanical Turk’s Worker Community to Improve Machine Learning Ever since the term “crowdsourcing” was coined in 2006, it’s been a buzzword for technology companies and social institutions. In the technology sector, crowdsourcing is instrumental for verifying machine learning algorithms, which, in turn, improves the user’s experience. In this session, we explore how Pinterest adapted to an increased reliability on human evaluation to improve their product, with a focus on how they’ve integrated with Mechanical Turk’s platform. This presentation is aimed at engineers, analysts, program managers, and product managers who are interested in how companies rely on Mechanical Turk’s human evaluation platform to better understand content and improve machine learning algorithms. The discussion focuses on the analysis and product decisions related to building a high quality crowdsourcing system that takes advantage of Mechanical Turk’s powerful worker community.
ABD201 – Big Data Architectural Patterns and Best Practices on AWS In this session, we simplify big data processing as a data bus comprising various stages: collect, store, process, analyze, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architectures, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost
ABD202 – Best Practices for Building Serverless Big Data Applications Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. In this session, we show you how to incorporate serverless concepts into your big data architectures. We explore the concepts behind and benefits of serverless architectures for big data, looking at design patterns to ingest, store, process, and visualize your data. Along the way, we explain when and how you can use serverless technologies to streamline data processing, minimize infrastructure management, and improve agility and robustness and share a reference architecture using a combination of cloud and open source technologies to solve your big data problems. Topics include: use cases and best practices for serverless big data applications; leveraging AWS technologies such as Amazon DynamoDB, Amazon S3, Amazon Kinesis, AWS Lambda, Amazon Athena, and Amazon EMR; and serverless ETL, event processing, ad hoc analysis, and real-time analytics.
ABD206 – Building Visualizations and Dashboards with Amazon QuickSight Just as a picture is worth a thousand words, a visual is worth a thousand data points. A key aspect of our ability to gain insights from our data is to look for patterns, and these patterns are often not evident when we simply look at data in tables. The right visualization will help you gain a deeper understanding in a much quicker timeframe. In this session, we will show you how to quickly and easily visualize your data using Amazon QuickSight. We will show you how you can connect to data sources, generate custom metrics and calculations, create comprehensive business dashboards with various chart types, and setup filters and drill downs to slice and dice the data.
ABD203 – Real-Time Streaming Applications on AWS: Use Cases and Patterns To win in the marketplace and provide differentiated customer experiences, businesses need to be able to use live data in real time to facilitate fast decision making. In this session, you learn common streaming data processing use cases and architectures. First, we give an overview of streaming data and AWS streaming data capabilities. Next, we look at a few customer examples and their real-time streaming applications. Finally, we walk through common architectures and design patterns of top streaming data use cases.
ABD213 – How to Build a Data Lake with AWS Glue Data Catalog As data volumes grow and customers store more data on AWS, they often have valuable data that is not easily discoverable and available for analytics. The AWS Glue Data Catalog provides a central view of your data lake, making data readily available for analytics. We introduce key features of the AWS Glue Data Catalog and its use cases. Learn how crawlers can automatically discover your data, extract relevant metadata, and add it as table definitions to the AWS Glue Data Catalog. We will also explore the integration between AWS Glue Data Catalog and Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.
ABD214 – Real-time User Insights for Mobile and Web Applications with Amazon Pinpoint With customers demanding relevant and real-time experiences across a range of devices, digital businesses are looking to gather user data at scale, understand this data, and respond to customer needs instantly. This requires tools that can record large volumes of user data in a structured fashion, and then instantly make this data available to generate insights. In this session, we demonstrate how you can use Amazon Pinpoint to capture user data in a structured yet flexible manner. Further, we demonstrate how this data can be set up for instant consumption using services like Amazon Kinesis Firehose and Amazon Redshift. We walk through example data based on real world scenarios, to illustrate how Amazon Pinpoint lets you easily organize millions of events, record them in real-time, and store them for further analysis.
ABD223 – IT Innovators: New Technology for Leveraging Data to Enable Agility, Innovation, and Business Optimization Companies of all sizes are looking for technology to efficiently leverage data and their existing IT investments to stay competitive and understand where to find new growth. Regardless of where companies are in their data-driven journey, they face greater demands for information by customers, prospects, partners, vendors and employees. All stakeholders inside and outside the organization want information on-demand or in “real time”, available anywhere on any device. They want to use it to optimize business outcomes without having to rely on complex software tools or human gatekeepers to relevant information. Learn how IT innovators at companies such as MasterCard, Jefferson Health, and TELUS are using Domo’s Business Cloud to help their organizations more effectively leverage data at scale.
ABD301 – Analyzing Streaming Data in Real Time with Amazon Kinesis Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. In this session, we present an end-to-end streaming data solution using Kinesis Streams for data ingestion, Kinesis Analytics for real-time processing, and Kinesis Firehose for persistence. We review in detail how to write SQL queries using streaming data and discuss best practices to optimize and monitor your Kinesis Analytics applications. Lastly, we discuss how to estimate the cost of the entire system
ABD302 – Real-Time Data Exploration and Analytics with Amazon Elasticsearch Service and Kibana In this session, we use Apache web logs as example and show you how to build an end-to-end analytics solution. First, we cover how to configure an Amazon ES cluster and ingest data using Amazon Kinesis Firehose. We look at best practices for choosing instance types, storage options, shard counts, and index rotations based on the throughput of incoming data. Then we demonstrate how to set up a Kibana dashboard and build custom dashboard widgets. Finally, we review approaches for generating custom, ad-hoc reports.
ABD304 – Best Practices for Data Warehousing with Amazon Redshift & Redshift Spectrum Most companies are over-run with data, yet they lack critical insights to make timely and accurate business decisions. They are missing the opportunity to combine large amounts of new, unstructured big data that resides outside their data warehouse with trusted, structured data inside their data warehouse. In this session, we take an in-depth look at how modern data warehousing blends and analyzes all your data, inside and outside your data warehouse without moving the data, to give you deeper insights to run your business. We will cover best practices on how to design optimal schemas, load data efficiently, and optimize your queries to deliver high throughput and performance.
ABD305 – Design Patterns and Best Practices for Data Analytics with Amazon EMR Amazon EMR is one of the largest Hadoop operators in the world, enabling customers to run ETL, machine learning, real-time processing, data science, and low-latency SQL at petabyte scale. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about lowering cost with Auto Scaling and Spot Instances, and security best practices for encryption and fine-grained access control. Finally, we dive into some of our recent launches to keep you current on our latest features.
ABD307 – Deep Analytics for Global AWS Marketing Organization To meet the needs of the global marketing organization, the AWS marketing analytics team built a scalable platform that allows the data science team to deliver custom econometric and machine learning models for end user self-service. To meet data security standards, we use end-to-end data encryption and different AWS services such as Amazon Redshift, Amazon RDS, Amazon S3, Amazon EMR with Apache Spark and Auto Scaling. In this session, you see real examples of how we have scaled and automated critical analysis, such as calculating the impact of marketing programs like re:Invent and prioritizing leads for our sales teams.
ABD311 – Deploying Business Analytics at Enterprise Scale with Amazon QuickSight One of the biggest tradeoffs customers usually make when deploying BI solutions at scale is agility versus governance. Large-scale BI implementations with the right governance structure can take months to design and deploy. In this session, learn how you can avoid making this tradeoff using Amazon QuickSight. Learn how to easily deploy Amazon QuickSight to thousands of users using Active Directory and Federated SSO, while securely accessing your data sources in Amazon VPCs or on-premises. We also cover how to control access to your datasets, implement row-level security, create scheduled email reports, and audit access to your data.
ABD315 – Building Serverless ETL Pipelines with AWS Glue Organizations need to gain insight and knowledge from a growing number of Internet of Things (IoT), APIs, clickstreams, unstructured and log data sources. However, organizations are also often limited by legacy data warehouses and ETL processes that were designed for transactional data. In this session, we introduce key ETL features of AWS Glue, cover common use cases ranging from scheduled nightly data warehouse loads to near real-time, event-driven ETL flows for your data lake. We discuss how to build scalable, efficient, and serverless ETL pipelines using AWS Glue. Additionally, Merck will share how they built an end-to-end ETL pipeline for their application release management system, and launched it in production in less than a week using AWS Glue.
ABD318 – Architecting a data lake with Amazon S3, Amazon Kinesis, and Amazon Athena Learn how to architect a data lake where different teams within your organization can publish and consume data in a self-service manner. As organizations aim to become more data-driven, data engineering teams have to build architectures that can cater to the needs of diverse users – from developers, to business analysts, to data scientists. Each of these user groups employs different tools, have different data needs and access data in different ways. In this talk, we will dive deep into assembling a data lake using Amazon S3, Amazon Kinesis, Amazon Athena, Amazon EMR, and AWS Glue. The session will feature Mohit Rao, Architect and Integration lead at Atlassian, the maker of products such as JIRA, Confluence, and Stride. First, we will look at a couple of common architectures for building a data lake. Then we will show how Atlassian built a self-service data lake, where any team within the company can publish a dataset to be consumed by a broad set of users.
Companies have valuable data that they may not be analyzing due to the complexity, scalability, and performance issues of loading the data into their data warehouse. However, with the right tools, you can extend your analytics to query data in your data lake—with no loading required. Amazon Redshift Spectrum extends the analytic power of Amazon Redshift beyond data stored in your data warehouse to run SQL queries directly against vast amounts of unstructured data in your Amazon S3 data lake. This gives you the freedom to store your data where you want, in the format you want, and have it available for analytics when you need it. Join a discussion with AWS solution architects to ask question.
ABD330 – Combining Batch and Stream Processing to Get the Best of Both Worlds Today, many architects and developers are looking to build solutions that integrate batch and real-time data processing, and deliver the best of both approaches. Lambda architecture (not to be confused with the AWS Lambda service) is a design pattern that leverages both batch and real-time processing within a single solution to meet the latency, accuracy, and throughput requirements of big data use cases. Come join us for a discussion on how to implement Lambda architecture (batch, speed, and serving layers) and best practices for data processing, loading, and performance tuning
ABD335 – Real-Time Anomaly Detection Using Amazon Kinesis Amazon Kinesis Analytics offers a built-in machine learning algorithm that you can use to easily detect anomalies in your VPC network traffic and improve security monitoring. Join us for an interactive discussion on how to stream your VPC flow Logs to Amazon Kinesis Streams and identify anomalies using Kinesis Analytics.
ABD339 – Deep Dive and Best Practices for Amazon Athena Amazon Athena is an interactive query service that enables you to process data directly from Amazon S3 without the need for infrastructure. Since its launch at re:invent 2016, several organizations have adopted Athena as the central tool to process all their data. In this talk, we dive deep into the most common use cases, including working with other AWS services. We review the best practices for creating tables and partitions and performance optimizations. We also dive into how Athena handles security, authorization, and authentication. Lastly, we hear from a customer who has reduced costs and improved time to market by deploying Athena across their organization.
We look forward to meeting you at re:Invent 2017!
About the Author
Roy Ben-Alta is a solution architect and principal business development manager at Amazon Web Services in New York. He focuses on Data Analytics and ML Technologies, working with AWS customers to build innovative data-driven products.
This post courtesy of Aaron Friedman, Healthcare and Life Sciences Partner Solutions Architect, AWS and Angel Pizarro, Genomics and Life Sciences Senior Solutions Architect, AWS
Precision medicine is tailored to individuals based on quantitative signatures, including genomics, lifestyle, and environment. It is often considered to be the driving force behind the next wave of human health. Through new initiatives and technologies such as population-scale genomics sequencing and IoT-backed wearables, researchers and clinicians in both commercial and public sectors are gaining new, previously inaccessible insights.
Many of these precision medicine initiatives are already happening on AWS. A few of these include:
PrecisionFDA – This initiative is led by the US Food and Drug Administration. The goal is to define the next-generation standard of care for genomics in precision medicine.
Deloitte ConvergeHEALTH – Gives healthcare and life sciences organizations the ability to analyze their disparate datasets on a singular real world evidence platform.
Central to many of these initiatives is genomics, which gives healthcare organizations the ability to establish a baseline for longitudinal studies. Due to its wide applicability in precision medicine initiatives—from rare disease diagnosis to improving outcomes of clinical trials—genomics data is growing at a larger rate than Moore’s law across the globe. Many expect these datasets to grow to be in the range of tens of exabytes by 2025.
Genomics data is also regularly re-analyzed by the community as researchers develop new computational methods or compare older data with newer genome references. These trends are driving innovations in data analysis methods and algorithms to address the massive increase of computational requirements.
Edico Genome, an AWS Partner Network (APN) Partner, has developed a novel solution that accelerates genomics analysis using field-programmable gate arrays, or FPGAs. Historically, Edico Genome deployed their FPGA appliances on-premises. When AWS announced the Amazon EC2 F1 PGA-based instance family in December 2016, Edico Genome adopted a cloud-first strategy, became a F1 launch partner, and was one of the first partners to deploy FPGA-enabled applications on AWS.
On October 19, 2017, Edico Genome partnered with the Children’s Hospital of Philadelphia (CHOP) to demonstrate their FPGA-accelerated genomic pipeline software, called DRAGEN. It can significantly reduce time-to-insight for patient genomes, and analyzed 1,000 genomes from the Center for Applied Genomics Biobank in the shortest time possible. This set a Guinness World Record for the fastest analysis of 1000 whole human genomes, and they did this using 1000 EC2 f1.2xlarge instances in a single AWS region. Not only were they able to analyze genomes at high throughput, they did so averaging approximately $3 per whole human genome of AWS compute for the analysis.
The version of DRAGEN that Edico Genome used for this analysis was also the same one used in the precisionFDA Hidden Treasures – Warm Up challenge, where they were one of the top performers in every assessment.
In the remainder of this post, we walk through the architecture used by Edico Genome, combining EC2 F1 instances and AWS Batch to achieve this milestone.
EC2 F1 instances and Edico’s DRAGEN
EC2 F1 instances provide access to programmable hardware-acceleration using FPGAs at a cloud scale. AWS customers use F1 instances for a wide variety of applications, including big data, financial analytics and risk analysis, image and video processing, engineering simulations, AR/VR, and accelerated genomics. Edico Genome’s FPGA-backed DRAGEN Bio-IT Platform is now integrated with EC2 F1 instances. You can access the accuracy, speed, flexibility, and low compute cost of DRAGEN through a number of third-party platforms, AWS Marketplace, and Edico Genome’s own platform. The DRAGEN platform offers a scalable, accelerated, and cost-efficient secondary analysis solution for a wide variety of genomics applications. Edico Genome also provides a highly optimized mechanism for the efficient storage of genomic data.
Scaling DRAGEN on AWS
Edico Genome used 1,000 EC2 F1 instances to help their customer, the Children’s Hospital of Philadelphia (CHOP), to process and analyze all 1,000 whole human genomes in parallel. They used AWS Batch to provision compute resources and orchestrate DRAGEN compute jobs across the 1,000 EC2 F1 instances. This solution successfully addressed the challenge of creating a scalable genomic processing pipeline that can easily scale to thousands of engines running in parallel.
Architecture
A simplified view of the architecture used for the analysis is shown in the following diagram:
DRAGEN’s portal uses Elastic Load Balancing and Auto Scaling groups to scale out EC2 instances that submitted jobs to AWS Batch.
Job metadata is stored in their Workflow Management (WFM) database, built on top of Amazon Aurora.
The DRAGEN Workflow Manager API submits jobs to AWS Batch.
These jobs are executed on the AWS Batch managed compute environment that was responsible for launching the EC2 F1 instances.
These jobs run as Docker containers that have the requisite DRAGEN binaries for whole genome analysis.
As each job runs, it retrieves and stores genomics data that is staged in Amazon S3.
The steps listed previously can also be bucketed into the following higher-level layers:
Workflow: Edico Genome used their Workflow Management API to orchestrate the submission of AWS Batch jobs. Metadata for the jobs (such as the S3 locations of the genomes, etc.) resides in the Workflow Management Database backed by Amazon Aurora.
Batch execution: AWS Batch launches EC2 F1 instances and coordinates the execution of DRAGEN jobs on these compute resources. AWS Batch enabled Edico to quickly and easily scale up to the full number of instances they needed as jobs were submitted. They also scaled back down as each job was completed, to optimize for both cost and performance.
Compute/job: Edico Genome stored their binaries in a Docker container that AWS Batch deployed onto each of the F1 instances, giving each instance the ability to run DRAGEN without the need to pre-install the core executables. The AWS based DRAGEN solution streams all genomics data from S3 for local computation and then writes the results to a destination bucket. They used an AWS Batch job role that specified the IAM permissions. The role ensured that DRAGEN only had access to the buckets or S3 key space it needed for the analysis. Jobs didn’t need to embed AWS credentials.
In the following sections, we dive deeper into several tasks that enabled Edico Genome’s scalable FPGA genome analysis on AWS:
Prepare your Amazon FPGA Image for AWS Batch
Create a Dockerfile and build your Docker image
Set up your AWS Batch FPGA compute environment
Prerequisites
In brief, you need a modern Linux distribution (3.10+), Amazon ECS Container Agent, awslogs driver, and Docker configured on your image. There are additional recommendations in the Compute Resource AMI specification.
Preparing your Amazon FPGA Image for AWS Batch
You can use any Amazon Machine Image (AMI) or Amazon FPGA Image (AFI) with AWS Batch, provided that it meets the Compute Resource AMI specification. This gives you the ability to customize any workload by increasing the size of root or data volumes, adding instance stores, and connecting with the FPGA (F) and GPU (G and P) instance families.
Next, install the AWS CLI:
pip install awscli
Add any additional software required to interact with the FPGAs on the F1 instances.
As a starting point, AWS publishes an FPGA Developer AMI in the AWS Marketplace. It is based on a CentOS Linux image and includes pre-integrated FPGA development tools. It also includes the runtime tools required to develop and use custom FPGAs for hardware acceleration applications.
For more information about how to set up custom AMIs for your AWS Batch managed compute environments, see Creating a Compute Resource AMI.
Building your Dockerfile
There are two common methods for connecting to AWS Batch to run FPGA-enabled algorithms. The first method, which is the route Edico Genome took, involves storing your binaries in the Docker container itself and running that on top of an F1 instance with Docker installed. The following code example is what a Dockerfile to build your container might look like for this scenario.
# DRAGEN_EXEC Docker image generator --
# Run this Dockerfile from a local directory that contains the latest release of
# - Dragen RPM and Linux DMA Driver available from Edico
# - Edico's Dragen WFMS Wrapper files
FROM centos:centos7
RUN rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
# Install Basic packages needed for Dragen
RUN yum -y install \
perl \
sos \
coreutils \
gdb \
time \
systemd-libs \
bzip2-libs \
R \
ca-certificates \
ipmitool \
smartmontools \
rsync
# Install the Dragen RPM
RUN mkdir -m777 -p /var/log/dragen /var/run/dragen
ADD . /root
RUN rpm -Uvh /root/edico_driver*.rpm || true
RUN rpm -Uvh /root/dragen-aws*.rpm || true
# Auto generate the Dragen license
RUN /opt/edico/bin/dragen_lic -i auto
#########################################################
# Now install the Edico WFMS "Wrapper" functions
# Add development tools needed for some util
RUN yum groupinstall -y "Development Tools"
# Install necessary standard packages
RUN yum -y install \
dstat \
git \
python-devel \
python-pip \
time \
tree && \
pip install --upgrade pip && \
easy_install requests && \
pip install psutil && \
pip install python-dateutil && \
pip install constants && \
easy_install boto3
# Setup Python path used by the wrapper
RUN mkdir -p /opt/workflow/python/bin
RUN ln -s /usr/bin/python /opt/workflow/python/bin/python2.7
RUN ln -s /usr/bin/python /opt/workflow/python/bin/python
# Install d_haul and dragen_job_execute wrapper functions and associated packages
RUN mkdir -p /root/wfms/trunk/scheduler/scheduler
COPY scheduler/d_haul /root/wfms/trunk/scheduler/
COPY scheduler/dragen_job_execute /root/wfms/trunk/scheduler/
COPY scheduler/scheduler/aws_utils.py /root/wfms/trunk/scheduler/scheduler/
COPY scheduler/scheduler/constants.py /root/wfms/trunk/scheduler/scheduler/
COPY scheduler/scheduler/job_utils.py /root/wfms/trunk/scheduler/scheduler/
COPY scheduler/scheduler/logger.py /root/wfms/trunk/scheduler/scheduler/
COPY scheduler/scheduler/scheduler_utils.py /root/wfms/trunk/scheduler/scheduler/
COPY scheduler/scheduler/webapi.py /root/wfms/trunk/scheduler/scheduler/
COPY scheduler/scheduler/wfms_exception.py /root/wfms/trunk/scheduler/scheduler/
RUN touch /root/wfms/trunk/scheduler/scheduler/__init__.py
# Landing directory should be where DJX is located
WORKDIR "/root/wfms/trunk/scheduler/"
# Debug print of container's directories
RUN tree /root/wfms/trunk/scheduler
# Default behaviour. Over-ride with --entrypoint on docker run cmd line
ENTRYPOINT ["/root/wfms/trunk/scheduler/dragen_job_execute"]
CMD []
Note: Edico Genome’s custom Python wrapper functions for its Workflow Management System (WFMS) in the latter part of this Dockerfile should be replaced with functions that are specific to your workflow.
The second method is to install binaries and then use Docker as a lightweight connector between AWS Batch and the AFI. For example, this might be a route you would choose to use if you were provisioning DRAGEN from the AWS Marketplace.
In this case, the Dockerfile would not contain the installation of the binaries to run DRAGEN, but would contain any other packages necessary for job completion. When you run your Docker container, you enable Docker to access the underlying file system.
Connecting to AWS Batch
AWS Batch provisions compute resources and runs your jobs, choosing the right instance types based on your job requirements and scaling down resources as work is completed. AWS Batch users submit a job, based on a template or “job definition” to an AWS Batch job queue.
Job queues are mapped to one or more compute environments that describe the quantity and types of resources that AWS Batch can provision. In this case, Edico created a managed compute environment that was able to launch 1,000 EC2 F1 instances across multiple Availability Zones in us-east-1. As jobs are submitted to a job queue, the service launches the required quantity and types of instances that are needed. As instances become available, AWS Batch then runs each job within appropriately sized Docker containers.
The Edico Genome workflow manager API submits jobs to an AWS Batch job queue. This job queue maps to an AWS Batch managed compute environment containing On-Demand F1 instances. In this section, you can set this up yourself.
To create the compute environment that DRAGEN can use:
An f1.2xlarge EC2 instance contains one FPGA, eight vCPUs, and 122-GiB RAM. As DRAGEN requires an entire FPGA to run, Edico Genome needed to ensure that only one analysis per time executed on an instance. By using the f1.2xlarge vCPUs and memory as a proxy in their AWS Batch job definition, Edico Genome could ensure that only one job runs on an instance at a time. Here’s what that looks like in the AWS CLI:
You can query the status of your DRAGEN job with the following command:
aws batch describe-jobs --jobs <the job ID from the above command>
The logs for your job are written to the /aws/batch/job CloudWatch log group.
Conclusion
In this post, we demonstrated how to set up an environment with AWS Batch that can run DRAGEN on EC2 F1 instances at scale. If you followed the walkthrough, you’ve replicated much of the architecture Edico Genome used to set the Guinness World Record.
There are several ways in which you can harness the computational power of DRAGEN to analyze genomes at scale. First, DRAGEN is available through several different genomics platforms, such as the DNAnexus Platform. DRAGEN is also available on the AWS Marketplace. You can apply the architecture presented in this post to build a scalable solution that is both performant and cost-optimized.
For more information about how AWS Batch can facilitate genomics processing at scale, be sure to check out our aws-batch-genomics GitHub repo on high-throughput genomics on AWS.
A robust analytics platform is critical for an organization’s success. However, an analytical system is only as good as the data that is used to power it. Wrong or incomplete data can derail analytical projects completely. Moreover, with varied data types and sources being added to the analytical platform, it’s important that the platform is able to keep up. This means the system has to be flexible enough to adapt to changing data types and volumes.
For a platform based on relational databases, this would mean countless data model updates, code changes, and regression testing, all of which can take a very long time. For a decision support system, it’s imperative to provide the correct answers quickly. Architects must design analytical platforms with the following criteria:
Help ingest data easily from multiple sources
Analyze it for completeness and accuracy
Use it for metrics computations
Store the data assets and scale as they grow rapidly without causing disruptions
Adapt to changes as they happen
Have a relatively short development cycle that is repeatable and easy to implement.
Schema-on-read is a unique approach for storing and querying datasets. It reverses the order of things when compared to schema-on-write in that the data can be stored as is and you apply a schema at the time that you read it. This approach has changed the time to value for analytical projects. It means you can immediately load data and can query it to do exploratory activities.
In this post, I show how to build a schema-on-read analytical pipeline, similar to the one used with relational databases, using Amazon Athena. The approach is completely serverless, which allows the analytical platform to scale as more data is stored and processed via the pipeline.
The data integration challenge
Part of the reason why it’s so difficult to build analytical platforms are the challenges associated with data integration. This is usually done as a series of Extract Transform Load (ETL) jobs that pull data from multiple varied sources and integrate them into a central repository. The multiple steps of ETL can be viewed as a pipeline where raw data is fed in one end and integrated data accumulates at the other.
Two major hurdles complicate the development of ETL pipelines:
The sheer volume of data needing to be integrated makes it very difficult to scale these jobs.
There are an immense variety of data formats from where information needs to be gathered.
Put this in context by looking at the healthcare industry. Building an analytical pipeline for healthcare usually involves working with multiple datasets that cover care quality, provider operations, patient clinical records, and patient claims data. Customers are now looking for even more data sources that may contain valuable information that can be analyzed. Examples include social media data, data from devices and sensors, and biometric and human-generated data such as notes from doctors. These new data sources do not rely on fixed schemas, so integrating them by conventional ETL jobs is much more difficult.
The volume of data available for analysis is growing as well. According to an article published by NCBI, the US healthcare system reached a collective data volume of 150 exabytes in 2011. With the current rate of growth, it is expected to quickly reach zettabyte scale, and soon grow into the yottabytes.
Why schema-on-read?
The traditional data warehouses of the early 2000s, like the one shown in the following diagram, were based on a standard four-layer approach of ingest, stage, store, and report. This involved building and maintaining huge data models that suited certain sources and analytical queries. This type of design is based on a schema-on-write approach, where you write data into a predefined schema and read data by querying it.
As we progressed to more varied datasets and analytical requirements, the predefined schemas were not able to keep up. Moreover, by the time the models were built and put into production, the requirements had changed.
Schema-on-read provides much needed flexibility to an analytical project. Not having to rely on physical schemas improves the overall performance when it comes to high volume data loads. Because there are no physical constraints for the data, it works well with datasets with undefined data structures. It gives customers the power to experiment and try out different options.
How can Athena help?
Athena is a serverless analytical query engine that allows you to start querying data stored in Amazon S3 instantly. It supports standard formats like CSV and Parquet and integrates with Amazon QuickSight, which allows you to build interactive business intelligence reports.
The Amazon Athena User Guide provides multiple best practices that should be considered when using Athena for analytical pipelines at production scale. There are various performance tuning techniques that you can apply to speed up query performance. For more information, see Top 10 Performance Tuning Tips for Amazon Athena.
Solution architecture
To demonstrate this solution, I use the healthcare dataset from the Centers for Disease Control (CDC) Behavioral Risk Factor Surveillance system (BRFSS). It gathers data via telephone surveys for health-related risk behaviors and conditions, and the use of preventive services. The dataset is available as zip files from the CDC FTP portal for general download and analysis. There is also a user guide with comprehensive details about the program and the process of collecting data.
The following diagram shows the schema-on-read pipeline that demonstrates this solution.
In this architecture, S3 is the central data repository for the CSV files, which are divided by behavioral risk factors like smoking, drinking, obesity, and high blood pressure.
Athena is used to collectively query the CSV files in S3. It uses the AWS Glue Data Catalog to maintain the schema details and applies it at the time of querying the data.
The dataset is further filtered and transformed into a subset that is specifically used for reporting with Amazon QuickSight.
As new data files become available, they are incrementally added to the S3 bucket and the subset query automatically appends them to the end of the table. The dashboards in Amazon QuickSight are refreshed with the new values of calculated metrics.
Walkthrough
To use Athena in an analytical pipeline, you have to consider how to design it for initial data ingestion and the subsequent incremental ingestions. Moreover, you also have to decide on the mechanism to trigger a particular stage in the pipeline, which can either be scheduled or event-based.
For the purposes of this post, you build a simple pipeline where the data is:
Ingested into a staging area in S3
Filtered and transformed for reporting into a different S3 location
Transformed into a reporting table in Athena
Included into a dashboard created using Amazon QuickSight
Incrementally ingested for updates
This approach is highly customizable and can be used for building more complex pipelines with many more steps.
Data staging in S3
The data ingestion into S3 is fairly straightforward. The files can be ingested via the S3 command line interface (CLI), API, or the AWS Management Console. The data files are in CSV format and already divided by the behavioral condition. To improve performance, I recommend that you use partitioned data with Athena, especially when dealing with large volumes. You can use pre-partitioned data in S3 or build partitions later in the process. The example you are working with has a total of 247 CSV files storing about 205 MB of data across them, but typical production scale deployment would be much larger.
To automate the pipeline, you can make it either event-based or schedule-based. If you take the event-based approach, you can make use of S3 events to trigger an action when the files are uploaded. The event triggers an action, using an AWS Lambda function that corresponds to another step in the pipeline. Traditional ETL jobs have to rely on mechanisms like database triggers to enable this, which can cause additional performance overhead.
If you choose to go with a scheduled-based approach, you can use Lambda with scheduled events. The schedule is managed via a cron expression and the Lambda function is used to run the next step of the pipeline. This is suitable for workloads similar to a scheduled batch ETL job.
Filter and data transformation
To filter and transform the dataset, first look at the overall counts and structure of the data. This allows you to choose the columns that are important for a given report and use the filter clause to extract a subset of the data. Transforming the data as you progress through the pipeline ensures that you are only exposing relevant data to the reporting layer, to optimize performance.
To look at the entire dataset, create a table in Athena to go across the entire data volume. This can be done using the following query:
CREATE EXTERNAL TABLE IF NOT EXISTS brfsdata(
ID STRING,
HIW STRING,
SUSA_NAME STRING,
MATCH_NAME STRING,
CHSI_NAME STRING,
NSUM STRING,
MEAN STRING,
FLAG STRING,
IND STRING,
UP_CI STRING,
LOW_CI STRING,
SEMEAN STRING,
AGE_ADJ STRING,
DATASRC STRING,
FIPS STRING,
FIPNAME STRING,
HRR STRING,
DATA_YR STRING,
UNIT STRING,
AGEGRP STRING,
GENDER STRING,
RACE STRING,
EHN STRING,
EDU STRING,
FAMINC STRING,
DISAB STRING,
METRO STRING,
SEXUAL STRING,
FAMSTRC STRING,
MARITAL STRING,
POP_SPC STRING,
POP_POLICY STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION "s3://<YourBucket/YourPrefix>"
Replace YourBucket and YourPrefix with your corresponding values.
In this case, there are a total of ~1.4 million records, which you get by running a simple COUNT(*) query on the table.
From here, you can run multiple analysis queries on the dataset. For example, you can find the number of records that fall into a certain behavioral risk, or the state that has the highest number of diabetic patients recorded. These metrics provide data points that help to determine the attributes that would be needed from a reporting perspective.
As you can see, this is much simpler compared to a schema-on-write approach where the analysis of the source dataset is much more difficult. This solution allows you to design the reporting platform in accordance with the questions you are looking to answer from your data. The source data analysis is the first step to design a good analytical platform and this approach allows customers to do that much earlier in the project lifecycle.
Reporting table
After you have completed the source data analysis, the next step is to filter out the required data and transform it to create a reporting database. This is synonymous to the data mart in a standard analytical pipeline. Based on the analysis carried out in the previous step, you might notice some mismatches with the data headers. You might also identify the filter clauses to apply to the dataset to get to your reporting data.
Athena automatically saves query results in S3 for every run. The default bucket for this is created in the following format:
Athena creates a prefix for each saved query and stores the result set as CSV files organized by dates. You can use this feature to filter out result datasets and store them in an S3 bucket for reporting.
To enable this, create queries that can filter out and transform the subset of data on which to report. For this use case, create three separate queries to filter out unwanted data and fix the column headers:
Query 1:
SELECT ID, up_ci AS source, semean AS state, datasrc
AS year, fips AS unit, fipname AS age,mean, current_date AS dt, current_time AS tm FROM brfsdata
WHERE ID != '' AND hrr IS NULL AND semean NOT LIKE '%29193%'
Query 2:
SELECT ID, up_ci AS source, semean AS state, datasrc
AS year, fips AS unit, fipname AS age,mean, current_date AS dt, current_time AS tm
FROM brfsdata WHERE ID != '' AND hrr IS NOT NULL AND up_ci LIKE '%BRFSS%'and semean NOT LIKE '"%' AND semean NOT LIKE '%29193%'
Query 3:
SELECT ID, low_ci AS source, age_adj AS state, fips
AS year, fipname AS unit, hrr AS age,mean, current_date AS dt, current_time AS tm
FROM brfsdata WHERE ID != '' AND hrr IS NOT NULL AND up_ci NOT LIKE '%BRFSS%' AND age_adj NOT LIKE '"%' AND semean NOT LIKE '%29193%' AND low_ci LIKE 'BRFSS'
You can save these queries in Athena so that you can get to the query results easily every time they are executed. The following screenshot is an example of the results when query 1 is executed four times.
The next step is to copy these results over to a new bucket for creating your reporting table. This can be done by running an S3 CP command from the CLI or API, as shown below:
Note the prefix structure in which Athena stores query results. It creates a separate prefix for each day in which the query is executed, and stores the corresponding CSV and metadata file for each run. Copy the result set over to a new prefix “Reporting_Data”. Use the “exclude” and “include” option of S3 CP to only copy the CSV files and use “recursive” to copy all the files from the run.
You can replace the value of the saved query name from “Query1” to “Query2” or “Query3” to copy all data resulting from those queries to the same target prefix. For pipelines that require more complicated transformations, divide the query transformation into multiple steps and execute them based on events or schedule them, as described in the earlier data staging step.
Amazon QuickSight dashboard
After the filtered results sets are copied as CSV files into the new Reporting_Data prefix, create a new table in Athena that is used specifically for BI reporting. This can be done using a create table statement similar to the one below:
CREATE EXTERNAL TABLE IF NOT EXISTS BRFSS_REPORTING(
ID varchar(100),
source varchar(100),
state varchar(100),
year int,
unit varchar(10),
age varchar(10),
mean float
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION "s3://healthdataset/brfsdata/Reporting_Data"
This table can now act as a source for a dashboard on Amazon QuickSight, which is a straightforward process to enable. When you choose a new data source, Athena shows up as an option and Amazon QuickSight automatically detects the tables in Athena that are exposed for querying. Here are the data sources supported by Athena at the time of this post:
After choosing Athena, give a name to the data source and choose the database. The tables available for querying automatically show up in the list.
If you choose “BRFSS_REPORTING”, you can create custom metrics using the columns in the reporting table, which can then be used in reports and dashboards.
To build a complete pipeline, think about ingesting data incrementally for new records as they become available. To enable this, make sure that the data can be incrementally ingested into the reporting schema and that the reporting metrics are refreshed on each run of the report. To demonstrate this, look at a scenario where the new dataset is ingested in a periodic basis into S3, and which has to be included into the reporting schema when calculating the metrics.
Look at the number of records in the reporting table before the incremental ingestion.
SELECT count(*) FROM brfss_reporting;
The results are as follows:
_col0
1
713123
Use Query1 as an example transformation query that can isolate the incremental load. Here is a view of the query result bucket before Query1 runs.
After the incremental data is ingested in S3, trigger an event (or pre-schedule) an execution of Query1 in Athena, which results in the csv result set and the metadata file as shown below.
Next, trigger (or schedule) the copy command to copy the incremental records file into the reporting prefix. This is easily automated by using the predefined structure in which Athena saves the query results on S3. On checking the records count in the reporting table after copying, you get an increased count.
_col0
1
810042
This shows that 96,919 records were added to our reporting table that can be used in metric calculations.
This process can be implemented programmatically to add incremental records into the reporting table every time new records are ingested into the staging area. As a result, you can simulate an end-to-end analytical pipeline that runs based on events or is scheduled to run as a batch workload.
Conclusion
Using Athena, you can benefit from the advantages of a schema-on-read analytical application. You can combine other AWS services like Lambda and Amazon QuickSight to build an end-to-end analytical pipeline.
However, it’s important to note that a schema-on-read analytical pipeline may not be the answer for all use cases. Carefully consider the choices between schema-on-read and schema-on-write. The source systems you are working with play a critical role in making that decision. For example:
Some systems have defined data structures and the chances of variation is very rare. These systems can work with a fixed relational target schema.
If the target queries mostly involve joining across normalized tables, they work better with a relational database. For this use case, schema-on-write is a good choice.
The query performance in a schema-on-write application is faster as the data is pre-structured. For fixed dashboards with little to no changes, a schema-on-write is a good choice.
You can also choose to go hybrid. Offload a part of the pipeline that deals with unstructured flat datasets into a schema-on-read architecture, and integrate the output into the main schema-on-write pipeline at a later stage. AWS provides multiple options to build analytical pipelines that suit various use cases. For more information, read about the AWS big data and analytics services.
If you have questions or suggestions, please comment below.
Ujjwal Ratan is a healthcare and life sciences Solutions Architect at AWS. He has worked with organizations ranging from large enterprises to smaller startups on problems related to distributed computing, analytics and machine learning. In his free time, he enjoys listening to (and playing) music and taking unplanned road trips with his family.
Deriving insights from large datasets is central to nearly every industry, and life sciences is no exception. To combat the rising cost of bringing drugs to market, pharmaceutical companies are looking for ways to optimize their drug development processes. They are turning to big data analytics to better quantify the effect that their drug compounds have on different populations and to look for new clinical indications for existing drugs.
A real world evidence (RWE) platform is loosely defined as an integrated set of services and products that life sciences companies use to securely acquire, store, and analyze large, often disparate datasets to gain insight into the functions of a specific drug or intervention. The petabyte-scale data generated from wearables, medical devices, genomics, clinical imaging, and claims (to name a few) allows pharmaceutical and other life sciences companies to build big data platforms to analyze these datasets. And many of them are doing it on AWS.
At the center of nearly every RWE platform is a data lake that houses different data types. It also stores the related metadata to identify where each piece of data came from, who owned it, etc. Analytics engines integrate the relevant streaming (e.g., wearables), structured (e.g., claims data), and unstructured (e.g., notes in electronic health records) data.
In this post, I highlight common architectural patterns that customers are using to maximize the value of real world evidence on AWS. The architecture presented here can be reproduced in multiple regions, so you can respect local data sovereignty requirements, when applicable, while conducting global studies. This post doesn’t cover all considerations for real world evidence (such as security and authentication), but instead focuses on the areas that are related to your data flow.
A data lake is your source of truth
The ability to integrate disparate data types is critical to maximizing the utility of RWE. Life sciences companies have to be able to store, search, and retrieve data of different types and sizes, including (but not limited to) the following:
Streaming data from wearables and medical devices
Structured data from genomics and claims data
Unstructured data from notes in electronic health records
A common solution to integrate these data types is a data lake. Data lakes allow organizations to store all their data, regardless of data type, in a centralized repository. Because data can be stored as-is, there’s no need to convert it to a predefined schema. And you no longer need to know what questions you want to ask of your data beforehand. You can use data lakes for ad hoc analyses, so you can quickly explore and discover new insights without needing to structure the data first, as you would with a traditional data warehouse.
Although there are many uses for storing real world evidence data in a data lake, here are a few examples of the processes it can facilitate for pharma and biotech companies:
Quickly access all data for a given subject, from clinical images to genomics to claims data.
Associate incoming RWE data with existing data in your RWE data lake, such as by subject or study.
Select specific windows in longitudinal studies (microbiome, metabolomics, etc.).
The following diagram shows an architecture of the features within a data lake and several examples of how data enters a data lake:
This architecture, which is entirely serverless and backed by Amazon S3, lets you scale your data lake to easily accommodate any data size for your RWE platform. Additionally, the components presented in this diagram can be secured via IAM controls and service policies. This enables you to secure and protect the sensitive information that often resides within an RWE platform. Amazon S3 is the canonical source for objects in your RWE data lake. You track and search metadata that is associated with these objects in a data catalog built on Amazon Elasticsearch Service and Amazon DynamoDB.
Let’s dive deeper into what this architecture does and how data makes it into the data lake:
Streaming (e.g., wearables), structured, and unstructured data is acquired from myriad devices and sources. Depending on the data size, you might use AWS IoT (streaming), AWS Storage Gateway (mid-size/continual batch), and AWS Snowball (large legacy datasets, such as imaging). AWS IoT writes to Amazon Kinesis Firehose, which transforms the telemetry data in-flight to land both transformed and raw data in Amazon S3.
When data lands in Amazon S3 buckets, an AWS Lambda function is invoked (either by trigger or manually).
This Lambda function writes to a data catalog that is fronted by Amazon API Gateway. The data catalog contains metadata about all the object data in Amazon S3, as well as data that resides in databases, such as Amazon Redshift.
An AWS Lambda function on the other side of API Gateway writes the appropriate metadata about the objects, such as the study that the data was generated from, into Amazon Elasticsearch Service and/or Amazon DynamoDB, which I refer to as the data catalog.
The data catalog mentioned in steps 3 and 4 is central to your data management. In most analyses, the first step is to build a data manifest of where the data-of-interest lies, which is discussed in later sections.
Normalizing data for real world evidence
As I previously mentioned, data can enter your real world evidence data lake in many different formats. In many cases, you might need to normalize this data into a specific format to ease downstream analysis. For example, you might have to integrate many different electronic health records that are stored in different formats. You also might have to transform genomics data, often represented as a variant call format (VCF) file, into a format that’s easier for big data technologies like Apache Parquet to query. While you can certainly analyze this data in the format in which it entered, it might be better for querying after it’s transformed into a different format, or into a different data store.
In the architecture shown in the following diagram, AWS Step Functions orchestrates the data normalization (extract, transform, load—or ETL) process. Step Functions is a serverless workflow service that ensures that your long-running ETL jobs execute in order and complete successfully. At a high level, you first query the data catalog to get a manifest of the data to be normalized. Normalization occurs on Amazon EC2 instances, and the results are then stored back in Amazon S3 or in databases such as Amazon Redshift for future analysis. Locations of these results data are also logged in the data catalog for future querying.
Here is an example Step Functions state machine for this process:
The following is the accompanying JSON that produces it:
Either a user or an application submits a POST request through Amazon API Gateway to process or transform a set of data. This request contains the query parameters that correspond to generating the data manifest. It is passed into an AWS Step Functions state machine, which orchestrates the ETL process.
A Lambda function queries the data catalog and builds a manifest that contains the location of the data of interest.
The manifest is passed to downstream Lambda functions in the state machine that orchestrate the batch workflow to process the data that’s specified in the manifest.
These Lambda functions submit jobs to AWS Batch to execute batch jobs on Amazon EC2.
Amazon EC2 processes the data (e.g., data normalization) and stores results back in Amazon S3 and/or Amazon Redshift.
Results data is then logged in the data catalog, using the process shown in the data acquisition diagram.
Analyzing and visualizing data in real world evidence
Ultimately, the value in real world evidence platforms is the value you derive from the wide variety of data that feeds into it. Big data analytics, such as Spark on EMR, machine learning with the Deep Learning AMIs, and healthcare data warehouses, are all fundamental to maximizing the value of RWE.
Given the wide array of questions you can answer, you want your architecture to be as flexible as possible. Your organization also might use business intelligence (BI) tools. Again, as with the data normalization tier, you can use Step Functions to orchestrate your workflow. You first build the data manifest and then submit each portion of your desired analysis to the relevant data location and compute options, such as Amazon EMR or Amazon Redshift. Your organization’s BI tool of choice, such as Amazon QuickSight or one offered by our AWS Big Data Competency partners, can extract and visualize the results.
Here is the workflow in more detail:
A user initiates a query on a dataset. In the context of RWE, this is largely analyzing a cohort in the context of a specific indication (drug response, etc.).
AWS Step Functions invokes a Lambda function to query the data catalog and build a manifest of data.
The manifest is passed to subsequent Lambda functions, which orchestrate the data analysis through different AWS services.
These analyses can include Amazon EMR for population-scale genomics, Amazon EC2 for HPC and machine learning, and Amazon Redshift for your healthcare data warehouse.
Results from each analysis are staged back in Amazon S3 and logged in the data catalog.
BI tools like Amazon QuickSight or Tableau, or data analysis workbooks like Jupyter can query Amazon S3 to visualize results, such as through Amazon Athena.
A practical example
Imagine that you are at a pharmaceutical company that is developing a drug to address a neurological condition. You have longitudinal data in the form of brain scans and have noticed that the drug compound under development seems to benefit a specific subpopulation of individuals. As a result, you determine to sequence the genomes of all the responders and non-responders to look for specific biomarkers that could indicate response levels. Success would result in a companion diagnostic that you could use alongside your drug to increase its overall efficacy by delivering it only to the responding population.
Let’s briefly look at how this study might play out in the steps of data acquisition, normalization, and analysis described earlier.
Data acquisition
Data from longitudinal brain imaging studies can be moved to AWS using AWS Snowball. Genomics data coming off your genome sequencers would land in Amazon S3 via AWS Storage Gateway and/or AWS Direct Connect. These datasets are stored in your data catalog where you track items such as date generated, anonymized subject ID, etc.
Data processing
Genomes are transformed from raw reads (e.g., FASTQ) to a human readable format (VCF) that identifies variations in a genome. These VCF files are subsequently extracted, transformed, and loaded into a format that is amenable to big data analytics, such as Parquet.
Data analysis
You build a data manifest of your brain scans in Amazon S3. You use the deep learning AMI and P2 instance family to build machine learning models to identify images that represent different stages in your disorder-of-interest. You then run association analyses that combine your genomics data with your brain imaging models and drug response data to identify what genomic variants associate with improved treatment outcomes. You can manage these analyses via Jupyter notebooks, or you can connect with your BI tool of choice to visualize your results.
It will always be day one for RWE
By implementing real world evidence platforms on AWS, you can quickly integrate and interrogate disparate healthcare data to advance human health. By designing your RWE platform to be flexible, you can readily incorporate new big data technologies as they become relevant to your needs. This enables you to innovate and discover at a quicker rate.
The healthcare ecosystem has chosen a variety of tools and techniques for working with big data, but one tool that comes up again and again in many of the architectures we design and review is Spark on Amazon EMR. Will spark power the data behind precision medicine?
About the Authors
Dr. Aaron Friedman is a Healthcare and Life Sciences Partner Solutions Architect at Amazon Web Services. He works with ISVs and SIs to architect healthcare solutions on AWS, and bring the best possible experience to their customers. His passion is working at the intersection of science, big data, and software. In his spare time, he’s exploring the outdoors, learning a new thing to cook, or spending time with his wife and his dog, Macaroon.
AWS Training allows you to learn from the experts so that you can advance your knowledge with practical skills and get more out of the AWS Cloud. We are adding one of our most popular event boot camps, Building a Serverless Data Lake, to our permanent instructor-led training portfolio.
This one-day course is designed to teach you how to design, build, and operate a serverless data lake solution with AWS services. We cover topics such as ingesting data from any data source at large scale, storing the data securely and durably, enabling the capability to use the right tool to process large volumes of data, and understanding the options available for analyzing the data in near-real time.
This course is intended for solution architects, big data developers, data architects and analysts, and other hands-on data analysis practitioners.
There are two opposing models of how the Internet has changed protest movements. The first is that the Internet has made protesters mightier than ever. This comes from the successful revolutions in Tunisia (2010-11), Egypt (2011), and Ukraine (2013). The second is that it has made them more ineffectual. Derided as “slacktivism” or “clicktivism,” the ease of action without commitment can result in movements like Occupy petering out in the US without any obvious effects. Of course, the reality is more nuanced, and Zeynep Tufekci teases that out in her new book Twitter and Tear Gas.
Tufekci is a rare interdisciplinary figure. As a sociologist, programmer, and ethnographer, she studies how technology shapes society and drives social change. She has a dual appointment in both the School of Information Science and the Department of Sociology at University of North Carolina at Chapel Hill, and is a Faculty Associate at the Berkman Klein Center for Internet and Society at Harvard University. Her regular New York Times column on the social impacts of technology is a must-read.
Modern Internet-fueled protest movements are the subjects of Twitter and Tear Gas. As an observer, writer, and participant, Tufekci examines how modern protest movements have been changed by the Internet — and what that means for protests going forward. Her book combines her own ethnographic research and her usual deft analysis, with the research of others and some big data analysis from social media outlets. The result is a book that is both insightful and entertaining, and whose lessons are much broader than the book’s central topic.
“The Power and Fragility of Networked Protest” is the book’s subtitle. The power of the Internet as a tool for protest is obvious: it gives people newfound abilities to quickly organize and scale. But, according to Tufekci, it’s a mistake to judge modern protests using the same criteria we used to judge pre-Internet protests. The 1963 March on Washington might have culminated in hundreds of thousands of people listening to Martin Luther King Jr. deliver his “I Have a Dream” speech, but it was the culmination of a multi-year protest effort and the result of six months of careful planning made possible by that sustained effort. The 2011 protests in Cairo came together in mere days because they could be loosely coordinated on Facebook and Twitter.
That’s the power. Tufekci describes the fragility by analogy. Nepalese Sherpas assist Mt. Everest climbers by carrying supplies, laying out ropes and ladders, and so on. This means that people with limited training and experience can make the ascent, which is no less dangerous — to sometimes disastrous results. Says Tufekci: “The Internet similarly allows networked movements to grow dramatically and rapidly, but without prior building of formal or informal organizational and other collective capacities that could prepare them for the inevitable challenges they will face and give them the ability to respond to what comes next.” That makes them less able to respond to government counters, change their tactics — a phenomenon Tufekci calls “tactical freeze” — make movement-wide decisions, and survive over the long haul.
Tufekci isn’t arguing that modern protests are necessarily less effective, but that they’re different. Effective movements need to understand these differences, and leverage these new advantages while minimizing the disadvantages.
To that end, she develops a taxonomy for talking about social movements. Protests are an example of a “signal” that corresponds to one of several underlying “capacities.” There’s narrative capacity: the ability to change the conversation, as Black Lives Matter did with police violence and Occupy did with wealth inequality. There’s disruptive capacity: the ability to stop business as usual. An early Internet example is the 1999 WTO protests in Seattle. And finally, there’s electoral or institutional capacity: the ability to vote, lobby, fund raise, and so on. Because of various “affordances” of modern Internet technologies, particularly social media, the same signal — a protest of a given size — reflects different underlying capacities.
This taxonomy also informs government reactions to protest movements. Smart responses target attention as a resource. The Chinese government responded to 2015 protesters in Hong Kong by not engaging with them at all, denying them camera-phone videos that would go viral and attract the world’s attention. Instead, they pulled their police back and waited for the movement to die from lack of attention.
If this all sounds dry and academic, it’s not. Twitter and Tear Gasis infused with a richness of detail stemming from her personal participation in the 2013 Gezi Park protests in Turkey, as well as personal on-the-ground interviews with protesters throughout the Middle East — particularly Egypt and her native Turkey — Zapatistas in Mexico, WTO protesters in Seattle, Occupy participants worldwide, and others. Tufekci writes with a warmth and respect for the humans that are part of these powerful social movements, gently intertwining her own story with the stories of others, big data, and theory. She is adept at writing for a general audience, anddespite being published by the intimidating Yale University Press — her book is more mass-market than academic. What rigor is there is presented in a way that carries readers along rather than distracting.
The synthesist in me wishes Tufekci would take some additional steps, taking the trends she describes outside of the narrow world of political protest and applying them more broadly to social change. Her taxonomy is an important contribution to the more-general discussion of how the Internet affects society. Furthermore, her insights on the networked public sphere has applications for understanding technology-driven social change in general. These are hard conversations for society to have. We largely prefer to allow technology to blindly steer society or — in some ways worse — leave it to unfettered for-profit corporations. When you’re reading Twitter and Tear Gas, keep current and near-term future technological issues such as ubiquitous surveillance, algorithmic discrimination, and automation and employment in mind. You’ll come away with new insights.
Tufekci twice quotes historian Melvin Kranzberg from 1985: “Technology is neither good nor bad; nor is it neutral.” This foreshadows her central message. For better or worse, the technologies that power the networked public sphere have changed the nature of political protest as well as government reactions to and suppressions of such protest.
I have long characterized our technological future as a battle between the quick and the strong. The quick — dissidents, hackers, criminals, marginalized groups — are the first to make use of a new technology to magnify their power. The strong are slower, but have more raw power to magnify. So while protesters are the first to use Facebook to organize, the governments eventually figure out how to use Facebook to track protesters. It’s still an open question who will gain the upper hand in the long term, but Tufekci’s book helps us understand the dynamics at work.
One of the great benefits of Amazon S3 is the ability to host, share, or consume public data sets. This provides transparency into data to which an external data scientist or developer might not normally have access. By exposing the data to the public, you can glean many insights that would have been difficult with a data silo.
The openFDA project creates easy access to the high value, high priority, and public access data of the Food and Drug Administration (FDA). The data has been formatted and documented in consumer-friendly standards. Critical data related to drugs, devices, and food has been harmonized and can easily be called by application developers and researchers via API calls. OpenFDA has published two whitepapers that drill into the technical underpinnings of the API infrastructure as well as how to properly analyze the data in R. In addition, FDA makes openFDA data available on S3 in raw format.
In this post, I show how to use S3, Amazon EMR, and Amazon Athena to analyze the drug adverse events dataset. A drug adverse event is an undesirable experience associated with the use of a drug, including serious drug side effects, product use errors, product quality programs, and therapeutic failures.
Data considerations
Keep in mind that this data does have limitations. In addition, in the United States, these adverse events are submitted to the FDA voluntarily from consumers so there may not be reports for all events that occurred. There is no certainty that the reported event was actually due to the product. The FDA does not require that a causal relationship between a product and event be proven, and reports do not always contain the detail necessary to evaluate an event. Because of this, there is no way to identify the true number of events. The important takeaway to all this is that the information contained in this data has not been verified to produce cause and effect relationships. Despite this disclaimer, many interesting insights and value can be derived from the data to accelerate drug safety research.
Data analysis using SQL
For application developers who want to perform targeted searching and lookups, the API endpoints provided by the openFDA project are “ready to go” for software integration using a standard API powered by Elasticsearch, NodeJS, and Docker. However, for data analysis purposes, it is often easier to work with the data using SQL and statistical packages that expect a SQL table structure. For large-scale analysis, APIs often have query limits, such as 5000 records per query. This can cause extra work for data scientists who want to analyze the full dataset instead of small subsets of data.
To address the concern of requiring all the data in a single dataset, the openFDA project released the full 100 GB of harmonized data files that back the openFDA project onto S3. Athena is an interactive query service that makes it easy to analyze data in S3 using standard SQL. It’s a quick and easy way to answer your questions about adverse events and aspirin that does not require you to spin up databases or servers.
While you could point tools directly at the openFDA S3 files, you can find greatly improved performance and use of the data by following some of the preparation steps later in this post.
Architecture
This post explains how to use the following architecture to take the raw data provided by openFDA, leverage several AWS services, and derive meaning from the underlying data.
Steps:
Load the openFDA /drug/event dataset into Spark and convert it to gzip to allow for streaming.
Transform the data in Spark and save the results as a Parquet file in S3.
Query the S3 Parquet file with Athena.
Perform visualization and analysis of the data in R and Python on Amazon EC2.
Optimizing public data sets: A primer on data preparation
Those who want to jump right into preparing the files for Athena may want to skip ahead to the next section.
Transforming, or pre-processing, files is a common task for using many public data sets. Before you jump into the specific steps for transforming the openFDA data files into a format optimized for Athena, I thought it would be worthwhile to provide a quick exploration on the problem.
Making a dataset in S3 efficiently accessible with minimal transformation for the end user has two key elements:
Partitioning the data into objects that contain a complete part of the data (such as data created within a specific month).
Using file formats that make it easy for applications to locate subsets of data (for example, gzip, Parquet, ORC, etc.).
With these two key elements in mind, you can now apply transformations to the openFDA adverse event data to prepare it for Athena. You might find the data techniques employed in this post to be applicable to many of the questions you might want to ask of the public data sets stored in Amazon S3.
Before you get started, I encourage those who are interested in doing deeper healthcare analysis on AWS to make sure that you first read the AWS HIPAA Compliance whitepaper. This covers the information necessary for processing and storing patient health information (PHI).
Also, the adverse event analysis shown for aspirin is strictly for demonstration purposes and should not be used for any real decision or taken as anything other than a demonstration of AWS capabilities. However, there have been robust case studies published that have explored a causal relationship between aspirin and adverse reactions using OpenFDA data. If you are seeking research on aspirin or its risks, visit organizations such as the Centers for Disease Control and Prevention (CDC) or the Institute of Medicine (IOM).
Preparing data for Athena
For this walkthrough, you will start with the FDA adverse events dataset, which is stored as JSON files within zip archives on S3. You then convert it to Parquet for analysis. Why do you need to convert it? The original data download is stored in objects that are partitioned by quarter.
Here is a small sample of what you find in the adverse events (/drugs/event) section of the openFDA website.
If you were looking for events that happened in a specific quarter, this is not a bad solution. For most other scenarios, such as looking across the full history of aspirin events, it requires you to access a lot of data that you won’t need. The zip file format is not ideal for using data in place because zip readers must have random access to the file, which means the data can’t be streamed. Additionally, the zip files contain large JSON objects.
To read the data in these JSON files, a streaming JSON decoder must be used or a computer with a significant amount of RAM must decode the JSON. Opening up these files for public consumption is a great start. However, you still prepare the data with a few lines of Spark code so that the JSON can be streamed.
Step 1: Convert the file types
Using Apache Spark on EMR, you can extract all of the zip files and pull out the events from the JSON files. To do this, use the Scala code below to deflate the zip file and create a text file. In addition, compress the JSON files with gzip to improve Spark’s performance and reduce your overall storage footprint. The Scala code can be run in either the Spark Shell or in an Apache Zeppelin notebook on your EMR cluster.
If you are unfamiliar with either Apache Zeppelin or the Spark Shell, the following posts serve as great references:
import scala.io.Source
import java.util.zip.ZipInputStream
import org.apache.spark.input.PortableDataStream
import org.apache.hadoop.io.compress.GzipCodec
// Input Directory
val inputFile = "s3://download.open.fda.gov/drug/event/2015q4/*.json.zip";
// Output Directory
val outputDir = "s3://{YOUR OUTPUT BUCKET HERE}/output/2015q4/";
// Extract zip files from
val zipFiles = sc.binaryFiles(inputFile);
// Process zip file to extract the json as text file and save it
// in the output directory
val rdd = zipFiles.flatMap((file: (String, PortableDataStream)) => {
val zipStream = new ZipInputStream(file.2.open)
val entry = zipStream.getNextEntry
val iter = Source.fromInputStream(zipStream).getLines
iter
}).map(.replaceAll("\s+","")).saveAsTextFile(outputDir, classOf[GzipCodec])
Step 2: Transform JSON into Parquet
With just a few more lines of Scala code, you can use Spark’s abstractions to convert the JSON into a Spark DataFrame and then export the data back to S3 in Parquet format.
Spark requires the JSON to be in JSON Lines format to be parsed correctly into a DataFrame.
// Output Parquet directory
val outputDir = "s3://{YOUR OUTPUT BUCKET NAME}/output/drugevents"
// Input json file
val inputJson = "s3://{YOUR OUTPUT BUCKET NAME}/output/2015q4/*”
// Load dataframe from json file multiline
val df = spark.read.json(sc.wholeTextFiles(inputJson).values)
// Extract results from dataframe
val results = df.select("results")
// Save it to Parquet
results.write.parquet(outputDir)
Step 3: Create an Athena table
With the data cleanly prepared and stored in S3 using the Parquet format, you can now place an Athena table on top of it to get a better understanding of the underlying data.
Because the openFDA data structure incorporates several layers of nesting, it can be a complex process to try to manually derive the underlying schema in a Hive-compatible format. To shorten this process, you can load the top row of the DataFrame from the previous step into a Hive table within Zeppelin and then extract the “create table” statement from SparkSQL.
results.createOrReplaceTempView("data")
val top1 = spark.sql("select * from data tablesample(1 rows)")
top1.write.format("parquet").mode("overwrite").saveAsTable("drugevents")
val show_cmd = spark.sql("show create table drugevents”).show(1, false)
This returns a “create table” statement that you can almost paste directly into the Athena console. Make some small modifications (adding the word “external” and replacing “using with “stored as”), and then execute the code in the Athena query editor. The table is created.
For the openFDA data, the DDL returns all string fields, as the date format used in your dataset does not conform to the yyy-mm-dd hh:mm:ss[.f…] format required by Hive. For your analysis, the string format works appropriately but it would be possible to extend this code to use a Presto function to convert the strings into time stamps.
With the Athena table in place, you can start to explore the data by running ad hoc queries within Athena or doing more advanced statistical analysis in R.
Using SQL and R to analyze adverse events
Using the openFDA data with Athena makes it very easy to translate your questions into SQL code and perform quick analysis on the data. After you have prepared the data for Athena, you can begin to explore the relationship between aspirin and adverse drug events, as an example. One of the most common metrics to measure adverse drug events is the Proportional Reporting Ratio (PRR). It is defined as:
PRR = (m/n)/( (M-m)/(N-n) ) Where m = #reports with drug and event n = #reports with drug M = #reports with event in database N = #reports in database
Gastrointestinal haemorrhage has the highest PRR of any reaction to aspirin when viewed in aggregate. One question you may want to ask is how the PRR has trended on a yearly basis for gastrointestinal haemorrhage since 2005.
Using the following query in Athena, you can see the PRR trend of “GASTROINTESTINAL HAEMORRHAGE” reactions with “ASPIRIN” since 2005:
with drug_and_event as
(select rpad(receiptdate, 4, 'NA') as receipt_year
, reactionmeddrapt
, count(distinct (concat(safetyreportid,receiptdate,reactionmeddrapt))) as reports_with_drug_and_event
from fda.drugevents
where rpad(receiptdate,4,'NA')
between '2005' and '2015'
and medicinalproduct = 'ASPIRIN'
and reactionmeddrapt= 'GASTROINTESTINAL HAEMORRHAGE'
group by reactionmeddrapt, rpad(receiptdate, 4, 'NA')
), reports_with_drug as
(
select rpad(receiptdate, 4, 'NA') as receipt_year
, count(distinct (concat(safetyreportid,receiptdate,reactionmeddrapt))) as reports_with_drug
from fda.drugevents
where rpad(receiptdate,4,'NA')
between '2005' and '2015'
and medicinalproduct = 'ASPIRIN'
group by rpad(receiptdate, 4, 'NA')
), reports_with_event as
(
select rpad(receiptdate, 4, 'NA') as receipt_year
, count(distinct (concat(safetyreportid,receiptdate,reactionmeddrapt))) as reports_with_event
from fda.drugevents
where rpad(receiptdate,4,'NA')
between '2005' and '2015'
and reactionmeddrapt= 'GASTROINTESTINAL HAEMORRHAGE'
group by rpad(receiptdate, 4, 'NA')
), total_reports as
(
select rpad(receiptdate, 4, 'NA') as receipt_year
, count(distinct (concat(safetyreportid,receiptdate,reactionmeddrapt))) as total_reports
from fda.drugevents
where rpad(receiptdate,4,'NA')
between '2005' and '2015'
group by rpad(receiptdate, 4, 'NA')
)
select drug_and_event.receipt_year,
(1.0 * drug_and_event.reports_with_drug_and_event/reports_with_drug.reports_with_drug)/ (1.0 * (reports_with_event.reports_with_event- drug_and_event.reports_with_drug_and_event)/(total_reports.total_reports-reports_with_drug.reports_with_drug)) as prr
, drug_and_event.reports_with_drug_and_event
, reports_with_drug.reports_with_drug
, reports_with_event.reports_with_event
, total_reports.total_reports
from drug_and_event
inner join reports_with_drug on drug_and_event.receipt_year = reports_with_drug.receipt_year
inner join reports_with_event on drug_and_event.receipt_year = reports_with_event.receipt_year
inner join total_reports on drug_and_event.receipt_year = total_reports.receipt_year
order by drug_and_event.receipt_year
One nice feature of Athena is that you can quickly connect to it via R or any other tool that can use a JDBC driver to visualize the data and understand it more clearly.
With this quick R script that can be run in R Studio either locally or on an EC2 instance, you can create a visualization of the PRR and Reporting Odds Ratio (RoR) for “GASTROINTESTINAL HAEMORRHAGE” reactions from “ASPIRIN” since 2005 to better understand these trends.
# connect to ATHENA
conn <- dbConnect(drv, '<Your JDBC URL>',s3_staging_dir="<Your S3 Location>",user=Sys.getenv(c("USER_NAME"),password=Sys.getenv(c("USER_PASSWORD"))
# Declare Adverse Event
adverseEvent <- "'GASTROINTESTINAL HAEMORRHAGE'"
# Build SQL Blocks
sqlFirst <- "SELECT rpad(receiptdate, 4, 'NA') as receipt_year, count(DISTINCT safetyreportid) as event_count FROM fda.drugsflat WHERE rpad(receiptdate,4,'NA') between '2005' and '2015'"
sqlEnd <- "GROUP BY rpad(receiptdate, 4, 'NA') ORDER BY receipt_year"
# Extract Aspirin with adverse event counts
sql <- paste(sqlFirst,"AND medicinalproduct ='ASPIRIN' AND reactionmeddrapt=",adverseEvent, sqlEnd,sep=" ")
aspirinAdverseCount = dbGetQuery(conn,sql)
# Extract Aspirin counts
sql <- paste(sqlFirst,"AND medicinalproduct ='ASPIRIN'", sqlEnd,sep=" ")
aspirinCount = dbGetQuery(conn,sql)
# Extract adverse event counts
sql <- paste(sqlFirst,"AND reactionmeddrapt=",adverseEvent, sqlEnd,sep=" ")
adverseCount = dbGetQuery(conn,sql)
# All Drug Adverse event Counts
sql <- paste(sqlFirst, sqlEnd,sep=" ")
allDrugCount = dbGetQuery(conn,sql)
# Select correct rows
selAll = allDrugCount$receipt_year == aspirinAdverseCount$receipt_year
selAspirin = aspirinCount$receipt_year == aspirinAdverseCount$receipt_year
selAdverse = adverseCount$receipt_year == aspirinAdverseCount$receipt_year
# Calculate Numbers
m <- c(aspirinAdverseCount$event_count)
n <- c(aspirinCount[selAspirin,2])
M <- c(adverseCount[selAdverse,2])
N <- c(allDrugCount[selAll,2])
# Calculate proptional reporting ratio
PRR = (m/n)/((M-m)/(N-n))
# Calculate reporting Odds Ratio
d = n-m
D = N-M
ROR = (m/d)/(M/D)
# Plot the PRR and ROR
g_range <- range(0, PRR,ROR)
g_range[2] <- g_range[2] + 3
yearLen = length(aspirinAdverseCount$receipt_year)
axis(1,1:yearLen,lab=ax)
plot(PRR, type="o", col="blue", ylim=g_range,axes=FALSE, ann=FALSE)
axis(1,1:yearLen,lab=ax)
axis(2, las=1, at=1*0:g_range[2])
box()
lines(ROR, type="o", pch=22, lty=2, col="red")
As you can see, the PRR and RoR have both remained fairly steady over this time range. With the R Script above, all you need to do is change the adverseEvent variable from GASTROINTESTINAL HAEMORRHAGE to another type of reaction to analyze and compare those trends.
Summary
In this walkthrough:
You used a Scala script on EMR to convert the openFDA zip files to gzip.
You then transformed the JSON blobs into flattened Parquet files using Spark on EMR.
You created an Athena DDL so that you could query these Parquet files residing in S3.
Finally, you pointed the R package at the Athena table to analyze the data without pulling it into a database or creating your own servers.
If you have questions or suggestions, please comment below.
Ryan Hood is a Data Engineer for AWS. He works on big data projects leveraging the newest AWS offerings. In his spare time, he enjoys watching the Cubs win the World Series and attempting to Sous-vide anything he can find in his refrigerator.
Vikram Anand is a Data Engineer for AWS. He works on big data projects leveraging the newest AWS offerings. In his spare time, he enjoys playing soccer and watching the NFL & European Soccer leagues.
Dave Rocamora is a Solutions Architect at Amazon Web Services on the Open Data team. Dave is based in Seattle and when he is not opening data, he enjoys biking and drinking coffee outside.
Nowadays, streaming data is seen and used everywhere—from social networks, to mobile and web applications, IoT devices, instrumentation in data centers, and many other sources. As the speed and volume of this type of data increases, the need to perform data analysis in real time with machine learning algorithms and extract a deeper understanding from the data becomes ever more important. For example, you might want a continuous monitoring system to detect sentiment changes in a social media feed so that you can react to the sentiment in near real time.
In this post, we use Amazon Kinesis Streams to collect and store streaming data. We then use Amazon Kinesis Analytics to process and analyze the streaming data continuously. Specifically, we use the Kinesis Analytics built-in RANDOM_CUT_FOREST function, a machine learning algorithm, to detect anomalies in the streaming data. Finally, we use Amazon Kinesis Firehose to export the anomalies data to Amazon Elasticsearch Service (Amazon ES). We then build a simple dashboard in the open source tool Kibana to visualize the result.
Solution overview
The following diagram depicts a high-level overview of this solution.
Amazon Kinesis Streams
You can use Amazon Kinesis Streams to build your own streaming application. This application can process and analyze streaming data by continuously capturing and storing terabytes of data per hour from hundreds of thousands of sources.
Amazon Kinesis Analytics
Kinesis Analytics provides an easy and familiar standard SQL language to analyze streaming data in real time. One of its most powerful features is that there are no new languages, processing frameworks, or complex machine learning algorithms that you need to learn.
Amazon Kinesis Firehose
Kinesis Firehose is the easiest way to load streaming data into AWS. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service.
Amazon Elasticsearch Service
Amazon ES is a fully managed service that makes it easy to deploy, operate, and scale Elasticsearch for log analytics, full text search, application monitoring, and more.
Solution summary
The following is a quick walkthrough of the solution that’s presented in the diagram:
IoT sensors send streaming data into Kinesis Streams. In this post, you use a Python script to simulate an IoT temperature sensor device that sends the streaming data.
By using the built-in RANDOM_CUT_FOREST function in Kinesis Analytics, you can detect anomalies in real time with the sensor data that is stored in Kinesis Streams. RANDOM_CUT_FOREST is also an appropriate algorithm for many other kinds of anomaly-detection use cases—for example, the media sentiment example mentioned earlier in this post.
The processed anomaly data is then loaded into the Kinesis Firehose delivery stream.
By using the built-in integration that Kinesis Firehose has with Amazon ES, you can easily export the processed anomaly data into the service and visualize it with Kibana.
Implementation steps
The following sections walk through the implementation steps in detail.
Creating the delivery stream
Open the Amazon Kinesis Streams console.
Create a new Kinesis stream. Give it a name that indicates it’s for raw incoming stream data—for example, RawStreamData. For Number of shards, type 1.
The Python code provided below simulates a streaming application, such as an IoT device, and generates random data and anomalies into a Kinesis stream. The code generates two temperature ranges, where the first range is the hypothetical sensor’s normal operating temperature range (10–20), and the second is the anomaly temperature range (100–120).Make sure to change the stream name on line 16 and 20 and the Region on line 6 to match your configuration. Alternatively, you can download the Amazon Kinesis Data Generator from this repository and use it to generate the data.
import json
import datetime
import random
import testdata
from boto import kinesis
kinesis = kinesis.connect_to_region("us-east-1")
def getData(iotName, lowVal, highVal):
data = {}
data["iotName"] = iotName
data["iotValue"] = random.randint(lowVal, highVal)
return data
while 1:
rnd = random.random()
if (rnd < 0.01):
data = json.dumps(getData("DemoSensor", 100, 120))
kinesis.put_record("RawStreamData", data, "DemoSensor")
print '***************************** anomaly ************************* ' + data
else:
data = json.dumps(getData("DemoSensor", 10, 20))
kinesis.put_record("RawStreamData", data, "DemoSensor")
print data
Open the Amazon Elasticsearch Service console and create a new domain.
Give the domain a unique name. In the Configure cluster screen, use the default settings.
In the Set up access policy screen, in the Set the domain access policy list, choose Allow access to the domain from specific IP(s).
Enter the public IP address of your computer. Note: If you’re working behind a proxy or firewall, see the “Use a proxy to simplify request signing” section in this AWS Database blog post to learn how to work with a proxy. For additional information about securing access to your Amazon ES domain, see How to Control Access to Your Amazon Elasticsearch Domain in the AWS Security Blog.
After the Amazon ES domain is up and running, you can set up and configure Kinesis Firehose to export results to Amazon ES:
Open the Amazon Kinesis Firehose console and choose Create Delivery Stream.
In the Destination dropdown list, choose Amazon Elasticsearch Service.
Type a stream name, and choose the Amazon ES domain that you created in Step 4.
Provide an index name and ES type. In the S3 bucket dropdown list, choose Create New S3 bucket. Choose Next.
In the configuration, change the Elasticsearch Buffer size to 1 MB and the Buffer interval to 60s. Use the default settings for all other fields. This shortens the time for the data to reach the ES cluster.
Under IAM Role, choose Create/Update existing IAM role. The best practice is to create a new role every time. Otherwise, the console keeps adding policy documents to the same role. Eventually the size of the attached policies causes IAM to reject the role, but it does it in a non-obvious way, where the console basically quits functioning.
Choose Next to move to the Review page.
Review the configuration, and then choose Create Delivery Stream.
Run the Python file for 1–2 minutes, and then press Ctrl+C to stop the execution. This loads some data into the stream for you to visualize in the next step.
Analyzing the data
Now it’s time to analyze the IoT streaming data using Amazon Kinesis Analytics.
Open the Amazon Kinesis Analytics console and create a new application. Give the application a name, and then choose Create Application.
On the next screen, choose Connect to a source. Choose the raw incoming data stream that you created earlier. (Note the stream name Source_SQL_STREAM_001 because you will need it later.)
Use the default settings for everything else. When the schema discovery process is complete, it displays a success message with the formatted stream sample in a table as shown in the following screenshot. Review the data, and then choose Save and continue.
Next, choose Go to SQL editor. When prompted, choose Yes, start application.
Copy the following SQL code and paste it into the SQL editor window.
CREATE OR REPLACE STREAM "TEMP_STREAM" (
"iotName" varchar (40),
"iotValue" integer,
"ANOMALY_SCORE" DOUBLE);
-- Creates an output stream and defines a schema
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
"iotName" varchar(40),
"iotValue" integer,
"ANOMALY_SCORE" DOUBLE,
"created" TimeStamp);
-- Compute an anomaly score for each record in the source stream
-- using Random Cut Forest
CREATE OR REPLACE PUMP "STREAM_PUMP_1" AS INSERT INTO "TEMP_STREAM"
SELECT STREAM "iotName", "iotValue", ANOMALY_SCORE FROM
TABLE(RANDOM_CUT_FOREST(
CURSOR(SELECT STREAM * FROM "SOURCE_SQL_STREAM_001")
)
);
-- Sort records by descending anomaly score, insert into output stream
CREATE OR REPLACE PUMP "OUTPUT_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"
SELECT STREAM "iotName", "iotValue", ANOMALY_SCORE, ROWTIME FROM "TEMP_STREAM"
ORDER BY FLOOR("TEMP_STREAM".ROWTIME TO SECOND), ANOMALY_SCORE DESC;
Choose Save and run SQL. As the application is running, it displays the results as stream data arrives. If you don’t see any data coming in, run the Python script again to generate some fresh data. When there is data, it appears in a grid as shown in the following screenshot.Note that you are selecting data from the source stream name Source_SQL_STREAM_001 that you created previously. Also note the ANOMALY_SCORE column. This is the value that the Random_Cut_Forest function calculates based on the temperature ranges provided by the Python script. Higher (anomaly) temperature ranges have a higher score.Looking at the SQL code, note that the first two blocks of code create two new streams to store temporary data and the final result. The third block of code analyzes the raw source data (Stream_Pump_1) using the Random_Cut_Forest function. It calculates an anomaly score (ANOMALY_SCORE) and inserts it into the TEMP_STREAM stream. The final code block loads the result stored in the TEMP_STREAM into DESTINATION_SQL_STREAM.
Choose Exit (done editing) next to the Save and run SQL button to return to the application configuration page.
Load processed data into the Kinesis Firehose delivery stream
Now, you can export the result from DESTINATION_SQL_STREAM into the Amazon Kinesis Firehose stream that you created previously.
On the application configuration page, choose Connect to a destination.
Choose the stream name that you created earlier, and use the default settings for everything else. Then choose Save and Continue.
On the application configuration page, choose Exit to Kinesis Analytics applications to return to the Amazon Kinesis Analytics console.
Run the Python script again for 4–5 minutes to generate enough data to flow through Amazon Kinesis Streams, Kinesis Analytics, Kinesis Firehose, and finally into the Amazon ES domain.
Open the Kinesis Firehose console, choose the stream, and then choose the Monitoring
As the processed data flows into Kinesis Firehose and Amazon ES, the metrics appear on the Delivery Stream metrics page. Keep in mind that the metrics page takes a few minutes to refresh with the latest data.
Open the Amazon Elasticsearch Service dashboard in the AWS Management Console. The count in the Searchable documents column increases as shown in the following screenshot. In addition, the domain shows a cluster health of Yellow. This is because, by default, it needs two instances to deploy redundant copies of the index. To fix this, you can deploy two instances instead of one.
Visualize the data using Kibana
Now it’s time to launch Kibana and visualize the data.
Use the ES domain link to go to the cluster detail page, and then choose the Kibana link as shown in the following screenshot. If you’re working behind a proxy or firewall, see the “Use a proxy to simplify request signing” section in this blog post to learn how to work with a proxy.
In the Kibana dashboard, choose the Discover tab to perform a query.
You can also visualize the data using the different types of charts offered by Kibana. For example, by going to the Visualize tab, you can quickly create a split bar chart that aggregates by ANOMALY_SCORE per minute.
Conclusion
In this post, you learned how to use Amazon Kinesis to collect, process, and analyze real-time streaming data, and then export the results to Amazon ES for analysis and visualization with Kibana. If you have comments about this post, add them to the “Comments” section below. If you have questions or issues with implementing this solution, please open a new thread on the Amazon Kinesis or Amazon ES discussion forums.
Tristan Li is a Solutions Architect with Amazon Web Services. He works with enterprise customers in the US, helping them adopt cloud technology to build scalable and secure solutions on AWS.
Are you tired of hearing me talk about Amazon WorkSpaces yet? I hope not, because we have a lot of customer-driven additions on the roadmap! Our customers in the developer and analyst community have been asking for a workstation-class machine that will allow them to take advantage of the low cost and flexibility of WorkSpaces. Developers want to run Visual Studio, IntelliJ, Eclipse, and other IDEs. Analysts want to run complex simulations and statistical analysis using MatLab, GNU Octave, R, and Stata.
New Power Bundle Today we are extending the current set of WorkSpaces bundles with a new Power bundle. With four vCPUs, 16 GiB of memory, and 275 GB of storage (175 GB on the system volume and another 100 GB on the user volume), this bundle is designed to make developers, analysts, (and me) smile. You can launch them in all of the usual ways: Console, CLI (create-workspaces), or API (CreateWorkSpaces):
One really interesting benefit to using a cloud-based virtual desktop for simulations and statistical analysis is the ease of access to data that’s already stored in the cloud. Analysts can mine and analyze petabytes of data stored in S3 that is effectively local (with respect to access time) to the WorkSpace. This low-latency access will boost productivity and also simplifies the use of other AWS data analysis tools such as Amazon Redshift, Amazon Redshift Spectrum, Amazon QuickSight, and Amazon Athena.
Like the existing bundles, the new Power bundle can be used in either billing configuration, AlwaysOn or AutoStop (read Amazon WorkSpaces Update – Hourly Usage and Expanded Root Volume to learn more). The bundle is available in all AWS Regions where WorkSpaces is available and you can launch one today! Visit the WorkSpaces Pricing page for pricing in your region.
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.