Tag Archives: data analysis

New AWS Training: Building a Serverless Data Lake

Post Syndicated from Sara Snedeker original https://aws.amazon.com/blogs/big-data/new-aws-training-building-a-serverless-data-lake/

AWS Training allows you to learn from the experts so that you can advance your knowledge with practical skills and get more out of the AWS Cloud. We are adding one of our most popular event boot camps, Building a Serverless Data Lake, to our permanent instructor-led training portfolio.

This one-day course is designed to teach you how to design, build, and operate a serverless data lake solution with AWS services. We cover topics such as ingesting data from any data source at large scale, storing the data securely and durably, enabling the capability to use the right tool to process large volumes of data, and understanding the options available for analyzing the data in near-real time.

This course is intended for solution architects, big data developers, data architects and analysts, and other hands-on data analysis practitioners.

You can explore our complete course catalog, or search for a public class near you. You can also request a private onsite training for your team by contacting AWS Training.


CyberChef – Cyber Swiss Army Knife

Post Syndicated from Darknet original http://feedproxy.google.com/~r/darknethackers/~3/SOhld_nebGs/

CyberChef is a simple, intuitive web app for carrying out all manner of “cyber” operations within a web browser. These operations include simple encoding like XOR or Base64, more complex encryption like AES, DES and Blowfish, creating binary and hexdumps, compression and decompression of data, calculating hashes and checksums, IPv6 and X.509…

Read the full post at darknet.org.uk

Book Review: Twitter and Tear Gas, by Zeynep Tufekci

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/07/book_review_twi.html

There are two opposing models of how the Internet has changed protest movements. The first is that the Internet has made protesters mightier than ever. This comes from the successful revolutions in Tunisia (2010-11), Egypt (2011), and Ukraine (2013). The second is that it has made them more ineffectual. Derided as “slacktivism” or “clicktivism,” the ease of action without commitment can result in movements like Occupy petering out in the US without any obvious effects. Of course, the reality is more nuanced, and Zeynep Tufekci teases that out in her new book Twitter and Tear Gas.

Tufekci is a rare interdisciplinary figure. As a sociologist, programmer, and ethnographer, she studies how technology shapes society and drives social change. She has a dual appointment in both the School of Information Science and the Department of Sociology at University of North Carolina at Chapel Hill, and is a Faculty Associate at the Berkman Klein Center for Internet and Society at Harvard University. Her regular New York Times column on the social impacts of technology is a must-read.

Modern Internet-fueled protest movements are the subjects of Twitter and Tear Gas. As an observer, writer, and participant, Tufekci examines how modern protest movements have been changed by the Internet­ — and what that means for protests going forward. Her book combines her own ethnographic research and her usual deft analysis, with the research of others and some big data analysis from social media outlets. The result is a book that is both insightful and entertaining, and whose lessons are much broader than the book’s central topic.

“The Power and Fragility of Networked Protest” is the book’s subtitle. The power of the Internet as a tool for protest is obvious: it gives people newfound abilities to quickly organize and scale. But, according to Tufekci, it’s a mistake to judge modern protests using the same criteria we used to judge pre-Internet protests. The 1963 March on Washington might have culminated in hundreds of thousands of people listening to Martin Luther King Jr. deliver his “I Have a Dream” speech, but it was the culmination of a multi-year protest effort and the result of six months of careful planning made possible by that sustained effort. The 2011 protests in Cairo came together in mere days because they could be loosely coordinated on Facebook and Twitter.

That’s the power. Tufekci describes the fragility by analogy. Nepalese Sherpas assist Mt. Everest climbers by carrying supplies, laying out ropes and ladders, and so on. This means that people with limited training and experience can make the ascent, which is no less dangerous — to sometimes disastrous results. Says Tufekci: “The Internet similarly allows networked movements to grow dramatically and rapidly, but without prior building of formal or informal organizational and other collective capacities that could prepare them for the inevitable challenges they will face and give them the ability to respond to what comes next.” That makes them less able to respond to government counters, change their tactics­ — a phenomenon Tufekci calls “tactical freeze” — make movement-wide decisions, and survive over the long haul.

Tufekci isn’t arguing that modern protests are necessarily less effective, but that they’re different. Effective movements need to understand these differences, and leverage these new advantages while minimizing the disadvantages.

To that end, she develops a taxonomy for talking about social movements. Protests are an example of a “signal” that corresponds to one of several underlying “capacities.” There’s narrative capacity: the ability to change the conversation, as Black Lives Matter did with police violence and Occupy did with wealth inequality. There’s disruptive capacity: the ability to stop business as usual. An early Internet example is the 1999 WTO protests in Seattle. And finally, there’s electoral or institutional capacity: the ability to vote, lobby, fund raise, and so on. Because of various “affordances” of modern Internet technologies, particularly social media, the same signal — a protest of a given size — reflects different underlying capacities.

This taxonomy also informs government reactions to protest movements. Smart responses target attention as a resource. The Chinese government responded to 2015 protesters in Hong Kong by not engaging with them at all, denying them camera-phone videos that would go viral and attract the world’s attention. Instead, they pulled their police back and waited for the movement to die from lack of attention.

If this all sounds dry and academic, it’s not. Twitter and Tear Gasis infused with a richness of detail stemming from her personal participation in the 2013 Gezi Park protests in Turkey, as well as personal on-the-ground interviews with protesters throughout the Middle East — particularly Egypt and her native Turkey — Zapatistas in Mexico, WTO protesters in Seattle, Occupy participants worldwide, and others. Tufekci writes with a warmth and respect for the humans that are part of these powerful social movements, gently intertwining her own story with the stories of others, big data, and theory. She is adept at writing for a general audience, and­despite being published by the intimidating Yale University Press — her book is more mass-market than academic. What rigor is there is presented in a way that carries readers along rather than distracting.

The synthesist in me wishes Tufekci would take some additional steps, taking the trends she describes outside of the narrow world of political protest and applying them more broadly to social change. Her taxonomy is an important contribution to the more-general discussion of how the Internet affects society. Furthermore, her insights on the networked public sphere has applications for understanding technology-driven social change in general. These are hard conversations for society to have. We largely prefer to allow technology to blindly steer society or — in some ways worse — leave it to unfettered for-profit corporations. When you’re reading Twitter and Tear Gas, keep current and near-term future technological issues such as ubiquitous surveillance, algorithmic discrimination, and automation and employment in mind. You’ll come away with new insights.

Tufekci twice quotes historian Melvin Kranzberg from 1985: “Technology is neither good nor bad; nor is it neutral.” This foreshadows her central message. For better or worse, the technologies that power the networked public sphere have changed the nature of political protest as well as government reactions to and suppressions of such protest.

I have long characterized our technological future as a battle between the quick and the strong. The quick — dissidents, hackers, criminals, marginalized groups — are the first to make use of a new technology to magnify their power. The strong are slower, but have more raw power to magnify. So while protesters are the first to use Facebook to organize, the governments eventually figure out how to use Facebook to track protesters. It’s still an open question who will gain the upper hand in the long term, but Tufekci’s book helps us understand the dynamics at work.

This essay originally appeared on Vice Motherboard.

The book on Amazon.com.

Analyze OpenFDA Data in R with Amazon S3 and Amazon Athena

Post Syndicated from Ryan Hood original https://aws.amazon.com/blogs/big-data/analyze-openfda-data-in-r-with-amazon-s3-and-amazon-athena/

One of the great benefits of Amazon S3 is the ability to host, share, or consume public data sets. This provides transparency into data to which an external data scientist or developer might not normally have access. By exposing the data to the public, you can glean many insights that would have been difficult with a data silo.

The openFDA project creates easy access to the high value, high priority, and public access data of the Food and Drug Administration (FDA). The data has been formatted and documented in consumer-friendly standards. Critical data related to drugs, devices, and food has been harmonized and can easily be called by application developers and researchers via API calls. OpenFDA has published two whitepapers that drill into the technical underpinnings of the API infrastructure as well as how to properly analyze the data in R. In addition, FDA makes openFDA data available on S3 in raw format.

In this post, I show how to use S3, Amazon EMR, and Amazon Athena to analyze the drug adverse events dataset. A drug adverse event is an undesirable experience associated with the use of a drug, including serious drug side effects, product use errors, product quality programs, and therapeutic failures.

Data considerations

Keep in mind that this data does have limitations. In addition, in the United States, these adverse events are submitted to the FDA voluntarily from consumers so there may not be reports for all events that occurred. There is no certainty that the reported event was actually due to the product. The FDA does not require that a causal relationship between a product and event be proven, and reports do not always contain the detail necessary to evaluate an event. Because of this, there is no way to identify the true number of events. The important takeaway to all this is that the information contained in this data has not been verified to produce cause and effect relationships. Despite this disclaimer, many interesting insights and value can be derived from the data to accelerate drug safety research.

Data analysis using SQL

For application developers who want to perform targeted searching and lookups, the API endpoints provided by the openFDA project are “ready to go” for software integration using a standard API powered by Elasticsearch, NodeJS, and Docker. However, for data analysis purposes, it is often easier to work with the data using SQL and statistical packages that expect a SQL table structure. For large-scale analysis, APIs often have query limits, such as 5000 records per query. This can cause extra work for data scientists who want to analyze the full dataset instead of small subsets of data.

To address the concern of requiring all the data in a single dataset, the openFDA project released the full 100 GB of harmonized data files that back the openFDA project onto S3. Athena is an interactive query service that makes it easy to analyze data in S3 using standard SQL. It’s a quick and easy way to answer your questions about adverse events and aspirin that does not require you to spin up databases or servers.

While you could point tools directly at the openFDA S3 files, you can find greatly improved performance and use of the data by following some of the preparation steps later in this post.


This post explains how to use the following architecture to take the raw data provided by openFDA, leverage several AWS services, and derive meaning from the underlying data.


  1. Load the openFDA /drug/event dataset into Spark and convert it to gzip to allow for streaming.
  2. Transform the data in Spark and save the results as a Parquet file in S3.
  3. Query the S3 Parquet file with Athena.
  4. Perform visualization and analysis of the data in R and Python on Amazon EC2.

Optimizing public data sets: A primer on data preparation

Those who want to jump right into preparing the files for Athena may want to skip ahead to the next section.

Transforming, or pre-processing, files is a common task for using many public data sets. Before you jump into the specific steps for transforming the openFDA data files into a format optimized for Athena, I thought it would be worthwhile to provide a quick exploration on the problem.

Making a dataset in S3 efficiently accessible with minimal transformation for the end user has two key elements:

  1. Partitioning the data into objects that contain a complete part of the data (such as data created within a specific month).
  2. Using file formats that make it easy for applications to locate subsets of data (for example, gzip, Parquet, ORC, etc.).

With these two key elements in mind, you can now apply transformations to the openFDA adverse event data to prepare it for Athena. You might find the data techniques employed in this post to be applicable to many of the questions you might want to ask of the public data sets stored in Amazon S3.

Before you get started, I encourage those who are interested in doing deeper healthcare analysis on AWS to make sure that you first read the AWS HIPAA Compliance whitepaper. This covers the information necessary for processing and storing patient health information (PHI).

Also, the adverse event analysis shown for aspirin is strictly for demonstration purposes and should not be used for any real decision or taken as anything other than a demonstration of AWS capabilities. However, there have been robust case studies published that have explored a causal relationship between aspirin and adverse reactions using OpenFDA data. If you are seeking research on aspirin or its risks, visit organizations such as the Centers for Disease Control and Prevention (CDC) or the Institute of Medicine (IOM).

Preparing data for Athena

For this walkthrough, you will start with the FDA adverse events dataset, which is stored as JSON files within zip archives on S3. You then convert it to Parquet for analysis. Why do you need to convert it? The original data download is stored in objects that are partitioned by quarter.

Here is a small sample of what you find in the adverse events (/drugs/event) section of the openFDA website.

If you were looking for events that happened in a specific quarter, this is not a bad solution. For most other scenarios, such as looking across the full history of aspirin events, it requires you to access a lot of data that you won’t need. The zip file format is not ideal for using data in place because zip readers must have random access to the file, which means the data can’t be streamed. Additionally, the zip files contain large JSON objects.

To read the data in these JSON files, a streaming JSON decoder must be used or a computer with a significant amount of RAM must decode the JSON. Opening up these files for public consumption is a great start. However, you still prepare the data with a few lines of Spark code so that the JSON can be streamed.

Step 1:  Convert the file types

Using Apache Spark on EMR, you can extract all of the zip files and pull out the events from the JSON files. To do this, use the Scala code below to deflate the zip file and create a text file. In addition, compress the JSON files with gzip to improve Spark’s performance and reduce your overall storage footprint. The Scala code can be run in either the Spark Shell or in an Apache Zeppelin notebook on your EMR cluster.

If you are unfamiliar with either Apache Zeppelin or the Spark Shell, the following posts serve as great references:


import scala.io.Source
import java.util.zip.ZipInputStream
import org.apache.spark.input.PortableDataStream
import org.apache.hadoop.io.compress.GzipCodec

// Input Directory
val inputFile = "s3://download.open.fda.gov/drug/event/2015q4/*.json.zip";

// Output Directory
val outputDir = "s3://{YOUR OUTPUT BUCKET HERE}/output/2015q4/";

// Extract zip files from 
val zipFiles = sc.binaryFiles(inputFile);

// Process zip file to extract the json as text file and save it
// in the output directory 
val rdd = zipFiles.flatMap((file: (String, PortableDataStream)) => {
    val zipStream = new ZipInputStream(file.2.open)
    val entry = zipStream.getNextEntry
    val iter = Source.fromInputStream(zipStream).getLines
}).map(.replaceAll("\s+","")).saveAsTextFile(outputDir, classOf[GzipCodec])

Step 2:  Transform JSON into Parquet

With just a few more lines of Scala code, you can use Spark’s abstractions to convert the JSON into a Spark DataFrame and then export the data back to S3 in Parquet format.

Spark requires the JSON to be in JSON Lines format to be parsed correctly into a DataFrame.

// Output Parquet directory
val outputDir = "s3://{YOUR OUTPUT BUCKET NAME}/output/drugevents"
// Input json file
val inputJson = "s3://{YOUR OUTPUT BUCKET NAME}/output/2015q4/*”
// Load dataframe from json file multiline 
val df = spark.read.json(sc.wholeTextFiles(inputJson).values)
// Extract results from dataframe
val results = df.select("results")
// Save it to Parquet

Step 3:  Create an Athena table

With the data cleanly prepared and stored in S3 using the Parquet format, you can now place an Athena table on top of it to get a better understanding of the underlying data.

Because the openFDA data structure incorporates several layers of nesting, it can be a complex process to try to manually derive the underlying schema in a Hive-compatible format. To shorten this process, you can load the top row of the DataFrame from the previous step into a Hive table within Zeppelin and then extract the “create  table” statement from SparkSQL.


val top1 = spark.sql("select * from data tablesample(1 rows)")


val show_cmd = spark.sql("show create table drugevents”).show(1, false)

This returns a “create table” statement that you can almost paste directly into the Athena console. Make some small modifications (adding the word “external” and replacing “using with “stored as”), and then execute the code in the Athena query editor. The table is created.

For the openFDA data, the DDL returns all string fields, as the date format used in your dataset does not conform to the yyy-mm-dd hh:mm:ss[.f…] format required by Hive. For your analysis, the string format works appropriately but it would be possible to extend this code to use a Presto function to convert the strings into time stamps.

   companynumb  string, 
   safetyreportid  string, 
   safetyreportversion  string, 
   receiptdate  string, 
   patientagegroup  string, 
   patientdeathdate  string, 
   patientsex  string, 
   patientweight  string, 
   serious  string, 
   seriousnesscongenitalanomali  string, 
   seriousnessdeath  string, 
   seriousnessdisabling  string, 
   seriousnesshospitalization  string, 
   seriousnesslifethreatening  string, 
   seriousnessother  string, 
   actiondrug  string, 
   activesubstancename  string, 
   drugadditional  string, 
   drugadministrationroute  string, 
   drugcharacterization  string, 
   drugindication  string, 
   drugauthorizationnumb  string, 
   medicinalproduct  string, 
   drugdosageform  string, 
   drugdosagetext  string, 
   reactionoutcome  string, 
   reactionmeddrapt  string, 
   reactionmeddraversionpt  string)
STORED AS parquet
  's3://{YOUR TARGET BUCKET}/output/drugevents'

With the Athena table in place, you can start to explore the data by running ad hoc queries within Athena or doing more advanced statistical analysis in R.

Using SQL and R to analyze adverse events

Using the openFDA data with Athena makes it very easy to translate your questions into SQL code and perform quick analysis on the data. After you have prepared the data for Athena, you can begin to explore the relationship between aspirin and adverse drug events, as an example. One of the most common metrics to measure adverse drug events is the Proportional Reporting Ratio (PRR). It is defined as:

PRR = (m/n)/( (M-m)/(N-n) )
m = #reports with drug and event
n = #reports with drug
M = #reports with event in database
N = #reports in database

Gastrointestinal haemorrhage has the highest PRR of any reaction to aspirin when viewed in aggregate. One question you may want to ask is how the PRR has trended on a yearly basis for gastrointestinal haemorrhage since 2005.

Using the following query in Athena, you can see the PRR trend of “GASTROINTESTINAL HAEMORRHAGE” reactions with “ASPIRIN” since 2005:

with drug_and_event as 
(select rpad(receiptdate, 4, 'NA') as receipt_year
    , reactionmeddrapt
    , count(distinct (concat(safetyreportid,receiptdate,reactionmeddrapt))) as reports_with_drug_and_event 
from fda.drugevents
where rpad(receiptdate,4,'NA') 
     between '2005' and '2015' 
     and medicinalproduct = 'ASPIRIN'
     and reactionmeddrapt= 'GASTROINTESTINAL HAEMORRHAGE'
group by reactionmeddrapt, rpad(receiptdate, 4, 'NA') 
), reports_with_drug as 
select rpad(receiptdate, 4, 'NA') as receipt_year
    , count(distinct (concat(safetyreportid,receiptdate,reactionmeddrapt))) as reports_with_drug 
 from fda.drugevents 
 where rpad(receiptdate,4,'NA') 
     between '2005' and '2015' 
     and medicinalproduct = 'ASPIRIN'
group by rpad(receiptdate, 4, 'NA') 
), reports_with_event as 
   select rpad(receiptdate, 4, 'NA') as receipt_year
    , count(distinct (concat(safetyreportid,receiptdate,reactionmeddrapt))) as reports_with_event 
   from fda.drugevents
   where rpad(receiptdate,4,'NA') 
     between '2005' and '2015' 
     and reactionmeddrapt= 'GASTROINTESTINAL HAEMORRHAGE'
   group by rpad(receiptdate, 4, 'NA')
), total_reports as 
   select rpad(receiptdate, 4, 'NA') as receipt_year
    , count(distinct (concat(safetyreportid,receiptdate,reactionmeddrapt))) as total_reports 
   from fda.drugevents
   where rpad(receiptdate,4,'NA') 
     between '2005' and '2015' 
   group by rpad(receiptdate, 4, 'NA')
select  drug_and_event.receipt_year, 
(1.0 * drug_and_event.reports_with_drug_and_event/reports_with_drug.reports_with_drug)/ (1.0 * (reports_with_event.reports_with_event- drug_and_event.reports_with_drug_and_event)/(total_reports.total_reports-reports_with_drug.reports_with_drug)) as prr
, drug_and_event.reports_with_drug_and_event
, reports_with_drug.reports_with_drug
, reports_with_event.reports_with_event
, total_reports.total_reports
from drug_and_event
    inner join reports_with_drug on  drug_and_event.receipt_year = reports_with_drug.receipt_year   
    inner join reports_with_event on  drug_and_event.receipt_year = reports_with_event.receipt_year
    inner join total_reports on  drug_and_event.receipt_year = total_reports.receipt_year
order by  drug_and_event.receipt_year

One nice feature of Athena is that you can quickly connect to it via R or any other tool that can use a JDBC driver to visualize the data and understand it more clearly.

With this quick R script that can be run in R Studio either locally or on an EC2 instance, you can create a visualization of the PRR and Reporting Odds Ratio (RoR) for “GASTROINTESTINAL HAEMORRHAGE” reactions from “ASPIRIN” since 2005 to better understand these trends.

# connect to ATHENA
conn <- dbConnect(drv, '<Your JDBC URL>',s3_staging_dir="<Your S3 Location>",user=Sys.getenv(c("USER_NAME"),password=Sys.getenv(c("USER_PASSWORD"))

# Declare Adverse Event

# Build SQL Blocks
sqlFirst <- "SELECT rpad(receiptdate, 4, 'NA') as receipt_year, count(DISTINCT safetyreportid) as event_count FROM fda.drugsflat WHERE rpad(receiptdate,4,'NA') between '2005' and '2015'"
sqlEnd <- "GROUP BY rpad(receiptdate, 4, 'NA') ORDER BY receipt_year"

# Extract Aspirin with adverse event counts
sql <- paste(sqlFirst,"AND medicinalproduct ='ASPIRIN' AND reactionmeddrapt=",adverseEvent, sqlEnd,sep=" ")
aspirinAdverseCount = dbGetQuery(conn,sql)

# Extract Aspirin counts
sql <- paste(sqlFirst,"AND medicinalproduct ='ASPIRIN'", sqlEnd,sep=" ")
aspirinCount = dbGetQuery(conn,sql)

# Extract adverse event counts
sql <- paste(sqlFirst,"AND reactionmeddrapt=",adverseEvent, sqlEnd,sep=" ")
adverseCount = dbGetQuery(conn,sql)

# All Drug Adverse event Counts
sql <- paste(sqlFirst, sqlEnd,sep=" ")
allDrugCount = dbGetQuery(conn,sql)

# Select correct rows
selAll =  allDrugCount$receipt_year == aspirinAdverseCount$receipt_year
selAspirin = aspirinCount$receipt_year == aspirinAdverseCount$receipt_year
selAdverse = adverseCount$receipt_year == aspirinAdverseCount$receipt_year

# Calculate Numbers
m <- c(aspirinAdverseCount$event_count)
n <- c(aspirinCount[selAspirin,2])
M <- c(adverseCount[selAdverse,2])
N <- c(allDrugCount[selAll,2])

# Calculate proptional reporting ratio
PRR = (m/n)/((M-m)/(N-n))

# Calculate reporting Odds Ratio
d = n-m
D = N-M
ROR = (m/d)/(M/D)

# Plot the PRR and ROR
g_range <- range(0, PRR,ROR)
g_range[2] <- g_range[2] + 3
yearLen = length(aspirinAdverseCount$receipt_year)
plot(PRR, type="o", col="blue", ylim=g_range,axes=FALSE, ann=FALSE)
axis(2, las=1, at=1*0:g_range[2])
lines(ROR, type="o", pch=22, lty=2, col="red")

As you can see, the PRR and RoR have both remained fairly steady over this time range. With the R Script above, all you need to do is change the adverseEvent variable from GASTROINTESTINAL HAEMORRHAGE to another type of reaction to analyze and compare those trends.


In this walkthrough:

  • You used a Scala script on EMR to convert the openFDA zip files to gzip.
  • You then transformed the JSON blobs into flattened Parquet files using Spark on EMR.
  • You created an Athena DDL so that you could query these Parquet files residing in S3.
  • Finally, you pointed the R package at the Athena table to analyze the data without pulling it into a database or creating your own servers.

If you have questions or suggestions, please comment below.

Next Steps

Take your skills to the next level. Learn how to optimize Amazon S3 for an architecture commonly used to enable genomic data analysis. Also, be sure to read more about running R on Amazon Athena.






About the Authors

Ryan Hood is a Data Engineer for AWS. He works on big data projects leveraging the newest AWS offerings. In his spare time, he enjoys watching the Cubs win the World Series and attempting to Sous-vide anything he can find in his refrigerator.



Vikram Anand is a Data Engineer for AWS. He works on big data projects leveraging the newest AWS offerings. In his spare time, he enjoys playing soccer and watching the NFL & European Soccer leagues.



Dave Rocamora is a Solutions Architect at Amazon Web Services on the Open Data team. Dave is based in Seattle and when he is not opening data, he enjoys biking and drinking coffee outside.





Perform Near Real-time Analytics on Streaming Data with Amazon Kinesis and Amazon Elasticsearch Service

Post Syndicated from Tristan Li original https://aws.amazon.com/blogs/big-data/perform-near-real-time-analytics-on-streaming-data-with-amazon-kinesis-and-amazon-elasticsearch-service/

Nowadays, streaming data is seen and used everywhere—from social networks, to mobile and web applications, IoT devices, instrumentation in data centers, and many other sources. As the speed and volume of this type of data increases, the need to perform data analysis in real time with machine learning algorithms and extract a deeper understanding from the data becomes ever more important. For example, you might want a continuous monitoring system to detect sentiment changes in a social media feed so that you can react to the sentiment in near real time.

In this post, we use Amazon Kinesis Streams to collect and store streaming data. We then use Amazon Kinesis Analytics to process and analyze the streaming data continuously. Specifically, we use the Kinesis Analytics built-in RANDOM_CUT_FOREST function, a machine learning algorithm, to detect anomalies in the streaming data. Finally, we use Amazon Kinesis Firehose to export the anomalies data to Amazon Elasticsearch Service (Amazon ES). We then build a simple dashboard in the open source tool Kibana to visualize the result.

Solution overview

The following diagram depicts a high-level overview of this solution.

Amazon Kinesis Streams

You can use Amazon Kinesis Streams to build your own streaming application. This application can process and analyze streaming data by continuously capturing and storing terabytes of data per hour from hundreds of thousands of sources.

Amazon Kinesis Analytics

Kinesis Analytics provides an easy and familiar standard SQL language to analyze streaming data in real time. One of its most powerful features is that there are no new languages, processing frameworks, or complex machine learning algorithms that you need to learn.

Amazon Kinesis Firehose

Kinesis Firehose is the easiest way to load streaming data into AWS. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service.

Amazon Elasticsearch Service

Amazon ES is a fully managed service that makes it easy to deploy, operate, and scale Elasticsearch for log analytics, full text search, application monitoring, and more.

Solution summary

The following is a quick walkthrough of the solution that’s presented in the diagram:

  1. IoT sensors send streaming data into Kinesis Streams. In this post, you use a Python script to simulate an IoT temperature sensor device that sends the streaming data.
  2. By using the built-in RANDOM_CUT_FOREST function in Kinesis Analytics, you can detect anomalies in real time with the sensor data that is stored in Kinesis Streams. RANDOM_CUT_FOREST is also an appropriate algorithm for many other kinds of anomaly-detection use cases—for example, the media sentiment example mentioned earlier in this post.
  3. The processed anomaly data is then loaded into the Kinesis Firehose delivery stream.
  4. By using the built-in integration that Kinesis Firehose has with Amazon ES, you can easily export the processed anomaly data into the service and visualize it with Kibana.

Implementation steps

The following sections walk through the implementation steps in detail.

Creating the delivery stream

  1. Open the Amazon Kinesis Streams console.
  2. Create a new Kinesis stream. Give it a name that indicates it’s for raw incoming stream data—for example, RawStreamData. For Number of shards, type 1.
  3. The Python code provided below simulates a streaming application, such as an IoT device, and generates random data and anomalies into a Kinesis stream. The code generates two temperature ranges, where the first range is the hypothetical sensor’s normal operating temperature range (10–20), and the second is the anomaly temperature range (100–120).Make sure to change the stream name on line 16 and 20 and the Region on line 6 to match your configuration. Alternatively, you can download the Amazon Kinesis Data Generator from this repository and use it to generate the data.
    import json
    import datetime
    import random
    import testdata
    from boto import kinesis
    kinesis = kinesis.connect_to_region("us-east-1")
    def getData(iotName, lowVal, highVal):
       data = {}
       data["iotName"] = iotName
       data["iotValue"] = random.randint(lowVal, highVal) 
       return data
    while 1:
       rnd = random.random()
       if (rnd < 0.01):
          data = json.dumps(getData("DemoSensor", 100, 120))  
          kinesis.put_record("RawStreamData", data, "DemoSensor")
          print '***************************** anomaly ************************* ' + data
          data = json.dumps(getData("DemoSensor", 10, 20))  
          kinesis.put_record("RawStreamData", data, "DemoSensor")
          print data

  4. Open the Amazon Elasticsearch Service console and create a new domain.
    1. Give the domain a unique name. In the Configure cluster screen, use the default settings.
    2. In the Set up access policy screen, in the Set the domain access policy list, choose Allow access to the domain from specific IP(s).
    3. Enter the public IP address of your computer.
      Note: If you’re working behind a proxy or firewall, see the “Use a proxy to simplify request signing” section in this AWS Database blog post to learn how to work with a proxy. For additional information about securing access to your Amazon ES domain, see How to Control Access to Your Amazon Elasticsearch Domain in the AWS Security Blog.
  5. After the Amazon ES domain is up and running, you can set up and configure Kinesis Firehose to export results to Amazon ES:
    1. Open the Amazon Kinesis Firehose console and choose Create Delivery Stream.
    2. In the Destination dropdown list, choose Amazon Elasticsearch Service.
    3. Type a stream name, and choose the Amazon ES domain that you created in Step 4.
    4. Provide an index name and ES type. In the S3 bucket dropdown list, choose Create New S3 bucket. Choose Next.
    5. In the configuration, change the Elasticsearch Buffer size to 1 MB and the Buffer interval to 60s. Use the default settings for all other fields. This shortens the time for the data to reach the ES cluster.
    6. Under IAM Role, choose Create/Update existing IAM role.
      The best practice is to create a new role every time. Otherwise, the console keeps adding policy documents to the same role. Eventually the size of the attached policies causes IAM to reject the role, but it does it in a non-obvious way, where the console basically quits functioning.
    7. Choose Next to move to the Review page.
  6. Review the configuration, and then choose Create Delivery Stream.
  7. Run the Python file for 1–2 minutes, and then press Ctrl+C to stop the execution. This loads some data into the stream for you to visualize in the next step.

Analyzing the data

Now it’s time to analyze the IoT streaming data using Amazon Kinesis Analytics.

  1. Open the Amazon Kinesis Analytics console and create a new application. Give the application a name, and then choose Create Application.
  2. On the next screen, choose Connect to a source. Choose the raw incoming data stream that you created earlier. (Note the stream name Source_SQL_STREAM_001 because you will need it later.)
  3. Use the default settings for everything else. When the schema discovery process is complete, it displays a success message with the formatted stream sample in a table as shown in the following screenshot. Review the data, and then choose Save and continue.
  4. Next, choose Go to SQL editor. When prompted, choose Yes, start application.
  5. Copy the following SQL code and paste it into the SQL editor window.
       "iotName"        varchar (40),
       "iotValue"   integer,
    -- Creates an output stream and defines a schema
       "iotName"       varchar(40),
       "iotValue"       integer,
       "created" TimeStamp);
    -- Compute an anomaly score for each record in the source stream
    -- using Random Cut Forest
    -- Sort records by descending anomaly score, insert into output stream


  1. Choose Save and run SQL.
    As the application is running, it displays the results as stream data arrives. If you don’t see any data coming in, run the Python script again to generate some fresh data. When there is data, it appears in a grid as shown in the following screenshot.Note that you are selecting data from the source stream name Source_SQL_STREAM_001 that you created previously. Also note the ANOMALY_SCORE column. This is the value that the Random_Cut_Forest function calculates based on the temperature ranges provided by the Python script. Higher (anomaly) temperature ranges have a higher score.Looking at the SQL code, note that the first two blocks of code create two new streams to store temporary data and the final result. The third block of code analyzes the raw source data (Stream_Pump_1) using the Random_Cut_Forest function. It calculates an anomaly score (ANOMALY_SCORE) and inserts it into the TEMP_STREAM stream. The final code block loads the result stored in the TEMP_STREAM into DESTINATION_SQL_STREAM.
  2. Choose Exit (done editing) next to the Save and run SQL button to return to the application configuration page.

Load processed data into the Kinesis Firehose delivery stream

Now, you can export the result from DESTINATION_SQL_STREAM into the Amazon Kinesis Firehose stream that you created previously.

  1. On the application configuration page, choose Connect to a destination.
  2. Choose the stream name that you created earlier, and use the default settings for everything else. Then choose Save and Continue.
  3. On the application configuration page, choose Exit to Kinesis Analytics applications to return to the Amazon Kinesis Analytics console.
  4. Run the Python script again for 4–5 minutes to generate enough data to flow through Amazon Kinesis Streams, Kinesis Analytics, Kinesis Firehose, and finally into the Amazon ES domain.
  5. Open the Kinesis Firehose console, choose the stream, and then choose the Monitoring
  6. As the processed data flows into Kinesis Firehose and Amazon ES, the metrics appear on the Delivery Stream metrics page. Keep in mind that the metrics page takes a few minutes to refresh with the latest data.
  7. Open the Amazon Elasticsearch Service dashboard in the AWS Management Console. The count in the Searchable documents column increases as shown in the following screenshot. In addition, the domain shows a cluster health of Yellow. This is because, by default, it needs two instances to deploy redundant copies of the index. To fix this, you can deploy two instances instead of one.

Visualize the data using Kibana

Now it’s time to launch Kibana and visualize the data.

  1. Use the ES domain link to go to the cluster detail page, and then choose the Kibana link as shown in the following screenshot.

    If you’re working behind a proxy or firewall, see the “Use a proxy to simplify request signing” section in this blog post to learn how to work with a proxy.
  2. In the Kibana dashboard, choose the Discover tab to perform a query.
  3. You can also visualize the data using the different types of charts offered by Kibana. For example, by going to the Visualize tab, you can quickly create a split bar chart that aggregates by ANOMALY_SCORE per minute.


In this post, you learned how to use Amazon Kinesis to collect, process, and analyze real-time streaming data, and then export the results to Amazon ES for analysis and visualization with Kibana. If you have comments about this post, add them to the “Comments” section below. If you have questions or issues with implementing this solution, please open a new thread on the Amazon Kinesis or Amazon ES discussion forums.

Next Steps

Take your skills to the next level. Learn real-time clickstream anomaly detection with Amazon Kinesis Analytics.


About the Author

Tristan Li is a Solutions Architect with Amazon Web Services. He works with enterprise customers in the US, helping them adopt cloud technology to build scalable and secure solutions on AWS.





New Power Bundle for Amazon WorkSpaces – More vCPUs, Memory, and Storage

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-power-bundle-for-amazon-workspaces-more-vcpus-memory-and-storage/

Are you tired of hearing me talk about Amazon WorkSpaces yet? I hope not, because we have a lot of customer-driven additions on the roadmap! Our customers in the developer and analyst community have been asking for a workstation-class machine that will allow them to take advantage of the low cost and flexibility of WorkSpaces. Developers want to run Visual Studio, IntelliJ, Eclipse, and other IDEs. Analysts want to run complex simulations and statistical analysis using MatLab, GNU Octave, R, and Stata.

New Power Bundle
Today we are extending the current set of WorkSpaces bundles with a new Power bundle. With four vCPUs, 16 GiB of memory, and 275 GB of storage (175 GB on the system volume and another 100 GB on the user volume), this bundle is designed to make developers, analysts, (and me) smile. You can launch them in all of the usual ways: Console, CLI (create-workspaces), or API (CreateWorkSpaces):

One really interesting benefit to using a cloud-based virtual desktop for simulations and statistical analysis is the ease of access to data that’s already stored in the cloud. Analysts can mine and analyze petabytes of data stored in S3 that is effectively local (with respect to access time) to the WorkSpace. This low-latency access will boost productivity and also simplifies the use of other AWS data analysis tools such as Amazon Redshift, Amazon Redshift Spectrum, Amazon QuickSight, and Amazon Athena.

Like the existing bundles, the new Power bundle can be used in either billing configuration, AlwaysOn or AutoStop (read Amazon WorkSpaces Update – Hourly Usage and Expanded Root Volume to learn more). The bundle is available in all AWS Regions where WorkSpaces is available and you can launch one today! Visit the WorkSpaces Pricing page for pricing in your region.


New AWS Certification Specialty Exam for Big Data

Post Syndicated from Sara Snedeker original https://aws.amazon.com/blogs/big-data/new-aws-certification-specialty-exam-for-big-data/

AWS Certifications validate technical knowledge with an industry-recognized credential. Today, the AWS Certification team released the AWS Certified Big Data – Specialty exam. This new exam validates technical skills and experience in designing and implementing AWS services to derive value from data. The exam requires a current Associate AWS Certification and is intended for individuals who perform complex big data analyses.

Individuals who are interested in sitting for this exam should know how to do the following:

  • Implement core AWS big data services according to basic architectural best practices
  • Design and maintain big data
  • Leverage tools to automate data analysis

To prepare for the exam, we recommend the Big Data on AWS course, plus AWS whitepapers and documentation that are focused on big data.

This credential can help you stand out from the crowd, get recognized, and provide more evidence of your unique technical skills.

The AWS Certification team also released an AWS Certified Advanced Networking – Specialty exam and new AWS Certification Benefits. You can read more about these new releases on the AWS Blog.

Have more questions about AWS Certification? See our AWS Certification FAQ.

Amazon QuickSight Now Supports Federated Single Sign-On Using SAML 2.0

Post Syndicated from Jose Kunnackal original https://aws.amazon.com/blogs/big-data/amazon-quicksight-now-supports-federated-single-sign-on-using-saml-2-0/

Since launch, Amazon QuickSight has enabled business users to quickly and easily analyze data from a wide variety of data sources with superfast visualization capabilities enabled by SPICE (Superfast, Parallel, In-memory Calculation Engine). When setting up Amazon QuickSight access for business users, administrators have a choice of authentication mechanisms. These include Amazon QuickSight–specific credentials, AWS credentials, or in the case of Amazon QuickSight Enterprise Edition, existing Microsoft Active Directory credentials. Although each of these mechanisms provides a reliable, secure authentication process, they all require end users to input their credentials every time users log in to Amazon QuickSight. In addition, the invitation model for user onboarding currently in place today requires administrators to add users to Amazon QuickSight accounts either via email invitations or via AD-group membership, which can contribute to delays in user provisioning.

Today, we are happy to announce two new features that will make user authentication and provisioning simpler – Federated Single-Sign-On (SSO) and just-in-time (JIT) user creation.

Federated Single Sign-On

Federated SSO authentication to web applications (including the AWS Management Console) and Software-as-a-Service products has become increasingly popular, because Federated SSO lets organizations consolidate end-user authentication to external applications.

Traditionally, SSO involves the use of a centralized identity store (such as Active Directory or LDAP) to authenticate the user against applications within a corporate network. The growing popularity of SaaS and web applications created the need to authenticate users outside corporate networks. Federated SSO makes this scenario possible. It provides a mechanism for external applications to direct authentication requests to the centralized identity store and receive an authentication token back with the response and validity. SAML is the most common protocol used as a basis for Federated SSO capabilities today.

With Federated SSO in place, business users sign in to their Identity Provider portals with existing credentials and access QuickSight with a single click, without having to enter any QuickSight-specific passwords or account names. This makes it simple for users to access Amazon QuickSight for data analysis needs.

Federated SSO also enables administrators to impose additional security requirements for Amazon QuickSight access (through the identity provider portal) depending on details such as where the user is accessing from or what device is used for access. This access control lets administrators comply with corporate policies regarding data access and also enforce additional security for sensitive data handling in Amazon QuickSight.

Setting up federated authentication in Amazon QuickSight is straightforward. You follow the same sequence of steps you would to setup federated access for the AWS Management Console and then setup redirection to ensure that users land directly on Amazon QuickSight.

Let’s take a look at how this works. The following diagram illustrates the authentication flow between Amazon QuickSight and a third-party identity provider with Federated SSO in place with SAML 2.0.

  1. The Amazon QuickSight user browses to the organization’s identity provider portal, and authenticates using existing credentials.
  2. The federation service requests user authentication from the organization’s identity store, based on credentials provided.
  3. The identity store authenticates the user, and returns the authentication response to the federation service.
  4. The federation service posts the SAML assertion to the user’s browser.
  5. The user’s browser posts the SAML assertion to the AWS Sign-In SAML endpoint. AWS Sign-In processes the SAML request, authenticates the user, and forwards the authentication token to Amazon QuickSight.
  6. Amazon QuickSight uses the authentication token from AWS Sign-In, and authorizes user access.

Federated SSO using SAML 2.0 is now available for Amazon QuickSight Standard Edition, with support for Enterprise Edition coming shortly. You can enable federated access by using any identity provider compliant with SAML 2.0. These identity providers include Microsoft Active Directory Federation Services, Okta, Ping Identity, and Shibboleth. To set up your Amazon QuickSight account for Federated SSO, follow the guidance here.

Just-in-time user creation

With this release, we are also launching a new permissions-based user provisioning model in Amazon QuickSight. Administrators can use the existing AWS permissions management mechanisms in place to enable Amazon QuickSight permissions for their users. Once these required permissions are in place, users can onboard themselves to QuickSight without any additional administrator intervention. This approach simplifies user provisioning and enables onboarding of thousands of users by simply granting the right permissions.

Administrators can choose to assign either of the permissions below, which will result in the user being able to sign up to QuickSight either as a user or an administrator.


If you have an AWS account that is already signed up for QuickSight, and you would like to add yourself as a new user, add one of the permissions above and access https://quicksight.aws.amazon.com.

You will see a screen that requests your email address. Once you provide this, you will be added to the QuickSight account as a user or administrator, as specified by your permissions!

Switch to a Federated SSO user: If you are already an Amazon QuickSight Standard Edition user using authentication based on user name and password, and you want to switch to using Federated SSO, follow these steps:

  1. Sign in using the Federated SSO option to the AWS Management console as you do today. Ensure that you have the permissions for QuickSight user/admin creation assigned to you.
  2. Access https://quicksight.aws.amazon.com.
  3. Provide your email address, and sign up for Amazon QuickSight as an Amazon QuickSight user or admin.
  4. Delete the existing Amazon QuickSight user that you no longer want to use.
  5. Assign resources and data to the new role-based user from step 1. (Amazon QuickSight will prompt you to do this when you delete a user. For more information, see Deleting a User Account.)
  6. Continue as the new, role-based user.

Learn more

To learn more about these capabilities and start using them with your identity provider, see [Managing-SSO-user-guide-topic] in the Amazon QuickSight User Guide.

Stay engaged

If you have questions and suggestions, you can post them on the Amazon QuickSight Discussion Forum.

Not an Amazon QuickSight user?

See the Amazon Quicksight page to get started for free.



Amazon Elasticsearch Service support for Elasticsearch 5.1

Post Syndicated from Tara Walker original https://aws.amazon.com/blogs/aws/amazon-elasticsearch-service-support-for-es-5-1/

The Amazon Elasticsearch Service is a fully managed service that provides easier deployment, operation, and scale for the Elasticsearch open-source search and analytics engine. We are excited to announce that Amazon Elasticsearch Service now supports Elasticsearch 5.1 and Kibana 5.1.

Elasticsearch 5 comes with a ton of new features and enhancements that customers can now take advantage of in Amazon Elasticsearch service. Elements of the Elasticsearch 5 release are as follow:

  • Indexing performance: Improved Indexing throughput with updates to lock implementation & async translog fsyncing
  • Ingestion Pipelines: Incoming data can be sent to a pipeline that applies a series of ingestion processors, allowing transformation to the exact data you want to have in your search index. There are twenty processors included, from simple appending to complex regex applications
  • Painless scripting: Amazon Elasticsearch Service supports Painless, a new secure and performant scripting language for Elasticsearch 5. You can use scripting to change the precedence of search results, delete index fields by query, modify search results to return specific fields, and more.
  • New data structures: Lucene 6 data structures, new data types; half_float, text, keyword, and more complete support for dots-in-fieldnames
  • Search and Aggregations: Refactored search API, BM25 relevance calculations, Instant Aggregations, improvements to histogram aggregations & terms aggregations, and rewritten percolator & completion suggester
  • User experience: Strict settings and body & query string parameter validation, index management improvement, default deprecation logging, new shard allocation API, and new indices efficiency pattern for rollover & shrink APIs
  • Java REST client: simple HTTP/REST Java client that works with Java 7 and handles retry on node failure, as well as, round-robin, sniffing, and logging of requests
  • Other improvements: Lazy unicast hosts DNS lookup, automatic parallel tasking of reindex, update-by-query, delete-by-query, and search cancellation by task management API

The compelling new enhancements of Elasticsearch 5 are meant to make the service faster and easier to use while providing better security. Amazon Elasticsearch Service is a managed service designed to aid customers in building, developing and deploying solutions with Elasticsearch by providing the following capabilities:

  • Multiple configurations of instance types
  • Amazon EBS volumes for data storage
  • Cluster stability improvement with dedicated master nodes
  • Zone awareness – Cluster node allocation across two Availability Zones in the region
  • Access Control & Security with AWS Identity and Access Management (IAM)
  • Various geographical locations/regions for resources
  • Amazon Elasticsearch domain snapshots for replication, backup and restore
  • Integration with Amazon CloudWatch for monitoring Amazon Elasticsearch domain metrics
  • Integration with AWS CloudTrail for configuration auditing
  • Integration with other AWS Services like Kinesis Firehouse and DynamoDB for loading of real-time streaming data into Amazon Elasticsearch Service

Amazon Elasticsearch Service allows dynamic changes with zero downtime. You can add instances, remove instances, change instance sizes, change storage configuration, and make other changes dynamically.

The best way to highlight some of the aforementioned capabilities is with an example.

During a presentation at the IT/Dev conference, I demonstrated how to build a serverless employee onboarding system using Express.js, AWS Lambda, Amazon DynamoDB, and Amazon S3. In the demo, the information collected was personnel data stored in DynamoDB about an employee going through a fictional onboarding process. Imagine if the collected employee data could be searched, queried, and analyzed as needed by the company’s HR department. We can easily augment the onboarding system to add these capabilities by enabling the employee table to use DynamoDB Streams to trigger Lambda and store the desired employee attributes in Amazon Elasticsearch Service.

The result is the following solution architecture:

We will focus solely on how to dynamically store and index employee data to Amazon Elasticseach Service each time an employee record is entered and subsequently stored in the database.
To add this enhancement to the existing aforementioned onboarding solution, we will implement the solution as noted by the detailed cloud architecture diagram below:

Let’s look at how to implement the employee load process to the Amazon Elasticsearch Service, which is the first process flow shown in the diagram above.

Amazon Elasticsearch Service: Domain Creation

Let’s now visit the AWS Console to check out Amazon Elasticsearch Service with Elasticsearch 5 in action. As you probably guessed, from the AWS Console home, we select Elasticsearch Service under the Analytics group.

The first step in creating an Elasticsearch solution is to create a domain.  You will notice that now when creating an Amazon Elasticsearch Service domain, you now have the option to choose the Elasticsearch 5.1 version.  Since we are discussing the launch of the support of Elasticsearch 5, we will, of course, choose the 5.1 Elasticsearch engine version when creating our domain in the Amazon Elasticsearch Service.

After clicking Next, we will now setup our Elasticsearch domain by configuring our instance and storage settings. The instance type and the number of instances for your cluster should be determined based upon your application’s availability, network volume, and data needs. A recommended best practice is to choose two or more instances in order to avoid possible data inconsistencies or split brain failure conditions with Elasticsearch. Therefore, I will choose two instances/data nodes for my cluster and set up EBS as my storage device.

To understand how many instances you will need for your specific application, please review the blog post, Get Started with Amazon Elasticsearch Service: How Many Data Instances Do I Need, on the AWS Database blog.

All that is left for me is to set up the access policy and deploy the service. Once I create my service, the domain will be initialized and deployed.

Now that I have my Elasticsearch service running, I now need a mechanism to populate it with data. I will implement a dynamic data load process of the employee data to Amazon Elasticsearch Service using DynamoDB Streams.

Amazon DynamoDB: Table and Streams

Before I head to the DynamoDB console, I will quickly cover the basics.

Amazon DynamoDB is a scalable, distributed NoSQL database service. DynamoDB Streams provide an ordered, time-based sequence of every CRUD operation to the items in a DynamoDB table. Each stream record has information about the primary attribute modification for an individual item in the table. Streams execute asynchronously and can write stream records in practically real time. Additionally, a stream can be enabled when a table is created or can be enabled and modified on an existing table. You can learn more about DynamoDB Streams in the DynamoDB developer guide.

Now we will head to the DynamoDB console and view the OnboardingEmployeeData table.

This table has a primary partition key, UserID, that is a string data type and a primary sort key, Username, which is also of a string data type. We will use the UserID as the document ID in Elasticsearch. You will also notice that on this table, streams are enabled and the stream view type is New image. A stream that is set to a New image view type will have stream records that display the entire item record after it has been updated. You also have the option to have the stream present records that provide data items before modification, provide only the items’ key attributes, or provide old and new item information.  If you opt to use the AWS CLI to create your DynamoDB table, the key information to capture is the Latest Stream ARN shown underneath the Stream Details section. A DynamoDB stream has a unique ARN identifier that is outside of the ARN of the DynamoDB table. The stream ARN will be needed to create the IAM policy for access permissions between the stream and the Lambda function.

IAM Policy

The first thing that is essential for any service implementation is getting the correct permissions in place. Therefore, I will first go to the IAM console to create a role and a policy for my Lambda function that will provide permissions for DynamoDB and Elasticsearch.

First, I will create a policy based upon an existing managed policy for Lambda execution with DynamoDB Streams.

This will take us to the Review Policy screen, which will have the selected managed policy details. I’ll name this policy, Onboarding-LambdaDynamoDB-toElasticsearch, and then customize the policy for my solution. The first thing you should notice is that the current policy allows access to all streams, however, the best practice would be to have this policy only access the specific DynamoDB Stream by adding the Latest Stream ARN. Hence, I will alter the policy and add the ARN for the DynamoDB table, OnboardingEmployeeData, and validate the policy. The altered policy is as shown below.

The only thing left is to add the Amazon Elasticsearch Service permissions in the policy. The core policy for Amazon Elasticsearch Service access permissions is as shown below:


I will use this policy and add the specific Elasticsearch domain ARN as the Resource for the policy. This ensures that I have a policy that enforces the Least Privilege security best practice for policies. With the Amazon Elasticsearch Service domain added as shown, I can validate and save the policy.

The best way to create a custom policy is to use the IAM Policy Simulator or view the examples of the AWS service permissions from the service documentation. You can also find some examples of policies for a subset of AWS Services here. Remember you should only add the ES permissions that are needed using the Least Privilege security best practice, the policy shown above is used only as an example.

We will create the role for our Lambda function to use to grant access and attach the aforementioned policy to the role.

AWS Lambda: DynamoDB triggered Lambda function

AWS Lambda is the core of Amazon Web Services serverless computing offering. With Lambda, you can write and run code using supported languages for almost any type of application or backend service. Lambda will trigger your code in response to events from AWS services or from HTTP requests. Lambda will dynamically scale based upon workload and you only pay for your code execution.

We will have DynamoDB streams trigger a Lambda function that will create an index and send data to Elasticsearch. Another option for this is to use the Logstash plugin for DynamoDB. However, since several of the Logstash processors are now included in Elasticsearch 5.1 core and with the improved performance optimizations, I will opt to use Lambda to process my DynamoDB stream and load data to Amazon Elasticsearch Service.
Now let us head over to the AWS Lambda console and create the lambda function for loading employee data to Amazon Elasticsearch Service.

Once in the console, I will create a new Lambda function by selecting the Blank Function blueprint that will take me to the Configure Trigger page. Once on the trigger page, I will select DynamoDB as the AWS service which will trigger Lambda, and I provide the following trigger related options:

  • Table: OnboardingEmployeeData
  • Batch size: 100 (default)
  • Starting position: Trim Horizon

I hit Next button, and I am on the Configure Function screen. The name of my function will be ESEmployeeLoad and I will write this function in Node.4.3.

The Lambda function code is as follows:

var AWS = require('aws-sdk');
var path = require('path');

//Object for all the ElasticSearch Domain Info
var esDomain = {
    region: process.env.RegionForES,
    endpoint: process.env.EndpointForES,
    index: process.env.IndexForES,
    doctype: 'onboardingrecords'
//AWS Endpoint from created ES Domain Endpoint
var endpoint = new AWS.Endpoint(esDomain.endpoint);
//The AWS credentials are picked up from the environment.
var creds = new AWS.EnvironmentCredentials('AWS');

console.log('Loading function');
exports.handler = (event, context, callback) => {
    //console.log('Received event:', JSON.stringify(event, null, 2));
    event.Records.forEach((record) => {
        console.log('DynamoDB Record: %j', record.dynamodb);
        var dbRecord = JSON.stringify(record.dynamodb);
        postToES(dbRecord, context, callback);

function postToES(doc, context, lambdaCallback) {
    var req = new AWS.HttpRequest(endpoint);

    req.method = 'POST';
    req.path = path.join('/', esDomain.index, esDomain.doctype);
    req.region = esDomain.region;
    req.headers['presigned-expires'] = false;
    req.headers['Host'] = endpoint.host;
    req.body = doc;

    var signer = new AWS.Signers.V4(req , 'es');  // es: service code
    signer.addAuthorization(creds, new Date());

    var send = new AWS.NodeHttpClient();
    send.handleRequest(req, null, function(httpResp) {
        var respBody = '';
        httpResp.on('data', function (chunk) {
            respBody += chunk;
        httpResp.on('end', function (chunk) {
            console.log('Response: ' + respBody);
            lambdaCallback(null,'Lambda added document ' + doc);
    }, function(err) {
        console.log('Error: ' + err);
        lambdaCallback('Lambda failed with error ' + err);

The Lambda function Environment variables are:

I will select an Existing role option and choose the ESOnboardingSystem IAM role I created earlier.

Upon completing my IAM role permissions for the Lambda function, I can review the Lambda function details and complete the creation of ESEmployeeLoad function.

I have completed the process of building my Lambda function to talk to Elasticsearch, and now I test my function my simulating data changes to my database.

Now my function, ESEmployeeLoad, will execute upon changes to the data in my database from my onboarding system. Additionally, I can review the processing of the Lambda function to Elasticsearch by reviewing the CloudWatch logs.

Now I can alter my Lambda function to take advantage of the new features or go directly to Elasticsearch and utilize the new Ingest Mode. An example of this would be to implement a pipeline for my Employee record documents.

I can replicate this function for handling the badge updates to the employee record, and/or leverage other preprocessors against the employee data. For instance, if I wanted to do a search of data based upon a data parameter in the Elasticsearch document, I could use the Search API and get records from the dataset.

The possibilities are endless, and you can get as creative as your data needs dictate while maintaining great performance.

Amazon Elasticsearch Service: Kibana 5.1

All Amazon Elasticsearch Service domains using Elasticsearch 5.1 are bundled with Kibana 5.1, the latest version of the open-source visualization tool.

The companion visualization and analytics platform, Kibana, has also been enhanced in the Kibana 5.1 release. Kibana is used to view, search or and interact with Elasticsearch data with a myriad of different charts, tables, and maps.  In addition, Kibana performs advanced data analysis of large volumes of the data. Key enhancements of the Kibana release are as follows:

  • Visualization tool new design: Updated color scheme and maximization of screen real-estate
  • Timelion: visualization tool with a time-based query DSL
  • Console: formerly known as Sense is now part of the core, using the same configuration for free-form requests to Elasticsearch
  • Scripted field language: ability use new Painless scripting language in the Elasticsearch cluster
  • Tag Cloud Visualization: 5.1 adds a word base graphical view of data sized by importance
  • More Charts: return of previously removed charts and addition of advanced view for X-Pack
  • Profiler UI:1 provides an enhancement to profile API with tree view
  • Rendering performance improvement: Discover performance fixes, decrease of CPU load


As you can see this release is expansive with many enhancements to assist customers in building Elasticsearch solutions. Amazon Elasticsearch Service now supports 15 new Elasticsearch APIs and 6 new plugins. Amazon Elasticsearch Service supports the following operations for Elasticsearch 5.1:

You can read more about the supported operations for Elasticsearch in the Amazon Elasticsearch Developer Guide, and you can get started by visiting the Amazon Elasticsearch Service website and/or sign into the AWS Management Console.



The state of Jupyter (O’Reilly)

Post Syndicated from corbet original http://lwn.net/Articles/712677/rss

Here’s an
O’Reilly article
describing the Jupyter project and what it has
Project Jupyter aims to create an ecosystem of open source tools for
interactive computation and data analysis, where the direct participation
of humans in the computational loop—executing code to understand a problem
and iteratively refine their approach—is the primary consideration.

The state of Jupyter (O’Reilly)

Post Syndicated from corbet original https://lwn.net/Articles/712677/rss

Here’s an
O’Reilly article
describing the Jupyter project and what it has
Project Jupyter aims to create an ecosystem of open source tools for
interactive computation and data analysis, where the direct participation
of humans in the computational loop—executing code to understand a problem
and iteratively refine their approach—is the primary consideration.

New – GPU-Powered Amazon Graphics WorkSpaces

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-gpu-powered-amazon-graphics-workspaces/

As you can probably tell from my I Love My Amazon WorkSpace post I am kind of a fan-boy!

Since writing that post I have found out that I am not alone, and that there are many other WorkSpaces fan-boys and fan-girls out there. Many AWS customers are enjoying their fully managed, secure desktop computing environments almost as much as I am. From their perspective as users, they like to be able to access their WorkSpace from a multitude of supported devices including Windows and Mac computers, PCoIP Zero Clients, Chromebooks, iPads, Fire tablets, and Android tablets. As administrators, they appreciate the ability to deploy high-quality cloud desktops for any number of users. And, finally, as business leaders they like the ability to pay hourly or monthly for the WorkSpaces that they launch.

New Graphics Bundle
These fans already have access to several different hardware choices: the Value, Standard, and Performance bundles. With 1 or 2 vCPUs (virtual CPUs) and 2 to 7.5 GiB of memory, these bundles are a good fit for many office productivity use cases.

Today we are expanding the WorkSpaces family by adding a new GPU-powered Graphics bundle. This bundle offers a high-end virtual desktop that is a great fit for 3D application developers, 3D modelers, and engineers that use CAD, CAM, or CAE tools at the office. Here are the specs:

  • Display – NVIDIA GPU with 1,536 CUDA cores and 4 GiB of graphics memory.
  • Processing – 8 vCPUs.
  • Memory – 15 GiB.
  • System volume – 100 GB.
  • User volume – 100 GB.

This new bundle is available in all regions where WorkSpaces currently operates, and can be used with any of the devices that I mentioned above. You can run the license-included operating system (Windows Server 2008 with Windows 7 Desktop Experience), or you can bring your own licenses for Windows 7 or 10. Applications that make use of OpenGL 4.x, DirectX, CUDA, OpenCL, and the NVIDIA GRID SDK will be able to take advantage of the GPU.

As you start to think about your petabyte-scale data analysis and visualization, keep in mind that these instances are located just light-feet away from EC2, RDS, Amazon Redshift, S3, and Kinesis. You can do your compute-intensive analysis server-side, and then render it in a visually compelling way on an adjacent WorkSpace. I am highly confident that you can use this combination of AWS services to create compelling applications that would simply not be cost-effective or achievable in any other way.

There is one important difference between the Graphics Bundle and the other bundles. Due to the way that the underlying hardware operates, WorkSpaces that run this bundle do not save the local state (running applications and open documents) when used in conjunction with the AutoStop running mode that I described in my Amazon WorkSpaces Update – Hourly Usage and Expanded Root Volume post. We recommend saving open documents and closing applications before disconnecting from your WorkSpace or stepping away from it for an extended period of time.

I don’t build 3D applications or use CAD, CAM, or CAE tools. However, I do like to design and build cool things with LEGO® bricks! I fired up the latest version of LEGO Digital Designer (LDD) and spent some time enhancing a design. Although I was not equipped to do any benchmarks, the GPU-enhanced version definitely ran more quickly and produced a higher quality finished product. Here’s a little design study I’ve been working on:

With my design all set up it was time to start building. Instead of trying to re-position my monitor so that it would be visible from my building table, I simply logged in to my Graphics WorkSpace from my Fire tablet. I was able to scale and rotate my design very quickly, even though I had very modest local computing power. Here’s what I saw on my Fire:

As you can see, the two screens (desktop and Fire) look identical! I stepped over to my building table and was able to set things up so that I could see my design and find my bricks:

Graphics WorkSpaces are available with an hourly billing option. You pay a small, fixed monthly fee to cover infrastructure costs and storage, and an hourly rate for each hour that the WorkSpace is used during the month. Prices start at $22/month + $1.75 per hour in the US East (Northern Virginia) Region; see the WorkSpaces Pricing page for more information.



Readmission Prediction Through Patient Risk Stratification Using Amazon Machine Learning

Post Syndicated from Ujjwal Ratan original https://blogs.aws.amazon.com/bigdata/post/Tx1Z7AR9QTXIWA1/Readmission-Prediction-Through-Patient-Risk-Stratification-Using-Amazon-Machine

Ujjwal Ratan is a Solutions Architect with Amazon Web Services

The Hospital Readmission Reduction Program (HRRP) was included as part of the Affordable Care Act to improve quality of care and lower healthcare spending. A patient visit to a hospital may be constituted as a readmission if the patient in question is admitted to a hospital within 30 days after being discharged from an earlier hospital stay. This should be easy to measure right? Wrong.

Unfortunately, it gets more complicated than this. Not all readmissions can be prevented, as some of them are part of an overall care plan for the patient. There are also factors beyond the hospital’s control that may cause a readmission. The Center for Medicare and Medicaid Services (CMS) recognized the complexities with measuring readmission rates and came up with a set of measures to evaluate providers.

There is still a long way to go for hospitals to be effective in preventing unplanned readmissions. Recognizing factors effecting readmissions is an important first step, but it is also important to draw out patterns in readmission data by aggregating information from multiple clinical and non-clinical hospital systems.

Moreover, most analysis algorithms rely on financial data which omit the clinical nuances applicable to a readmission pattern. The data sets contain a lot of redundant information like patient demographics and historical data. All this creates a massive data analysis challenge that may take months to solve using conventional means.

In this post, I show how to apply advanced analytics concepts like pattern analysis and machine learning to do risk stratification for patient cohorts.

The role of Amazon ML

There have been multiple global scientific studies on scalable models for predicting readmissions with high accuracy. Some of them, like comparison of models for predicting early hospital readmissions and predicting hospital readmissions in the Medicare population, are great examples.

Readmission records demonstrate patterns in data that can be used in a prediction algorithm. These patterns can be separated as outliers that are used to identify patient cohorts with high risk. Attribute correlation helps to identify the significant features that effect readmission risk in a patient.  This risk stratification in patients is enabled by categorizing patient attributes into numerical, categorical, and text attributes and applying statistical methods like standard deviation, median analysis, and the chi-squared test. These data sets are used to build statistical models to identify patients demonstrating certain characteristics consistent with readmissions so necessary steps can be taken to prevent it.

Amazon Machine Learning (Amazon ML) provides visual tools and wizards that guide users in creating complex ML models in minutes. You can also interact with it using the AWS CLI and API to integrate the power of ML with other applications. Based on the chosen target attribute in Amazon ML, you can build ML models like a binary classification model that predicts between states of 0 or 1 or a numeric regression model that predicts numerical values based on certain correlated attributes.

Creating an ML model for readmission prediction

The following diagram represents a reference architecture for building a scalable ML platform on AWS.

  1. The first step is to get the data into Amazon S3, the object storage service from AWS.
  2. Amazon Redshift acts as the database for the huge amounts of structured clinical data. The data is loaded into Amazon Redshift tables and is massaged to make it more meaningful as a data source for an ML model.
  3. A binary classification ML model is created using Amazon ML, with Amazon Redshift as the data source. A real-time endpoint is also created to allow real-time querying for the ML model.
  4. Amazon Cognito is used for secure federated access to the Amazon ML real-time endpoint.
  5. A static web site is created on S3. This website hosts the end user facing application using which one can query the Amazon ML endpoint in real time.

The architecture above is just one of the ways in which you can use AWS for building machine learning applications. You can vary this architecture and add services such as Amazon Elastic Map Reduce (EMR) if your use case involves large volumes of unstructured data sets or build a business intelligence (BI) reporting interface for analysis of predicted metrics. AWS provides a range of services that act as building blocks for the use case you want to build.


Prerequisite: Start with a data set

The first step in creating an accurate model is to choose the right data set to build and train the model. For the purposes of this post, I am using a publicly available diabetes data set from the University of California, Irvine (UCI).  The data set consists of 101,766 rows and represents 10 years of clinical care records from 130 US hospitals and integrated delivery networks. It includes over 50 features (attributes) representing patient and hospital outcomes. The data set can be downloaded from the UCI website. The hosted zip file consists of two csv files. The first file, diabetic_data.csv, is the actual data set and the second file, IDs_mapping.csv is the master data for admission_type_id, discharge_disposition_id, and admission_source_id.

Amazon ML automatically splits source data sets into two parts. The first part is used to train the ML model and the second part is used to evaluate the ML model’s accuracy. In this case, seventy percent of the source data is used to train the ML model and thirty percent is used to evaluate it. This is represented in the data rearrangement attribute as shown below:

ML model training data set:

  "splitting": {
    "percentBegin": 0,
    "percentEnd": 70,
    "strategy": "random",
    "complement": false,
    "strategyParams": {
      "randomSeed": ""

ML model evaluation data set:

  "splitting": {
    "percentBegin": 70,
    "percentEnd": 100,
    "strategy": "random",
    "complement": false,
    "strategyParams": {
      "randomSeed": ""

The accuracy of ML models becomes better when more data is used to train it. The data set I’m using in this post is very limited for building a comprehensive ML model but this methodology can be replicated with larger data sets.


Prepare the data and move it into Amazon S3

For an ML model to be effective, you should prepare the data so that it provides the right patterns to the model. The data set should have good coverage for relevant features, be low in unwanted “noise” or variance, and be as complete as possible with correct labels.

Use the Amazon Redshift database to prepare the data set. To begin, copy the data into an S3 bucket named diabetesdata. The bucket consists of four CSV files:

You can LIST the bucket contents by running the following command in the AWS CLI:

aws s3 ls s3://diabetesdata

Following this, create the necessary tables in Amazon Redshift to process the data in the CSV files by creating three master tables in one transaction table.

The transaction table consists of lookup IDs which act as foreign keys (FK) from the above master tables. It also has a primary key “encounter_id” and multiple columns that act as features for the ML model. The createredshifttables.sql script is executed to create the above tables.         

After the necessary tables are created, start loading them with data. You can make use of the Amazon Redshift COPY command to copy the data from the files on S3 into the respective Amazon Redshift tables. The following script template details the format of the copy command used:

COPY diabetes_data from 's3://<S3 file path>' credentials 'aws_access_key_id=<AWS Access Key ID>;aws_secret_access_key=<AWS Secret Access Key>' delimiter ',' IGNOREHEADER 1;

The loaddata.sql script is executed for the data loading step.


Modify the data set in Amazon Redshift

The next step is to make some changes to the data set to make it less noisy and suitable for the ML model that you create later. There are various things you can do as part of this clean up, such as updating incomplete values and grouping attributes into categories. For example, age can be grouped into young, adult or old based on age ranges.

For the target attribute for your ML model, create a custom attribute called readmission_result, with a value of “Yes” or “No” based on conditions in the readmitted attribute. To see all the changes made to the data, see the ModifyData.sql script.

Finally, the complete modified data set is dumped into a new table, diabetes_data_modified, which acts as a source for the ML model. Notice the new custom column readmission_result, which is your target attribute for the ML model.


Create a data source for Amazon ML and build the ML model

Next, create an Amazon ML data source, choosing Amazon Redshift as the source. This can be easily done through the console or through the CreateDataSourceFromRedshift API operation by specifying the Redshift parameters like Cluster Name, Database Name, username, password, role and the SQL query. The IAM role for Amazon Redshift as a data source is easily populated, as shown in the screenshot below.

You need the entire data set for the ML model, so use the following query for the data source:

SELECT * FROM diabetes_data_modified

This can be modified with column names and WHERE clauses to build different data sets for training the ML model.

The steps to create a binary classification ML model are covered in detail in the Building a Binary Classification Model with Amazon Machine Learning and Amazon Redshift blog post.

Amazon ML provides two types of predictions that you can try. The first one is a batch prediction that can be generated through the console or the GetBatchPrediction API operation. The result of the batch prediction is stored in an Amazon S3 bucket and can be used to build reports for end users (like monthly actual value vs predicted value report).

You can also use the ML model to generate a real-time prediction. To enable real-time predictions, create an endpoint for the ML model either through the console or using the CreateRealTimeEndpoint API operation.

After it’s created, you can query this endpoint in real time to get a response from Amazon ML, as shown in the following CLI screenshot.



Build the end user application

The Amazon ML endpoint created earlier can be invoked using an API call. This is very handy for building an application for end users who can interact with the ML model in real time.

Create a similar application and host it as a static website on Amazon S3. This feature of S3 allows you to host websites without any web servers and takes away the complexities of scaling hardware based on traffic routed to your application. The following is a screenshot from the application:

The application allows end users to select certain patient parameters and then makes a call to the predict API. The results are displayed in real time in the results pane.

I made use of the AWS SDK for JavaScript to build this application. The SDK can be added to your script using the following code:

<script src="https://sdk.amazonaws.com/js/aws-sdk-2.3.3.min.js"></script>


Use Amazon Cognito for secure access

To authenticate the Amazon ML API request, you can make use of Amazon Cognito, which allows for secure access to the Amazon ML endpoint without making use of the AWS security credentials. To enable this, create an identity pool in Amazon Cognito.

Amazon Cognito creates a new role in IAM. You need to allow this new IAM role to interact with Amazon ML by attaching the AmazonMachineLearningRealTimePredictionOnlyAccess policy to the role. This IAM policy allows the application to query the Amazon ML endpoint.

  "Version": "2012-10-17",
  "Statement": [
      "Effect": "Allow",
      "Action": [
      "Resource": "*"

Next, initialize credential objects, as shown in the code below:

var parameters = {
      AccountId: "AWS Account ID",
      RoleArn: "ARN for the role created by Amazon Cognito",
      IdentityPoolId: "The identity pool ID created in Amazon Cognito"
 // set the Amazon Cognito region
       AWS.config.region = 'us-east-1';
// initialize the Credentials object with the parameters
 AWS.config.credentials = new AWS.CognitoIdentityCredentials(parameters);


Call the AML Endpoint using the API

Create the function callApi() to make a call to the Amazon ML endpoint. The steps in the callAPI() function involve building the object that forms a part of the parameters sent to the Amazon ML endpoint, as shown in the code below:

var machinelearning = new AWS.MachineLearning({apiVersion: '2014-12-12'});
var params = {
	 	 	MLModelId: ‘<ML model ID>',
	  		PredictEndpoint: ‘<ML model real-time endpoint>',
		var request = machinelearning.predict(params);

The API call returns a JSON object that includes, among other things, the predictedLabel and predictedScores parameters, as shown in the code below:

    "Prediction": {
        "details": {
            "Algorithm": "SGD",
            "PredictiveModelType": "BINARY"
        "predictedLabel": "1",
        "predictedScores": {
            "1": 0.5548262000083923

The predictedScores parameter generates a score between 0 and 1 which you can convert into a percentage:

			finalScore = Math.round(predictedScore * 100);
			resultMessage = finalScore + "%";

The complete code for this sample application is uploaded to PredictReadmission_AML GitHub repo for reference and can be used to create more sophisticated machine learning applications using Amazon ML.



The power of machine learning opens new avenues for advanced analytics in healthcare. With new means of gathering data that range from sensors mounted on medical devices to medical images and everything in between, the complexities demonstrated by these varied data sets are pushing the boundaries of conventional analysis techniques.

The advent of cloud computing has made it possible for researchers to take up the challenging task of synthesizing these data sets and draw insights that are providing us with information that we never knew existed.

We are still at the beginning of this journey and there are, of course, challenges that we have to overcome. The ease of availability of quality data sets, which is the starting point of any good analysis, is still a major hurdle. Regulations like Health Insurance Portability and Accountability Act of 1996 (HIPAA) make it difficult to obtain medical records with Protected Health Information (PHI). The good news is that this is changing with initiatives like AWS Public Data Sets, which hosts a variety of public data sets that anyone can use.

At the end of the day, all this analysis and research is for one cause: To improve the quality of human lives. I hope this is, and will continue to be, the greatest motivation to overcome any challenge.

If you have any questions or suggestions, please comment below.
_ _ _ _ _

Do you want to be part of the conversation? Join AWS developers, enthusiasts, and healthcare professionals as we discuss building smart healthcare applications on AWS in Seattle on August 31.

Seattle AWS Big Data Meetup (Wednesday, August 31, 2016)



Building a Multi-Class ML Model with Amazon Machine Learning


Month in Review: July 2016

Post Syndicated from Derek Young original https://blogs.aws.amazon.com/bigdata/post/Tx3PZZPH7CK6QOB/Month-in-Review-July-2016

July was a busy month of big data solutions on the Big Data Blog. The month started with our most popular story yet, Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE. It was a great post to start a spectacular month. Take a look at our summaries below. Learn, comment, and share. Thank you for reading the AWS Big Data Blog!

Installing and Running JobServer for Apache Spark on Amazon EMR
In this blog post, learn how to install JobServer on EMR using a bootstrap action (BA) derived from the JobServer GitHub repository. Then, run JobServer using a sample dataset.

Process Large DynamoDB Streams Using Multiple Amazon Kinesis Client Library (KCL) Workers
A previous post, described how you can use the Amazon Kinesis Client Library (KCL) and DynamoDB Streams Kinesis Adapter to efficiently process DynamoDB streams. This post focuses on the KCL configurations that are likely to have an impact on the performance of your application when processing a large DynamoDB stream.

Simplify Management of Amazon Redshift Snapshots using AWS Lambda
In this blog post, learn about the new Amazon Redshift Utils module that helps you manage the Snapshots that your cluster creates. You supply a simple configuration, and then AWS Lambda ensures that you have cluster snapshots as frequently as required to meet your RPO.

How SmartNews Built a Lambda Architecture on AWS to Analyze Customer Behavior and Recommend Content
In this post, SmartNews shows you how they built their data platform on AWS. Their current system generates tens of GBs of data from multiple data sources, and runs daily aggregation queries or machine learning algorithms on datasets with hundreds of GBs. Some outputs by machine learning algorithms are joined on data streams for gathering user feedback in near real-time (e.g. the last 5 minutes). It lets them adapt their product for users with minimum latency.

Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Managing a hybrid cluster of both CPU and GPU instances poses challenges because cluster managers such as Yarn/Mesos do not natively support GPUs. Even if they did have native GPU support, the open source deep learning libraries would have to be re-written to work with the cluster manager API. This post discusses an alternate solution; namely, running separate CPU and GPU clusters, and driving the end-to-end modeling process from Apache Spark.


Will Spark Power the Data behind Precision Medicine? (March 2016)
Spark is already known for being a major player in big data analysis, but it is additionally uniquely capable in advancing genomics algorithms given the complex nature of genomics research. This post introduces gene analysis using Spark on EMR and ADAM, for those new to precision medicine.


Want to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.

Leave a comment below to let us know what big data topics you’d like to see next on the AWS Big Data Blog.