The adoption of Apache Spark has increased significantly over the past few years, and running Spark-based application pipelines is the new normal. Spark jobs that are in an ETL (extract, transform, and load) pipeline have different requirements—you must handle dependencies in the jobs, maintain order during executions, and run multiple jobs in parallel. In most of these cases, you can use workflow scheduler tools like Apache Oozie, Apache Airflow, and even Cron to fulfill these requirements.
Apache Oozie is a widely used workflow scheduler system for Hadoop-based jobs. However, its limited UI capabilities, lack of integration with other services, and heavy XML dependency might not be suitable for some users. On the other hand, Apache Airflow comes with a lot of neat features, along with powerful UI and monitoring capabilities and integration with several AWS and third-party services. However, with Airflow, you do need to provision and manage the Airflow server. The Cron utility is a powerful job scheduler. But it doesn’t give you much visibility into the job details, and creating a workflow using Cron jobs can be challenging.
What if you have a simple use case, in which you want to run a few Spark jobs in a specific order, but you don’t want to spend time orchestrating those jobs or maintaining a separate application? You can do that today in a serverless fashion using AWS Step Functions. You can create the entire workflow in AWS Step Functions and interact with Spark on Amazon EMR through Apache Livy.
In this post, I walk you through a list of steps to orchestrate a serverless Spark-based ETL pipeline using AWS Step Functions and Apache Livy.
Input data
For the source data for this post, I use the New York City Taxi and Limousine Commission (TLC) trip record data. For a description of the data, see this detailed dictionary of the taxi data. In this example, we’ll work mainly with the following three columns for the Spark jobs.
Column name
Column description
RateCodeID
Represents the rate code in effect at the end of the trip (for example, 1 for standard rate, 2 for JFK airport, 3 for Newark airport, and so on).
FareAmount
Represents the time-and-distance fare calculated by the meter.
TripDistance
Represents the elapsed trip distance in miles reported by the taxi meter.
The trip data is in comma-separated values (CSV) format with the first row as a header. To shorten the Spark execution time, I trimmed the large input data to only 20,000 rows. During the deployment phase, the input file tripdata.csv is stored in Amazon S3 in the <<your-bucket>>/emr-step-functions/input/ folder.
The following image shows a sample of the trip data:
Solution overview
The next few sections describe how Spark jobs are created for this solution, how you can interact with Spark using Apache Livy, and how you can use AWS Step Functions to create orchestrations for these Spark applications.
At a high level, the solution includes the following steps:
Trigger the AWS Step Function state machine by passing the input file path.
The first stage in the state machine triggers an AWS Lambda
The Lambda function interacts with Apache Spark running on Amazon EMR using Apache Livy, and submits a Spark job.
The state machine waits a few seconds before checking the Spark job status.
Based on the job status, the state machine moves to the success or failure state.
Subsequent Spark jobs are submitted using the same approach.
The state machine waits a few seconds for the job to finish.
The job finishes, and the state machine updates with its final status.
Let’s take a look at the Spark application that is used for this solution.
Spark jobs
For this example, I built a Spark jar named spark-taxi.jar. It has two different Spark applications:
MilesPerRateCode – The first job that runs on the Amazon EMR cluster. This job reads the trip data from an input source and computes the total trip distance for each rate code. The output of this job consists of two columns and is stored in Apache Parquet format in the output path.
The following are the expected output columns:
rate_code – Represents the rate code for the trip.
total_distance – Represents the total trip distance for that rate code (for example, sum(trip_distance)).
RateCodeStatus – The second job that runs on the EMR cluster, but only if the first job finishes successfully. This job depends on two different input sets:
csv – The same trip data that is used for the first Spark job.
miles-per-rate – The output of the first job.
This job first reads the tripdata.csv file and aggregates the fare_amount by the rate_code. After this point, you have two different datasets, both aggregated by rate_code. Finally, the job uses the rate_code field to join two datasets and output the entire rate code status in a single CSV file.
The output columns are as follows:
rate_code_id – Represents the rate code type.
total_distance – Derived from first Spark job and represents the total trip distance.
total_fare_amount – A new field that is generated during the second Spark application, representing the total fare amount by the rate code type.
Note that in this case, you don’t need to run two different Spark jobs to generate that output. The goal of setting up the jobs in this way is just to create a dependency between the two jobs and use them within AWS Step Functions.
Both Spark applications take one input argument called rootPath. It’s the S3 location where the Spark job is stored along with input and output data. Here is a sample of the final output:
The next section discusses how you can use Apache Livy to interact with Spark applications that are running on Amazon EMR.
Using Apache Livy to interact with Apache Spark
Apache Livy provides a REST interface to interact with Spark running on an EMR cluster. Livy is included in Amazon EMR release version 5.9.0 and later. In this post, I use Livy to submit Spark jobs and retrieve job status. When Amazon EMR is launched with Livy installed, the EMR master node becomes the endpoint for Livy, and it starts listening on port 8998 by default. Livy provides APIs to interact with Spark.
Let’s look at a couple of examples how you can interact with Spark running on Amazon EMR using Livy.
To list active running jobs, you can execute the following from the EMR master node:
curl localhost:8998/sessions
If you want to do the same from a remote instance, just change localhost to the EMR hostname, as in the following (port 8998 must be open to that remote instance through the security group):
Through Spark submit, you can pass multiple arguments for the Spark job and Spark configuration settings. You can also do that using Livy, by passing the S3 path through the args parameter, as shown following:
For a detailed list of Livy APIs, see the Apache Livy REST API page. This post uses GET /batches and POST /batches.
In the next section, you create a state machine and orchestrate Spark applications using AWS Step Functions.
Using AWS Step Functions to create a Spark job workflow
AWS Step Functions automatically triggers and tracks each step and retries when it encounters errors. So your application executes in order and as expected every time. To create a Spark job workflow using AWS Step Functions, you first create a Lambda state machine using different types of states to create the entire workflow.
First, you use the Task state—a simple state in AWS Step Functions that performs a single unit of work. You also use the Wait state to delay the state machine from continuing for a specified time. Later, you use the Choice state to add branching logic to a state machine.
The following is a quick summary of how to use different states in the state machine to create the Spark ETL pipeline:
Task state – Invokes a Lambda function. The first Task state submits the Spark job on Amazon EMR, and the next Task state is used to retrieve the previous Spark job status.
Wait state – Pauses the state machine until a job completes execution.
Choice state – Each Spark job execution can return a failure, an error, or a success state So, in the state machine, you use the Choice state to create a rule that specifies the next action or step based on the success or failure of the previous step.
Here is one of my Task states, MilesPerRateCode, which simply submits a Spark job:
"MilesPerRate Job": {
"Type": "Task",
"Resource":"arn:aws:lambda:us-east-1:xxxxxx:function:blog-miles-per-rate-job-submit-function",
"ResultPath": "$.jobId",
"Next": "Wait for MilesPerRate job to complete"
}
This Task state configuration specifies the Lambda function to execute. Inside the Lambda function, it submits a Spark job through Livy using Livy’s POST API. Using ResultPath, it tells the state machine where to place the result of the executing task. As discussed in the previous section, Spark submit returns the session ID, which is captured with $.jobId and used in a later state.
The following code section shows the Lambda function, which is used to submit the MilesPerRateCode job. It uses the Python request library to submit a POST against the Livy endpoint hosted on Amazon EMR and passes the required parameters in JSON format through payload. It then parses the response, grabs id from the response, and returns it. The Next field tells the state machine which state to go to next.
Just like in the MilesPerRate job, another state submits the RateCodeStatus job, but it executes only when all previous jobs have completed successfully.
Here is the Task state in the state machine that checks the Spark job status:
Just like other states, the preceding Task executes a Lambda function, captures the result (represented by jobStatus), and passes it to the next state. The following is the Lambda function that checks the Spark job status based on a given session ID:
In the Choice state, it checks the Spark job status value, compares it with a predefined state status, and transitions the state based on the result. For example, if the status is success, move to the next state (RateCodeJobStatus job), and if it is dead, move to the MilesPerRate job failed state.
To set up this entire solution, you need to create a few AWS resources. To make it easier, I have created an AWS CloudFormation template. This template creates all the required AWS resources and configures all the resources that are needed to create a Spark-based ETL pipeline on AWS Step Functions.
This CloudFormation template requires you to pass the following four parameters during initiation.
Parameter
Description
ClusterSubnetID
The subnet where the Amazon EMR cluster is deployed and Lambda is configured to talk to this subnet.
KeyName
The name of the existing EC2 key pair to access the Amazon EMR cluster.
VPCID
The ID of the virtual private cloud (VPC) where the EMR cluster is deployed and Lambda is configured to talk to this VPC.
S3RootPath
The Amazon S3 path where all required files (input file, Spark job, and so on) are stored and the resulting data is written.
IMPORTANT: These templates are designed only to show how you can create a Spark-based ETL pipeline on AWS Step Functions using Apache Livy. They are not intended for production use without modification. And if you try this solution outside of the us-east-1 Region, download the necessary files from s3://aws-data-analytics-blog/emr-step-functions, upload the files to the buckets in your Region, edit the script as appropriate, and then run it.
To launch the CloudFormation stack, choose Launch Stack:
Launching this stack creates the following list of AWS resources.
Logical ID
Resource Type
Description
StepFunctionsStateExecutionRole
IAM role
IAM role to execute the state machine and have a trust relationship with the states service.
SparkETLStateMachine
AWS Step Functions state machine
State machine in AWS Step Functions for the Spark ETL workflow.
LambdaSecurityGroup
Amazon EC2 security group
Security group that is used for the Lambda function to call the Livy API.
RateCodeStatusJobSubmitFunction
AWS Lambda function
Lambda function to submit the RateCodeStatus job.
MilesPerRateJobSubmitFunction
AWS Lambda function
Lambda function to submit the MilesPerRate job.
SparkJobStatusFunction
AWS Lambda function
Lambda function to check the Spark job status.
LambdaStateMachineRole
IAM role
IAM role for all Lambda functions to use the lambda trust relationship.
EMRCluster
Amazon EMR cluster
EMR cluster where Livy is running and where the job is placed.
During the AWS CloudFormation deployment phase, it sets up S3 paths for input and output. Input files are stored in the <<s3-root-path>>/emr-step-functions/input/ path, whereas spark-taxi.jar is copied under <<s3-root-path>>/emr-step-functions/.
The following screenshot shows how the S3 paths are configured after deployment. In this example, I passed a bucket that I created in the AWS account s3://tm-app-demos for the S3 root path.
If the CloudFormation template completed successfully, you will see Spark-ETL-State-Machine in the AWS Step Functions dashboard, as follows:
Choose the Spark-ETL-State-Machine state machine to take a look at this implementation. The AWS CloudFormation template built the entire state machine along with its dependent Lambda functions, which are now ready to be executed.
On the dashboard, choose the newly created state machine, and then choose New execution to initiate the state machine. It asks you to pass input in JSON format. This input goes to the first state MilesPerRate Job, which eventually executes the Lambda function blog-miles-per-rate-job-submit-function.
Pass the S3 root path as input:
{
“rootPath”: “s3://tm-app-demos”
}
Then choose Start Execution:
The rootPath value is the same value that was passed when creating the CloudFormation stack. It can be an S3 bucket location or a bucket with prefixes, but it should be the same value that is used for AWS CloudFormation. This value tells the state machine where it can find the Spark jar and input file, and where it will write output files. After the state machine starts, each state/task is executed based on its definition in the state machine.
At a high level, the following represents the flow of events:
Execute the first Spark job, MilesPerRate.
The Spark job reads the input file from the location <<rootPath>>/emr-step-functions/input/tripdata.csv. If the job finishes successfully, it writes the output data to <<rootPath>>/emr-step-functions/miles-per-rate.
If the Spark job fails, it transitions to the error state MilesPerRate job failed, and the state machine stops. If the Spark job finishes successfully, it transitions to the RateCodeStatus Job state, and the second Spark job is executed.
If the second Spark job fails, it transitions to the error state RateCodeStatus job failed, and the state machine stops with the Failed status.
If this Spark job completes successfully, it writes the final output data to the <<rootPath>>/emr-step-functions/rate-code-status/ It also transitions the RateCodeStatus job finished state, and the state machine ends its execution with the Success status.
This following screenshot shows a successfully completed Spark ETL state machine:
The right side of the state machine diagram shows the details of individual states with their input and output.
When you execute the state machine for the second time, it fails because the S3 path already exists. The state machine turns red and stops at MilePerRate job failed. The following image represents that failed execution of the state machine:
You can also check your Spark application status and logs by going to the Amazon EMR console and viewing the Application history tab:
I hope this walkthrough paints a picture of how you can create a serverless solution for orchestrating Spark jobs on Amazon EMR using AWS Step Functions and Apache Livy. In the next section, I share some ideas for making this solution even more elegant.
Next steps
The goal of this post is to show a simple example that uses AWS Step Functions to create an orchestration for Spark-based jobs in a serverless fashion. To make this solution robust and production ready, you can explore the following options:
In this example, I manually initiated the state machine by passing the rootPath as input. You can instead trigger the state machine automatically. To run the ETL pipeline as soon as the files arrive in your S3 bucket, you can pass the new file path to the state machine. Because CloudWatch Events supports AWS Step Functions as a target, you can create a CloudWatch rule for an S3 event. You can then set AWS Step Functions as a target and pass the new file path to your state machine. You’re all set!
You can also improve this solution by adding an alerting mechanism in case of failures. To do this, create a Lambda function that sends an alert email and assigns that Lambda function to a Fail That way, when any part of your state fails, it triggers an email and notifies the user.
If you want to submit multiple Spark jobs in parallel, you can use the Parallel state type in AWS Step Functions. The Parallel state is used to create parallel branches of execution in your state machine.
With Lambda and AWS Step Functions, you can create a very robust serverless orchestration for your big data workload.
Cleaning up
When you’ve finished testing this solution, remember to clean up all those AWS resources that you created using AWS CloudFormation. Use the AWS CloudFormation console or AWS CLI to delete the stack named Blog-Spark-ETL-Step-Functions.
Summary
In this post, I showed you how to use AWS Step Functions to orchestrate your Spark jobs that are running on Amazon EMR. You used Apache Livy to submit jobs to Spark from a Lambda function and created a workflow for your Spark jobs, maintaining a specific order for job execution and triggering different AWS events based on your job’s outcome. Go ahead—give this solution a try, and share your experience with us!
Tanzir Musabbir is an EMR Specialist Solutions Architect with AWS. He is an early adopter of open source Big Data technologies. At AWS, he works with our customers to provide them architectural guidance for running analytics solutions on Amazon EMR, Amazon Athena & AWS Glue. Tanzir is a big Real Madrid fan and he loves to travel in his free time.
Hadoop User Experience (Hue) is an open-source, web-based, graphical user interface for use with Amazon EMR and Apache Hadoop. The Hue database stores things like users, groups, authorization permissions, Apache Hive queries, Apache Oozie workflows, and so on.
There might come a time when you want to migrate your Hue database to a new EMR cluster. For example, you might want to upgrade from an older version of the Amazon EMR AMI (Amazon Machine Image), but your Hue application and its database have had a lot of customization.You can avoid re-creating these user entities and retain query/workflow histories in Hue by migrating the existing Hue database, or remote database in Amazon RDS, to a new cluster.
By default, Hue user information and query histories are stored in a local MySQL database on the EMR cluster’s master node. However, you can create one or more Hue-enabled clusters using a configuration stored in Amazon S3 and a remote MySQL database in Amazon RDS. This allows you to preserve user information and query history that Hue creates without keeping your Amazon EMR cluster running.
This post describes the step-by-step process for migrating the Hue database from an existing EMR cluster.
Note: Amazon EMR supports different Hue versions across different AMI releases. Keep in mind the compatibility of Hue versions between the old and new clusters in this migration activity. Currently, Hue 3.x.x versions are not compatible with Hue 4.x.x versions, and therefore a migration between these two Hue versions might create issues. In addition, Hue 3.10.0 is not backward compatible with its previous 3.x.x versions.
Before you begin
First, let’s create a new testUser in Hue on an existing EMR cluster, as shown following:
You will use these credentials later to log in to Hue on the new EMR cluster and validate whether you have successfully migrated the Hue database.
Let’s get started!
Migration how-to
Follow these steps to migrate your database to a new EMR cluster and then validate the migration process.
1.) Make a backup of the existing Hue database.
Use SSH to connect to the master node of the old cluster, as shown following (if you are using Linux/Unix/macOS), and dump the Hue database to a JSON file.
Edit the hue-mysql.json output file by removing all JSON objects that have useradmin.userprofile in the model field, and save the file. For example, remove the objects as shown following:
2.) Store the hue-mysql.json file on persistent storage like Amazon S3.
You can copy the file from the old EMR cluster to Amazon S3 using the AWS CLI or Secure Copy (SCP) client. For example, the following uses the AWS CLI:
b.) Connect to the Hue database—either the local MySQL database or the remote database in Amazon RDS for your cluster as shown following, using the mysql client.
$ mysql -h HOST –u USER –pPASSWORD
For a local MySQL database, you can find the hostname, user name, and password for connecting to the database in the /etc/hue/conf/hue.ini file on the master node.
[[database]]
engine = mysql
name = huedb
case_insensitive_collation = utf8_unicode_ci
test_charset = utf8
test_collation = utf8_bin
host = ip-172-31-37-133.us-west-2.compute.internal
user = hue
test_name = test_huedb
password = QdWbL3Ai6GcBqk26
port = 3306
Based on the preceding example configuration, the sample command is as follows. (Replace the host, user, and password details based on your EMR cluster settings.)
$ mysql -h ip-172-31-37-133.us-west-2.compute.internal -u hue -pQdWbL3Ai6GcBqk26
c.) Drop the existing Hue database with the name huedb from the MySQL server.
mysql> DROP DATABASE IF EXISTS huedb;
d.) Create a new empty database with the same name huedb.
mysql> CREATE DATABASE huedb DEFAULT CHARACTER SET utf8 DEFAULT COLLATE=utf8_bin;
i.) In MySQL, add the foreign key content_type_id back to the auth_permission
mysql> use huedb;
mysql> ALTER TABLE huedb.auth_permission ADD FOREIGN KEY (`content_type_id`) REFERENCES `django_content_type` (`id`);
j.) Start the Hue service again.
$ sudo start hue
hue start/running, process XXXX
That’s it! Now, verify whether you can successfully access the Hue UI, and sign in using your existing testUser credentials.
After a successful sign in to Hue on the new EMR cluster, you should see a similar Hue homepage as shown following with testUser as the user signed in:
Conclusion
You have now learned how to migrate an existing Hue database to a new Amazon EMR cluster and validate the migration process. If you have any similar Amazon EMR administration topics that you want to see covered in a future post, please let us know in the comments below.
Anvesh Ragi is a Big Data Support Engineer with Amazon Web Services. He works closely with AWS customers to provide them architectural and engineering assistance for their data processing workflows. In his free time, he enjoys traveling and going for hikes.
Amazon EMR empowers many customers to build big data processing applications quickly and cost-effectively, using popular distributed frameworks such as Apache Spark, Apache HBase, Presto, and Apache Flink. For organizations that are crafting their analytical applications on Amazon EMR, there is a growing need to keep their data assets organized in an automated fashion. Because datasets tend to grow exponentially, using cataloging tools is essential to automating data discovery and organizing data assets.
AWS Glue Data Catalog provides this essential capability, allowing you to automatically discover and catalog metadata about your data stores in a central repository. Since Amazon EMR 5.8.0, customers have been using the AWS Glue Data Catalog as a metadata store for Apache Hive and Spark SQL applications that are running on Amazon EMR. Starting with Amazon EMR 5.10.0, you can catalog datasets using AWS Glue and run queries using Presto on Amazon EMR from the Hue (Hadoop User Experience) and Apache Zeppelin UIs.
You might wonder what scenarios warrant using Presto running on Amazon EMR and when to choose Amazon Athena (which uses Presto as the query engine under the hood). It is important to note that both are excellent tools for querying massive amounts of data and addressing different needs and use cases.
Amazon Athena provides the easiest way to run interactive queries for data in Amazon S3 without needing to set up or manage any servers. Presto running on Amazon EMR gives you much more flexibility in how you configure and run your queries, providing the ability to federate to other data sources if needed. For example, you might have a use case that requires LDAP authentication for clients such as the Presto CLI or JDBC/ODBC drivers. Or you might have a workflow where you need to join data between different systems like MySQL/Amazon Redshift/Apache Cassandra and Hive. In these examples, Presto running on Amazon EMR is the right tool to use because it can be configured to enable LDAP authentication in addition to the desired database connectors at cluster launch.
Now, let’s look at how metadata management for Presto works with AWS Glue.
Using an AWS Glue crawler to discover datasets
The AWS Glue Data Catalog is a reference to the location, schema, and runtime metrics of your datasets. To create this reference metadata, AWS Glue needs to crawl your datasets. In this exercise, we use an AWS Glue crawler to populate tables in the Data Catalog for the NYC taxi rides dataset.
The following are the steps for adding a crawler:
Sign in to the AWS Management Console, and open the AWS Glue console. In the navigation pane, choose Crawlers. Then choose Add crawler.
On the Add a data store page, specify the location of the NYC taxi rides dataset.
In the next step, choose an existing IAM role if one is available, or create a new role. Then choose Next.
On the scheduling page, for Frequency, choose Run on demand.
On the Configure the crawler’s output page, choose Add database. Specify blog-db as the database name. (You can specify a name of your choice, but be sure to choose the correct database name when running queries.)
Follow the remaining steps using the default values to create a crawler.
When the crawler displays the Ready state, navigate to the Databases (Choose blog-db from the list of databases, or search for it by specifying it as a filter, as shown in the following screenshot.) Then choose Tables. You should see the three tables created by the crawler, as follows.
(Optional) The discovered data is classified as CSV files. You can optionally convert this data into Parquet format for better response times on your queries.
Launching an Amazon EMR cluster
With the dataset discovered and organized, we can now walk through different options for launching Presto on an Amazon EMR cluster to use the AWS Glue Data Catalog.
After you’ve set up the Amazon EMR cluster with Presto, the AWS Glue Data Catalog is available through a default “hive” catalog. To change between the Hive and Glue metastores, you have to manually update hive.properties and restart the Presto server. Connect to the master node on your EMR cluster using SSH, and run the Presto CLI to start running queries interactively.
$ presto-cli --catalog hive
Begin with a simple query to sample a few rows:
presto> SELECT * FROM “blog-db”.taxi limit 10;
The query shows a few sample rows as follows:
Query the average fare for trips at each hour of the day and for each day of the month on the Parquet version of the taxi dataset.
presto> SELECT EXTRACT (HOUR FROM pickup_datetime) AS hour, avg(fare_amount) AS average_fare FROM “blog-db”.taxi_parquet GROUP BY 1 ORDER BY 1;
The following image shows the results:
More interestingly, you can compute the number of trips that gave tips in the 10 percent, 15 percent, or higher percentage range:
presto> -- Tip Percent Category
SELECT TipPrctCtgry
, COUNT (DISTINCT TripID) TripCt
FROM
(SELECT TripID
, (CASE
WHEN fare_prct < 0.7 THEN 'FL70'
WHEN fare_prct < 0.8 THEN 'FL80'
WHEN fare_prct < 0.9 THEN 'FL90'
ELSE 'FL100'
END) FarePrctCtgry
, (CASE
WHEN tip_prct < 0.1 THEN 'TL10'
WHEN tip_prct < 0.15 THEN 'TL15'
WHEN tip_prct < 0.2 THEN 'TL20'
ELSE 'TG20'
END) TipPrctCtgry
FROM
(SELECT TripID
, (fare_amount / total_amount) as fare_prct
, (extra / total_amount) as extra_prct
, (mta_tax / total_amount) as tip_prct
, (tolls_amount / total_amount) as mta_taxprct
, (tip_amount / total_amount) as tolls_prct
, (improvement_surcharge / total_amount) as imprv_suchrgprct
, total_amount
FROM
(SELECT *
, (cast(pickup_longitude AS VARCHAR(100)) || '_' || cast(pickup_latitude AS VARCHAR(100))) as TripID
from "blog-db”.taxi_parquet
WHERE total_amount > 0
) as t
) as t
) ct
GROUP BY TipPrctCtgry;
The results are as follows:
While the preceding query is running, navigate to the web interface for Presto on Amazon EMR at <http://master-public-dns-name:8889/. Here you can look into the query metrics, such as active worker nodes, number of rows read per second, reserved memory, and parallelism.
Running queries in the Presto Editor on Hue
If you installed Hue with your Amazon EMR launch, you can also run queries on Hue’s Presto Editor. On the Amazon EMR Cluster console, choose Enable Web Connection, and follow the instructions to access the web interfaces for Hue and Zeppelin.
After the web connection is enabled, choose the Hue link to open the web interface. At the login screen, if you are the administrator logging in for the first time, type a user name and password to create your Hue superuser account. Then choose Create account. Otherwise, type your user name and password and choose Create account, or type the credentials provided by your administrator.
Choose the Presto Editor from the menu. You can run Presto queries against your tables in the AWS Glue Data Catalog.
Conclusion
Having a shared data catalog for applications on Amazon EMR alleviates a myriad of data-related challenges that organizations face today—including discovery, governance, auditability, and collaboration. In this post, we explored how the AWS Glue Data Catalog addresses discoverability and manageability for table metadata for Presto on Amazon EMR. Go ahead, give this a try, and share your experience with us!
Radhika Ravirala is a Solutions Architect at Amazon Web Services where she helps customers craft distributed big data applications on the AWS platform. Prior to her cloud journey, she worked as a software engineer and designer for technology companies in Silicon Valley. She holds a M.S in computer science from San Jose State University.
This is a customer post by Ajay Rathod, a Staff Data Engineer at Realtor.com.
Realtor.com, in their own words: Realtor.com®, operated by Move, Inc., is a trusted resource for home buyers, sellers, and dreamers. It offers the most comprehensive database of for-sale properties, among competing national sites, and the information, tools, and professional expertise to help people move confidently through every step of their home journey.
Move, Inc. processes hundreds of terabytes of data partitioned by day and hour. Various teams run hundreds of queries on this data. Using AWS services, Move, Inc. has built an infrastructure for gathering and analyzing data:
To increase the effectiveness of the storage and subsequent querying, the data is converted into a Parquet format, and stored again in S3.
Amazon Athena is used as the SQL (Structured Query Language) engine to query the data in S3. Athena is easy to use and is often quickly adopted by various teams.
Teams visualize query results in Amazon QuickSight. Amazon QuickSight is a business analytics service that allows you to quickly and easily visualize data and collaborate with other users in your account.
This architecture is known as the data platform and is shared by the data science, data engineering, and the data operations teams within the organization. Move, Inc. also enables other cross-functional teams to use Athena. When many users use Athena, it helps to monitor its usage to ensure cost-effectiveness. This leads to a strong need for Athena metrics that can give details about the following:
Users
Amount of data scanned (to monitor the cost of AWS service usage)
The databases used for queries
Actual queries that teams run
Currently, the Move, Inc. team does not have an easy way of obtaining all these metrics from a single tool. Having a way to do this would greatly simplify monitoring efforts. For example, the data operations team wants to collect several metrics every day obtained from queries run on Athena for their data. They require the following metrics:
Amount of data scanned by each user
Number of queries by each user
Databases accessed by each user
In this post, I discuss how to build a solution for monitoring Athena usage. To build this solution, you rely on AWS CloudTrail. CloudTrail is a web service that records AWS API calls for your AWS account and delivers log files to an S3 bucket.
Solution
Here is the high-level overview:
Use the CloudTrail API to audit the user queries, and then use Athena to create a table from the CloudTrail logs.
Query the Athena API with the AWS CLI to gather metrics about the data scanned by the user queries and put this information into another table in Athena.
Combine the information from these two sources by joining the two tables.
Use the resulting data to analyze, build insights, and create a dashboard that shows the usage of Athena by users within different teams in the organization.
The architecture of this solution is shown in the following diagram.
Take a look at this solution step by step.
IAM and permissions setup
This solution uses CloudTrail, Athena, and S3. Make sure that the users who run the following scripts and steps have the appropriate IAM roles and policies. For more information, see Tutorial: Delegate Access Across AWS Accounts Using IAM Roles.
Step 1: Create a table in Athena for data in CloudTrail
The CloudTrail API records all Athena queries run by different teams within the organization. These logs are saved in S3. The fields of most interest are:
User identity
Start time of the API call
Source IP address
Request parameters
Response elements returned by the service
When end users make queries in Athena, these queries are recorded by CloudTrail as responses from Athena web service calls. In these responses, each query is represented as a JSON (JavaScript Object Notation) string.
You can use the following CREATE TABLE statement to create the cloudtrail_logs table in Athena. For more information, see Querying CloudTrail Logs in the Athena documentation.
Step 2: Create a table in Amazon Athena for data from API output
Athena provides an API that can be queried to obtain information of a specific query ID. It also provides an API to obtain information of a batch of query IDs, with a batch size of up to 50 query IDs.
You can use this API call to obtain information about the Athena queries that you are interested in and store this information in an S3 location. Create an Athena table to represent this data in S3. For the purpose of this post, the response fields that are of interest are as follows:
QueryExecutionId
Database
EngineExecutionTimeInMillis
DataScannedInBytes
Status
SubmissionDateTime
CompletionDateTime
The CREATE TABLE statement for athena_api_output, is as follows:
CREATE EXTERNAL TABLE IF NOT EXISTS athena_api_output(
queryid string,
querydatabase string,
executiontime bigint,
datascanned bigint,
status string,
submissiondatetime string,
completiondatetime string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://<s3 location of the output from the API calls>'
TBLPROPERTIES ('has_encrypted_data'='false')
You can inspect the query IDs and user information for the last day. The query is as follows:
with data AS (
SELECT
json_extract(responseelements,
'$.queryExecutionId') AS query_id,
(useridentity.arn) AS uid,
(useridentity.sessioncontext.sessionIssuer.userName) AS role,
from_iso8601_timestamp(eventtime) AS dt
FROM cloudtrail_logs
WHERE eventsource='athena.amazonaws.com'
AND eventname='StartQueryExecution'
AND json_extract(responseelements, '$.queryExecutionId') is NOT null)
SELECT *
FROM data
WHERE dt > date_add('day',-1,now() )
Step 3: Obtain Query Statistics from Athena API
You can write a simple Python script to loop through queries in batches of 50 and query the Athena API for query statistics. You can use the Boto library for these lookups. Boto is a library that provides you with an easy way to interact with and automate your AWS development. The response from the Boto API can be parsed to extract the fields that you need as described in Step 2.
An example Python script is available in the AthenaMetrics GitHub repo.
Format these fields, for each query ID, as CSV strings and store them for the entire batch response in an S3 bucket. This S3 bucket is represented by the table created in Step 2, cloudtrail_logs.
In your Python code, create a variable named sql_query, and assign it a string representing the SQL query defined in Step 2. The s3_query_folder is the location in S3 that is used by Athena for storing results of the query. The code is as follows:
sql_query =
“””
with data AS (
SELECT
json_extract(responseelements,
'$.queryExecutionId') AS query_id,
(useridentity.arn) AS uid,
(useridentity.sessioncontext.sessionIssuer.userName) AS role,
from_iso8601_timestamp(eventtime) AS dt
FROM cloudtrail_logs
WHERE eventsource='athena.amazonaws.com'
AND eventname='StartQueryExecution'
AND json_extract(responseelements, '$.queryExecutionId') is NOT null)
SELECT *
FROM data
WHERE dt > date_add('day',-1,now() )
“””
athena_client = boto3.client('athena')
query_execution = self.client.start_query_execution(
QueryString=sql_query,
ClientRequestToken=str(uuid.uuid4()),
ResultConfiguration={
'OutputLocation': s3_staging_folder,
}
)
query_execution_id = query_execution['QueryExecutionId']
### Allow query to complete, check for status response["QueryExecution"]["Status"]["State"]
response = athena_client.get_query_execution(QueryExecutionId=query_execution_id)
if response[“QueryExecution”][“Status”][“State”] == “SUCCEEDED”:
results = athena_client.get_query_results(QueryEecutionId=query_exection_id)
You can iterate through the results in the response object and consolidate them in batches of 50 results. For each batch, you can invoke the Athena API, batch-get-query-execution.
Store the output in the S3 location pointed to by the CREATE TABLE definition for the table athena_api_output, in Step 2. The SQL statement above returns only queries run in the last 24 hours. You may want to increase that to get usage over a longer period of time. The code snippet for this API call is as follows:
The batchqueryids value is an array of 50 query IDs extracted from the result set of the SELECT query. This script creates the data needed by your second table, athena_api_output, and you are now ready to join both tables in Athena.
Step 4: Join the CloudTrail and Athena API data
Now that the two tables are available with the data that you need, you can run the following Athena query to look at the usage by user. You can limit the output of this query to the most recent five days.
SELECT
c.useridentity.arn,
json_extract(c.responseelements, '$.queryExecutionId') qid,
a.datascanned,
a.querydatabase,
a.executiontime,
a.submissiondatetime,
a.completiondatetime,
a.status
FROM cloudtrail_logs c
JOIN athena_api_output a
ON cast(json_extract(c.responseelements, '$.queryExecutionId') as varchar) = a.queryid
WHERE eventsource = 'athena.amazonaws.com'
AND eventname = 'StartQueryExecution'
AND from_iso8601_timestamp(eventtime) > date_add('day',-5 ,now() )
Step 5: Analyze and visualize the results
In this step, using QuickSight, you can create a dashboard that shows the following metrics:
Average amount of data scanned (MB) by a user and database
Using the solution described in this post, you can continuously monitor the usage of Athena by various teams. Taking this a step further, you can automate and set user limits for how much data the Athena users in your team can query within a given period of time. You may also choose to add notifications when the usage by a particular user crosses a specified threshold. This helps you manage costs incurred by different teams in your organization.
Realtor.com would like to acknowledge the tremendous support and guidance provided by Hemant Borole, Senior Consultant, Big Data & Analytics with AWS Professional Services in helping to author this post.
Ajay Rathod is Staff Data Engineer at Realtor.com. With a deep background in AWS Cloud Platform and Data Infrastructure, Ajay leads the Data Engineering and automation aspect of Data Operations at Realtor.com. He has designed and deployed many ETL pipelines and workflows for the Realtor Data Analytics Platform using AWS services like Data Pipeline, Athena, Batch, Glue and Boto3. He has created various operational metrics to monitor ETL Pipelines and Resource Usage.
This is a customer post by Stephen Borg, the Head of Big Data and BI at Cerberus Technologies.
Cerberus Technologies, in their own words: Cerberus is a company founded in 2017 by a team of visionary iGaming veterans. Our mission is simple – to offer the best tech solutions through a data-driven and a customer-first approach, delivering innovative solutions that go against traditional forms of working and process. This mission is based on the solid foundations of reliability, flexibility and security, and we intend to fundamentally change the way iGaming and other industries interact with technology.
Over the years, I have developed and created a number of data warehouses from scratch. Recently, I built a data warehouse for the iGaming industry single-handedly. To do it, I used the power and flexibility of Amazon Redshift and the wider AWS data management ecosystem. In this post, I explain how I was able to build a robust and scalable data warehouse without the large team of experts typically needed.
In two of my recent projects, I ran into challenges when scaling our data warehouse using on-premises infrastructure. Data was growing at many tens of gigabytes per day, and query performance was suffering. Scaling required major capital investment for hardware and software licenses, and also significant operational costs for maintenance and technical staff to keep it running and performing well. Unfortunately, I couldn’t get the resources needed to scale the infrastructure with data growth, and these projects were abandoned. Thanks to cloud data warehousing, the bottleneck of infrastructure resources, capital expense, and operational costs have been significantly reduced or have totally gone away. There is no more excuse for allowing obstacles of the past to delay delivering timely insights to decision makers, no matter how much data you have.
With Amazon Redshift and AWS, I delivered a cloud data warehouse to the business very quickly, and with a small team: me. I didn’t have to order hardware or software, and I no longer needed to install, configure, tune, or keep up with patches and version updates. Instead, I easily set up a robust data processing pipeline and we were quickly ingesting and analyzing data. Now, my data warehouse team can be extremely lean, and focus more time on bringing in new data and delivering insights. In this post, I show you the AWS services and the architecture that I used.
Handling data feeds
I have several different data sources that provide everything needed to run the business. The data includes activity from our iGaming platform, social media posts, clickstream data, marketing and campaign performance, and customer support engagements.
To handle the diversity of data feeds, I developed abstract integration applications using Docker that run on Amazon EC2 Container Service (Amazon ECS) and feed data to Amazon Kinesis Data Streams. These data streams can be used for real time analytics. In my system, each record in Kinesis is preprocessed by an AWS Lambda function to cleanse and aggregate information. My system then routes it to be stored where I need on Amazon S3 by Amazon Kinesis Data Firehose. Suppose that you used an on-premises architecture to accomplish the same task. A team of data engineers would be required to maintain and monitor a Kafka cluster, develop applications to stream data, and maintain a Hadoop cluster and the infrastructure underneath it for data storage. With my stream processing architecture, there are no servers to manage, no disk drives to replace, and no service monitoring to write.
Setting up a Kinesis stream can be done with a few clicks, and the same for Kinesis Firehose. Firehose can be configured to automatically consume data from a Kinesis Data Stream, and then write compressed data every N minutes to Amazon S3. When I want to process a Kinesis data stream, it’s very easy to set up a Lambda function to be executed on each message received. I can just set a trigger from the AWS Lambda Management Console, as shown following.
Regardless of the format I receive the data from our partners, I can send it to Kinesis as JSON data using my own formatters. After Firehose writes this to Amazon S3, I have everything in nearly the same structure I received but compressed, encrypted, and optimized for reading.
This data is automatically crawled by AWS Glue and placed into the AWS Glue Data Catalog. This means that I can immediately query the data directly on S3 using Amazon Athena or through Amazon Redshift Spectrum. Previously, I used Amazon EMR and an Amazon RDS–based metastore in Apache Hive for catalog management. Now I can avoid the complexity of maintaining Hive Metastore catalogs. Glue takes care of high availability and the operations side so that I know that end users can always be productive.
Working with Amazon Athena and Amazon Redshift for analysis
I found Amazon Athena extremely useful out of the box for ad hoc analysis. Our engineers (me) use Athena to understand new datasets that we receive and to understand what transformations will be needed for long-term query efficiency.
For our data analysts and data scientists, we’ve selected Amazon Redshift. Amazon Redshift has proven to be the right tool for us over and over again. It easily processes 20+ million transactions per day, regardless of the footprint of the tables and the type of analytics required by the business. Latency is low and query performance expectations have been more than met. We use Redshift Spectrum for long-term data retention, which enables me to extend the analytic power of Amazon Redshift beyond local data to anything stored in S3, and without requiring me to load any data. Redshift Spectrum gives me the freedom to store data where I want, in the format I want, and have it available for processing when I need it.
To load data directly into Amazon Redshift, I use AWS Data Pipeline to orchestrate data workflows. I create Amazon EMR clusters on an intra-day basis, which I can easily adjust to run more or less frequently as needed throughout the day. EMR clusters are used together with Amazon RDS, Apache Spark 2.0, and S3 storage. The data pipeline application loads ETL configurations from Spring RESTful services hosted on AWS Elastic Beanstalk. The application then loads data from S3 into memory, aggregates and cleans the data, and then writes the final version of the data to Amazon Redshift. This data is then ready to use for analysis. Spark on EMR also helps with recommendations and personalization use cases for various business users, and I find this easy to set up and deliver what users want. Finally, business users use Amazon QuickSight for self-service BI to slice, dice, and visualize the data depending on their requirements.
Each AWS service in this architecture plays its part in saving precious time that’s crucial for delivery and getting different departments in the business on board. I found the services easy to set up and use, and all have proven to be highly reliable for our use as our production environments. When the architecture was in place, scaling out was either completely handled by the service, or a matter of a simple API call, and crucially doesn’t require me to change one line of code. Increasing shards for Kinesis can be done in a minute by editing a stream. Increasing capacity for Lambda functions can be accomplished by editing the megabytes allocated for processing, and concurrency is handled automatically. EMR cluster capacity can easily be increased by changing the master and slave node types in Data Pipeline, or by using Auto Scaling. Lastly, RDS and Amazon Redshift can be easily upgraded without any major tasks to be performed by our team (again, me).
In the end, using AWS services including Kinesis, Lambda, Data Pipeline, and Amazon Redshift allows me to keep my team lean and highly productive. I eliminated the cost and delays of capital infrastructure, as well as the late night and weekend calls for support. I can now give maximum value to the business while keeping operational costs down. My team pushed out an agile and highly responsive data warehouse solution in record time and we can handle changing business requirements rapidly, and quickly adapt to new data and new user requests.
Stephen Borg is the Head of Big Data and BI at Cerberus Technologies. He has a background in platform software engineering, and first became involved in data warehousing using the typical RDBMS, SQL, ETL, and BI tools. He quickly became passionate about providing insight to help others optimize the business and add personalization to products. He is now the Head of Big Data and BI at Cerberus Technologies.
One of the challenges faced by our customers—especially those in highly regulated industries—is balancing the need for security with flexibility. In this post, we cover how to enable multi-tenancy and increase security by using EMRFS (EMR File System) authorization, the Amazon S3 storage-level authorization on Amazon EMR.
Amazon EMR is an easy, fast, and scalable analytics platform enabling large-scale data processing. EMRFS authorization provides Amazon S3 storage-level authorization by configuring EMRFS with multiple IAM roles. With this functionality enabled, different users and groups can share the same cluster and assume their own IAM roles respectively.
Simply put, on Amazon EMR, we can now have an Amazon EC2 role per user assumed at run time instead of one general EC2 role at the cluster level. When the user is trying to access Amazon S3 resources, Amazon EMR evaluates against a predefined mappings list in EMRFS configurations and picks up the right role for the user.
In this post, we will discuss what EMRFS authorization is (Amazon S3 storage-level access control) and show how to configure the role mappings with detailed examples. You will then have the desired permissions in a multi-tenant environment. We also demo Amazon S3 access from HDFS command line, Apache Hive on Hue, and Apache Spark.
EMRFS authorization for Amazon S3
There are two prerequisites for using this feature:
Users must be authenticated, because EMRFS needs to map the current user/group/prefix to a predefined user/group/prefix. There are several authentication options. In this post, we launch a Kerberos-enabled cluster that manages the Key Distribution Center (KDC) on the master node, and enable a one-way trust from the KDC to a Microsoft Active Directory domain.
The application must support accessing Amazon S3 via Applications that have their own S3FileSystem APIs (for example, Presto) are not supported at this time.
EMRFS supports three types of mapping entries: user, group, and Amazon S3 prefix. Let’s use an example to show how this works.
Assume that you have the following three identities in your organization, and they are defined in the Active Directory:
To enable all these groups and users to share the EMR cluster, you need to define the following IAM roles:
In this case, you create a separate Amazon EC2 role that doesn’t give any permission to Amazon S3. Let’s call the role the base role (the EC2 role attached to the EMR cluster), which in this example is named EMR_EC2_RestrictedRole. Then, you define all the Amazon S3 permissions for each specific user or group in their own roles. The restricted role serves as the fallback role when the user doesn’t belong to any user/group, nor does the user try to access any listed Amazon S3 prefixes defined on the list.
Important: For all other roles, like emrfs_auth_group_role_data_eng, you need to add the base role (EMR_EC2_RestrictedRole) as the trusted entity so that it can assume other roles. See the following example:
This role grants all Amazon S3 permissions to the emrfs-auth-data-science-bucket-demo bucket and all the objects in it. Similarly, the policy for the role emrfs_auth_group_role_data_eng is shown below:
To configure EMRFS authorization, you use EMR security configuration. Here is the configuration we use in this post
Consider the following scenario.
First, the admin user admin1 tries to log in and run a command to access Amazon S3 data through EMRFS. The first role emrfs_auth_user_role_admin_user on the mapping list, which is a user role, is mapped and picked up. Then admin1 has access to the Amazon S3 locations that are defined in this role.
Then a user from the data engineer group (grp_data_engineering) tries to access a data bucket to run some jobs. When EMRFS sees that the user is a member of the grp_data_engineering group, the group role emrfs_auth_group_role_data_eng is assumed, and the user has proper access to Amazon S3 that is defined in the emrfs_auth_group_role_data_eng role.
Next, the third user comes, who is not an admin and doesn’t belong to any of the groups. After failing evaluation of the top three entries, EMRFS evaluates whether the user is trying to access a certain Amazon S3 prefix defined in the last mapping entry. This type of mapping entry is called the prefix type. If the user is trying to access s3://emrfs-auth-default-bucket-demo/, then the prefix mapping is in effect, and the prefix role emrfs_auth_prefix_role_default_s3_prefix is assumed.
If the user is not trying to access any of the Amazon S3 paths that are defined on the list—which means it failed the evaluation of all the entries—it only has the permissions defined in the EMR_EC2RestrictedRole. This role is assumed by the EC2 instances in the cluster.
In this process, all the mappings defined are evaluated in the defined order, and the first role that is mapped is assumed, and the rest of the list is skipped.
Setting up an EMR cluster and mapping Active Directory users and groups
Now that we know how EMRFS authorization role mapping works, the next thing we need to think about is how we can use this feature in an easy and manageable way.
Active Directory setup
Many customers manage their users and groups using Microsoft Active Directory or other tools like OpenLDAP. In this post, we create the Active Directory on an Amazon EC2 instance running Windows Server and create the users and groups we will be using in the example below. After setting up Active Directory, we use the Amazon EMR Kerberos auto-join capability to establish a one-way trust from the KDC running on the EMR master node to the Active Directory domain on the EC2 instance. You can use your own directory services as long as it talks to the LDAP (Lightweight Directory Access Protocol).
After configuring Active Directory, you can create all the users and groups using the Active Directory tools and add users to appropriate groups. In this example, we created users like admin1, dataeng1, datascientist1, grp_data_engineering, and grp_data_science, and then add the users to the right groups.
Join the EMR cluster to an Active Directory domain
For clusters with Kerberos, Amazon EMR now supports automated Active Directory domain joins. You can use the security configuration to configure the one-way trust from the KDC to the Active Directory domain. You also configure the EMRFS role mappings in the same security configuration.
The following is an example of the EMR security configuration with a trusted Active Directory domain EMRKRB.TEST.COM and the EMRFS role mappings as we discussed earlier:
The EMRFS role mapping configuration is shown in this example:
We will also provide an example AWS CLI command that you can run.
Launching the EMR cluster and running the tests
Now you have configured Kerberos and EMRFS authorization for Amazon S3.
Additionally, you need to configure Hue with Active Directory using the Amazon EMR configuration API in order to log in using the AD users created before. The following is an example of Hue AD configuration.
Note: In the preceding configuration JSON file, change the values as required before pasting it into the software setting section in the Amazon EMR console.
Now let’s use this configuration and the security configuration you created before to launch the cluster.
In the Amazon EMR console, choose Create cluster. Then choose Go to advanced options. On the Step1: Software and Steps page, under Edit software settings (optional), paste the configuration in the box.
The rest of the setup is the same as an ordinary cluster setup, except in the Security Options section. In Step 4: Security, under Permissions, choose Custom, and then choose the RestrictedRole that you created before.
Choose the appropriate subnets (these should meet the base requirement in order for a successful Active Directory join—see the Amazon EMR Management Guide for more details), and choose the appropriate security groups to make sure it talks to the Active Directory. Choose a key so that you can log in and configure the cluster.
Most importantly, choose the security configuration that you created earlier to enable Kerberos and EMRFS authorization for Amazon S3.
You can use the following AWS CLI command to create a cluster.
Note: If you create the cluster using CLI, you need to save the JSON configuration for Hue into a file named hue-config.json and place it on the server where you run the CLI command.
After the cluster gets into the Waiting state, try to connect by using SSH into the cluster using the Active Directory user name and password.
It successfully returns the listing results. Next we will test Apache Hive and then Apache Spark.
To run jobs successfully, you need to create a home directory for every user in HDFS for staging data under /user/<username>. Users can configure a step to create a home directory at cluster launch time for every user who has access to the cluster. In this example, you use Hue since Hue will create the home directory in HDFS for the user at the first login. Here Hue also needs to be integrated with the same Active Directory as explained in the example configuration described earlier.
First, log in to Hue as a data engineer user, and open a Hive Notebook in Hue. Then run a query to create a new table pointing to the data engineer bucket, s3://emrfs-auth-data-engineering-bucket-demo/table1_data_eng/.
You can see that the table was created successfully. Now try to create another table pointing to the data science group’s bucket, where the data engineer group doesn’t have access.
It failed and threw an Amazon S3 Access Denied error.
Now insert one line of data into the successfully create table.
Next, log out, switch to a data science group user, and create another table, test2_datasci_tb.
The creation is successful.
The last task is to test Spark (it requires the user directory, but Hue created one in the previous step).
Now let’s come back to the command line and run some Spark commands.
Login to the master node using the datascientist1 user:
Start the SparkSQL interactive shell by typing spark-sql, and run the show tables command. It should list the tables that you created using Hive.
As a data science group user, try select on both tables. You will find that you can only select the table defined in the location that your group has access to.
Conclusion
EMRFS authorization for Amazon S3 enables you to have multiple roles on the same cluster, providing flexibility to configure a shared cluster for different teams to achieve better efficiency. The Active Directory integration and group mapping make it much easier for you to manage your users and groups, and provides better auditability in a multi-tenant environment.
Songzhi Liu is a Big Data Consultant with AWS Professional Services. He works closely with AWS customers to provide them Big Data & Machine Learning solutions and best practices on the Amazon cloud.
Microservice-application requirements have changed dramatically in recent years. These days, applications operate with petabytes of data, need almost 100% uptime, and end users expect sub-second response times. Typical N-tier applications can’t deliver on these requirements.
Reactive Manifesto, published in 2014, describes the essential characteristics of reactive systems including: responsiveness, resiliency, elasticity, and being message driven.
Being message driven is perhaps the most important characteristic of reactive systems. Asynchronous messaging helps in the design of loosely coupled systems, which is a key factor for scalability. In order to build a highly decoupled system, it is important to isolate services from each other. As already described, isolation is an important aspect of the microservices pattern. Indeed, reactive systems and microservices are a natural fit.
Implemented Use Case This reference architecture illustrates a typical ad-tracking implementation.
Many ad-tracking companies collect massive amounts of data in near-real-time. In many cases, these workloads are very spiky and heavily depend on the success of the ad-tech companies’ customers. Typically, an ad-tracking-data use case can be separated into a real-time part and a non-real-time part. In the real-time part, it is important to collect data as fast as possible and ask several questions including:, “Is this a valid combination of parameters?,””Does this program exist?,” “Is this program still valid?”
Because response time has a huge impact on conversion rate in advertising, it is important for advertisers to respond as fast as possible. This information should be kept in memory to reduce communication overhead with the caching infrastructure. The tracking application itself should be as lightweight and scalable as possible. For example, the application shouldn’t have any shared mutable state and it should use reactive paradigms. In our implementation, one main application is responsible for this real-time part. It collects and validates data, responds to the client as fast as possible, and asynchronously sends events to backend systems.
The non-real-time part of the application consumes the generated events and persists them in a NoSQL database. In a typical tracking implementation, clicks, cookie information, and transactions are matched asynchronously and persisted in a data store. The matching part is not implemented in this reference architecture. Many ad-tech architectures use frameworks like Hadoop for the matching implementation.
The system can be logically divided into the data collection partand the core data updatepart. The data collection part is responsible for collecting, validating, and persisting the data. In the core data update part, the data that is used for validation gets updated and all subscribers are notified of new data.
Components and Services
Main Application The main application is implemented using Java 8 and uses Vert.x as the main framework. Vert.x is an event-driven, reactive, non-blocking, polyglot framework to implement microservices. It runs on the Java virtual machine (JVM) by using the low-level IO library Netty. You can write applications in Java, JavaScript, Groovy, Ruby, Kotlin, Scala, and Ceylon. The framework offers a simple and scalable actor-like concurrency model. Vert.x calls handlers by using a thread known as an event loop. To use this model, you have to write code known as “verticles.” Verticles share certain similarities with actors in the actor model. To use them, you have to implement the verticle interface. Verticles communicate with each other by generating messages in a single event bus. Those messages are sent on the event bus to a specific address, and verticles can register to this address by using handlers.
With only a few exceptions, none of the APIs in Vert.x block the calling thread. Similar to Node.js, Vert.x uses the reactor pattern. However, in contrast to Node.js, Vert.x uses several event loops. Unfortunately, not all APIs in the Java ecosystem are written asynchronously, for example, the JDBC API. Vert.x offers a possibility to run this, blocking APIs without blocking the event loop. These special verticles are called worker verticles. You don’t execute worker verticles by using the standard Vert.x event loops, but by using a dedicated thread from a worker pool. This way, the worker verticles don’t block the event loop.
Our application consists of five different verticles covering different aspects of the business logic. The main entry point for our application is the HttpVerticle, which exposes an HTTP-endpoint to consume HTTP-requests and for proper health checking. Data from HTTP requests such as parameters and user-agent information are collected and transformed into a JSON message. In order to validate the input data (to ensure that the program exists and is still valid), the message is sent to the CacheVerticle.
This verticle implements an LRU-cache with a TTL of 10 minutes and a capacity of 100,000 entries. Instead of adding additional functionality to a standard JDK map implementation, we use Google Guava, which has all the features we need. If the data is not in the L1 cache, the message is sent to the RedisVerticle. This verticle is responsible for data residing in Amazon ElastiCache and uses the Vert.x-redis-client to read data from Redis. In our example, Redis is the central data store. However, in a typical production implementation, Redis would just be the L2 cache with a central data store like Amazon DynamoDB. One of the most important paradigms of a reactive system is to switch from a pull- to a push-based model. To achieve this and reduce network overhead, we’ll use Redis pub/sub to push core data changes to our main application.
Vert.x also supports direct Redis pub/sub-integration, the following code shows our subscriber-implementation:
vertx.eventBus().<JsonObject>consumer(REDIS_PUBSUB_CHANNEL_VERTX, received -> {
JsonObject value = received.body().getJsonObject("value");
String message = value.getString("message");
JsonObject jsonObject = new JsonObject(message);
eb.send(CACHE_REDIS_EVENTBUS_ADDRESS, jsonObject);
});
redis.subscribe(Constants.REDIS_PUBSUB_CHANNEL, res -> {
if (res.succeeded()) {
LOGGER.info("Subscribed to " + Constants.REDIS_PUBSUB_CHANNEL);
} else {
LOGGER.info(res.cause());
}
});
The verticle subscribes to the appropriate Redis pub/sub-channel. If a message is sent over this channel, the payload is extracted and forwarded to the cache-verticle that stores the data in the L1-cache. After storing and enriching data, a response is sent back to the HttpVerticle, which responds to the HTTP request that initially hit this verticle. In addition, the message is converted to ByteBuffer, wrapped in protocol buffers, and send to an Amazon Kinesis Data Stream.
The following example shows a stripped-down version of the KinesisVerticle:
Kinesis Consumer This AWS Lambda function consumes data from an Amazon Kinesis Data Stream and persists the data in an Amazon DynamoDB table. In order to improve testability, the invocation code is separated from the business logic. The invocation code is implemented in the class KinesisConsumerHandler and iterates over the Kinesis events pulled from the Kinesis stream by AWS Lambda. Each Kinesis event is unwrapped and transformed from ByteBuffer to protocol buffers and converted into a Java object. Those Java objects are passed to the business logic, which persists the data in a DynamoDB table. In order to improve duration of successive Lambda calls, the DynamoDB-client is instantiated lazily and reused if possible.
Redis Updater From time to time, it is necessary to update core data in Redis. A very efficient implementation for this requirement is using AWS Lambda and Amazon Kinesis. New core data is sent over the AWS Kinesis stream using JSON as data format and consumed by a Lambda function. This function iterates over the Kinesis events pulled from the Kinesis stream by AWS Lambda. Each Kinesis event is unwrapped and transformed from ByteBuffer to String and converted into a Java object. The Java object is passed to the business logic and stored in Redis. In addition, the new core data is also sent to the main application using Redis pub/sub in order to reduce network overhead and converting from a pull- to a push-based model.
The following example shows the source code to store data in Redis and notify all subscribers:
public void updateRedisData(final TrackingMessage trackingMessage, final Jedis jedis, final LambdaLogger logger) {
try {
ObjectMapper mapper = new ObjectMapper();
String jsonString = mapper.writeValueAsString(trackingMessage);
Map<String, String> map = marshal(jsonString);
String statusCode = jedis.hmset(trackingMessage.getProgramId(), map);
}
catch (Exception exc) {
if (null == logger)
exc.printStackTrace();
else
logger.log(exc.getMessage());
}
}
public void notifySubscribers(final TrackingMessage trackingMessage, final Jedis jedis, final LambdaLogger logger) {
try {
ObjectMapper mapper = new ObjectMapper();
String jsonString = mapper.writeValueAsString(trackingMessage);
jedis.publish(Constants.REDIS_PUBSUB_CHANNEL, jsonString);
}
catch (final IOException e) {
log(e.getMessage(), logger);
}
}
Similarly to our Kinesis Consumer, the Redis-client is instantiated somewhat lazily.
Infrastructure as Code As already outlined, latency and response time are a very critical part of any ad-tracking solution because response time has a huge impact on conversion rate. In order to reduce latency for customers world-wide, it is common practice to roll out the infrastructure in different AWS Regions in the world to be as close to the end customer as possible. AWS CloudFormation can help you model and set up your AWS resources so that you can spend less time managing those resources and more time focusing on your applications that run in AWS.
You create a template that describes all the AWS resources that you want (for example, Amazon EC2 instances or Amazon RDS DB instances), and AWS CloudFormation takes care of provisioning and configuring those resources for you. Our reference architecture can be rolled out in different Regions using an AWS CloudFormation template, which sets up the complete infrastructure (for example, Amazon Virtual Private Cloud (Amazon VPC), Amazon Elastic Container Service (Amazon ECS) cluster, Lambda functions, DynamoDB table, Amazon ElastiCache cluster, etc.).
Conclusion In this blog post we described reactive principles and an example architecture with a common use case. We leveraged the capabilities of different frameworks in combination with several AWS services in order to implement reactive principles—not only at the application-level but also at the system-level. I hope I’ve given you ideas for creating your own reactive applications and systems on AWS.
About the Author
Sascha Moellering is a Senior Solution Architect. Sascha is primarily interested in automation, infrastructure as code, distributed computing, containers and JVM. He can be reached at [email protected]
Kuhu Shukla (bottom center) and team at the 2017 DataWorks Summit
By Kuhu Shukla
This post first appeared here on the Apache Software Foundation blog as part of ASF’s “Success at Apache” monthly blog series.
As I sit at my desk on a rather frosty morning with my coffee, looking up new JIRAs from the previous day in the Apache Tez project, I feel rather pleased. The latest community release vote is complete, the bug fixes that we so badly needed are in and the new release that we tested out internally on our many thousand strong cluster is looking good. Today I am looking at a new stack trace from a different Apache project process and it is hard to miss how much of the exceptional code I get to look at every day comes from people all around the globe. A contributor leaves a JIRA comment before he goes on to pick up his kid from soccer practice while someone else wakes up to find that her effort on a bug fix for the past two months has finally come to fruition through a binding +1.
Yahoo – which joined AOL, HuffPost, Tumblr, Engadget, and many more brands to form the Verizon subsidiary Oath last year – has been at the frontier of open source adoption and contribution since before I was in high school. So while I have no historical trajectories to share, I do have a story on how I found myself in an epic journey of migrating all of Yahoo jobs from Apache MapReduce to Apache Tez, a then-new DAG based execution engine.
Oath grid infrastructure is through and through driven by Apache technologies be it storage through HDFS, resource management through YARN, job execution frameworks with Tez and user interface engines such as Hive, Hue, Pig, Sqoop, Spark, Storm. Our grid solution is specifically tailored to Oath’s business-critical data pipeline needs using the polymorphic technologies hosted, developed and maintained by the Apache community.
On the third day of my job at Yahoo in 2015, I received a YouTube link on An Introduction to Apache Tez. I watched it carefully trying to keep up with all the questions I had and recognized a few names from my academic readings of Yarn ACM papers. I continued to ramp up on YARN and HDFS, the foundational Apache technologies Oath heavily contributes to even today. For the first few weeks I spent time picking out my favorite (necessary) mailing lists to subscribe to and getting started on setting up on a pseudo-distributed Hadoop cluster. I continued to find my footing with newbie contributions and being ever more careful with whitespaces in my patches. One thing was clear – Tez was the next big thing for us. By the time I could truly call myself a contributor in the Hadoop community nearly 80-90% of the Yahoo jobs were now running with Tez. But just like hiking up the Grand Canyon, the last 20% is where all the pain was. Being a part of the solution to this challenge was a happy prospect and thankfully contributing to Tez became a goal in my next quarter.
The next sprint planning meeting ended with me getting my first major Tez assignment – progress reporting. The progress reporting in Tez was non-existent – “Just needs an API fix,” I thought. Like almost all bugs in this ecosystem, it was not easy. How do you define progress? How is it different for different kinds of outputs in a graph? The questions were many.
I, however, did not have to go far to get answers. The Tez community actively came to a newbie’s rescue, finding answers and posing important questions. I started attending the bi-weekly Tez community sync up calls and asking existing contributors and committers for course correction. Suddenly the team was much bigger, the goals much more chiseled. This was new to anyone like me who came from the networking industry, where the most open part of the code are the RFCs and the implementation details are often hidden. These meetings served as a clean room for our coding ideas and experiments. Ideas were shared, to the extent of which data structure we should pick and what a future user of Tez would take from it. In between the usual status updates and extensive knowledge transfers were made.
Oath uses Apache Pig and Apache Hive extensively and most of the urgent requirements and requests came from Pig and Hive developers and users. Each issue led to a community JIRA and as we started running Tez at Oath scale, new feature ideas and bugs around performance and resource utilization materialized. Every year most of the Hadoop team at Oath travels to the Hadoop Summit where we meet our cohorts from the Apache community and we stand for hours discussing the state of the art and what is next for the project. One such discussion set the course for the next year and a half for me.
We needed an innovative way to shuffle data. Frameworks like MapReduce and Tez have a shuffle phase in their processing lifecycle wherein the data from upstream producers is made available to downstream consumers. Even though Apache Tez was designed with a feature set corresponding to optimization requirements in Pig and Hive, the Shuffle Handler Service was retrofitted from MapReduce at the time of the project’s inception. With several thousands of jobs on our clusters leveraging these features in Tez, the Shuffle Handler Service became a clear performance bottleneck. So as we stood talking about our experience with Tez with our friends from the community, we decided to implement a new Shuffle Handler for Tez. All the conversation points were tracked now through an umbrella JIRA TEZ-3334 and the to-do list was long. I picked a few JIRAs and as I started reading through I realized, this is all new code I get to contribute to and review. There might be a better way to put this, but to be honest it was just a lot of fun! All the whiteboards were full, the team took walks post lunch and discussed how to go about defining the API. Countless hours were spent debugging hangs while fetching data and looking at stack traces and Wireshark captures from our test runs. Six months in and we had the feature on our sandbox clusters. There were moments ranging from sheer frustration to absolute exhilaration with high fives as we continued to address review comments and fixing big and small issues with this evolving feature.
As much as owning your code is valued everywhere in the software community, I would never go on to say “I did this!” In fact, “we did!” It is this strong sense of shared ownership and fluid team structure that makes the open source experience at Apache truly rewarding. This is just one example. A lot of the work that was done in Tez was leveraged by the Hive and Pig community and cross Apache product community interaction made the work ever more interesting and challenging. Triaging and fixing issues with the Tez rollout led us to hit a 100% migration score last year and we also rolled the Tez Shuffle Handler Service out to our research clusters. As of last year we have run around 100 million Tez DAGs with a total of 50 billion tasks over almost 38,000 nodes.
In 2018 as I move on to explore Hadoop 3.0 as our future release, I hope that if someone outside the Apache community is reading this, it will inspire and intrigue them to contribute to a project of their choice. As an astronomy aficionado, going from a newbie Apache contributor to a newbie Apache committer was very much like looking through my telescope - it has endless possibilities and challenges you to be your best.
About the Author:
Kuhu Shukla is a software engineer at Oath and did her Masters in Computer Science at North Carolina State University. She works on the Big Data Platforms team on Apache Tez, YARN and HDFS with a lot of talented Apache PMCs and Committers in Champaign, Illinois. A recent Apache Tez Committer herself she continues to contribute to YARN and HDFS and spoke at the 2017 Dataworks Hadoop Summit on “Tez Shuffle Handler: Shuffling At Scale With Apache Hadoop”. Prior to that she worked on Juniper Networks’ router and switch configuration APIs. She likes to participate in open source conferences and women in tech events. In her spare time she loves singing Indian classical and jazz, laughing, whale watching, hiking and peering through her Dobsonian telescope.
Many enterprises use Microsoft Active Directory to manage users, groups, and computers in a network. And a question is asked frequently: How can Active Directory users access big data workloads running on Amazon EMR with the same single sign-on (SSO) experience they have when accessing resources in the Active Directory network?
This post walks you through the process of using AWS CloudFormation to set up a cross-realm trust and extend authentication from an Active Directory network into an Amazon EMR cluster with Kerberos enabled. By establishing a cross-realm trust, Active Directory users can use their Active Directory credentials to access an Amazon EMR cluster and run jobs as themselves.
Walkthrough overview
In this example, you build a solution that allows Active Directory users to seamlessly access Amazon EMR clusters and run big data jobs. Here’s what you need before setting up this solution:
A possible limit increase for your account (Note: Usually a limit increase will not be necessary. See the AWS Service Limits documentation if you encounter a limit error while building the solution.)
To make it easier for you to get started, I created AWS CloudFormation templates that automatically configure and deploy the solution for you. The following steps and resources are involved in setting up the solution:
Note: If you want to manually create and configure the components for this solution without using AWS CloudFormation, refer to the Amazon EMR cross-realm documentation. IMPORTANT: The AWS CloudFormation templates used in this post are designed to work only in the us-east-1 (N. Virginia) Region. They are not intended for production use without modification.
Single-step solution deployment
If you don’t want to set up each component individually, you can use the single-step AWS CloudFormation template. The single-step template is a master template that uses nested stacks (additional templates) to launch and configure all the resources for the solution in one go.
To deploy the single-step template into your account, choose Launch Stack:
This takes you to the Create stack wizard in the AWS CloudFormation console. The template is launched in the US East (N. Virginia) Region by default. Do not change to a different Region because the template is designed to work only in us-east-1 (N. Virginia).
On the Select Template page, keep the default URL for the AWS CloudFormation template, and then choose Next.
On the Specify Details page, review the parameters for the template. Provide values for the parameters that require input (for more information, see the parameters table that follows).
The following parameters are available in this template.
Parameter
Default
Description
Domain Controller name
DC1
NetBIOS (hostname) name of the Active Directory server. This name can be up to 15 characters long.
Active Directory domain
example.com
Fully qualified domain name (FQDN) of the forest root domain (for example, example.com).
Domain NetBIOS name
EXAMPLE
NetBIOS name of the domain for users of earlier versions of Windows. This name can be up to 15 characters long.
Domain admin user
CrossRealmAdmin
User name for the account that is added as domain administrator. This account is separate from the default administrator account.
Domain admin password
Requires input
Password for the domain admin user. Must be at least eight characters including letters, numbers, and symbols.
Key pair name
Requires input
Name of an existing key pair, which enables you to connect securely to your instance after it launches.
Instance type
m4.xlarge
Instance type for the domain controller and the Amazon EMR cluster.
Allowed IP address
10.0.0.0/16
The client IP address can that can reach your cluster. Specify an IP address range in CIDR notation (for example, 203.0.113.5/32). By default, only the VPC CIDR (10.0.0.0/16) can reach the cluster. Be sure to add your client IP range so that you can connect to the cluster using SSH.
EMR Kerberos realm
EC2.INTERNAL
Cluster’s Kerberos realm name. By default, the realm name is derived from the cluster’s VPC domain name in uppercase letters (for example, EC2.INTERNAL is the default VPC domain name in the us-east-1 Region).
Trusted AD domain
EXAMPLE.COM
The Active Directory (AD) domain that you want to trust. This is the same as the “Active Directory domain.” However, it must use all uppercase letters (for example, EXAMPLE.COM).
Cross-realm trust password
Requires input
Password that you want to use for your cross-realm trust.
Instance count
2
The number of instances (core nodes) for the cluster.
EMR applications
Hadoop, Spark, Ganglia, Hive
Comma separated list of applications to install on the cluster.
After you specify the template details, choose Next. On the Options page, choose Next again. On the Review page, select the I acknowledge that AWS CloudFormation might create IAM resources with custom names check box, and then choose Create.
It takes approximately 45 minutes for the deployment to complete. When the stack launch is complete, it will return outputs with information about the resources that were created. Note the outputs and skip to the Managing and testing the solution section. You can view the stack outputs on the AWS Management Console or by using the following AWS CLI command:
This section describes how to use AWS CloudFormation templates to perform each step separately in the solution.
Create and configure an Amazon VPC
In order for you to establish a cross-realm trust between an Amazon EMR Kerberos realm and an Active Directory domain, your Amazon VPC must meet the following requirements:
The subnet used for the Amazon EMR cluster must have a CIDR block of fewer than nine digits (for example, 10.0.1.0/24).
Both DNS resolution and DNS hostnames must be enabled (set to “yes”).
The Active Directory domain controller must be the DNS server for instances in the Amazon VPC (this is configured in the next step).
To use the AWS CloudFormation template to create and configure an Amazon VPC with the prerequisites listed previously, choose Launch Stack:
Note: If you want to create the VPC manually (without using AWS CloudFormation), see Set Up the VPC and Subnet in the Amazon EMR documentation.
Launching this stack creates the following AWS resources:
Amazon VPC with CIDR block 10.0.0.0/16 (Name: CrossRealmVPC)
Internet Gateway (Name: CrossRealmGateway)
Public subnet with CIDR block 10.0.1.0/24 (Name: CrossRealmSubnet)
Security group allowing inbound access from the VPC’s subnets (Name tag: CrossRealmSecurityGroup)
When the stack launch is complete, it should return outputs similar to the following.
Key
Value example
Description
SubnetID
subnet-xxxxxxxx
The subnet for the Active Directory domain controller and the EMR cluster.
SecurityGroup
sg-xxxxxxxx
The security group for the Active Directory domain controller.
VPCID
vpc-xxxxxxxx
The Active Directory domain controller and EMR cluster will be launched on this VPC.
Note the outputs because they are used in the next step. You can view the stack outputs on the AWS Management Console or by using the following AWS CLI command:
Launch and configure an Active Directory domain controller
In this step, you use an AWS CloudFormation template to automatically launch and configure a new Active Directory domain controller and cross-realm trust.
Note: There are various ways to install and configure an Active Directory domain controller. For details on manually launching and installing a domain controller without AWS CloudFormation, see Step 2: Launch and Install the AD Domain Controller in the Amazon EMR documentation.
In addition to launching and configuring an Active Directory domain controller and cross-realm trust, this AWS CloudFormation template also sets the domain controller as the DNS server (name server) for your Amazon VPC. In other words, the template creates a new DHCP option-set for the VPC where it’s being deployed to, and it sets the private IP of the domain controller as the name server for that new DHCP option set.
IMPORTANT: You should not use this template on a production VPC with existing resources like Amazon EC2 instances. When you launch this stack, make sure that you use the new environment and resources (Amazon VPC, subnet, and security group) that were created in the Create and configure an Amazon VPC step.
To launch this stack, choose Launch Stack:
The following table contains information about the parameters available in this template. Review the parameters for the template and provide values for those that require input.
Parameter
Default
Description
VPC ID
Requires input
Launch the domain controller on this VPC (for example, use the VPC created in the Create and configure an Amazon VPC step).
Subnet ID
Requires input
Subnet used for the domain controller (for example, use the subnet created in the Create and configure an Amazon VPC step).
Security group ID
Requires input
Security group (SG) for the domain controller (for example, use the SG created in the Create and configure an Amazon VPC step).
Domain Controller name
DC1
NetBIOS name of the Active Directory server (up to 15 characters).
Active Directory domain
example.com
Fully qualified domain name (FQDN) of the forest root domain (for example, example.com).
Domain NetBIOS name
EXAMPLE
NetBIOS name of the domain for users of earlier versions of Windows. This name can be up to 15 characters long.
Domain admin user
CrossRealmAdmin
User name for the account that is added as domain administrator. This account is separate from the default administrator account.
Domain admin password
Requires input
Password for the domain admin user. Must be at least eight characters including letters, numbers, and symbols.
Key pair name
Requires input
Name of an existing EC2 key pair to enable access to the domain controller instance.
Instance type
m4.xlarge
Instance type for the domain controller.
EMR Kerberos realm
EC2.INTERNAL
Cluster’s Kerberos realm name. By default, the realm name is derived from the cluster’s VPC domain name in uppercase letters (for example, EC2.INTERNAL is the default VPC domain name in the us-east-1 Region).
Cross-realm trust password
Requires input
Password that you want to use for your cross-realm trust.
It takes 25–30 minutes for this stack to be created. When it’s complete, note the stack’s outputs, and then move to the next step: Launch an EMR cluster with Kerberos enabled.
Create a security configuration and launch an Amazon EMR cluster with Kerberos enabled
To launch a kerberized Amazon EMR cluster, you first need to create a security configuration containing the cross-realm trust configuration. You then specify cluster-specific Kerberos attributes when launching the cluster.
In this step, you use AWS CloudFormation to launch and configure a kerberized Amazon EMR cluster with a cross-realm trust. If you want to manually launch and configure a cluster with Kerberos enabled, see Step 6: Launch a Kerberized EMR Cluster in the Amazon EMR documentation.
Note: At the time of this writing, AWS CloudFormation does not yet support launching Amazon EMR clusters with Kerberos authentication enabled. To overcome this limitation, I created a template that uses an AWS Lambda-backed custom resource to launch and configure the Amazon EMR cluster with Kerberos enabled. If you use this template, there’s nothing else that you need to do. Just keep in mind that the template creates and invokes an AWS Lambda function (custom resource) to launch the cluster.
To create a cross-realm trust security configuration and launch a kerberized Amazon EMR cluster using AWS CloudFormation, choose Launch Stack:
The following table lists and describes the template parameters for deploying a kerberized Amazon EMR cluster and configuring a cross-realm trust.
Parameter
Default
Description
Active Directory domain
example.com
The Active Directory domain that you want to establish the cross-realm trust with.
Domain admin user (joiner user)
CrossRealmAdmin
The user name of an Active Directory domain user with privileges to join domains/computers to the Active Directory domain (joiner user).
Domain admin password
Requires input
Password of the joiner user.
Cross-realm trust password
Requires input
Password of your cross-realm trust.
EC2 key pair name
Requires input
Name of an existing key pair, which enables you to connect securely to your cluster after it launches.
Subnet ID
Requires input
Subnet that you want to use for your Amazon EMR cluster (for example, choose the subnet created in the Create and configure an Amazon VPC step).
Security group ID
Requires input
Security group that you want to use for your Amazon EMR cluster (for example, choose the security group created in the Create and configure an Amazon VPC step).
Instance type
m4.xlarge
The instance type that you want to use for the cluster nodes.
Instance count
2
The number of instances (core nodes) for the cluster.
Allowed IP address
10.0.0.0/16
The client IP address can that can reach your cluster. Specify an IP address range in CIDR notation (for example, 203.0.113.5/32). By default, only the VPC CIDR (10.0.0.0/16) can reach the cluster. Be sure to add your client IP range so that you can connect to the cluster using SSH.
EMR applications
Hadoop, Spark, Ganglia, Hive
Comma separated list of the applications that you want installed on the cluster.
EMR Kerberos realm
EC2.INTERNAL
Cluster’s Kerberos realm name. By default, the realm name is derived from the cluster’s VPC domain name in uppercase letters (for example, EC2.INTERNAL is the default VPC domain name in the us-east-1 Region).
Trusted AD domain
EXAMPLE.COM
The Active Directory domain that you want to trust. This name is the same as the “AD domain name.” However, it must use all uppercase letters (for example, EXAMPLE.COM).
It takes 10–15 minutes for this stack to be created. When it’s complete, note the stack’s outputs, and then move to the next section: Managing and testing the solution.
Managing and testing the solution
Now that you’ve configured and built the solution, it’s time to test it by connecting to a cluster using Active Directory credentials.
SSH to a cluster using Active Directory credentials (single sign-on)
After you launch a kerberized Amazon EMR cluster, if you used the AWS CloudFormation templates and added your client IP range to the Allowed IP address parameter, you should be able to connect to the cluster using an SSH client and your Active Directory user credentials. If you have trouble connecting to the cluster using SSH, check the cluster’s security group to make sure that it allows inbound SSH connection (TCP port 22) from your client’s IP address (source).
The following steps assume that you’re using a client such as OpenSSH. If you’re using a different SSH application (for example, PuTTY), consult the application-specific documentation.
Note: Because the cluster was launched with a cross-realm trust configuration, you don’t need to use a private key (.pem file) when you connect to it as a domain user using SSH.
To connect to your Amazon EMR cluster as an Active Directory user using SSH, run the following command. Replace ad_user with the domain admin user that you created while setting up the domain controller and replace master_node_URL with the cluster’s URL (see the stack’s outputs to find this information):
$ ssh -l <ad_user> <master_node_URL>
If your SSH client is configured to use a key as the preferred authentication method, the login might fail. If that’s the case, you can add the following options to your SSH command to force the SSH connection to use password authentication:
After a domain user connects to the cluster using SSH, if this is the first that the user is connecting, a local home directory is created for that user. In addition to creating a local home directory, if you used the create-hdfs-home-ba.sh bootstrap action when launching the cluster (done by default if you used the AWS CloudFormation template to launch a kerberized cluster), an HDFS user home directory is also automatically created.
Note: If you manually launched the cluster and did not use the create-hdfs-home-ba.sh bootstrap action, then you’ll need to manually create HDFS user home directories for your users.
When you connect to the cluster using SSH for the first time (as a domain user), you should see the following messages if the HDFS home directory for your domain user was successfully created:
Running jobs on a kerberized Amazon EMR cluster
To run a job on a kerberized cluster, the user submitting the job must first be authenticated. If you followed the previous section to connect to your cluster as an Active Directory user using SSH, the user should be authenticated automatically.
If running the klist command returns a “No credentials cache found” message, it means that the user is not authenticated (the user doesn’t have a Kerberos ticket). You can re-authenticate a user at any time by running the following command (be sure to use all uppercase letters for the Active Directory domain):
$ kinit <username>@<AD_DOMAIN>
When the user is authenticated, they can submit jobs just like they would on a non-kerberized cluster.
Auditing jobs
Another advantage that Kerberos can provide is that you can easily tell which user ran a particular job. For example, connect (using SSH) to a kerberized cluster with an Active Directory user, and submit the SparkPi sample application:
$ spark-example SparkPi
After running the SparkPi application, go to the Amazon EMR console and choose your cluster. Then choose the Application history tab. There you can see information about the application, including the user that submitted the job:
Common issues
Although it would be hard to cover every possible Kerberos issue, this section covers some of the more common issues that might occur and ways to fix them.
Issue 1: You can successfully connect and get authenticated on a cluster. However, whenever you try running job, it fails with an error similar to the following:
Solution: Make sure that an HDFS home directory for the user was created and that it has the right permissions.
Issue 2: You can successfully connect to the cluster, but you can’t run any Hadoop or HDFS commands.
Solution: Use the klist command to confirm whether the user is authenticated and has a valid Kerberos ticket. Use the kinit command to re-authenticate a user.
Issue 3: You can’t connect (using SSH) to the cluster using Active Directory user credentials, but you can manually authenticate the user with kinit.
Solution: Make sure that the Active Directory domain controller is the DNS server (name server) for the cluster nodes.
Cleaning up
After completing and testing this solution, remember to clean up the resources. If you used the AWS CloudFormation templates to create the resources, then use the AWS CloudFormation console or AWS CLI/SDK to delete the stacks. Deleting a stack also deletes the resources created by that stack.
If one of your stacks does not delete, make sure that there are no dependencies on the resources created by that stack. For example, if you deployed an Amazon VPC using AWS CloudFormation and then deployed a domain controller into that VPC using a different AWS CloudFormation stack, you must first delete the domain controller stack before the VPC stack can be deleted.
Summary
The ability to authenticate users and services with Kerberos not only allows you to secure your big data applications, but it also enables you to easily integrate Amazon EMR clusters with an Active Directory environment. This post showed how you can use Kerberos on Amazon EMR to create a single sign-on solution where Active Directory domain users can seamlessly access Amazon EMR clusters and run big data applications. We also showed how you can use AWS CloudFormation to automate the deployment of this solution.
AWS Glue has a transform called Relationalize that simplifies the extract, transform, load (ETL) process by converting nested JSON into columns that you can easily import into relational databases. Relationalize transforms the nested JSON into key-value pairs at the outermost level of the JSON document. The transformed data maintains a list of the original keys from the nested JSON separated by periods.
Let’s look at how Relationalize can help you with a sample use case.
An example of Relationalize in action
Suppose that the developers of a video game want to use a data warehouse like Amazon Redshift to run reports on player behavior based on data that is stored in JSON. Sample 1 shows example user data from the game. The player named “user1” has characteristics such as race, class, and location in nested JSON data. Further down, the player’s arsenal information includes additional nested JSON data. If the developers want to ETL this data into their data warehouse, they might have to resort to nested loops or recursive functions in their code.
You can then write the data to a database or to a data warehouse. You can also write it to delimited text files, such as in comma-separated value (CSV) format, or columnar file formats such as Optimized Row Columnar (ORC) format. You can use either of these format types for long-term storage in Amazon S3. Storing the transformed files in S3 provides the additional benefit of being able to query this data using Amazon Athena or Amazon Redshift Spectrum. You can further extend the usefulness of the data by performing joins between data stored in S3 and the data stored in an Amazon Redshift data warehouse.
Before we get started…
In my example, I took two preparatory steps that save some time in your ETL code development:
I stored my data in an Amazon S3 bucket and used an AWS Glue crawler to make my data available in the AWS Glue data catalog. You can find instructions on how to do that in Cataloging Tables with a Crawler in the AWS Glue documentation. The AWS Glue database name I used was “blog,” and the table name was “players.” You can see these values in use in the sample code that follows.
I deployed a Zeppelin notebook using the automated deployment available within AWS Glue. If you already used an AWS Glue development endpoint to deploy a Zeppelin notebook, you can skip the deployment instructions. Otherwise, let’s quickly review how to deploy Zeppelin.
Deploying a Zeppelin notebook with AWS Glue
The following steps are outlined in the AWS Glue documentation, and I include a few screenshots here for clarity.
Next, in the AWS Glue Management Console, choose Dev endpoints, and then choose Add endpoint.
Specify a name for the endpoint and the AWS Glue IAM role that you created.
On the networking screen, choose Skip Networking because our code only communicates with S3.
Complete the development endpoint process by providing a Secure Shell (SSH) public key and confirming your settings.
When your new development endpoint’s Provisioning status changes from PROVISIONING to READY, choose your endpoint, and then for Actions choose Create notebook server.
Enter the notebook server details, including the role you previously created and a security group with inbound access allowed on TCP port 443.
Doing this automatically launches an AWS CloudFormation template. The output specifies the URL that you can use to access your Zeppelin notebook with the username and password you specified in the wizard.
How do we flatten nested JSON?
With my data loaded and my notebook server ready, I accessed Zeppelin, created a new note, and set my interpreter to spark. I used some Python code that AWS Glue previously generated for another job that outputs to ORC. Then I added the Relationalize transform. You can see the resulting Python code in Sample 3.
Sample 3: Python code to transform the nested JSON and output it to ORC
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
#from awsglue.transforms import Relationalize
# Begin variables to customize with your information
glue_source_database = "blog"
glue_source_table = "players"
glue_temp_storage = "s3://blog-example-edz/temp"
glue_relationalize_output_s3_path = "s3://blog-example-edz/output-flat"
dfc_root_table_name = "root" #default value is "roottable"
# End variables to customize with your information
glueContext = GlueContext(spark.sparkContext)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = glue_source_table, transformation_ctx = "datasource0")
dfc = Relationalize.apply(frame = datasource0, staging_path = glue_temp_storage, name = dfc_root_table_name, transformation_ctx = "dfc")
blogdata = dfc.select(dfc_root_table_name)
blogdataoutput = glueContext.write_dynamic_frame.from_options(frame = blogdata, connection_type = "s3", connection_options = {"path": glue_relationalize_output_s3_path}, format = "orc", transformation_ctx = "blogdataoutput")
What exactly is going on in this script?
After the import statements, we instantiate a GlueContext object, which allows us to work with the data in AWS Glue. Next, we create a DynamicFrame (datasource0) from the “players” table in the AWS Glue “blog” database. We use this DynamicFrame to perform any necessary operations on the data structure before it’s written to our desired output format. The source files remain unchanged.
We then run the Relationalize transform (Relationalize.apply()) with our datasource0 as one of the parameters. Another important parameter is the name parameter, which is a key that identifies our data after the transformation completes.
The Relationalize.apply() method returns a DynamicFrameCollection, and this is stored in the dfc variable. Before we can write our data to S3, we need to select the DynamicFrame from the DynamicFrameCollection object. We do this with the dfc.select() method. The correct DynamicFrame is stored in the blogdata variable.
You might be curious why a DynamicFrameCollection was returned when we started with a single DynamicFrame. This return value comes from the way Relationalize treats arrays in the JSON document: A DynamicFrame is created for each array. Together with the root data structure, each generated DynamicFrame is added to a DynamicFrameCollection when Relationalize completes its work. Although we didn’t have any arrays in our data, it’s good to keep this in mind. Finally, we output (blogdataoutput) the root DynamicFrame to ORC files in S3.
Using the transformed data
One of the use cases we discussed earlier was using Amazon Athena or Amazon Redshift Spectrum to query the ORC files.
I used the following SQL DDL statements to create external tables in both services to enable queries of my data stored in Amazon S3.
-- Create a Schema
-- A single schema can be used with multiple external tables.
-- This step is only required once for the external tables you create.
create external schema spectrum
from data catalog
database 'blog'
iam_role 'arn:aws:iam::0123456789:role/redshift-role'
create external database if not exists;
-- Create an external table in the schema
create external table spectrum.blog(
username VARCHAR,
characteristics_race VARCHAR,
characteristics_class VARCHAR,
characteristics_subclass VARCHAR,
characteristics_power INTEGER,
characteristics_playercountry VARCHAR,
kinetic_name VARCHAR,
kinetic_type VARCHAR,
kinetic_power INTEGER,
kinetic_element VARCHAR,
energy_name VARCHAR,
energy_type VARCHAR,
energy_power INTEGER,
energy_element VARCHAR,
power_name VARCHAR,
power_type VARCHAR,
power_power INTEGER,
power_element VARCHAR,
armor_head VARCHAR,
armor_arms VARCHAR,
armor_chest VARCHAR,
armor_leg VARCHAR,
armor_classItem VARCHAR,
map VARCHAR,
waypoint VARCHAR)
stored as orc
location 's3://blog-example-edz/output-flat';
I even ran a query, shown in Sample 6, that joined my Redshift Spectrum table (spectrum.playerdata) with data in an Amazon Redshift table (public.raids) to generate advanced reports. In the where clause, I join the two tables based on the username values that are common to both data sources.
Sample 6: Select statement with a join of Redshift Spectrum data with Amazon Redshift data
-- Get Total Raid Completions for the Hunter Class.
select spectrum.playerdata.characteristics_class as class, sum(public.raids."completions.val.raids.leviathan") as "Total Hunter Leviathan Raid Completions" from spectrum.playerdata, public.raids
where spectrum.playerdata.username = public.raids."completions.val.username"
and spectrum.playerdata.characteristics_class = 'Hunter'
group by spectrum.playerdata.characteristics_class;
Summary
This post demonstrated how simple it can be to flatten nested JSON data with AWS Glue, using the Relationalize transform to automate the conversion of nested JSON. AWS Glue also automates the deployment of Zeppelin notebooks that you can use to develop your Python automation script. Finally, AWS Glue can output the transformed data directly to a relational database, or to files in Amazon S3 for further analysis with tools such as Amazon Athena and Amazon Redshift Spectrum.
As great as Relationalize is, it’s not the only transform available with AWS Glue. You can see a complete list of available transforms in Built-In Transforms in the AWS Glue documentation. Try them out today!
Trevor Roberts Jr is a Solutions Architect with AWS. He provides architectural guidance to help customers achieve success in the cloud. In his spare time, Trevor enjoys traveling to new places and spending time with family.
Genomics analysis has taken off in recent years as organizations continue to adopt the cloud for its elasticity, durability, and cost. With the AWS Cloud, customers have a number of performant options to choose from. These options include AWS Batch in conjunction with AWS Lambda and AWS Step Functions; AWS Glue, a serverless extract, transform, and load (ETL) service; and of course, the AWS big data and machine learning workhorse Amazon EMR.
For this task, we use Hail, an open source framework for exploring and analyzing genomic data that uses the Apache Spark framework. In this post, we use Amazon EMR to run Hail. We walk through the setup, configuration, and data processing. Finally, we generate an Apache Parquet–formatted variant dataset and explore it using Amazon Athena.
Compiling Hail
Because Hail is still under active development, you must compile it before you can start using it. To help simplify the process, you can launch the following AWS CloudFormation template that creates an EMR cluster, compiles Hail, and installs a Jupyter Notebook so that you’re ready to go with Hail.
There are a few things to note about the AWS CloudFormation template. You must provide a password for the Jupyter Notebook. Also, you must provide a virtual private cloud (VPC) to launch Amazon EMR in, and make sure that you select a subnet from within that VPC. Next, update the cluster resources to fit your needs. Lastly, the HailBuildOutputS3Path parameter should be an Amazon S3 bucket/prefix, where you should save the compiled Hail binaries for later use. Leave the Hail and Spark versions as is, unless you’re comfortable experimenting with more recent versions.
When you’ve completed these steps, the following files are saved locally on the cluster to be used when running the Apache Spark Python API (PySpark) shell.
The files are also copied to the Amazon S3 location defined by the AWS CloudFormation template so that you can include them when running jobs using the Amazon EMR Step API.
Collecting genome data
To get started with Hail, use the 1000 Genome Project dataset available on AWS. The data that you will use is located at s3://1000genomes/release/20130502/.
For Hail to process these files in an efficient manner, they need to be block compressed. In many cases, files that use gzip compression are compressed in blocks, so you don’t need to recompress—you can just rename the file extension to “.bgz” from “.gz” . Hail can process .gz files, but it’s much slower and not recommended. The simple way to accomplish this is to copy the data files from the public S3 bucket to your own and rename them.
The following is the Bash command line to copy the first five genome Variant Call Format (VCF) files and rename them appropriately using the AWS CLI.
for i in $(seq 5); do aws s3 cp s3://1000genomes/release/20130502/ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz s3://your_bucket/prefix/ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.bgz; done
Now that you have some data files containing variants in the Variant Call Format, you need to get the sample annotations that go along with them. These annotations provide more information about each sample, such as the population they are a part of.
In this section, you use the data collected in the previous section to explore genome variations interactively using a Jupyter Notebook. You then create a simple ETL job to convert these variations into Parquet format. Finally, you query it using Amazon Athena.
Let’s open the Jupyter notebook. To start, sign in to the AWS Management Console, and open the AWS CloudFormation console. Choose the stack that you created, and then choose the Output tab. There you see the JupyterURL. Open this URL in your browser.
Go ahead and download the Jupyter Notebook that is provided to your local machine. Log in to Jupyter with the password that you provided during stack creation. Choose Upload on the right side, and then choose the notebook from your local machine.
After the notebook is uploaded, choose it from the list on the left to open it.
Select the first cell, update the S3 bucket location to point to the bucket where you saved the compiled Hail libraries, and then choose Run. This code imports the Hail modules that you compiled at the beginning. When the cell is executing, you will see In [*]. When the process is complete, the asterisk (*) is replaced by a number, for example, In [1].
Next, run the subsequent two cells, which imports the Hail module into PySpark and initiates the Hail context.
The next cell imports a single VCF file from the bucket where you saved your data in the previous section. If you change the Amazon S3 path to not include a file name, it imports all the VCF files in that directory. Depending on your cluster size, it might take a few minutes.
Remember that in the previous section, you also copied an annotation file. Now you use it to annotate the VCF files that you’ve loaded with Hail. Execute the next cell—as a shortcut, you can select the cell and press Shift+Enter.
The import_table API takes a path to the annotation file in TSV (tab-separated values) format and a parameter named impute that attempts to infer the schema of the file, as shown in the output below the cell.
At this point, you can interactively explore the data. For example, you can count the number of samples you have and group them by population.
You can also calculate the standard quality control (QC) metrics on your variants and samples.
What if you want to query this data outside of Hail and Spark, for example, using Amazon Athena? To start, you need to change the column names to lowercase because Athena currently supports only lowercase names. To do that, use the two functions provided in the notebook and call them on your virtual dedicated server (VDS), as shown in the following image. Note that you’re only changing the case of the variants and samples schemas. If you’ve further augmented your VDS, you might need to modify the lowercase functions to do the same for those schemas.
In the current version of Hail, the sample annotations are not stored in the exported Parquet VDS, so you need to save them separately. As noted by the Hail maintainers, in future versions, the data represented by the VDS Parquet output will change, and it is recommended that you also export the variant annotations. So let’s do that.
Note that both of these lines are similar in that they export a table representation of the sample and variant annotations, convert them to a Spark DataFrame, and finally save them to Amazon S3 in Parquet file format.
Finally, it is beneficial to save the VDS file back to Amazon S3 so that next time you need to analyze your data, you can load it without having to start from the raw VCF. Note that when Hail saves your data, it requires a path and a file name.
After you run these cells, expect it to take some time as it writes out the data.
Discovering table metadata
Before you can query your data, you need to tell Athena the schema of your data. You have a couple of options. The first is to use AWS Glue to crawl the S3 bucket, infer the schema from the data, and create the appropriate table. Before proceeding, you might need to migrate your Athena database to use the AWS Glue Data Catalog.
Creating tables in AWS Glue
To use the AWS Glue crawler, open the AWS Glue console and choose Crawlers in the left navigation pane.
Then choose Add crawler to create a new crawler.
Next, give your crawler a name and assign the appropriate IAM role. Leave Amazon S3 as the data source, and select the S3 bucket where you saved the data and the sample annotations. When you set the crawler’s Include path, be sure to include the entire path, for example: s3://output_bucket/hail_data/sample_annotations/
Under the Exclusion Paths, type _SUCCESS, so that you don’t crawl that particular file.
Continue forward with the default settings until you are asked if you want to add another source. Choose Yes, and add the Amazon S3 path to the variant annotation bucket s3://your_output_bucket/hail_data/sample_annotations/ so that it can build your variant annotation table. Give it an existing database name, or create a new one.
Provide a table prefix and choose Next. Then choose Finish. At this point, assuming that the data is finished writing, you can go ahead and run the crawler. When it finishes, you have two new tables in the database you created that should look something like the following:
You can explore the schema of these tables by choosing their name and then choosing Edit Schema on the right side of the table view; for example:
Creating tables in Amazon Athena
If you cannot or do not want to use AWS Glue crawlers, you can add the tables via the Athena console by typing the following statements:
In the Amazon Athena console, choose the database in which your tables were created. In this case, it looks something like the following:
To verify that you have data, choose the three dots on the right, and then choose Preview table.
Indeed, you can see some data.
You can further explore the sample and variant annotations along with the calculated QC metrics that you calculated previously using Hail.
Summary
To summarize, this post demonstrated the ease in which you can install, configure, and use Hail, an open source highly scalable framework for exploring and analyzing genomics data on Amazon EMR. We demonstrated setting up a Jupyter Notebook to make our exploration easy. We also used the power of Hail to calculate quality control metrics for variants and samples. We exported them to Amazon S3 and allowed a broader range of users and analysts to explore them on-demand in a serverless environment using Amazon Athena.
Roy Hasson is a Global Business Development Manager for AWS Analytics. He works with customers around the globe to design solutions to meet their data processing, analytics and business intelligence needs. Roy is big Manchester United fan cheering his team on and hanging out with his family.
We can’t believe that there are just few days left before re:Invent 2017. If you are attending this year, you’ll want to check out our Big Data sessions! The Big Data and Machine Learning categories are bigger than ever. As in previous years, you can find these sessions in various tracks, including Analytics & Big Data, Deep Learning Summit, Artificial Intelligence & Machine Learning, Architecture, and Databases.
We have great sessions from organizations and companies like Vanguard, Cox Automotive, Pinterest, Netflix, FINRA, Amtrak, AmazonFresh, Sysco Foods, Twilio, American Heart Association, Expedia, Esri, Nextdoor, and many more. All sessions are recorded and made available on YouTube. In addition, all slide decks from the sessions will be available on SlideShare.net after the conference.
This post highlights the sessions that will be presented as part of the Analytics & Big Data track, as well as relevant sessions from other tracks like Architecture, Artificial Intelligence & Machine Learning, and IoT. If you’re interested in Machine Learning sessions, don’t forget to check out our Guide to Machine Learning at re:Invent 2017.
This year’s session catalog contains the following breakout sessions.
Raju Gulabani, VP, Database, Analytics and AI at AWS will discuss the evolution of database and analytics services in AWS, the new database and analytics services and features we launched this year, and our vision for continued innovation in this space. We are witnessing an unprecedented growth in the amount of data collected, in many different forms. Storage, management, and analysis of this data require database services that scale and perform in ways not possible before. AWS offers a collection of database and other data services—including Amazon Aurora, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Amazon ElastiCache, Amazon Kinesis, and Amazon EMR—to process, store, manage, and analyze data. In this session, we provide an overview of AWS database and analytics services and discuss how customers are using these services today.
Deep dive customer use cases
ABD401 – How Netflix Monitors Applications in Near Real-Time with Amazon Kinesis Thousands of services work in concert to deliver millions of hours of video streams to Netflix customers every day. These applications vary in size, function, and technology, but they all make use of the Netflix network to communicate. Understanding the interactions between these services is a daunting challenge both because of the sheer volume of traffic and the dynamic nature of deployments. In this session, we first discuss why Netflix chose Kinesis Streams to address these challenges at scale. We then dive deep into how Netflix uses Kinesis Streams to enrich network traffic logs and identify usage patterns in real time. Lastly, we cover how Netflix uses this system to build comprehensive dependency maps, increase network efficiency, and improve failure resiliency. From this session, you will learn how to build a real-time application monitoring system using network traffic logs and get real-time, actionable insights.
In this session, learn how Nextdoor replaced their home-grown data pipeline based on a topology of Flume nodes with a completely serverless architecture based on Kinesis and Lambda. By making these changes, they improved both the reliability of their data and the delivery times of billions of records of data to their Amazon S3–based data lake and Amazon Redshift cluster. Nextdoor is a private social networking service for neighborhoods.
ABD205 – Taking a Page Out of Ivy Tech’s Book: Using Data for Student Success Data speaks. Discover how Ivy Tech, the nation’s largest singly accredited community college, uses AWS to gather, analyze, and take action on student behavioral data for the betterment of over 3,100 students. This session outlines the process from inception to implementation across the state of Indiana and highlights how Ivy Tech’s model can be applied to your own complex business problems.
ABD207 – Leveraging AWS to Fight Financial Crime and Protect National Security Banks aren’t known to share data and collaborate with one another. But that is exactly what the Mid-Sized Bank Coalition of America (MBCA) is doing to fight digital financial crime—and protect national security. Using the AWS Cloud, the MBCA developed a shared data analytics utility that processes terabytes of non-competitive customer account, transaction, and government risk data. The intelligence produced from the data helps banks increase the efficiency of their operations, cut labor and operating costs, and reduce false positive volumes. The collective intelligence also allows greater enforcement of Anti-Money Laundering (AML) regulations by helping members detect internal risks—and identify the challenges to detecting these risks in the first place. This session demonstrates how the AWS Cloud supports the MBCA to deliver advanced data analytics, provide consistent operating models across financial institutions, reduce costs, and strengthen national security.
ABD208 – Cox Automotive Empowered to Scale with Splunk Cloud & AWS and Explores New Innovation with Amazon Kinesis Firehose In this session, learn how Cox Automotive is using Splunk Cloud for real time visibility into its AWS and hybrid environments to achieve near instantaneous MTTI, reduce auction incidents by 90%, and proactively predict outages. We also introduce a highly anticipated capability that allows you to ingest, transform, and analyze data in real time using Splunk and Amazon Kinesis Firehose to gain valuable insights from your cloud resources. It’s now quicker and easier than ever to gain access to analytics-driven infrastructure monitoring using Splunk Enterprise & Splunk Cloud.
ABD209 – Accelerating the Speed of Innovation with a Data Sciences Data & Analytics Hub at Takeda Historically, silos of data, analytics, and processes across functions, stages of development, and geography created a barrier to R&D efficiency. Gathering the right data necessary for decision-making was challenging due to issues of accessibility, trust, and timeliness. In this session, learn how Takeda is undergoing a transformation in R&D to increase the speed-to-market of high-impact therapies to improve patient lives. The Data and Analytics Hub was built, with Deloitte, to address these issues and support the efficient generation of data insights for functions such as clinical operations, clinical development, medical affairs, portfolio management, and R&D finance. In the AWS hosted data lake, this data is processed, integrated, and made available to business end users through data visualization interfaces, and to data scientists through direct connectivity. Learn how Takeda has achieved significant time reductions—from weeks to minutes—to gather and provision data that has the potential to reduce cycle times in drug development. The hub also enables more efficient operations and alignment to achieve product goals through cross functional team accountability and collaboration due to the ability to access the same cross domain data.
ABD210 – Modernizing Amtrak: Serverless Solution for Real-Time Data Capabilities As the nation’s only high-speed intercity passenger rail provider, Amtrak needs to know critical information to run their business such as: Who’s onboard any train at any time? How are booking and revenue trending? Amtrak was faced with unpredictable and often slow response times from existing databases, ranging from seconds to hours; existing booking and revenue dashboards were spreadsheet-based and manual; multiple copies of data were stored in different repositories, lacking integration and consistency; and operations and maintenance (O&M) costs were relatively high. Join us as we demonstrate how Deloitte and Amtrak successfully went live with a cloud-native operational database and analytical datamart for near-real-time reporting in under six months. We highlight the specific challenges and the modernization of architecture on an AWS native Platform as a Service (PaaS) solution. The solution includes cloud-native components such as AWS Lambda for microservices, Amazon Kinesis and AWS Data Pipeline for moving data, Amazon S3 for storage, Amazon DynamoDB for a managed NoSQL database service, and Amazon Redshift for near-real time reports and dashboards. Deloitte’s solution enabled “at scale” processing of 1 million transactions/day and up to 2K transactions/minute. It provided flexibility and scalability, largely eliminate the need for system management, and dramatically reduce operating costs. Moreover, it laid the groundwork for decommissioning legacy systems, anticipated to save at least $1M over 3 years.
ABD211 – Sysco Foods: A Journey from Too Much Data to Curated Insights In this session, we detail Sysco’s journey from a company focused on hindsight-based reporting to one focused on insights and foresight. For this shift, Sysco moved from multiple data warehouses to an AWS ecosystem, including Amazon Redshift, Amazon EMR, AWS Data Pipeline, and more. As the team at Sysco worked with Tableau, they gained agile insight across their business. Learn how Sysco decided to use AWS, how they scaled, and how they became more strategic with the AWS ecosystem and Tableau.
ABD217 – From Batch to Streaming: How Amazon Flex Uses Real-time Analytics to Deliver Packages on Time Reducing the time to get actionable insights from data is important to all businesses, and customers who employ batch data analytics tools are exploring the benefits of streaming analytics. Learn best practices to extend your architecture from data warehouses and databases to real-time solutions. Learn how to use Amazon Kinesis to get real-time data insights and integrate them with Amazon Aurora, Amazon RDS, Amazon Redshift, and Amazon S3. The Amazon Flex team describes how they used streaming analytics in their Amazon Flex mobile app, used by Amazon delivery drivers to deliver millions of packages each month on time. They discuss the architecture that enabled the move from a batch processing system to a real-time system, overcoming the challenges of migrating existing batch data to streaming data, and how to benefit from real-time analytics.
ABD218 – How EuroLeague Basketball Uses IoT Analytics to Engage Fans IoT and big data have made their way out of industrial applications, general automation, and consumer goods, and are now a valuable tool for improving consumer engagement across a number of industries, including media, entertainment, and sports. The low cost and ease of implementation of AWS analytics services and AWS IoT have allowed AGT, a leader in IoT, to develop their IoTA analytics platform. Using IoTA, AGT brought a tailored solution to EuroLeague Basketball for real-time content production and fan engagement during the 2017-18 season. In this session, we take a deep dive into how this solution is architected for secure, scalable, and highly performant data collection from athletes, coaches, and fans. We also talk about how the data is transformed into insights and integrated into a content generation pipeline. Lastly, we demonstrate how this solution can be easily adapted for other industries and applications.
ABD222 – How to Confidently Unleash Data to Meet the Needs of Your Entire Organization Where are you on the spectrum of IT leaders? Are you confident that you’re providing the technology and solutions that consistently meet or exceed the needs of your internal customers? Do your peers at the executive table see you as an innovative technology leader? Innovative IT leaders understand the value of getting data and analytics directly into the hands of decision makers, and into their own. In this session, Daren Thayne, Domo’s Chief Technology Officer, shares how innovative IT leaders are helping drive a culture change at their organizations. See how transformative it can be to have real-time access to all of the data that’ is relevant to YOUR job (including a complete view of your entire AWS environment), as well as understand how it can help you lead the way in applying that same pattern throughout your entire company
ABD303 – Developing an Insights Platform – Sysco’s Journey from Disparate Systems to Data Lake and Beyond Sysco has nearly 200 operating companies across its multiple lines of business throughout the United States, Canada, Central/South America, and Europe. As the global leader in food services, Sysco identified the need to streamline the collection, transformation, and presentation of data produced by the distributed units and systems, into a central data ecosystem. Sysco’s Business Intelligence and Analytics team addressed these requirements by creating a data lake with scalable analytics and query engines leveraging AWS services. In this session, Sysco will outline their journey from a hindsight reporting focused company to an insights driven organization. They will cover solution architecture, challenges, and lessons learned from deploying a self-service insights platform. They will also walk through the design patterns they used and how they designed the solution to provide predictive analytics using Amazon Redshift Spectrum, Amazon S3, Amazon EMR, AWS Glue, Amazon Elasticsearch Service and other AWS services.
ABD309 – How Twilio Scaled Its Data-Driven Culture As a leading cloud communications platform, Twilio has always been strongly data-driven. But as headcount and data volumes grew—and grew quickly—they faced many new challenges. One-off, static reports work when you’re a small startup, but how do you support a growth stage company to a successful IPO and beyond? Today, Twilio’s data team relies on AWS and Looker to provide data access to 700 colleagues. Departments have the data they need to make decisions, and cloud-based scale means they get answers fast. Data delivers real-business value at Twilio, providing a 360-degree view of their customer, product, and business. In this session, you hear firsthand stories directly from the Twilio data team and learn real-world tips for fostering a truly data-driven culture at scale.
ABD310 – How FINRA Secures Its Big Data and Data Science Platform on AWS FINRA uses big data and data science technologies to detect fraud, market manipulation, and insider trading across US capital markets. As a financial regulator, FINRA analyzes highly sensitive data, so information security is critical. Learn how FINRA secures its Amazon S3 Data Lake and its data science platform on Amazon EMR and Amazon Redshift, while empowering data scientists with tools they need to be effective. In addition, FINRA shares AWS security best practices, covering topics such as AMI updates, micro segmentation, encryption, key management, logging, identity and access management, and compliance.
ABD331 – Log Analytics at Expedia Using Amazon Elasticsearch Service Expedia uses Amazon Elasticsearch Service (Amazon ES) for a variety of mission-critical use cases, ranging from log aggregation to application monitoring and pricing optimization. In this session, the Expedia team reviews how they use Amazon ES and Kibana to analyze and visualize Docker startup logs, AWS CloudTrail data, and application metrics. They share best practices for architecting a scalable, secure log analytics solution using Amazon ES, so you can add new data sources almost effortlessly and get insights quickly
ABD316 – American Heart Association: Finding Cures to Heart Disease Through the Power of Technology Combining disparate datasets and making them accessible to data scientists and researchers is a prevalent challenge for many organizations, not just in healthcare research. American Heart Association (AHA) has built a data science platform using Amazon EMR, Amazon Elasticsearch Service, and other AWS services, that corrals multiple datasets and enables advanced research on phenotype and genotype datasets, aimed at curing heart diseases. In this session, we present how AHA built this platform and the key challenges they addressed with the solution. We also provide a demo of the platform, and leave you with suggestions and next steps so you can build similar solutions for your use cases
ABD319 – Tooling Up for Efficiency: DIY Solutions @ Netflix At Netflix, we have traditionally approached cloud efficiency from a human standpoint, whether it be in-person meetings with the largest service teams or manually flipping reservations. Over time, we realized that these manual processes are not scalable as the business continues to grow. Therefore, in the past year, we have focused on building out tools that allow us to make more insightful, data-driven decisions around capacity and efficiency. In this session, we discuss the DIY applications, dashboards, and processes we built to help with capacity and efficiency. We start at the ten thousand foot view to understand the unique business and cloud problems that drove us to create these products, and discuss implementation details, including the challenges encountered along the way. Tools discussed include Picsou, the successor to our AWS billing file cost analyzer; Libra, an easy-to-use reservation conversion application; and cost and efficiency dashboards that relay useful financial context to 50+ engineering teams and managers.
ABD312 – Deep Dive: Migrating Big Data Workloads to AWS Customers are migrating their analytics, data processing (ETL), and data science workloads running on Apache Hadoop, Spark, and data warehouse appliances from on-premise deployments to AWS in order to save costs, increase availability, and improve performance. AWS offers a broad set of analytics services, including solutions for batch processing, stream processing, machine learning, data workflow orchestration, and data warehousing. This session will focus on identifying the components and workflows in your current environment; and providing the best practices to migrate these workloads to the right AWS data analytics product. We will cover services such as Amazon EMR, Amazon Athena, Amazon Redshift, Amazon Kinesis, and more. We will also feature Vanguard, an American investment management company based in Malvern, Pennsylvania with over $4.4 trillion in assets under management. Ritesh Shah, Sr. Program Manager for Cloud Analytics Program at Vanguard, will describe how they orchestrated their migration to AWS analytics services, including Hadoop and Spark workloads to Amazon EMR. Ritesh will highlight the technical challenges they faced and overcame along the way, as well as share common recommendations and tuning tips to accelerate the time to production.
ABD402 – How Esri Optimizes Massive Image Archives for Analytics in the Cloud Petabyte scale archives of satellites, planes, and drones imagery continue to grow exponentially. They mostly exist as semi-structured data, but they are only valuable when accessed and processed by a wide range of products for both visualization and analysis. This session provides an overview of how ArcGIS indexes and structures data so that any part of it can be quickly accessed, processed, and analyzed by reading only the minimum amount of data needed for the task. In this session, we share best practices for structuring and compressing massive datasets in Amazon S3, so it can be analyzed efficiently. We also review a number of different image formats, including GeoTIFF (used for the Public Datasets on AWS program, Landsat on AWS), cloud optimized GeoTIFF, MRF, and CRF as well as different compression approaches to show the effect on processing performance. Finally, we provide examples of how this technology has been used to help image processing and analysis for the response to Hurricane Harvey.
ABD329 – A Look Under the Hood – How Amazon.com Uses AWS Services for Analytics at Massive Scale Amazon’s consumer business continues to grow, and so does the volume of data and the number and complexity of the analytics done in support of the business. In this session, we talk about how Amazon.com uses AWS technologies to build a scalable environment for data and analytics. We look at how Amazon is evolving the world of data warehousing with a combination of a data lake and parallel, scalable compute engines such as Amazon EMR and Amazon Redshift.
ABD327 – Migrating Your Traditional Data Warehouse to a Modern Data Lake In this session, we discuss the latest features of Amazon Redshift and Redshift Spectrum, and take a deep dive into its architecture and inner workings. We share many of the recent availability, performance, and management enhancements and how they improve your end user experience. You also hear from 21st Century Fox, who presents a case study of their fast migration from an on-premises data warehouse to Amazon Redshift. Learn how they are expanding their data warehouse to a data lake that encompasses multiple data sources and data formats. This architecture helps them tie together siloed business units and get actionable 360-degree insights across their consumer base. MCL202 – Ally Bank & Cognizant: Transforming Customer Experience Using Amazon Alexa Given the increasing popularity of natural language interfaces such as Voice as User technology or conversational artificial intelligence (AI), Ally® Bank was looking to interact with customers by enabling direct transactions through conversation or voice. They also needed to develop a capability that allows third parties to connect to the bank securely for information sharing and exchange, using oAuth, an authentication protocol seen as the future of secure banking technology. Cognizant’s Architecture team partnered with Ally Bank’s Enterprise Architecture group and identified the right product for oAuth integration with Amazon Alexa and third-party technologies. In this session, we discuss how building products with conversational AI helps Ally Bank offer an innovative customer experience; increase retention through improved data-driven personalization; increase the efficiency and convenience of customer service; and gain deep insights into customer needs through data analysis and predictive analytics to offer new products and services.
MCL317 – Orchestrating Machine Learning Training for Netflix Recommendations At Netflix, we use machine learning (ML) algorithms extensively to recommend relevant titles to our 100+ million members based on their tastes. Everything on the member home page is an evidence-driven, A/B-tested experience that we roll out backed by ML models. These models are trained using Meson, our workflow orchestration system. Meson distinguishes itself from other workflow engines by handling more sophisticated execution graphs, such as loops and parameterized fan-outs. Meson can schedule Spark jobs, Docker containers, bash scripts, gists of Scala code, and more. Meson also provides a rich visual interface for monitoring active workflows and inspecting execution logs. It has a powerful Scala DSL for authoring workflows as well as the REST API. In this session, we focus on how Meson trains recommendation ML models in production, and how we have re-architected it to scale up for a growing need of broad ETL applications within Netflix. As a driver for this change, we have had to evolve the persistence layer for Meson. We talk about how we migrated from Cassandra to Amazon RDS backed by Amazon Aurora
MCL350 – Humans vs. the Machines: How Pinterest Uses Amazon Mechanical Turk’s Worker Community to Improve Machine Learning Ever since the term “crowdsourcing” was coined in 2006, it’s been a buzzword for technology companies and social institutions. In the technology sector, crowdsourcing is instrumental for verifying machine learning algorithms, which, in turn, improves the user’s experience. In this session, we explore how Pinterest adapted to an increased reliability on human evaluation to improve their product, with a focus on how they’ve integrated with Mechanical Turk’s platform. This presentation is aimed at engineers, analysts, program managers, and product managers who are interested in how companies rely on Mechanical Turk’s human evaluation platform to better understand content and improve machine learning algorithms. The discussion focuses on the analysis and product decisions related to building a high quality crowdsourcing system that takes advantage of Mechanical Turk’s powerful worker community.
ABD201 – Big Data Architectural Patterns and Best Practices on AWS In this session, we simplify big data processing as a data bus comprising various stages: collect, store, process, analyze, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architectures, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost
ABD202 – Best Practices for Building Serverless Big Data Applications Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. In this session, we show you how to incorporate serverless concepts into your big data architectures. We explore the concepts behind and benefits of serverless architectures for big data, looking at design patterns to ingest, store, process, and visualize your data. Along the way, we explain when and how you can use serverless technologies to streamline data processing, minimize infrastructure management, and improve agility and robustness and share a reference architecture using a combination of cloud and open source technologies to solve your big data problems. Topics include: use cases and best practices for serverless big data applications; leveraging AWS technologies such as Amazon DynamoDB, Amazon S3, Amazon Kinesis, AWS Lambda, Amazon Athena, and Amazon EMR; and serverless ETL, event processing, ad hoc analysis, and real-time analytics.
ABD206 – Building Visualizations and Dashboards with Amazon QuickSight Just as a picture is worth a thousand words, a visual is worth a thousand data points. A key aspect of our ability to gain insights from our data is to look for patterns, and these patterns are often not evident when we simply look at data in tables. The right visualization will help you gain a deeper understanding in a much quicker timeframe. In this session, we will show you how to quickly and easily visualize your data using Amazon QuickSight. We will show you how you can connect to data sources, generate custom metrics and calculations, create comprehensive business dashboards with various chart types, and setup filters and drill downs to slice and dice the data.
ABD203 – Real-Time Streaming Applications on AWS: Use Cases and Patterns To win in the marketplace and provide differentiated customer experiences, businesses need to be able to use live data in real time to facilitate fast decision making. In this session, you learn common streaming data processing use cases and architectures. First, we give an overview of streaming data and AWS streaming data capabilities. Next, we look at a few customer examples and their real-time streaming applications. Finally, we walk through common architectures and design patterns of top streaming data use cases.
ABD213 – How to Build a Data Lake with AWS Glue Data Catalog As data volumes grow and customers store more data on AWS, they often have valuable data that is not easily discoverable and available for analytics. The AWS Glue Data Catalog provides a central view of your data lake, making data readily available for analytics. We introduce key features of the AWS Glue Data Catalog and its use cases. Learn how crawlers can automatically discover your data, extract relevant metadata, and add it as table definitions to the AWS Glue Data Catalog. We will also explore the integration between AWS Glue Data Catalog and Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.
ABD214 – Real-time User Insights for Mobile and Web Applications with Amazon Pinpoint With customers demanding relevant and real-time experiences across a range of devices, digital businesses are looking to gather user data at scale, understand this data, and respond to customer needs instantly. This requires tools that can record large volumes of user data in a structured fashion, and then instantly make this data available to generate insights. In this session, we demonstrate how you can use Amazon Pinpoint to capture user data in a structured yet flexible manner. Further, we demonstrate how this data can be set up for instant consumption using services like Amazon Kinesis Firehose and Amazon Redshift. We walk through example data based on real world scenarios, to illustrate how Amazon Pinpoint lets you easily organize millions of events, record them in real-time, and store them for further analysis.
ABD223 – IT Innovators: New Technology for Leveraging Data to Enable Agility, Innovation, and Business Optimization Companies of all sizes are looking for technology to efficiently leverage data and their existing IT investments to stay competitive and understand where to find new growth. Regardless of where companies are in their data-driven journey, they face greater demands for information by customers, prospects, partners, vendors and employees. All stakeholders inside and outside the organization want information on-demand or in “real time”, available anywhere on any device. They want to use it to optimize business outcomes without having to rely on complex software tools or human gatekeepers to relevant information. Learn how IT innovators at companies such as MasterCard, Jefferson Health, and TELUS are using Domo’s Business Cloud to help their organizations more effectively leverage data at scale.
ABD301 – Analyzing Streaming Data in Real Time with Amazon Kinesis Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. In this session, we present an end-to-end streaming data solution using Kinesis Streams for data ingestion, Kinesis Analytics for real-time processing, and Kinesis Firehose for persistence. We review in detail how to write SQL queries using streaming data and discuss best practices to optimize and monitor your Kinesis Analytics applications. Lastly, we discuss how to estimate the cost of the entire system
ABD302 – Real-Time Data Exploration and Analytics with Amazon Elasticsearch Service and Kibana In this session, we use Apache web logs as example and show you how to build an end-to-end analytics solution. First, we cover how to configure an Amazon ES cluster and ingest data using Amazon Kinesis Firehose. We look at best practices for choosing instance types, storage options, shard counts, and index rotations based on the throughput of incoming data. Then we demonstrate how to set up a Kibana dashboard and build custom dashboard widgets. Finally, we review approaches for generating custom, ad-hoc reports.
ABD304 – Best Practices for Data Warehousing with Amazon Redshift & Redshift Spectrum Most companies are over-run with data, yet they lack critical insights to make timely and accurate business decisions. They are missing the opportunity to combine large amounts of new, unstructured big data that resides outside their data warehouse with trusted, structured data inside their data warehouse. In this session, we take an in-depth look at how modern data warehousing blends and analyzes all your data, inside and outside your data warehouse without moving the data, to give you deeper insights to run your business. We will cover best practices on how to design optimal schemas, load data efficiently, and optimize your queries to deliver high throughput and performance.
ABD305 – Design Patterns and Best Practices for Data Analytics with Amazon EMR Amazon EMR is one of the largest Hadoop operators in the world, enabling customers to run ETL, machine learning, real-time processing, data science, and low-latency SQL at petabyte scale. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about lowering cost with Auto Scaling and Spot Instances, and security best practices for encryption and fine-grained access control. Finally, we dive into some of our recent launches to keep you current on our latest features.
ABD307 – Deep Analytics for Global AWS Marketing Organization To meet the needs of the global marketing organization, the AWS marketing analytics team built a scalable platform that allows the data science team to deliver custom econometric and machine learning models for end user self-service. To meet data security standards, we use end-to-end data encryption and different AWS services such as Amazon Redshift, Amazon RDS, Amazon S3, Amazon EMR with Apache Spark and Auto Scaling. In this session, you see real examples of how we have scaled and automated critical analysis, such as calculating the impact of marketing programs like re:Invent and prioritizing leads for our sales teams.
ABD311 – Deploying Business Analytics at Enterprise Scale with Amazon QuickSight One of the biggest tradeoffs customers usually make when deploying BI solutions at scale is agility versus governance. Large-scale BI implementations with the right governance structure can take months to design and deploy. In this session, learn how you can avoid making this tradeoff using Amazon QuickSight. Learn how to easily deploy Amazon QuickSight to thousands of users using Active Directory and Federated SSO, while securely accessing your data sources in Amazon VPCs or on-premises. We also cover how to control access to your datasets, implement row-level security, create scheduled email reports, and audit access to your data.
ABD315 – Building Serverless ETL Pipelines with AWS Glue Organizations need to gain insight and knowledge from a growing number of Internet of Things (IoT), APIs, clickstreams, unstructured and log data sources. However, organizations are also often limited by legacy data warehouses and ETL processes that were designed for transactional data. In this session, we introduce key ETL features of AWS Glue, cover common use cases ranging from scheduled nightly data warehouse loads to near real-time, event-driven ETL flows for your data lake. We discuss how to build scalable, efficient, and serverless ETL pipelines using AWS Glue. Additionally, Merck will share how they built an end-to-end ETL pipeline for their application release management system, and launched it in production in less than a week using AWS Glue.
ABD318 – Architecting a data lake with Amazon S3, Amazon Kinesis, and Amazon Athena Learn how to architect a data lake where different teams within your organization can publish and consume data in a self-service manner. As organizations aim to become more data-driven, data engineering teams have to build architectures that can cater to the needs of diverse users – from developers, to business analysts, to data scientists. Each of these user groups employs different tools, have different data needs and access data in different ways. In this talk, we will dive deep into assembling a data lake using Amazon S3, Amazon Kinesis, Amazon Athena, Amazon EMR, and AWS Glue. The session will feature Mohit Rao, Architect and Integration lead at Atlassian, the maker of products such as JIRA, Confluence, and Stride. First, we will look at a couple of common architectures for building a data lake. Then we will show how Atlassian built a self-service data lake, where any team within the company can publish a dataset to be consumed by a broad set of users.
Companies have valuable data that they may not be analyzing due to the complexity, scalability, and performance issues of loading the data into their data warehouse. However, with the right tools, you can extend your analytics to query data in your data lake—with no loading required. Amazon Redshift Spectrum extends the analytic power of Amazon Redshift beyond data stored in your data warehouse to run SQL queries directly against vast amounts of unstructured data in your Amazon S3 data lake. This gives you the freedom to store your data where you want, in the format you want, and have it available for analytics when you need it. Join a discussion with AWS solution architects to ask question.
ABD330 – Combining Batch and Stream Processing to Get the Best of Both Worlds Today, many architects and developers are looking to build solutions that integrate batch and real-time data processing, and deliver the best of both approaches. Lambda architecture (not to be confused with the AWS Lambda service) is a design pattern that leverages both batch and real-time processing within a single solution to meet the latency, accuracy, and throughput requirements of big data use cases. Come join us for a discussion on how to implement Lambda architecture (batch, speed, and serving layers) and best practices for data processing, loading, and performance tuning
ABD335 – Real-Time Anomaly Detection Using Amazon Kinesis Amazon Kinesis Analytics offers a built-in machine learning algorithm that you can use to easily detect anomalies in your VPC network traffic and improve security monitoring. Join us for an interactive discussion on how to stream your VPC flow Logs to Amazon Kinesis Streams and identify anomalies using Kinesis Analytics.
ABD339 – Deep Dive and Best Practices for Amazon Athena Amazon Athena is an interactive query service that enables you to process data directly from Amazon S3 without the need for infrastructure. Since its launch at re:invent 2016, several organizations have adopted Athena as the central tool to process all their data. In this talk, we dive deep into the most common use cases, including working with other AWS services. We review the best practices for creating tables and partitions and performance optimizations. We also dive into how Athena handles security, authorization, and authentication. Lastly, we hear from a customer who has reduced costs and improved time to market by deploying Athena across their organization.
We look forward to meeting you at re:Invent 2017!
About the Author
Roy Ben-Alta is a solution architect and principal business development manager at Amazon Web Services in New York. He focuses on Data Analytics and ML Technologies, working with AWS customers to build innovative data-driven products.
Data security is paramount in many industries. Organizations that shift their IT infrastructure to the cloud must ensure that their data is protected and that the attack surface is minimized. This post focuses on a method of securely loading a subset of data from one Amazon Redshift cluster to another Amazon Redshift cluster that is located in a different AWS account. You can accomplish this by dynamically controlling the security group ingress rules that are attached to the clusters.
Disclaimer: This solution is just one component of creating a secure infrastructure for loading and migrating data. It should not be thought of as an all-encompassing security solution. This post doesn’t touch on other important security topics, such as connecting to Amazon Redshift using Secure Sockets Layer (SSL) and encrypting data at rest.
The case for creating a segregated data loading account
From a security perspective, it is easier to restrict access to sensitive infrastructure if the respective stages (dev, QA, staging, and prod) are each located in their own isolated AWS account. Another common method for isolating resources is to set up separate virtual private clouds (VPCs) for each stage, all within a single AWS account. Because many services live outside the VPC (for example, Amazon S3, Amazon DynamoDB, and Amazon Kinesis), it requires careful thought to isolate the resources that should be associated with dev, QA, staging, and prod.
The segregated account model setup does create more overhead. But it gives administrators more control without them having to create tags and use cumbersome naming conventions to define a logical stage. In the segregated account model, all the data and infrastructure that are located in an account belong to that particular stage of the release pipeline (dev, QA, staging, or prod).
But where should you put infrastructure that does not belong to one particular stage?
Infrastructure to support deployments or to load data across accounts is best located in another segregated account. By deploying infrastructure or loading data from a separate account, you can’t depend on any existing roles, VPCs, subnets, etc. Any information that is necessary to deploy your infrastructure or load the data must be captured up front. This allows you to perform repeatable processes in a predictable and secure manner. With the recent addition of the StackSets feature in AWS CloudFormation, you can provision and manage infrastructure in multiple AWS accounts and Regions from a single template. This four-part blog series discusses different ways of automating the creation of cross-account roles and capturing account-specific information.
Loading OpenFDA data into Amazon Redshift
Before you get started with loading data from one Amazon Redshift cluster to another, you first need to create an Amazon Redshift cluster and load some data into it. You can use the following AWS CloudFormation template to create an Amazon Redshift cluster. You need to create Amazon Redshift clusters in both the source and target accounts.
AWSTemplateFormatVersion: '2010-09-09'
Description: This template creates a Redshift cluster given with the supplied username and password.
Parameters:
username:
Type: String
Description: The master username for the Redshift cluster.
password:
Type: String
Description: The master password for the Redshift cluster.
MinLength: 8
Resources:
DataWarehouse:
Type: AWS::Redshift::Cluster
Properties:
ClusterType: single-node
DBName: openfda
MasterUsername:
Ref: username
MasterUserPassword:
Ref: password
NodeType: dc1.large
Outputs:
RedshiftEndpoint:
Description: The endpoint address of the Redshift cluster.
Value:
Fn::GetAtt:
- DataWarehouse
- Endpoint.Address
After you create your Amazon Redshift clusters, you can go ahead and load some data into the cluster that is located in your source account. One of the great benefits of AWS is the ability to host and share public datasets on Amazon S3. When you test different architectures, these datasets serve as useful resources to get up and running without a lot of effort. For this post, we use the OpenFDA food enforcement dataset because it is a relatively small file and is easy to work with.
In the source account, you need to spin up an Amazon EMR cluster with Apache Spark so that you can unzip the file and format it properly before loading it into Amazon Redshift. The following AWS CloudFormation template provides the EMR cluster that you need.
AWSTemplateFormatVersion: '2010-09-09'
Description: This template creates an EMR cluster to load OpenFDA data into the source Redshift cluster.
Parameters:
keyPairName:
Type: String
Description: The name of the KeyPair to SSH into the EMR instances.
Resources:
Cluster:
Type: AWS::EMR::Cluster
Properties:
Applications:
- Name: Hadoop
- Name: Spark
- Name: Zeppelin
- Name: Livy
Instances:
MasterInstanceGroup:
InstanceCount: 1
InstanceType: m3.xlarge
Market: ON_DEMAND
Name: Master
CoreInstanceGroup:
InstanceCount: 2
InstanceType: m3.xlarge
Market: ON_DEMAND
Name: Core
TerminationProtected: false
Ec2KeyName:
Ref: keyPairName
JobFlowRole: EMR_EC2_DefaultRole
Name: OpenFDALoader
ServiceRole: EMR_DefaultRole
ReleaseLabel: emr-5.9.0
VisibleToAllUsers: true
Note: As an alternative, you can load the data using AWS Glue, which now supports Scala.
Now that your EMR cluster is up and running, you can submit this Scala code over a REST API call to Apache Livy. You also have the option of running this code inside of an Apache Zeppelin notebook.
To be able to run this Scala code, you must install the spark-redshift driver on your EMR cluster.
Connect to your source Amazon Redshift cluster in your source account, and verify that the data is present by running a quick query:
select count(*) from public.food_enforcement;
Opening up the security groups
Now that the data has been loaded in the source Amazon Redshift cluster, it can be moved over to the target Amazon Redshift cluster. Because the security groups that are associated with the two clusters are very restrictive, there is no way to load the data from the centralized data loading AWS account without modifying the ingress rules on both security groups. Here are a few possible options:
Add an ingress rule to allow all traffic to port 5439 (the default Amazon Redshift port).
This option is not recommended because you are widening your attack surface significantly and exposing yourself to a potential attack.
Peer the VPC in the data loader account to the source and target Amazon Redshift VPCs, and modify the ingress rule to allow all traffic from the private IP range of the data loader VPC.
This solution is reasonably secure but does require some manual setup. Because the ingress rules in the source and target Amazon Redshift clusters allow access from the VPC private IP range, any resources in the data loader account can access both clusters, which is suboptimal.
Leave long-running Amazon EC2 instances or EMR clusters in the data loader AWS account and manually create specific ingress rules in the source and target Amazon Redshift security groups to allow for those specific IPs.
This option creates a lot of wasted cost because it requires leaving EC2 instances or an EMR cluster running indefinitely whether or not they are actually being used.
None of these three options is ideal, so let’s explore another option. One of the more powerful features of running EC2 instances in the cloud is the ability to dynamically manage and configure your environment using instance metadata. The AWS Cloud is dynamic by nature and incentivizes you to reduce costs by terminating instances when they are not being used. Therefore, instance metadata can serve as the glue to performing repeatable processes to these dynamic instances.
To load the data from the source Amazon Redshift cluster to the target Amazon Redshift cluster, perform the following steps:
Spin up an EC2 instance in the data loader account.
Use instance metadata to look up the IP of the EC2 instance.
Assume roles in the source and target accounts using the AWS Security Token Service (AWS STS), and create ingress rules for the IP that was retrieved from instance metadata.
Run a simple Python or Java program to perform a simple transformation and unload the data from the source Amazon Redshift cluster. Then load the results into the target Amazon Redshift cluster.
UNLOAD('select case when product_description ilike ''%milk%'' then 1 else 0 end as milk_flag
, *
from dev.public.food_enforcement
where left(recall_initiation_date, 4) >= 2016')
TO 's3://<Your S3 Bucket>/milk-food-enforcement.csv'
IAM_ROLE '<Your Redshift Role>'
ADDQUOTES
COPY public.milk_food_enforcement
FROM 's3://<Your S3 Bucket>/milk-food-enforcement.csv'
IAM_ROLE '<Your Redshift Role>'
REMOVEQUOTES
DELIMITER '|'
Assume roles in the source and target accounts using AWS STS, and remove the ingress rules that were created in step 3.
Once step 5 is completed, you should see that the security groups for both Amazon Redshift clusters don’t allow traffic from any IP. Manually add your IP as an ingress rule to the target Amazon Redshift cluster’s security group on port 5439. When you run the following query, you should see that the data has been populated within the target Amazon Redshift cluster.
select *
from milk_food_enforcement
Recap
This post highlighted the importance of loading data in a secure manner across accounts. It mentioned reasons why you might want to provision infrastructure and load data from a centralized account. Several candidate solutions were discussed. Ultimately, the solution that we chose involved opening up security groups for a single IP and then closing them back up after the data was loaded. This solution minimizes the attack surface to a single IP and can be completely automated.
Ryan Hood is a Data Engineer for AWS. He works on big data projects leveraging the newest AWS offerings. In his spare time, he attempts to Sous-vide anything he can find in his refrigerator.
This is a guest post by Robin Cottiss, strategic customer consultant, Russell Christopher, staff product manager, and Vaidy Krishnan, senior manager of product marketing, at Tableau. Tableau, in their own words, “helps anyone quickly analyze, visualize, and share information. More than 61,000 customer accounts get rapid results with Tableau in the office and on the go. Over 300,000 people use Tableau Public to share public data in their blogs and websites.”
We’re excited to announce today an update to our Amazon Redshift connector with support for Amazon Redshift Spectrum to analyze data in external Amazon S3 tables. This feature, the direct result of joint engineering and testing work performed by the teams at Tableau and AWS, was released as part of Tableau 10.3.3 and will be available broadly in Tableau 10.4.1. With this update, you can quickly and directly connect Tableau to data in Amazon Redshift and analyze it in conjunction with data in Amazon S3—all with drag-and-drop ease.
This connector is yet another in a series of market-leading integrations of Tableau with AWS’s analytics platform, with services such as Amazon Redshift, Amazon EMR, and Amazon Athena. These integrations have allowed Tableau to become the natural choice of tool for analyzing data stored on AWS. Beyond this, Tableau Server runs seamlessly in the AWS Cloud infrastructure. If you prefer to deploy all your applications inside AWS, you have a complete solution offering from Tableau.
How does support for Amazon Redshift Spectrum help you?
If you’re like many Tableau customers, you have large buckets of data stored in Amazon S3. You might need to access this data frequently and store it in a consistent, highly structured format. If so, you can provision it to a data warehouse like Amazon Redshift. You might also want to explore this S3 data on an ad hoc basis. For example, you might want to determine whether or not to provision the data, and where—options might be Hadoop, Impala, Amazon EMR, or Amazon Redshift. To do so, you can use Amazon Athena, a serverless interactive query service from AWS that requires no infrastructure setup and management.
But what if you want to analyze both the frequently accessed data stored locally in Amazon Redshift AND your full datasets stored cost-effectively in Amazon S3? What if you want the throughput of disk and sophisticated query optimization of Amazon Redshift AND a service that combines a serverless scale-out processing capability with the massively reliable and scalable S3 infrastructure? What if you want the super-fast performance of Amazon Redshift AND support for open storage formats (for example, Parquet or ORC) in S3?
To enable these AND and resolve the tyranny of ORs, AWS launched Amazon Redshift Spectrum earlier this year.
Amazon Redshift Spectrum gives you the freedom to store your data where you want, in the format you want, and have it available for processing when you need it. Since the Amazon Redshift Spectrum launch, Tableau has worked tirelessly to provide best-in-class support for this new service. With Tableau and Redshift Spectrum, you can extend your Amazon Redshift analyses out to the entire universe of data in your S3 data lakes.
This latest update has been tested by many customers with very positive feedback. One such customer is the world’s largest food product distributor, Sysco—you can watch their session referencing the Amazon Spectrum integration at Tableau Conference 2017. Sysco also plans to reprise its “Tableau on AWS” story again in a month’s time at AWS re:Invent.
Now, I’d like to use a concrete example to demonstrate how Tableau works with Amazon Redshift Spectrum. In this example, I also show you how and why you might want to connect to your AWS data in different ways.
The setup
I use the pipeline described following to ingest, process, and analyze data with Tableau on an AWS stack. The source data is the New York City Taxi dataset, which has 9 years’ worth of taxi rides activity (including pick-up and drop-off location, amount paid, payment type, and so on) captured in 1.2 billion records.
In this pipeline, this data lands in S3, is cleansed and partitioned by using Amazon EMR, and is then converted to a columnar Parquet format that is analytically optimized. You can point Tableau to the raw data in S3 by using Amazon Athena. You can also access the cleansed data with Tableau using Presto through your Amazon EMR cluster.
Why use Tableau this early in the pipeline? Because sometimes you want to understand what’s there and what questions are worth asking before you even start the analysis.
After you find out what those questions are and determine if this sort of analysis has long-term usefulness, you can automate and optimize that pipeline. You do this to add new data as soon as possible as it arrives, to get it to the processes and people that need it. You might also want to provision this data to a highly performant “hotter” layer (Amazon Redshift or Tableau Extract) for repeated access.
In the illustration preceding, S3 contains the raw denormalized ride data at the timestamp level of granularity. This S3 data is the fact table. Amazon Redshift has the time dimensions broken out by date, month, and year, and also has the taxi zone information.
Now imagine I want to know where and when taxi pickups happen on a certain date in a certain borough. With support for Amazon Redshift Spectrum, I can now join the S3 tables with the Amazon Redshift dimensions, as shown following.
I can next analyze the data in Tableau to produce a borough-by-borough view of New York City ride density on Christmas Day 2015.
Or I can hone in on just Manhattan and identify pickup hotspots, with ride charges way above the average!
With Amazon Redshift Spectrum, you now have a fast, cost-effective engine that minimizes data processed with dynamic partition pruning. You can further improve query performance by reducing the data scanned. You do this by partitioning and compressing data and by using a columnar format for storage.
At the end of the day, which engine you use behind Tableau is a function of what you want to optimize for. Some possible engines are Amazon Athena, Amazon Redshift, and Redshift Spectrum, or you can bring a subset of data into Tableau Extract. Factors in planning optimization include these:
Are you comfortable with the serverless cost model of Amazon Athena and potential full scans? Or do you prefer the advantages of no setup?
Do you want the throughput of local disk?
Effort and time of setup. Are you okay with the lead-time of an Amazon Redshift cluster setup, as opposed to just bringing everything into Tableau Extract?
To meet the many needs of our customers, Tableau’s approach is simple: It’s all about choice. The choice of how you want to connect to and analyze your data. Throughout the history of our product and into the future, we have and will continue to empower choice for customers.
For more on how to deal with choice, as you go about making architecture decisions for your enterprise, watch this big data strategy session my friend Robin Cottiss and I delivered at Tableau Conference 2017. This session includes several customer examples leveraging the Tableau on AWS platform, and also a run-through of the aforementioned demonstration.
If you’re curious to learn more about analyzing data with Tableau on Amazon Redshift we encourage you to check out the following resources:
Check out the The Tableau AWS Modern Data Warehouse Quickstart for an automated AWS-certified reference deployment of Tableau on Amazon Redshift. The Quickstart simplifies the process of launching, configuring, and running projects with the required AWS resources for compute, network, storage and other services while following best practices for security, availability and optimum query performance
Hear Sysco and Expedia talk about their experiences deploying Tableau on AWS to achieve self-service analytics at AWS re:Invent 2017 in Las Vegas
In 2015, the Centers for Medicare and Medicaid Services (CMS) reported that healthcare spending made up 17.8% of the U.S. GDP – that’s almost $3.2 trillion or $9,990 per person. By 2025, the CMS estimates this number will increase to nearly 20%. As cloud technology evolves in the healthcare and life science industries, we are seeing how companies of all sizes are using AWS to provide powerful and innovative solutions to customers across the globe. This month we are excited to feature the following startups:
ClearCare – helping home care agencies operate efficiently and grow their business.
DNAnexus – providing a cloud-based global network for sharing and managing genomic data.
ClearCare (San Francisco, CA)
ClearCare envisions a future where home care is the only choice for aging in place. Home care agencies play a critical role in the economy and their communities by significantly lowering the overall cost of care, reducing the number of hospital admissions, and bending the cost curve of aging. Patients receiving home care typically have multiple chronic conditions and functional limitations, driving over $190 billion in healthcare spending in the U.S. each year. To offset these costs, health insurance payers are developing in-home care management programs for patients. ClearCare’s goal is to help home care agencies leverage technology to improve costs, outcomes, and quality of life for the aging population. The company’s powerful software platform is specifically designed for use by non-medical, in-home care agencies to manage their businesses.
Founder and CEO Geoff Nudd created ClearCare because of his own grandmother’s need for care. Keeping family members and caregivers up to date on a loved one’s well being can be difficult, so Geoff created what is now ClearCare’s Family Room, which enables caregivers and agency staff to check schedules and receive real-time updates about what’s happening in the home. Since then, agencies have provided feedback on others areas of their businesses that could be streamlined. ClearCare has now built over 20 modules to help home care agencies optimize operations with services including a telephony service, billing and payroll, and more. ClearCare now serves over 4,000 home care agencies, representing 500,000 caregivers and 400,000 seniors.
Using AWS, ClearCare is able to spin up reliable infrastructure for proofs of concept and iterate on those systems to quickly get value to market. The company runs many AWS services including Amazon Elasticsearch Service, Amazon RDS, and Amazon CloudFront. Amazon EMR and Amazon Athena have enabled ClearCare to build a Hadoop-based ETL and data warehousing system that processes terabytes of data each day. By utilizing these managed services, ClearCare has been able to go from concept to customer delivery in less than three months.
DNAnexus is accelerating the application of genomic data in precision medicine by providing a cloud-based platform for sharing and managing genomic and biomedical data and analysis tools. The company was founded in 2009 by Stanford graduate student Andreas Sundquist and two Stanford professors Arend Sidow and Serafim Batzoglou, to address the need for scaling secondary analysis of next-generation sequencing (NGS) data in the cloud. The founders quickly learned that users needed a flexible solution to build complex analysis workflows and tools that enable them to share and manage large volumes of data. DNAnexus is optimized to address the challenges of security, scalability, and collaboration for organizations that are pursuing genomic-based approaches to health, both in clinics and research labs. DNAnexus has a global customer base – spanning North America, Europe, Asia-Pacific, South America, and Africa – that runs a million jobs each month and is doubling their storage year-over-year. The company currently stores more than 10 petabytes of biomedical and genomic data. That is equivalent to approximately 100,000 genomes, or in simpler terms, over 50 billion Facebook photos!
DNAnexus is working with its customers to help expand their translational informatics research, which includes expanding into clinical trial genomic services. This will help companies developing different medicines to better stratify clinical trial populations and develop companion tests that enable the right patient to get the right medicine. In collaboration with Janssen Human Microbiome Institute, DNAnexus is also launching Mosaic – a community platform for microbiome research.
AWS provides DNAnexus and its customers the flexibility to grow and scale research programs. Building the technology infrastructure required to manage these projects in-house is expensive and time-consuming. DNAnexus removes that barrier for labs of any size by using AWS scalable cloud resources. The company deploys its customers’ genomic pipelines on Amazon EC2, using Amazon S3 for high-performance, high-durability storage, and Amazon Glacier for low-cost data archiving. DNAnexus is also an AWS Life Sciences Competency Partner.
A data lake is an increasingly popular way to store and analyze data that addresses the challenges of dealing with massive volumes of heterogeneous data. A data lake allows organizations to store all their data—structured and unstructured—in one centralized repository. Because data can be stored as-is, there is no need to convert it to a predefined schema.
Many organizations understand the benefits of using AWS as their data lake. For example, Amazon S3 is a highly durable, cost-effective object start that supports Open Data Formats while decoupling storage from compute, and it works with all the AWS analytic services. Although Amazon S3 provides the foundation of a data lake, you can add other services to tailor the data lake to your business needs. For more information about building data lakes on AWS, see What is a Data Lake?
Because one of the main challenges of using a data lake is finding the data and understanding the schema and data format, Amazon recently introduced AWS Glue. AWS Glue significantly reduces the time and effort that it takes to derive business insights quickly from an Amazon S3 data lake by discovering the structure and form of your data. AWS Glue automatically crawls your Amazon S3 data, identifies data formats, and then suggests schemas for use with other AWS analytic services.
This post walks you through the process of using AWS Glue to crawl your data on Amazon S3 and build a metadata store that can be used with other AWS offerings.
AWS Glue features
AWS Glue is a fully managed data catalog and ETL (extract, transform, and load) service that simplifies and automates the difficult and time-consuming tasks of data discovery, conversion, and job scheduling. AWS Glue crawls your data sources and constructs a data catalog using pre-built classifiers for popular data formats and data types, including CSV, Apache Parquet, JSON, and more.
The AWS Glue Data Catalog is compatible with Apache Hive Metastore and supports popular tools such as Hive, Presto, Apache Spark, and Apache Pig. It also integrates directly with Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.
In addition, the AWS Glue Data Catalog features the following extensions for ease-of-use and data-management functionality:
AWS Glue is an essential component of an Amazon S3 data lake, providing the data catalog and transformation services for modern data analytics.
In the preceding figure, data is staged for different analytic use cases. Initially, the data is ingested in its raw format, which is the immutable copy of the data. The data is then transformed and enriched to make it more valuable for each use case. In this example, the raw CSV files are transformed into Apache Parquet for use by Amazon Athena to improve performance and reduce cost.
The data can also be enriched by blending it with other datasets to provide additional insights. An AWS Glue crawler creates a table for each stage of the data based on a job trigger or a predefined schedule. In this example, an AWS Lambda function is used to trigger the ETL process every time a new file is added to the Raw Data S3 bucket. The tables can be used by Amazon Athena, Amazon Redshift Spectrum, and Amazon EMR to query the data at any stage using standard SQL or Apache Hive. This configuration is a popular design pattern that delivers Agile Business Intelligence to derive business value from a variety of data quickly and easily.
Walkthrough
In this walkthrough, you define a database, configure a crawler to explore data in an Amazon S3 bucket, create a table, transform the CSV file into Parquet, create a table for the Parquet data, and query the data with Amazon Athena.
Discover the data
Sign in to the AWS Management Console and open the AWS Glue console. You can find AWS Glue in the Analytics section. AWS Glue is currently available in US East (N. Virginia), US East (Ohio), and US West (Oregon). Additional AWS Regions are added frequently.
The first step to discovering the data is to add a database. A database is a collection of tables.
In the console, choose Add database. In Database name, type nycitytaxi, and choose Create.
Choose Tables in the navigation pane. A table consists of the names of columns, data type definitions, and other metadata about a dataset.
Add a table to the database nycitytaxi.You can add a table manually or by using a crawler. A crawler is a program that connects to a data store and progresses through a prioritized list of classifiers to determine the schema for your data. AWS Glue provides classifiers for common file types like CSV, JSON, Avro, and others. You can also write your own classifier using a grok pattern.
To add a crawler, enter the data source: an Amazon S3 bucket named s3://aws-bigdata-blog/artifacts/glue-data-lake/data/. This S3 bucket contains the data file consisting of all the rides for the green taxis for the month of January 2017.
Choose Next.
For IAM role, choose the default role AWSGlueServiceRoleDefault in the drop-down list.
For Frequency, choose Run on demand. The crawler can be run on demand or set to run on a schedule.
For Database, choose nycitytaxi.It is important to understand how AWS Glue deals with schema changes so that you can select the appropriate method. In this example, the table is updated with any change. For more information about schema changes, see Cataloging Tables with a Crawler in the AWS Glue Developer Guide.
Review the steps, and choose Finish. The crawler is ready to run. Choose Run it now. When the crawler has finished, one table has been added.
Choose Tables in the left navigation pane, and then choose data. This screen describes the table, including schema, properties, and other valuable information.
Transform the data from CSV to Parquet format
Now you can configure and run a job to transform the data from CSV to Parquet. Parquet is a columnar format that is well suited for AWS analytics services like Amazon Athena and Amazon Redshift Spectrum.
Under ETL in the left navigation pane, choose Jobs, and then choose Add job.
For the Name, type nytaxi-csv-parquet.
For the IAM role, choose AWSGlueServiceRoleDefault.
For This job runs, choose A proposed script generated by AWS Glue.
Provide a unique Amazon S3 path to store the scripts.
Provide a unique Amazon S3 directory for a temporary directory.
Choose Next.
Choose data as the data source.
Choose Create tables in your data target.
Choose Parquet as the format.
Choose a new location (a new prefix location without any existing objects) to store the results.
Verify the schema mapping, and choose Finish.
View the job.This screen provides a complete view of the job and allows you to edit, save, and run the job.AWS Glue created this script. However, if required, you can create your own.
Choose Save, and then choose Run job.
Add the Parquet table and crawler
When the job has finished, add a new table for the Parquet data using a crawler.
For Crawler name, type nytaxiparquet.
Choose S3 as the Data store.
Include the Amazon S3 path chosen in the ETL
For the IAM role, choose AWSGlueServiceRoleDefault.
For Database, choose nycitytaxi.
For Frequency, choose Run on demand.
After the crawler has finished, there are two tables in the nycitytaxi database: a table for the raw CSV data and a table for the transformed Parquet data.
Analyze the data with Amazon Athena
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is capable of querying CSV data. However, the Parquet file format significantly reduces the time and cost of querying the data. For more information, see the blog post Analyzing Data in Amazon S3 using Amazon Athena.
To use AWS Glue with Amazon Athena, you must upgrade your Athena data catalog to the AWS Glue Data Catalog. For more information about upgrading your Athena data catalog, see this step-by-step guide.
Open the AWS Management Console for Athena. The Query Editor displays both tables in the nycitytaxi
You can query the data using standard SQL.
Choose the nytaxigreenparquet
Type Select * From "nycitytaxi"."data" limit 10;
Choose Run Query.
Conclusion
This post demonstrates how easy it is to build the foundation of a data lake using AWS Glue and Amazon S3. By using AWS Glue to crawl your data on Amazon S3 and build an Apache Hive-compatible metadata store, you can use the metadata across the AWS analytic services and popular Hadoop ecosystem tools. This combination of AWS services is powerful and easy to use, allowing you to get to business insights faster.
If you have questions or suggestions, please comment below.
Additional reading
See the following blog posts for more information:
Gordon Heinrich is a Solutions Architect working with global systems integrators. He works with our partners and customers to provide them architectural guidance for building data lakes and using AWS analytic services. In his spare time, he enjoys spending time with his family, skiing, hiking, and mountain biking in Colorado.
Success in the popular music industry is typically measured in terms of the number of Top 10 hits artists have to their credit. The music industry is a highly competitive multi-billion dollar business, and record labels incur various costs in exchange for a percentage of the profits from sales and concert tickets.
Predicting the success of an artist’s release in the popular music industry can be difficult. One release may be extremely popular, resulting in widespread play on TV, radio and social media, while another single may turn out quite unpopular, and therefore unprofitable. Record labels need to be selective in their decision making, and predictive analytics can help them with decision making around the type of songs and artists they need to promote.
In this walkthrough, you leverage H2O.ai, Amazon Athena, and RStudio to make predictions on whether a song might make it to the Top 10 Billboard charts. You explore the GLM, GBM, and deep learning modeling techniques using H2O’s rapid, distributed and easy-to-use open source parallel processing engine. RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. This is ideal if you don’t want to connect to a server via SSH and use code editors such as vi to do analytics. RStudio is available in a desktop version, or a server version that allows you to access R via a web browser. RStudio’s Notebooks feature is used to demonstrate the execution of code and output. In addition, this post showcases how you can leverage Athena for query and interactive analysis during the modeling phase. A working knowledge of statistics and machine learning would be helpful to interpret the analysis being performed in this post.
Walkthrough
Your goal is to predict whether a song will make it to the Top 10 Billboard charts. For this purpose, you will be using multiple modeling techniques―namely GLM, GBM and deep learning―and choose the model that is the best fit.
This solution involves the following steps:
Install and configure RStudio with Athena
Log in to RStudio
Install R packages
Connect to Athena
Create a dataset
Create models
Install and configure RStudio with Athena
Use the following AWS CloudFormation stack to install, configure, and connect RStudio on an Amazon EC2 instance with Athena.
Launching this stack creates all required resources and prerequisites:
Amazon EC2 instance with Amazon Linux (minimum size of t2.large is recommended)
Provisioning of the EC2 instance in an existing VPC and public subnet
Installation of Java 8
Assignment of an IAM role to the EC2 instance with the required permissions for accessing Athena and Amazon S3
Security group allowing access to the RStudio and SSH ports from the internet (I recommend restricting access to these ports)
S3 staging bucket required for Athena (referenced within RStudio as ATHENABUCKET)
RStudio username and password
Setup logs in Amazon CloudWatch Logs (if needed for additional troubleshooting)
Amazon EC2 Systems Manager agent, which makes it easy to manage and patch
All AWS resources are created in the US-East-1 Region. To avoid cross-region data transfer fees, launch the CloudFormation stack in the same region. To check the availability of Athena in other regions, see Region Table.
Log in to RStudio
The instance security group has been automatically configured to allow incoming connections on the RStudio port 8787 from any source internet address. You can edit the security group to restrict source IP access. If you have trouble connecting, ensure that port 8787 isn’t blocked by subnet network ACLS or by your outgoing proxy/firewall.
In the CloudFormation stack, choose Outputs, Value, and then open the RStudio URL. You might need to wait for a few minutes until the instance has been launched.
Log in to RStudio with the and password you provided during setup.
Install R packages
Next, install the required R packages from the RStudio console. You can download the R notebook file containing just the code.
#install pacman – a handy package manager for managing installs
if("pacman" %in% rownames(installed.packages()) == FALSE)
{install.packages("pacman")}
library(pacman)
p_load(h2o,rJava,RJDBC,awsjavasdk)
h2o.init(nthreads = -1)
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 hours 42 minutes
## H2O cluster version: 3.10.4.6
## H2O cluster version age: 4 months and 4 days !!!
## H2O cluster name: H2O_started_from_R_rstudio_hjx881
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.30 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## R Version: R version 3.3.3 (2017-03-06)
## Warning in h2o.clusterInfo():
## Your H2O cluster version is too old (4 months and 4 days)!
## Please download and install the latest version from http://h2o.ai/download/
#install aws sdk if not present (pre-requisite for using Athena with an IAM role)
if (!aws_sdk_present()) {
install_aws_sdk()
}
load_sdk()
## NULL
Connect to Athena
Next, establish a connection to Athena from RStudio, using an IAM role associated with your EC2 instance. Use ATHENABUCKET to specify the S3 staging directory.
URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
list.files()
## [1] "AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")
con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
s3_staging_dir=Sys.getenv("ATHENABUCKET"),
aws_credentials_provider_class="com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
Verify the connection. The results returned depend on your specific Athena setup.
For this analysis, you use a sample dataset combining information from Billboard and Wikipedia with Echo Nest data in the Million Songs Dataset. Upload this dataset into your own S3 bucket. The table below provides a description of the fields used in this dataset.
Field
Description
year
Year that song was released
songtitle
Title of the song
artistname
Name of the song artist
songid
Unique identifier for the song
artistid
Unique identifier for the song artist
timesignature
Variable estimating the time signature of the song
timesignature_confidence
Confidence in the estimate for the timesignature
loudness
Continuous variable indicating the average amplitude of the audio in decibels
tempo
Variable indicating the estimated beats per minute of the song
tempo_confidence
Confidence in the estimate for tempo
key
Variable with twelve levels indicating the estimated key of the song (C, C#, B)
key_confidence
Confidence in the estimate for key
energy
Variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitch
Continuous variable that indicates the pitch of the song
timbre_0_min thru timbre_11_min
Variables that indicate the minimum values over all segments for each of the twelve values in the timbre vector
timbre_0_max thru timbre_11_max
Variables that indicate the maximum values over all segments for each of the twelve values in the timbre vector
top10
Indicator for whether or not the song made it to the Top 10 of the Billboard charts (1 if it was in the top 10, and 0 if not)
Create an Athena table based on the dataset
In the Athena console, select the default database, sampled, or create a new database.
Run the following create table statement.
create external table if not exists billboard
(
year int,
songtitle string,
artistname string,
songID string,
artistID string,
timesignature int,
timesignature_confidence double,
loudness double,
tempo double,
tempo_confidence double,
key int,
key_confidence double,
energy double,
pitch double,
timbre_0_min double,
timbre_0_max double,
timbre_1_min double,
timbre_1_max double,
timbre_2_min double,
timbre_2_max double,
timbre_3_min double,
timbre_3_max double,
timbre_4_min double,
timbre_4_max double,
timbre_5_min double,
timbre_5_max double,
timbre_6_min double,
timbre_6_max double,
timbre_7_min double,
timbre_7_max double,
timbre_8_min double,
timbre_8_max double,
timbre_9_min double,
timbre_9_max double,
timbre_10_min double,
timbre_10_max double,
timbre_11_min double,
timbre_11_max double,
Top10 int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://aws-bigdata-blog/artifacts/predict-billboard/data'
;
Inspect the table definition for the ‘billboard’ table that you have created. If you chose a database other than sampledb, replace that value with your choice.
Next, run a sample query to obtain a list of all songs from Janet Jackson that made it to the Billboard Top 10 charts.
dbGetQuery(con, " SELECT songtitle,artistname,top10 FROM sampledb.billboard WHERE lower(artistname) = 'janet jackson' AND top10 = 1")
## songtitle artistname top10
## 1 Runaway Janet Jackson 1
## 2 Because Of Love Janet Jackson 1
## 3 Again Janet Jackson 1
## 4 If Janet Jackson 1
## 5 Love Will Never Do (Without You) Janet Jackson 1
## 6 Black Cat Janet Jackson 1
## 7 Come Back To Me Janet Jackson 1
## 8 Alright Janet Jackson 1
## 9 Escapade Janet Jackson 1
## 10 Rhythm Nation Janet Jackson 1
Determine how many songs in this dataset are specifically from the year 2010.
dbGetQuery(con, " SELECT count(*) FROM sampledb.billboard WHERE year = 2010")
## _col0
## 1 373
The sample dataset provides certain song properties of interest that can be analyzed to gauge the impact to the song’s overall popularity. Look at one such property, timesignature, and determine the value that is the most frequent among songs in the database. Timesignature is a measure of the number of beats and the type of note involved.
Running the query directly may result in an error, as shown in the commented lines below. This error is a result of trying to retrieve a large result set over a JDBC connection, which can cause out-of-memory issues at the client level. To address this, reduce the fetch size and run again.
#t<-dbGetQuery(con, " SELECT timesignature FROM sampledb.billboard")
#Note: Running the preceding query results in the following error:
#Error in .jcall(rp, "I", "fetch", stride, block): java.sql.SQLException: The requested #fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try #again. Refer to the Athena documentation for valid fetchSize values.
# Use the dbSendQuery function, reduce the fetch size, and run again
r <- dbSendQuery(con, " SELECT timesignature FROM sampledb.billboard")
dftimesignature<- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
table(dftimesignature)
## dftimesignature
## 0 1 3 4 5 7
## 10 143 503 6787 112 19
nrow(dftimesignature)
## [1] 7574
From the results, observe that 6787 songs have a timesignature of 4.
Next, determine the song with the highest tempo.
dbGetQuery(con, " SELECT songtitle,artistname,tempo FROM sampledb.billboard WHERE tempo = (SELECT max(tempo) FROM sampledb.billboard) ")
## songtitle artistname tempo
## 1 Wanna Be Startin' Somethin' Michael Jackson 244.307
Create the training dataset
Your model needs to be trained such that it can learn and make accurate predictions. Split the data into training and test datasets, and create the training dataset first. This dataset contains all observations from the year 2009 and earlier. You may face the same JDBC connection issue pointed out earlier, so this query uses a fetch size.
#BillboardTrain <- dbGetQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
#Running the preceding query results in the following error:-
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve #JDBC result set for SELECT * FROM sampledb.billboard WHERE year <= 2009 (Internal error)
#Follow the same approach as before to address this issue.
r <- dbSendQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
BillboardTrain <- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
BillboardTrain[1:2,c(1:3,6:10)]
## year songtitle artistname timesignature
## 1 2009 The Awkward Goodbye Athlete 3
## 2 2009 Rubik's Cube Athlete 3
## timesignature_confidence loudness tempo tempo_confidence
## 1 0.732 -6.320 89.614 0.652
## 2 0.906 -9.541 117.742 0.542
nrow(BillboardTrain)
## [1] 7201
Create the test dataset
BillboardTest <- dbGetQuery(con, "SELECT * FROM sampledb.billboard where year = 2010")
BillboardTest[1:2,c(1:3,11:15)]
## year songtitle artistname key
## 1 2010 This Is the House That Doubt Built A Day to Remember 11
## 2 2010 Sticks & Bricks A Day to Remember 10
## key_confidence energy pitch timbre_0_min
## 1 0.453 0.9666556 0.024 0.002
## 2 0.469 0.9847095 0.025 0.000
nrow(BillboardTest)
## [1] 373
Convert the training and test datasets into H2O dataframes
You need to designate the independent and dependent variables prior to applying your modeling algorithms. Because you’re trying to predict the ‘top10’ field, this would be your dependent variable and everything else would be independent.
Create your first model using GLM. Because GLM works best with numeric data, you create your model by dropping non-numeric variables. You only use the variables in the dataset that describe the numerical attributes of the song in the logistic regression model. You won’t use these variables: “year”, “songtitle”, “artistname”, “songid”, or “artistid”.
Create Model 1 with the training dataset, using GLM as the modeling algorithm and H2O’s built-in h2o.glm function.
modelh1 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
##
|
| | 0%
|
|===== | 8%
|
|=================================================================| 100%
Measure the performance of Model 1, using H2O’s built-in performance function.
h2o.performance(model=modelh1,newdata=test.h2o)
## H2OBinomialMetrics: glm
##
## MSE: 0.09924684
## RMSE: 0.3150347
## LogLoss: 0.3220267
## Mean Per-Class Error: 0.2380168
## AUC: 0.8431394
## Gini: 0.6862787
## R^2: 0.254663
## Null Deviance: 326.0801
## Residual Deviance: 240.2319
## AIC: 308.2319
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 255 59 0.187898 =59/314
## 1 17 42 0.288136 =17/59
## Totals 272 101 0.203753 =76/373
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.192772 0.525000 100
## 2 max f2 0.124912 0.650510 155
## 3 max f0point5 0.416258 0.612903 23
## 4 max accuracy 0.416258 0.879357 23
## 5 max precision 0.813396 1.000000 0
## 6 max recall 0.037579 1.000000 282
## 7 max specificity 0.813396 1.000000 0
## 8 max absolute_mcc 0.416258 0.455251 23
## 9 max min_per_class_accuracy 0.161402 0.738854 125
## 10 max mean_per_class_accuracy 0.124912 0.765006 155
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
h2o.auc(h2o.performance(modelh1,test.h2o))
## [1] 0.8431394
The AUC metric provides insight into how well the classifier is able to separate the two classes. In this case, the value of 0.8431394 indicates that the classification is good. (A value of 0.5 indicates a worthless test, while a value of 1.0 indicates a perfect test.)
Next, inspect the coefficients of the variables in the dataset.
Typically, songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”). This knowledge is helpful for interpreting the modeling results.
You can make the following observations from the results:
The coefficient estimates for the confidence values associated with the time signature, key, and tempo variables are positive. This suggests that higher confidence leads to a higher predicted probability of a Top 10 hit.
The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs with heavier instrumentation.
The coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those songs with light instrumentation.
These coefficients lead to contradictory conclusions for Model 1. This could be due to multicollinearity issues. Inspect the correlation between the variables “loudness” and “energy” in the training set.
This number indicates that these two variables are highly correlated, and Model 1 does indeed suffer from multicollinearity. Typically, you associate a value of -1.0 to -0.5 or 1.0 to 0.5 to indicate strong correlation, and a value of 0.1 to 0.1 to indicate weak correlation. To avoid this correlation issue, omit one of these two variables and re-create the models.
You build two variations of the original model:
Model 2, in which you keep “energy” and omit “loudness”
Model 3, in which you keep “loudness” and omit “energy”
You compare these two models and choose the model with a better fit for this use case.
Inspecting the coefficient of the variable energy, Model 2 suggests that songs with high energy levels tend to be more popular. This is as per expectation.
As H2O orders variables by significance, the variable energy is not significant in this model.
You can conclude that Model 2 is not ideal for this use , as energy is not significant.
From the confusion matrix, the model correctly predicts that 33 songs will be top 10 hits (true positives). However, it has 26 false positives (songs that the model predicted would be Top 10 hits, but ended up not being Top 10 hits).
Loudness has a positive coefficient estimate, meaning that this model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion from Model 2.
Loudness is significant in this model.
Overall, Model 3 predicts a higher number of top 10 hits with an accuracy rate that is acceptable. To choose the best fit for production runs, record labels should consider the following factors:
Desired model accuracy at a given threshold
Number of correct predictions for top10 hits
Tolerable number of false positives or false negatives
Next, make predictions using Model 3 on the test dataset.
The first set of output results specifies the probabilities associated with each predicted observation. For example, observation 1 is 96.54739% likely to not be a Top 10 hit, and 3.4526052% likely to be a Top 10 hit (predict=1 indicates Top 10 hit and predict=0 indicates not a Top 10 hit). The second set of results list the actual predictions made. From the third set of results, this model predicts that 61 songs will be top 10 hits.
Compute the baseline accuracy, by assuming that the baseline predicts the most frequent outcome, which is that most songs are not Top 10 hits.
table(BillboardTest$top10)
##
## 0 1
## 314 59
Now observe that the baseline model would get 314 observations correct, and 59 wrong, for an accuracy of 314/(314+59) = 0.8418231.
It seems that Model 3, with an accuracy of 0.8552, provides you with a small improvement over the baseline model. But is this model useful for record labels?
View the two models from an investment perspective:
A production company is interested in investing in songs that are more likely to make it to the Top 10. The company’s objective is to minimize the risk of financial losses attributed to investing in songs that end up unpopular.
How many songs does Model 3 correctly predict as a Top 10 hit in 2010? Looking at the confusion matrix, you see that it predicts 33 top 10 hits correctly at an optimal threshold, which is more than half the number
It will be more useful to the record label if you can provide the production company with a list of songs that are highly likely to end up in the Top 10.
The baseline model is not useful, as it simply does not label any song as a hit.
Considering the three models built so far, you can conclude that Model 3 proves to be the best investment choice for the record label.
GBM model
H2O provides you with the ability to explore other learning models, such as GBM and deep learning. Explore building a model using the GBM technique, using the built-in h2o.gbm function.
Before you do this, you need to convert the target variable to a factor for multinomial classification techniques.
train.h2o$top10=as.factor(train.h2o$top10)
gbm.modelh <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 500, max_depth = 4, learn_rate = 0.01, seed = 1122,distribution="multinomial")
##
|
| | 0%
|
|=== | 5%
|
|===== | 7%
|
|====== | 9%
|
|======= | 10%
|
|====================== | 33%
|
|===================================== | 56%
|
|==================================================== | 79%
|
|================================================================ | 98%
|
|=================================================================| 100%
perf.gbmh<-h2o.performance(gbm.modelh,test.h2o)
perf.gbmh
## H2OBinomialMetrics: gbm
##
## MSE: 0.09860778
## RMSE: 0.3140188
## LogLoss: 0.3206876
## Mean Per-Class Error: 0.2120263
## AUC: 0.8630573
## Gini: 0.7261146
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 266 48 0.152866 =48/314
## 1 16 43 0.271186 =16/59
## Totals 282 91 0.171582 =64/373
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.189757 0.573333 90
## 2 max f2 0.130895 0.693717 145
## 3 max f0point5 0.327346 0.598802 26
## 4 max accuracy 0.442757 0.876676 14
## 5 max precision 0.802184 1.000000 0
## 6 max recall 0.049990 1.000000 284
## 7 max specificity 0.802184 1.000000 0
## 8 max absolute_mcc 0.169135 0.496486 104
## 9 max min_per_class_accuracy 0.169135 0.796610 104
## 10 max mean_per_class_accuracy 0.169135 0.805948 104
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
h2o.sensitivity(perf.gbmh,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501205344484314. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.1355932
h2o.auc(perf.gbmh)
## [1] 0.8630573
This model correctly predicts 43 top 10 hits, which is 10 more than the number predicted by Model 3. Moreover, the AUC metric is higher than the one obtained from Model 3.
As seen above, H2O’s API provides the ability to obtain key statistical measures required to analyze the models easily, using several built-in functions. The record label can experiment with different parameters to arrive at the model that predicts the maximum number of Top 10 hits at the desired level of accuracy and threshold.
H2O also allows you to experiment with deep learning models. Deep learning models have the ability to learn features implicitly, but can be more expensive computationally.
Now, create a deep learning model with the h2o.deeplearning function, using the same training and test datasets created before. The time taken to run this model depends on the type of EC2 instance chosen for this purpose. For models that require more computation, consider using accelerated computing instances such as the P2 instance type.
system.time(
dlearning.modelh <- h2o.deeplearning(y = y.dep,
x = x.indep,
training_frame = train.h2o,
epoch = 250,
hidden = c(250,250),
activation = "Rectifier",
seed = 1122,
distribution="multinomial"
)
)
##
|
| | 0%
|
|=== | 4%
|
|===== | 8%
|
|======== | 12%
|
|========== | 16%
|
|============= | 20%
|
|================ | 24%
|
|================== | 28%
|
|===================== | 32%
|
|======================= | 36%
|
|========================== | 40%
|
|============================= | 44%
|
|=============================== | 48%
|
|================================== | 52%
|
|==================================== | 56%
|
|======================================= | 60%
|
|========================================== | 64%
|
|============================================ | 68%
|
|=============================================== | 72%
|
|================================================= | 76%
|
|==================================================== | 80%
|
|======================================================= | 84%
|
|========================================================= | 88%
|
|============================================================ | 92%
|
|============================================================== | 96%
|
|=================================================================| 100%
## user system elapsed
## 1.216 0.020 166.508
perf.dl<-h2o.performance(model=dlearning.modelh,newdata=test.h2o)
perf.dl
## H2OBinomialMetrics: deeplearning
##
## MSE: 0.1678359
## RMSE: 0.4096778
## LogLoss: 1.86509
## Mean Per-Class Error: 0.3433013
## AUC: 0.7568822
## Gini: 0.5137644
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 290 24 0.076433 =24/314
## 1 36 23 0.610169 =36/59
## Totals 326 47 0.160858 =60/373
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.826267 0.433962 46
## 2 max f2 0.000000 0.588235 239
## 3 max f0point5 0.999929 0.511811 16
## 4 max accuracy 0.999999 0.865952 10
## 5 max precision 1.000000 1.000000 0
## 6 max recall 0.000000 1.000000 326
## 7 max specificity 1.000000 1.000000 0
## 8 max absolute_mcc 0.999929 0.363219 16
## 9 max min_per_class_accuracy 0.000004 0.662420 145
## 10 max mean_per_class_accuracy 0.000000 0.685334 224
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.sensitivity(perf.dl,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.496293348880151. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.3898305
h2o.auc(perf.dl)
## [1] 0.7568822
The AUC metric for this model is 0.7568822, which is less than what you got from the earlier models. I recommend further experimentation using different hyper parameters, such as the learning rate, epoch or the number of hidden layers.
H2O’s built-in functions provide many key statistical measures that can help measure model performance. Here are some of these key terms.
Metric
Description
Sensitivity
Measures the proportion of positives that have been correctly identified. It is also called the true positive rate, or recall.
Specificity
Measures the proportion of negatives that have been correctly identified. It is also called the true negative rate.
Threshold
Cutoff point that maximizes specificity and sensitivity. While the model may not provide the highest prediction at this point, it would not be biased towards positives or negatives.
Precision
The fraction of the documents retrieved that are relevant to the information needed, for example, how many of the positively classified are relevant
AUC
Provides insight into how well the classifier is able to separate the two classes. The implicit goal is to deal with situations where the sample distribution is highly skewed, with a tendency to overfit to a single class.
0.90 – 1 = excellent (A)
0.8 – 0.9 = good (B)
0.7 – 0.8 = fair (C)
.6 – 0.7 = poor (D)
0.5 – 0.5 = fail (F)
Here’s a summary of the metrics generated from H2O’s built-in functions for the three models that produced useful results.
Metric
Model 3
GBM Model
Deep Learning Model
Accuracy
(max)
0.882038
(t=0.435479)
0.876676
(t=0.442757)
0.865952
(t=0.999999)
Precision
(max)
1.0
(t=0.821606)
1.0
(t=0802184)
1.0
(t=1.0)
Recall
(max)
1.0
1.0
1.0
(t=0)
Specificity
(max)
1.0
1.0
1.0
(t=1)
Sensitivity
0.2033898
0.1355932
0.3898305
(t=0.5)
AUC
0.8492389
0.8630573
0.756882
Note: ‘t’ denotes threshold.
Your options at this point could be narrowed down to Model 3 and the GBM model, based on the AUC and accuracy metrics observed earlier. If the slightly lower accuracy of the GBM model is deemed acceptable, the record label can choose to go to production with the GBM model, as it can predict a higher number of Top 10 hits. The AUC metric for the GBM model is also higher than that of Model 3.
Record labels can experiment with different learning techniques and parameters before arriving at a model that proves to be the best fit for their business. Because deep learning models can be computationally expensive, record labels can choose more powerful EC2 instances on AWS to run their experiments faster.
Conclusion
In this post, I showed how the popular music industry can use analytics to predict the type of songs that make the Top 10 Billboard charts. By running H2O’s scalable machine learning platform on AWS, data scientists can easily experiment with multiple modeling techniques and interactively query the data using Amazon Athena, without having to manage the underlying infrastructure. This helps record labels make critical decisions on the type of artists and songs to promote in a timely fashion, thereby increasing sales and revenue.
If you have questions or suggestions, please comment below.
Gopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.
Bob Strahan, a Senior Consultant with AWS Professional Services, contributed to this post.
Oath, parent company of Yahoo, has announced that it has released Vespa as an open source project on GitHub. “Building applications increasingly means dealing with huge amounts of data. While developers can use the the Hadoop stack to store and batch process big data, and Storm to stream-process data, these technologies do not help with serving results to end users. Serving is challenging at large scale, especially when it is necessary to make computations quickly over data while a user is waiting, as with applications that feature search, recommendation, and personalization.
By releasing Vespa, we are making it easy for anyone to build applications that can compute responses to user requests, over large datasets, at real time and at internet scale – capabilities that up until now, have been within reach of only a few large companies.” (Thanks to Paul Wise)
Amazon EMR lets you have complete control over your cluster, giving you the flexibility to customize a cluster and install additional applications easily. EMR customers often use bootstrap actions to install and configure custom software in a cluster. However, bootstrap actions only run during the cluster or node startup. This makes it difficult for you to make configuration changes after a cluster is already running.
EMR clusters can also use a custom Amazon Machine Image (AMI). With the new support for launching clusters with custom Amazon Linux AMIs, customizing an EMR cluster is now even easier. However, the task of creating and managing custom AMIs can become increasingly difficult as the number of AMIs in your environment starts to increase.
Amazon EC2 Systems Manager helps you automate various management tasks such as automating AMI creation or running a command or script across hundreds of instances. In this post, I show how Systems Manager Automation can be used to automate the creation and patching of custom Amazon Linux AMIs for EMR.
Systems Manager Run Command lets you remotely manage the configuration of Amazon EC2 instances or on-premises machines. Run Command can be used to help you perform the following types of tasks on your EMR cluster nodes: install applications, restart daemons (HDFS, YARN, Presto, etc.), and make configuration changes. I also show how you can use Run Command to send commands to all nodes of a running EMR cluster.
Benefits of using a custom AMI
Although you can easily customize an EMR cluster using bootstrap actions, there can be benefits to using a custom AMI.
Reduction of cluster start time There are certain scenarios where a bootstrap action may affect your cluster start time. For example, your bootstrap action could be doing something like downloading a large program over the internet and delaying the time for your cluster to be ready. By adding and installing a program directly in the AMI, the time to complete a cluster launch may be reduced.
Prevent unexpected bootstrap action failures There are also scenarios where installing and configuring custom software directly in the AMI reduces the risk of unexpected failures. For example, a mirror or repo used by your bootstrap action to download a program might be offline or inaccessible. This could cause your bootstrap action to fail, which could cause a cluster launch failure.
Support for Amazon EBS root volume encryption A number of security and encryption features are available with EMR security configurations. This includes the ability to encrypt data at rest for HDFS (local volumes/Amazon EBS) and Amazon S3. However, certain regulatory/compliance policies may require that the root (boot) volume is also encrypted. By bringing your own Amazon Linux AMI, you can create AMIs that use encrypted EBS root volumes and use those AMIs for your EMR clusters.
Bring your own AMI requirements
Custom AMIs for EMR must meet the following requirements:
Must not have users with the same name as applications (example: hadoop, hdfs, yarn, or spark)
It is not necessary for you to own the custom AMI, but your service role must have launch permissions. Therefore, the AMI should be one of the following:
For best practices and considerations for EMR custom AMIs, see Using a Custom AMI.
Walkthrough
For the examples in this post, I show how you can set up the following solutions:
Automate a workflow of creating custom AMIs with pre-installed software
Run commands or make application configuration changes on all nodes of a running EMR cluster
Before you begin
In this post, the AWS CLI is used to execute the examples and steps shown. However, having the AWS CLI installed is not a requirement and the AWS Management Console can be used to perform the same tasks.
The region used for the examples is us-east-1 (N. Virginia).
Building a custom AMI with Systems Manager Automation
In this section, I show how you can use Automation to create a custom AMI. The following diagram shows an overview of the actions that the Automation will perform:
1) Configure roles for Automation
Before getting started, you have to configure an IAM instance profile role and a service role that Automation can use. The instance profile role gives Automation permission to perform actions on your instances, such as executing commands or starting and stopping services. The service role (or assume role) gives Automation permissions to perform actions on your behalf.
Configuring the required IAM roles for Automation is usually one of the hardest parts of setting up Automation. Luckily, you only do this step one time. We also have an AWS CloudFormation template that can be used to create and configure the required roles for Automation. For more information, see Method 1: Using AWS CloudFormation to Configure Roles for Automation.
An Automation document defines the actions that Systems Manager performs. In this step, you create a custom Automation document (customEmrAmiDocument) that performs the following steps:
Launch an EC2 instance from a base Amazon Linux AMI
Update installed software on the instance
Run additional Linux commands (optional)
Shut down the instance
Create an AMI of the instance
Terminate the instance
To create a custom Automation document, first download the customEmrAmiDocument.json document to your local machine. You can then use the console, AWS CLI, or AWS SDKs to create (upload) that Automation document in your account. The following example shows how to create an Automation document called “customEmrAmiDocument” using the AWS CLI:
Note: Creating an Automation document does not cause that document to be executed. You execute this document in the next step. Also note that file:// must be referenced followed by the path of the content file.
The “customEmrAmiDocument” Automation document created in the previous step has a list of parameters (SourceAmiId, InstanceIamRole, etc.), along with the description of each parameter. To describe the document parameters, run the following command:
The preceding command returns an output similar to the following:
[
{
"Type": "String",
"Description": "(Required) The source Amazon Machine Image ID.",
"Name": "SourceAmiId"
},
{
"Type": "String",
"Description": "(Required) The name of the role that enables Systems Manager (SSM) to manage the instance.",
"DefaultValue": "ManagedInstanceProfile",
"Name": "InstanceIamRole"
},
…
When you start an Automation execution, you must pass the required parameters (SourceAmiId) along with any additional parameters for which you would like to overwrite the default value. For example, if you used CloudFormation to create the required IAM roles, you do not need to specify the InstanceRole and AutomationAssumeRole parameters.
To execute the document without including the InstanceRole and AutomationAssumeRole parameters, run the following command:
If your role names or ARNs have different values than the defaults, make sure that you specify those parameters accordingly. For example, if your instance profile/role is called “MyManagedInstanceProfile” and the Automation service role ARN is “arn:aws:iam::012345678910:role/MyAutomationServiceRole”, then your parameters to execute the Automation should be similar to the following:
To start an Automation execution that creates a custom Amazon Linux AMI with Python 3.5 and additional Python libraries (boto3) installed, use the following command:
I chose “ami-4fffc834” for the SourceAmiId parameter because it’s the latest Amazon Linux AMI in the us-east-1 (N. Virginia) region at the time of publication. It also has all the requirements needed for EMR custom AMIs. If you’re running your Automation document in a different region, set the SourceAmiId parameter to an AMI that’s available in that particular region (ex: “ami-aa5ebdd2” for us-west-2).
4) Finding details about the Automation execution
After the Automation execution is complete, you can view the steps that were executed in addition to the status of each step and their output. To view all Automation executions that used the “customEmrAmiDocument” document, you can run the following command:
The output of the preceding command contains details about each step executed by the Automation execution. To easily find the AMI ID/imageID of the AMI created during the Automation createImage step, run the following command:
If the Automation execution fails or stops before reaching the final instance-termination step, you might need to stop instances manually or disable services that were started during the Automation execution. See the Automation CLI walkthrough and the troubleshooting Systems Manager Automation guide for more information.
5) Launch an EMR cluster with a custom AMI
After completing the preceding steps, you should now have a custom Amazon Linux AMI that can be used for EMR. For more information, see Using a Custom AMI.
The following command can be used to launch an EMR cluster via the AWS CLI:
For information about how to find the AMI ID of the custom AMI created by Automation, see step 4.
Using Run Command with EMR
In this section, I show how you can use Run Command to send commands to the nodes of a running EMR cluster. The following diagram shows an overview of a Run Command execution:
1) Configure the instance IAM role for Systems Manager
EC2 instances (EMR cluster nodes) need an IAM role to be able to communicate with the Systems Manager API. Because EMR already assigns an IAM role (usually called EMR_EC2_DefaultRole) to each cluster node, you can attach an additional managed policy (Systems Manager policy) to that role.
The following command attaches the “AmazonEC2RoleforSSM” managed policy to the EMR_EC2_DefaultRole role:
$ aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM --role-name EMR_EC2_DefaultRole
If you’re not using the default EC2 role, replace the –role-name parameter value with the role name that you’re using for your role.
Skip this step if your custom AMI was created by Automation. The customEmrAmiDocument Automation document that you used to create the custom AMI installs the SSM agent by default.
The Systems Manager (SSM) agent is used to process System Manager requests and configure your instances as specified in the request. For more information, see Installing SSM Agent on Linux.
3) Running a command with Run Command
You should now be able to run commands or Linux scripts on the instances that have the SSM agent running and the IAM role for SSM configured (Step 1 in this section). To view a list of instances that are ready to receive commands, run the following command:
The easiest way to send a command to all cluster nodes is by using a resource tag as the target for Run Command. If you didn’t add any tags to your EMR cluster during launch, you can add tags using the following command:
The preceding command adds a tag to an EMR cluster. The key of the tag is “environment” and the value is “emr-ssm”. You can now send a command using the tags as the target:
The preceding command is sent (executed) to all EC2 instances that have the following tags: environment=”emr-ssm”.
4) Finding details on a Run Command execution
For the Run Command (send-command) that was executed in the previous step, Run Command is executing a command to show the hostname (hostname -f) of an instance and its Python 3 version (python3 -V).
After executing the Run Command (send-command), it should return a “CommandID” field in the output. You can use that command ID to gather information on the instances that the command was sent to and to view the status of the command execution:
You can also view the output of the commands (‘hostname -f’ and ‘python3 -V’ for our example) that were executed by Run Command in a specific EC2 instance:
This post showed you some of the benefits of using custom AMIs for Amazon EMR and how you can use Automation to automate the management and creation of custom AMIs. I also showed how Run Command can be used to send commands and make configuration changes on all nodes of a running EMR cluster.
If you have questions or suggestions, please comment below.
Bruno Faria is an EMR Solution Architect with AWS. He works with our customers to provide them architectural guidance for running complex applications on Amazon EMR. In his spare time, he enjoys spending time with his family and learning about new big data solutions.
As a school supply aficionado, the month of September has always held a special place in my heart. Nothing sets the tone for success like getting a killer deal on pens and a crisp college ruled notebook. Even if back to school shopping trips have secured a seat in your distant memory, this is still a perfect time of year to stock up on office supplies and set aside some time for flexing those learning muscles. A great way to get started: scan through our September Tech Talks and check out the ones that pique your interest. This month we are covering re:Invent, AI, and much more.
September 2017 – Schedule
Noted below are the upcoming scheduled live, online technical sessions being held during the month of September. Make sure to register ahead of time so you won’t miss out on these free talks conducted by AWS subject matter experts.
The AWS Online Tech Talks series covers a broad range of topics at varying technical levels. These sessions feature live demonstrations & customer examples led by AWS engineers and Solution Architects. Check out the AWS YouTube channel for more on-demand webinars on AWS technologies.
– Sara
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.