Computers can easily process vast amounts of data in their raw format, such as databases or binary files, but humans require visualizations to be able to derive facts from data. The plethora of tools and services such as Kibana (as part of Amazon ES) or Amazon Quicksight to design visualizations from a data source are a testimony to this need.
Such tools often provide out-of-the-box templates for designing simple graphs from appropriately pre-processed data, but applying these to production-grade, complex visualizations can be challenging for several reasons:
The raw data upon which a visualization is built may contain encoded attributes that aren’t understandable for the viewer. For example, the following layered bar chart could be raw data about people where the gender is encoded in numbers, but the visualization should still have human-readable attribute values.
Visualizations are often built on aggregated summaries of raw data. Storing and maintaining different aggregations for each visualization may be unfeasible. For instance, if you classify raw data items in multiple dimensions (for example, classifying cars by color or engine size) and build visualizations on these different dimensions, you need to store different aggregated materialized views of the same data.
Different data views used in a single visualization may require an ad hoc computation over the underlying data to generate an appropriate foundational data source.
This post shows how to implement Vega visualizations included in Kibana, which is part of Amazon Elasticsearch Service (Amazon ES), using a real-world clickstream data sample. Vega visualizations are an integrated scripting mechanism of Kibana to perform on-the-fly computations on raw data to generate D3.js visualizations. For this post, we use a fully automated setup using AWS CloudFormation to show how to build a customized histogram for a web analytics use case. This example implements an ad hoc map-reduce like aggregation of the underlying data for a histogram.
Use case
For this post, we use Online Shopping Store – Web Server Logs published by Harvard Dataverse. The 3.3 GB dataset contains 10,365,152 access logs of an online shopping store. For example, see the following data sample:
Each entry contains a source IP address (for the preceding data sample, 207.46.13.136), the HTTP request, and return code (“GET /product/14926 HTTP/1.1” 404), the size of the response in bytes (33617) and additional metadata, such a timestamp. For this use case, we assume the role of a web server administrator who wants to visualize the response message size in bytes for the traffic over a specific time period. We want to generate a histogram of the response message sizes over a given period that looks like the following screenshot.
In the preceding screenshot, the majority of the response sizes are less than 1 MB. Based on this visualization, the administrator of the online shop could identify the following:
Requests that result in high bandwidth use
Distributed denial of service (DDoS) attacks caused by repeatedly requesting pages, which causes a high amount of response traffic
The built-in Vertical / Horizontal Bar visualization options in Kibana can’t produce this histogram over this Elasticsearch index without storing the raw data. In addition to this, these visualizations should be able to instruct complex transformations and aggregations over the data, so you can generate the preceding histogram. Storing the data in a histogram-friendly way in Amazon ES and building a visualization with Vertical / Horizontal Bar components requires ETL (Extract Transform Load) of the data to a different index. Such an ETL creates unnecessary storage and compute costs. To avoid these costs and complex workflows, Vega provides a flexible approach that can execute such transformations on the fly on the server log data.
We have built some of the necessary transformations outside of our Vega code to keep the code presented in this post concise. This was done to improve the readability of this post; the complete transformation could have also been done in Vega.
Overview of solution
To avoid unnecessary costs and focus on Vega visualization creation task in this post, we use an AWS Lambda function to stream the access logs out of an Amazon Simple Storage Service (Amazon S3) bucket instead of serving them from a web server (this Lambda function contains some of the transformations that have been removed to improve the readability of this post). In a production deployment, you replace the function and S3 bucket with a real web server and the Amazon Kinesis Agent, but the rest of the architecture remains unchanged. AWS CloudFormation deploys the following architecture into your AWS account.
The Lambda function reads and transforms the access logs out of an S3 bucket to stream them into Amazon Kinesis Data Firehose, which forwards the data to Amazon ES. Kinesis Data Firehose is the easiest way to reliably load streaming data into data lakes, data stores, and analytics tools, and therefore it is suitable for this task. The available data will be automatically transferred to Amazon ES after the deployment.
Prerequisites and deploying your CloudFormation template
To follow the content of this post, including a step-by-step walkthrough to build a customized histogram with Vega visualizations on Amazon ES, you need an AWS account and access rights to deploy a CloudFormation template. The Elasticsearch cluster and other resources the template creates result in charges to your AWS account. During a test-run in eu-west-1, we measured costs of $0.50 per hour. Complete the following steps:
Depending on which Region you want to use, sign in to the AWS Management Console and choose one of the following Regions to deploy the necessary resources in your account:
Select the check boxes, I acknowledge that AWS CloudFormation might create IAM resources with custom names and I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND.
Choose Create Stack to start the deployment.
The deployment takes approximately 20–25 minutes and is finished when the status switches to CREATE_COMPLETE.
Connecting to the Kibana
After you deploy the stack, complete the following steps to log in to Kibana and build the visualization inside Kibana:
On the Outputs tab of your CloudFormation stack, choose the URL for KibanaLoginURL.
This opens the Kibana login prompt.
Unless you changed these at the start of the CloudFormation stack deployment process, the username and password are as follows:
Amazon Cognito requires you to change the password, which Kibana prompts you for.
Enter a new password and a name for yourself.
Remember this password; there is no recovery option in this setup if you need to re-authenticate.
Upon completing these steps, you should arrive at the Kibana home screen (see the following screenshot).
You’re ready to build a Vega visualization in Kibana. We guide through this process in the following sections.
Creating an index pattern and exploring the data
Kibana visualizations are based on data stored in your Elasticsearch cluster, and the data is stored in an Elasticsearch index called vega-visu-blog-index. To create an index pattern, complete the following steps:
On the Kibana home screen, choose Discover.
For Index Pattern, enter “vega*”.
Choose Next step.
For Time Filter field name, choose timestamp.
Choose Create index pattern.
After a few seconds, a summary page with details about the created index pattern appears.
The index pattern is now created and you can use it for queries and visualizations. When you choose Discover, you see a page that shows your index pattern. By default, Kibana displays data of the last 15 minutes, but given that the data is historical, you have to properly select the time range. For this use case, the data spans January 1, 2019 to February 1, 2019. After you configure the time range, choose Update.
In addition to a graphical representation of the tuple count over time, Kibana shows the raw data in the following format:
size_in_bytes: { "size": 13000, "freq": 11 }, …
The following screenshot shows both the visualization and raw data.
As noted before, we executed a pre-aggregation step with the data, which counted the number of requests in the log file with a given size. The message sizes were truncated into steps of 100 bytes. Therefore, the preceding screenshot indicates that there were 11 requests with 13,000–13,099 bytes in size during the minute after January 26, 2019, on 13:59. The next section shows how to create the aggregated histogram from this data using a Vega visualization involving a series of data transformations implicitly and on-the-fly, which Amazon ES runs without changing the underlying data stored in the index.
Vega on-the-fly data transformation
To properly compute the desired histogram, you need to transform the data. Because the format and temporal granularity of the data stored in the Elasticsearch index doesn’t match the format required by the visualization interface. For this use case, we use the scripted metric aggregation pattern, which calls stored scripts written in Painless to implement the following four steps:
Variable initialization
Mapping documents to a key-value structure for the histogram
Grouping values into arrays by keys
Aggregation of array content into the histogram
To implement the first step for initializing the variables used by the other steps, choose Dev Tools on the left side of the screen. The development console allows you to define scripts that are later called by the Vega visualization. The following code for the first step is very simple because it initializes an empty array variable “state.test”:
Enter the preceding code into the left-hand input section of the Kibana Dev Tools page, select the code, and choose the green triangular button. Kibana acknowledges the loading of the script in the output (see the following screenshot).
To improve the readability of the rest of this section, we will show the result of each step based on the following initial input JSON:
The next step of the transformation maps the request size in each document of the index to their appropriate count of occurrences as key-value pairs, i.e., with the example data above, the first size_in_bytes field is transformed into “100”: 369. As in the previous step, implement this part of the transformation by entering the following code into the Kibana Dev Tools page and choosing the green button to load the script.
#map_script
POST _scripts/map_documents
{
"script": {
"lang": "painless",
"source": """
if (doc.containsKey(params.bucketField))
for (int i=0; i<doc[params.bucketField].length; i++)
{
def key = doc[params.bucketField][i];
def value = doc[params.countField][i];
state.test[(key).toString()] = value;
}
"""
}
}
Using our example input data results for this step, the following output is computed:
As shown above, the script’s outputs are JSON documents. Because the desired histogram aggregates data over all documents, you need to combine or merge the data by key in the next step. Enter the following code:
With the example intermediate output from the previous step, the following intermediate output is computed:
{
“100”: [369, 386],
“200”: [62, 60], …
}
Finally, you can aggregate the count value arrays returned by the combine_documents script into one scalar value by each messagesize as implemented by the aggregate_histogram_buckets script. See the following code:
This concludes the implementation of the on-the-fly data transformation used for the Vega visualization of this post. The documents are transformed into buckets with two fields—messagesize and—messagecount – which contain the corresponding data for the histogram.
Creating a Vega-Lite visualization
The Vega visualization generates D3.js representations of the data using the on-the-fly transformation discussed earlier. You can create this histogram visualization using Vega or Vega-Lite visualization grammars, which are both supported by Kibana. Vega-Lite is sufficient to implement our use case. To make use of the transformation, create a new visualization in Kibana and choose Vega.
A Vega visualization is created by a JSON document that describes the content and transformations required to generate the visual output. In our use case, we use the vega-visu-blog-index index and the four transformation steps in a scripted metric aggregation operation to generate the data property which provides the main content suitable to visualize as a histogram. The second part of the JSON document specifies the graph type, axis labels, and binning to format the visualization as required for our use case. The full Vega-Lite visualization JSON is as follows:
Replace all text from the code pane of the Vega visualization designer with the preceding code and choose Apply changes. Kibana computes the visualization calling the four stored scripts (mentioned in the previous section) for on-the-fly data transformations and displays the desired histogram. Afterwards, you can use the visualization just like the other Kibana visualizations to create Kibana dashboards.
To use the visualization in dashboards, save it by choosing Save. Changing the parameters of the visualization automatically results in a re-computation, including the data transformation. You can test this by changing the time period.
Exploring and debugging scripted metric aggregation
The debugger included into most modern browsers allows you to view and test the transformation result of the scripted metric aggregation. To view the transformation result, open the developer tools in your browser, choose Console, and enter the following code:
VEGA_DEBUG.view.data('table')
The Vega Debugger shows the data as a tree, which you can easily explore.
The developer tools are useful when writing transformation scripts to test the functionality of the scripts and manually explore their output.
Cleaning up
To delete all resources and stop incurring costs to your AWS account, complete the following steps:
On the AWS CloudFormation console, from the list of deployed stacks, choose vega-visu-blog.
Choose Delete.
The process can take up to 15 minutes, and removes all the resources you deployed for following this post.
Conclusion
Although Kibana itself provides powerful built-in visualization methods, these approaches require the underlying data to have the right format. This creates issues when the same data is used for different visualizations or simply not available in an ideal format for your visualization. Because storing different aggregations or views of the same data isn’t a cost-effective approach, this post showed how to generate customized visualizations using Amazon ES, Kibana, and Vega visualizations with on-the-fly data transformations.
About the authors
Markus Bestehorn is a Principal Prototyping Engagement Manager at AWS. He is responsible for building business-critical prototypes with AWS customers, and is a specialist for IoT and machine learning. His “career” started as a 7-year-old when he got his hands on a computer with two 5.25” floppy disks, no hard disk, and no mouse, on which he started writing BASIC, and later C, as well as C++ programs. He holds a PhD in computer science and all currently available AWS certifications. When he’s not on the computer, he runs or climbs mountains.
Anil Sener is a Data Prototyping Architect at AWS. He builds prototypes on Big Data Analytics, Streaming, and Machine Learning, which accelerates the production journey on AWS for top EMEA customers. He has two Master Degrees in MIS and Data Science. He likes to read about history and philosophy in his free time.
We’re now expanding the Kinesis Data Firehose delivery destinations to include generic HTTP endpoints. This enables you to use a fully managed delivery service to HTTP endpoints without building custom applications or worrying about operating and managing the data delivery infrastructure. Additionally, the HTTP endpoint enhancement opens up a number of key integration opportunities between Kinesis Data Firehose and other AWS services such as Amazon DynamoDB or Amazon SNS or Amazon RDS using Amazon API Gateway’s AWS Service integrations.
All the existing Kinesis Data Firehose features are fully supported, including AWS Lambda service integration, retry option, data protection on delivery failure, and cross-account and cross-Region data delivery. Traffic between Kinesis Data Firehose and the HTTP endpoint is encrypted in transit using HTTPS. Kinesis Data Firehose incorporates error handling, automatic scaling, transformation, conversion, aggregation, and compression functionality to help you accelerate the deployment of data streams across your organization. There is no additional cost to use this feature.
In this post, we walk through setting up a Kinesis Data Firehose HTTP endpoint. Our example use case ingests data into a Firehose delivery stream and sends it to an Amazon API Gateway REST API endpoint that loads the data to a DynamoDB table using the DynamoDB AWS Service integration.
Configuring a delivery stream to an HTTP endpoint
To set this up, you provide the URL for your HTTP endpoint application or service, add an optional access key as a header to the HTTP calls made to your HTTP endpoint, and include optional key-value parameters. You can configure the endpoint on the AWS Management Console, the AWS Command Line Interface (AWS CLI), or the AWS SDK.
On the console, under Analytics, choose Kinesis.
Choose Create delivery stream.
For Delivery stream name, enter a name.
For Choose a source, select Direct PUT or other sources as the source using the Kinesis Data Firehose PutRecord API.
Leave the rest of the settings at their default and choose Next.
Leave all settings at their default in Step 2: Process records and choose Next.
You can optionally choose to encode and compress your request body before posting it to your HTTP endpoint.
For Parameters, you can pass optional key-value parameters as needed.
In this post, we pass two key-value pairs that serve as inputs to an Amazon API Gateway integration with our Amazon DynamoDB API endpoint.
TableName as the Key and Table1 as the Value.
Region as the Key and us-east-2 as the Value.
For Retry duration, leave at its default of 300 seconds.
For S3 backup mode, select Failed data only.
For S3 bucket, enter an S3 bucket as a backup for the delivery stream to store data that failed delivery to the HTTP API endpoint.
Choose Next.
Leave everything at its default on the Configure settings page and choose Next.
Review your settings and choose Create delivery stream.
When the delivery stream is active, your source can start streaming data to it.
Testing the delivery stream
For this post, we use the Test with demo data feature available in Kinesis Data Firehose to stream sample data to the newly created delivery stream.
On the Kinesis Data Firehose console, choose the delivery stream you just created.
Choose Test with demo data.
The delivery stream delivers the demo data to the API Gateway REST API, that is configured as the HTTP endpoint. The integration endpoint reads the key-value header attributes to determine the DynamoDB table name and region, and loads the payload to the specified table.
Monitoring the delivery stream
You can view Amazon CloudWatch metrics on the Monitoring tab. Pertinent metrics to observe are Delivery to HTTP Endpoint data freshness and HTTP Endpoint delivery success.
Conclusion
This post demonstrated how to create a delivery stream to a HTTP endpoint, which eliminates the need to develop custom applications or manage the corresponding infrastructure. Kinesis Data Firehose provides a fully managed service that helps you reduce complexities, so you can expand and accelerate the use of data streams throughout your organization.
About the Authors
Imtiaz (Taz) Sayed is the World Wide Tech Leader for Data Analytics at AWS. He is an ardent data engineer and relishes connecting with the data-analytics community. He likes roller-coasters, good heist movies, and is undecided between “The Godfather” and “The Shawshank Redemption” as the greatest movie of all time.
Masudur Rahaman Sayem is a Specialist Solution Architect for Analytics at AWS. He is passionate about distributed systems. He also likes to read, especially the classic comic books.
This post is courtesy of Rajdeep Tarat, Solutions Architect and Venugopal Pai, Solutions Architect
Customers across industry verticals collect, analyze, and derive insights from end-user application analytics using solutions such as Google Analytics and MixPanel. While these solutions provide built-in dashboards for marketing analytics, it can be difficult to reuse the raw event data.
Setting up a pipeline to move the raw event data into AWS opens up possibilities for various rule-based, statistical, and machine learning algorithms to derive deep insights about end-user behavior. Additionally, the raw event data can be enriched with other transactional data points available within the customer’s AWS environment.
This post uses the Segment Partner integration in Amazon EventBridge to pipe the data into your AWS environment. Segment allows you to collect, unify, and connect end-user application analytics into AWS using Amazon EventBridge as a destination.
Segment already supports direct, optimized connections to many AWS services such as Amazon Redshift, Amazon Personalize, Amazon Kinesis, Amazon Kinesis Data Firehose, AWS Lambda, and Amazon S3. The EventBridge destination is a good choice for customers who want the flexibility and centralization that EventBridge offers.
EventBridge makes it easy to build scalable event-driven application pipelines by handling event ingestion, delivery, security, authorization, and error handling for you. The architecture of this pipeline is shown below:
In the diagram, end-user applications send the data into Segment, which is routed to each of the configured destinations (for example, EventBridge). Once the data reaches EventBridge, it is again routed to multiple targets. With this approach you can continue using existing solutions supported by Segment in parallel to the Amazon EventBridge destination.
This architecture makes the pipeline highly extensible and modular. Firstly, you can configure multiple Segment destinations to fan out the event data into other existing solutions in parallel to EventBridge. Marketing teams can continue to use their existing tools without any disruptions while the data is seamlessly pumped into AWS. Within the AWS Cloud, EventBridge can again route the event data to up to five targets per rule.
The following section provides a walkthrough of setting up the Segment integration with EventBridge, and configuring two targets within the AWS Cloud.
The first target uses an Amazon Kinesis Data Firehose to deliver the data to an S3 bucket. From the S3 bucket, multiple AWS services can use the data (learn more about using S3 as a data lake).
The second target posts the event data to an SNS topic. From here, the data can be consumed by subscribers for the topic.
Walkthrough
To set up the pipeline, you must configure the Segment partner integration in EventBridge, and then set up the targets where analytics data is sent.
Amazon EventBridge – Segment partner integration:
From the Amazon EventBridge console, navigate to the Partner Event Sources > Segment setup page. Copy your AWS Account ID from here.
On the Segment destination setup page, use the Amazon EventBridge integration. Enter the AWS Account ID and select a Region (learn more about setting up a Segment destination).
Create the event bus:
After linking the Segment Destination with the AWS Account ID, send a test event from Segment to create a Partner Event Source. This also creates an Event bus with the same name. This is done by firing a test event from the Event Tester in the Segment Dashboard.
After the first test event is fired, the Partner Event Source and the corresponding event bus is created in the EventBridge console.
Create rules:
A rule watches for incoming events and routes them to specific targets that are configured. Start by creating a new rule and entering a name.
For Event Pattern, select the Predefined pattern by Service, and select Service Partners > Segment.
Under Select event bus, select the Custom or partner event bus, and the name of the event bus created.
Configuring multiple targets for the event bus:
For Kinesis streams, select Kinesis stream from the Target dropdown, and the name of the stream. For more details on creating a Kinesis data stream, read this documentation.
For SNS topic, choose Add Target and repeat the same steps to add an SNS topic instead. For more details on creating an SNS topic, read this documentation.
You can optionally tag the resource, then choose Create.
This post shows how customers can capture end-user application analytics using the partner solution Segment in real time, and ingest data into Amazon EventBridge. The data routing is made extensible using multiple Segment destinations (for third-party solutions), and using multiple rules in EventBridge (for multiple destinations within the AWS Cloud).
This is a guest post by Arvind Godbole, Lead Software Engineer with FactSet and Tarik Makota, AWS Principal Solutions Architect. In their own words “FactSet creates flexible, open data and software solutions for tens of thousands of investment professionals around the world, which provides instant access to financial data and analytics that investors use to make crucial decisions. At FactSet, we are always working to improve the value that our products provide.”
One area that we’ve been looking into is the relevancy of search results for our clients. Given the wide variety of client use cases and the large number of searches per day, we needed a platform to store anonymized usage data and allow us to analyze that data to boost results using our custom scoring algorithm. Amazon EMR was the obvious choice to host the calculations, but the question arose on how to get our anonymized data into a form that Amazon EMR could use. We worked with AWS and chose to use Amazon DynamoDB to prepare the data for usage in Amazon EMR.
This post walks you through how FactSet takes data from a DynamoDB table and converts that data into Apache Parquet. We store the Parquet files in Amazon S3 to enable near real-time analysis with Amazon EMR. Along the way, we encountered challenges related to data type conversion, which we will explain and show how we were able to overcome these.
Workflow overview
Our workflow contained the following steps:
Anonymized log data is stored into DynamoDB tables. These entries have different fields, depending on how the logs were generated. Whenever we create items in the tables, we use DynamoDB Streams to write out a record. The stream records contain information from a single item in a DynamoDB table.
An AWS Lambda function is hooked into the DynamoDB stream to capture the new items stored in a DynamoDB table. We built our Lambda function off of the lambda-streams-to-firehose project on GitHub to convert the DynamoDB stream image to JSON, which we stringify and push to Amazon Kinesis Data Firehose.
Kinesis Data Firehose transforms the JSON data into Parquet using data contained within an AWS Glue Data Catalog table.
Kinesis Data Firehose stores the Parquet files in S3.
An AWS Glue crawler discovers the schema of DynamoDB items and stores the associated metadata into the Data Catalog.
The following diagram illustrates this workflow.
AWS Glue provides tools to help with data preparation and analysis. A crawler can run on a DynamoDB table to take inventory of the table data and store that information in a Data Catalog. Other services can use the Data Catalog as an index to the location, schema, and types of the table data. There are other ways to add metadata into a Data Catalog, but the key idea is that you can update and modify the metadata easily. For more information, see Populating the AWS Glue Data Catalog.
Problem: Data type disparities
Using a variety of technologies to build a solution often requires mapping and converting data types between these technologies. The cloud is no exception. In our case, log items stored in DynamoDB contained attributes of type String Set. String Set values caused data conversion exceptions when Kinesis tried to transform the data to Parquet. After investigating the problem, we found the following:
As the crawler indexes the DynamoDB table, Set data types (StringSet, NumberSet) are stored in the Glue metadata catalog as set<string> and set<bigint>.
Kinesis Data Firehose uses that same catalog when it performs the conversion to Apache Parquet. The conversion requires valid Hive data types.
set<string> and set<bigint> are not valid Hive data types, so the conversion fails, and an exception is generated. The exception looks similar to the following code:
[{
"lastErrorCode": "DataFormatConversion.InvalidSchema",
"lastErrorMessage": "The schema is invalid. Error parsing the schema: Error: type expected at the position 38 of 'array,used:bigint>>' but 'set' is found."
}]
Solution: Construct data mapping
While working with the AWS team, we confirmed that the Kinesis Data Firehose converter needs valid Hive data types in the Data Catalog to succeed. When it comes to complex data types, Hive doesn’t support set<data_type>, but it does support the following:
In our case, this meant that we must convert set<string> and set<bigint> into array<string> and array<bigint>. Our first step was to manually change the types directly in the Data Catalog. After we updated the Data Catalog to change all occurrences of set<data_type> to array<data_type>, the Kinesis transformation to Parquet completed successfully.
Our business case calls for a data store that can store items with different attributes in the same table and the addition of new attributes on-the-fly. We took advantage of DynamoDB’s schema-less nature and ability to scale up and down on-demand so we could focus on our functionality and not the management of the underlying infrastructure. For more information, see Should Your DynamoDB Table Be Normalized or Denormalized?
If our data had a static schema, a manual change would be good enough. Given our business case, a manual solution wasn’t going to scale. Every time we introduced new attributes to the DynamoDB table, we needed to run the crawler, which re-created the metadata and overwrote the change.
Serverless event architecture
To automate the data type updates to the Data Catalog, we used Amazon EventBridge and Lambda to implement the modifications to the data type mapping. EventBridge is a serverless event bus that connects applications using events. An event is a signal that a system’s state has changed, such as the status of a Data Catalog table.
The following diagram shows the previous workflow with the new architecture.
The crawler stays as-is and crawls the DynamoDB table to obtain the metadata.
The metadata obtained by the crawler is stored in the Data Catalog. Previous metadata is updated or removed, and changes (manual or automated) are overwritten.
The event GlueTableChanged in EventBridge listens to any changes to the Data Catalog tables. After we receive the event that there was a change to the table, we trigger the Lambda function.
The Lambda function uses AWS SDK to update the Glue Catalog table using the glue.update_table() API to replace occurrences of set<data_type> with array<data_type>.
To set up EventBridge, we set Event pattern to be “Pre-defined pattern by service”. For service provider, we selected AWS and Glue as service. Event Type we selected “Glue Data Catalog Table State Change”. The following screenshot shows the EventBridge configuration that sends events to the Lambda function that updates the Data Catalog.
The following is the baseline Lambda code:
# This is NOT production worthy code please modify and implement error handling routines as appropriate
import json
import logging
import boto3
glue = boto3.client('glue')
logger = logging.getLogger()
logger.setLevel(logging.INFO)
# Define subsegments manually
def table_contains_set(databaseName, tableName):
# returns Glue Catalog description for Table structure
response = glue.get_table( DatabaseName=databaseName,Name=tableName)
logger.info(response)
# loop thru all the Columns of the table
isModified = False
for i in response['Table']['StorageDescriptor']['Columns']:
logger.info("## Column: " + str(i['Name']))
# if Column datatype starts with set< then change it to array<
if i['Type'].find("set<") != -1:
i['Type'] = i['Type'].replace("set<", "array<")
isModified = True
logger.info(i['Type'])
if isModified:
# following 3 statements simply clean up the response JSON so that update_table API call works
del response['Table']['DatabaseName']
del response['Table']['CreateTime']
del response['Table']['UpdateTime']
glue.update_table(DatabaseName=databaseName,TableInput=response['Table'],SkipArchive=True)
logger.info("============ ### =============")
logger.info(response)
return True
def lambda_handler(event, context):
logger.info('## EVENT')
# logger.info(event)
# This is Sample of the event payload that would be received
# { 'version': '0',
# 'id': '2b402842-21f5-1d76-1a9a-c90076d1d7da',
# 'detail-type': 'Glue Data Catalog Table State Change',
# 'source': 'aws.glue',
# 'account': '1111111111',
# 'time': '2019-08-18T02:53:41Z',
# 'region': 'us-east-1',
# 'resources': ['arn:aws:glue:us-east-1:111111111:table/ddb-glue-fh/ddb_glu_fh_sample'],
# 'detail': {
# 'databaseName': 'ddb-glue-fh',
# 'changedPartitions': [],
# 'typeOfChange': 'UpdateTable',
# 'tableName': 'ddb_glu_fh_sample'
# }
# }
# get the database and table name of the Glue table triggered the event
databaseName = event['detail']['databaseName']
tableName = event['detail']['tableName']
logger.info("DB: " + databaseName + " | Table: " + tableName)
table_contains_set(databaseName, tableName)
# TODO implement and modify
return {
'statusCode': 200,
'body': json.dumps('Hello from Lambda!')
}
The Lambda function is straightforward; this post provides a basic skeleton. You can use this as a template to implement your own functionality for your specific data.
Conclusion
Simple things such as data type conversion and mapping can create unexpected outcomes and challenges when data crosses service boundaries. One of the advantages of AWS is the wide variety of tools with which you can create robust and scalable solutions tailored to your needs. Using event-driven architecture, we solved our data type conversion errors and automated the process to eliminate the issue as we move forward.
About the Authors
Arvind Godbole is a Lead Software Engineer at FactSet Research Systems. He has experience in building high-performance, high-availability client facing products and services, ranging from real-time financial applications to search infrastructure. He is currently building an analytics platform to gain insights into client workflows. He holds a B.S. in Computer Engineering from the University of California, San Diego
Tarik Makota is a Principal Solutions Architect with the Amazon Web Services. He provides technical guidance, design advice and thought leadership to AWS’ customers across US Northeast. He holds an M.S. in Software Development and Management from Rochester Institute of Technology.
AWS re:Invent 2018 is almost here! This post includes a list of Amazon Kinesis sessions, chalk talks, and workshops at AWS re:Invent 2018. You can choose the link next to each session description for the session schedule. Use the information to help schedule your conference week in Las Vegas to learn more about Amazon Kinesis.
Sessions
ANT208 – Serverless Video Ingestion & Analytics with Amazon Kinesis Video Streams
Amazon Kinesis Video Streams makes it easy to capture live video, play it back, and store it for real-time and batch-oriented ML-driven analytics. In this session, we first dive deep on the top five best practices for getting started and scaling with Amazon Kinesis Video Streams. Next, we demonstrate a streaming video from a standard USB camera connected to a laptop, and we perform a live playback on a standard browser within minutes. We also have on stage members of Amazon Go, who are building the next generation of physical retail store experiences powered by their “just walk out” technology. They walk through the technical details of their integration with Kinesis Video Streams and highlight their successes and difficulties along the way.
ANT310 – Architecting for Real-Time Insights with Amazon Kinesis
Amazon Kinesis makes it easy to speed up the time it takes for you to get valuable, real-time insights from your streaming data. In this session, we walk through the most popular applications that customers implement using Amazon Kinesis, including streaming extract-transform-load, continuous metric generation, and responsive analytics. Our customer Autodesk joins us to describe how they created real-time metrics generation and analytics using Amazon Kinesis and Amazon Elasticsearch Service. They walk us through their architecture and the best practices they learned in building and deploying their real-time analytics solution.
ANT322-R – High Performance Data Streaming with Amazon Kinesis: Best Practices
Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. In this session, we dive deep into best practices for Kinesis Data Streams and Kinesis Data Firehose to get the most performance out of your data streaming applications. Our customer NICE inContact joins us to discuss how they utilize Amazon Kinesis Data Streams to make real-time decisions on customer contact routing and agent assignments for its Call Center as a Service (CCaaS) Platform. NICE inContact walks through their architecture and requirements for low-latency, accurate processing to be as responsive as possible to changes.
ANT322-R1 – High Performance Data Streaming with Amazon Kinesis: Best Practices
Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. In this session, we dive deep into best practices for Kinesis Data Streams and Kinesis Data Firehose to get the most performance out of your data streaming applications. Comcast uses Amazon Kinesis Data Streams to build a Streaming Data Platform that centralizes data exchanges. It is foundational to the way our data analysts and data scientists derive real-time insights from the data. In the second part of this talk, Comcast zooms into how to properly scale a Kinesis stream. We first list the factors to consider to avoid scaling issues with standard Kinesis stream consumption, and then we see how the new fan-out feature changes these scaling considerations.
SRV316-R & SRV316-R1 – Serverless Stream Processing Pipeline Best Practices
Real-time analytics has traditionally been analyzed using batch processing in DWH/Hadoop environments. Common use cases use data lakes, data science, and machine learning (ML). Creating serverless data-driven architecture and serverless streaming solutions with services like Amazon Kinesis, AWS Lambda, and Amazon Athena can solve real-time ingestion, storage, and analytics challenges, and help you focus on application logic without managing infrastructure. In this session, we introduce design patterns, best practices, and share customer journeys from batch to real-time insights in building modern serverless data-driven architecture applications. Hear how Intel built the Intel Pharma Analytics Platform using a serverless architecture. This AI cloud-based offering enables remote monitoring of patients using an array of sensors, wearable devices, and ML algorithms to objectively quantify the impact of interventions and power clinical studies in various therapeutics conditions.
SEC402-R – AWS, I Choose You: Pokemon’s Battle against the Bots
Join us for this advanced-level talk to learn about Pokemon’s journey defending against DDoS attacks and bad bots with AWS WAF, AWS Shield, and other AWS services. We go through their initial challenges and the evolution of their bot mitigation solution, which includes offline log analysis and dynamic updates of badbot IPs along with rate-based rules. This is an advanced talk and assumes some knowledge of Amazon DynamoDB, Amazon Kinesis Data Firehose, Amazon Kinesis Data Analytics, AWS Firewall Manager, AWS Shield, and AWS WAF.
Streaming data ingestion and near real-time analysis gives you immediate insights into your data. By using AWS Lambda with Amazon Kinesis, you can obtain these insights without the need to manage servers. But are you doing this in the most optimal way? In this interactive session, we review the best practices for using Lambda with Kinesis, and how to avoid common pitfalls.
ANT359 – Considerations for Building Your First Streaming Application
Do you want to increase your knowledge of AWS big data web services and launch your first big data application on the cloud? In this chalk talk, we provide an overview of many of the AWS analytics services, including Amazon EMR, Amazon Kinesis, Amazon Athena, and Amazon Redshift. We discuss how they are architected together to solve common big data problems, such as ingestion, ETL, and real-time analytics.
ANT360 – Don’t Wait Until Tomorrow: From Batch to Streaming
In recent years, there has been explosive growth in the number of connected devices and real-time data sources. Data is being produced continuously and its production rate is accelerating. Businesses can no longer wait for hours or days to use this data. To gain the most valuable insights, they must use this data immediately so they can react quickly to new information. In this chalk talk, we discuss how to take advantage of streaming data sources to analyze and react in near-real time. In addition, we present different options for how to solve a real-world scenario and walk through those solutions.
ANT361 – Using Amazon Kinesis Data Streams as a Low-Latency Message Bus
Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. In this chalk talk, we dive deep into best practices for Kinesis Data Streams and how to optimize for low-latency, multi-consumer solutions.
BAP328-R & BAP328-R1 – Architectures for Gaining Data Insights into Your Contact Center Experience
Join us for a deep dive into using Amazon Kinesis Data Analytics for insight into what’s happening with the contacts and agents in your Amazon Connect contact center. Learn how to leverage AWS analytics and ML services to inspect, transform, and gain insight into the customer’s journey through your contact center. We also show you how to use Alexa for Business to receive timely voice-activated business intelligence on your contact center’s performance.
Workshops
Before starting a workshop, you should have a basic understanding of Amazon Kinesis. Please bring your laptop and power supply to the workshop.
ANT213-R – Build Your First Big Data Application on AWS
Do you want to increase your knowledge of AWS big data web services and launch your first big data application on the cloud? In this session, we walk you through simplifying big data processing as a data bus comprising ingest, store, process, and visualize. You will build a big data application using AWS managed services, including Amazon Athena, Amazon Kinesis, Amazon DynamoDB, and Amazon S3. Along the way, we review architecture design patterns for big data applications and give you access to a take-home lab so you can rebuild and customize the application yourself.
ANT213-R1 – Build Your First Big Data Application on AWS
Do you want to increase your knowledge of AWS big data web services and launch your first big data application on the cloud? In this session, we walk you through simplifying big data processing as a data bus comprising ingest, store, process, and visualize. You will build a big data application using AWS managed services, including Amazon Athena, Amazon Kinesis, Amazon DynamoDB, and Amazon S3. Along the way, we review architecture design patterns for big data applications and give you access to a take-home lab so you can rebuild and customize the application yourself.
ANT357 – Stream Video, Analyze It in Real Time, and Share It in Real Time
Video is ‘big data.’ Image sensors—in our smartphones, smart home devices, and traffic cameras—are getting Internet-connected. Massive streams of video data are generated, but currently not mined for real-time insights to drive businesses forward. In this workshop, learn to capture, process, and analyze video streams. Build and configure your camera device’s media pipeline to start streaming video into the AWS Cloud using Amazon Kinesis Video Streams. Next, build and deploy your own machine learning (ML) model in Amazon SageMaker to generate inferences about objects or activities in your video stream. Finally, build a browser-based web player to view the video in Live and On-Demand modes, including the analyzed video stream. In this workshop, you use Amazon Kinesis Video Streams, Amazon SageMaker, Amazon Rekognition Video, and Amazon ECS.
ANT362 – Use Streaming Data to Gain Real-Time Insights into Your Business
In recent years, there has been an explosive growth in the number of connected devices and real-time data sources. Because of this, data is being continuously produced, and its production rate is accelerating. Businesses can no longer wait for hours or days to use this data. To gain the most valuable insights, they must use this data immediately so they can react quickly to new information. In this workshop, you will learn how to take advantage of streaming data sources to analyze and react in near real time. We provide several requirements for a real-world streaming data scenario, and you’re tasked with creating a solution that successfully satisfies the requirements using services such as Amazon Kinesis, AWS Lambda, and Amazon SNS.
ANT318-R – Build, Deploy and Serve Machine learning models on streaming data using Amazon SageMaker, Apache Spark on Amazon EMR and Amazon Kinesis
As data exponentially grows in organizations, there is an increasing need to use machine learning (ML) to gather insights from this data at scale and to use those insights to perform real-time predictions on incoming data. In this workshop, we walk you through how to train an Apache Spark model using Amazon SageMaker that points to Apache Livy and running on an Amazon EMR Spark cluster. We also show you how to host the Spark model on Amazon SageMaker to serve a RESTful inference API. Finally, we show you how to use the RESTful API to serve real-time predictions on streaming data from Amazon Kinesis Data Streams.
ANT318-R1 – Build, Deploy and Serve Machine learning models on streaming data using Amazon SageMaker, Apache Spark on Amazon EMR and Amazon Kinesis
As data exponentially grows in organizations, there is an increasing need to use machine learning (ML) to gather insights from this data at scale and to use those insights to perform real-time predictions on incoming data. In this workshop, we walk you through how to train an Apache Spark model using Amazon SageMaker that points to Apache Livy and running on an Amazon EMR Spark cluster. We also show you how to host the Spark model on Amazon SageMaker to serve a RESTful inference API. Finally, we show you how to use the RESTful API to serve real-time predictions on streaming data from Amazon Kinesis Data Streams.
In this hands-on workshop, you learn best practices and architectural patterns for building streaming data processing pipelines without servers. Using Amazon Kinesis, AWS Lambda, and other services, you have the opportunity to build, deploy, and monitor an application to ingest and process high-velocity data at scale. This advanced workshop assumes that you have experience writing Lambda functions and understand the basics of the AWS serverless platform, so come ready to dive into the deep end. Bring your laptop with a full keyboard. We provide a sandbox AWS account for you to use during the workshop.
MAE309 – Build an AWS Analytics Solution to Monitor the Video Streaming Experience
In this workshop, we build and deploy an end-to-end analytics solution for monitoring the video streaming experience. We integrate an open source video player with Amazon Kinesis Data Streams to capture events in real time. We explore the data available for capture and a variety of use cases: from generating alerts on poor experience to content recommendations based on user behavior. We also show you how this real-time data can be archived in a data lake and further used to generate reports of aggregate performance and experience across a number of dimensions.
ADT401 – Real-Time Web Analytics with Amazon Kinesis Data Analytics
Knowing what users are doing on your websites in real time provides insights you can act on without waiting for delayed batch processing of clickstream data. Watching the immediate impact on user behavior after new releases, detecting and responding to anomalies, situational awareness, and evaluating trends are all benefits of real-time website analytics. In this workshop, we build a cost-optimized platform to capture web beacon traffic, analyze it for interesting metrics, and display it on a customized dashboard. We start by deploying the Web Analytics Solution Accelerator, then once the core is complete, we extend their solution to capture new and interesting metrics, process those with Amazon Kinesis Data Analytics, and display new graphs on their custom dashboard. Participants come away with a fully functional system for capturing, analyzing, and displaying valuable website metrics in real time.
GAM305 – Dynamic Encounters for Veteran Players Using Machine Learning
Are you trying to keep your game fresh for long-time players but don’t have the resources to keep building new handcrafted content? Join this session to learn how to launch dynamic content for groups of players without relying on static techniques like instancing or spawn points. We dive into Amazon Kinesis and Amazon Kinesis Data Analytics for real-time data collection and hotspot detection, and machine learning for encounter-building based on observed player behavior.
Conclusion
We look forward to seeing you at AWS re:Invent 2018 in Las Vegas. In addition to the sessions described in this blog post, please stop by the Analytics booth during Expo hours to learn more about Amazon Kinesis.
About the Author
Larry Heathcote is a Principal Product Marketing Manager at Amazon Web Services. Larry is passionate about seeing the results of data-driven insights on business outcomes. He enjoys family time, home projects, grilling out, and classic barbeque.
Amazon Kinesis Data Firehose is the easiest way to capture and stream data into a data lake built on Amazon S3. This data can be anything—from AWS service logs like AWS CloudTrail log files, Amazon VPC Flow Logs, Application Load Balancer logs, and others. It can also be IoT events, game events, and much more. To efficiently query this data, a time-consuming ETL (extract, transform, and load) process is required to massage and convert the data to an optimal file format, which increases the time to insight. This situation is less than ideal, especially for real-time data that loses its value over time.
To solve this common challenge, Kinesis Data Firehose can now save data to Amazon S3 in Apache Parquet or Apache ORC format. These are optimized columnar formats that are highly recommended for best performance and cost-savings when querying data in S3. This feature directly benefits you if you use Amazon Athena, Amazon Redshift, AWS Glue, Amazon EMR, or any other big data tools that are available from the AWS Partner Network and through the open-source community.
Amazon Connect is a simple-to-use, cloud-based contact center service that makes it easy for any business to provide a great customer experience at a lower cost than common alternatives. Its open platform design enables easy integration with other systems. One of those systems is Amazon Kinesis—in particular, Kinesis Data Streams and Kinesis Data Firehose.
What’s really exciting is that you can now save events from Amazon Connect to S3 in Apache Parquet format. You can then perform analytics using Amazon Athena and Amazon Redshift Spectrum in real time, taking advantage of this key performance and cost optimization. Of course, Amazon Connect is only one example. This new capability opens the door for a great deal of opportunity, especially as organizations continue to build their data lakes.
Amazon Connect includes an array of analytics views in the Administrator dashboard. But you might want to run other types of analysis. In this post, I describe how to set up a data stream from Amazon Connect through Kinesis Data Streams and Kinesis Data Firehose and out to S3, and then perform analytics using Athena and Amazon Redshift Spectrum. I focus primarily on the Kinesis Data Firehose support for Parquet and its integration with the AWS Glue Data Catalog, Amazon Athena, and Amazon Redshift.
Solution overview
Here is how the solution is laid out:
The following sections walk you through each of these steps to set up the pipeline.
1. Define the schema
When Kinesis Data Firehose processes incoming events and converts the data to Parquet, it needs to know which schema to apply. The reason is that many times, incoming events contain all or some of the expected fields based on which values the producers are advertising. A typical process is to normalize the schema during a batch ETL job so that you end up with a consistent schema that can easily be understood and queried. Doing this introduces latency due to the nature of the batch process. To overcome this issue, Kinesis Data Firehose requires the schema to be defined in advance.
To see the available columns and structures, see Amazon Connect Agent Event Streams. For the purpose of simplicity, I opted to make all the columns of type String rather than create the nested structures. But you can definitely do that if you want.
The simplest way to define the schema is to create a table in the Amazon Athena console. Open the Athena console, and paste the following create table statement, substituting your own S3 bucket and prefix for where your event data will be stored. A Data Catalog database is a logical container that holds the different tables that you can create. The default database name shown here should already exist. If it doesn’t, you can create it or use another database that you’ve already created.
That’s all you have to do to prepare the schema for Kinesis Data Firehose.
2. Define the data streams
Next, you need to define the Kinesis data streams that will be used to stream the Amazon Connect events. Open the Kinesis Data Streams console and create two streams. You can configure them with only one shard each because you don’t have a lot of data right now.
3. Define the Kinesis Data Firehose delivery stream for Parquet
Let’s configure the Data Firehose delivery stream using the data stream as the source and Amazon S3 as the output. Start by opening the Kinesis Data Firehose console and creating a new data delivery stream. Give it a name, and associate it with the Kinesis data stream that you created in Step 2.
As shown in the following screenshot, enable Record format conversion (1) and choose Apache Parquet (2). As you can see, Apache ORC is also supported. Scroll down and provide the AWS Glue Data Catalog database name (3) and table names (4) that you created in Step 1. Choose Next.
To make things easier, the output S3 bucket and prefix fields are automatically populated using the values that you defined in the LOCATION parameter of the create table statement from Step 1. Pretty cool. Additionally, you have the option to save the raw events into another location as defined in the Source record S3 backup section. Don’t forget to add a trailing forward slash “ / “ so that Data Firehose creates the date partitions inside that prefix.
On the next page, in the S3 buffer conditions section, there is a note about configuring a large buffer size. The Parquet file format is highly efficient in how it stores and compresses data. Increasing the buffer size allows you to pack more rows into each output file, which is preferred and gives you the most benefit from Parquet.
Compression using Snappy is automatically enabled for both Parquet and ORC. You can modify the compression algorithm by using the Kinesis Data Firehose API and update the OutputFormatConfiguration.
Be sure to also enable Amazon CloudWatch Logs so that you can debug any issues that you might run into.
Lastly, finalize the creation of the Firehose delivery stream, and continue on to the next section.
4. Set up the Amazon Connect contact center
After setting up the Kinesis pipeline, you now need to set up a simple contact center in Amazon Connect. The Getting Started page provides clear instructions on how to set up your environment, acquire a phone number, and create an agent to accept calls.
After setting up the contact center, in the Amazon Connect console, choose your Instance Alias, and then choose Data Streaming. Under Agent Event, choose the Kinesis data stream that you created in Step 2, and then choose Save.
At this point, your pipeline is complete. Agent events from Amazon Connect are generated as agents go about their day. Events are sent via Kinesis Data Streams to Kinesis Data Firehose, which converts the event data from JSON to Parquet and stores it in S3. Athena and Amazon Redshift Spectrum can simply query the data without any additional work.
So let’s generate some data. Go back into the Administrator console for your Amazon Connect contact center, and create an agent to handle incoming calls. In this example, I creatively named mine Agent One. After it is created, Agent One can get to work and log into their console and set their availability to Available so that they are ready to receive calls.
To make the data a bit more interesting, I also created a second agent, Agent Two. I then made some incoming and outgoing calls and caused some failures to occur, so I now have enough data available to analyze.
5. Analyze the data with Athena
Let’s open the Athena console and run some queries. One thing you’ll notice is that when we created the schema for the dataset, we defined some of the fields as Strings even though in the documentation they were complex structures. The reason for doing that was simply to show some of the flexibility of Athena to be able to parse JSON data. However, you can define nested structures in your table schema so that Kinesis Data Firehose applies the appropriate schema to the Parquet file.
Let’s run the first query to see which agents have logged into the system.
The query might look complex, but it’s fairly straightforward:
WITH dataset AS (
SELECT
from_iso8601_timestamp(eventtimestamp) AS event_ts,
eventtype,
-- CURRENT STATE
json_extract_scalar(
currentagentsnapshot,
'$.agentstatus.name') AS current_status,
from_iso8601_timestamp(
json_extract_scalar(
currentagentsnapshot,
'$.agentstatus.starttimestamp')) AS current_starttimestamp,
json_extract_scalar(
currentagentsnapshot,
'$.configuration.firstname') AS current_firstname,
json_extract_scalar(
currentagentsnapshot,
'$.configuration.lastname') AS current_lastname,
json_extract_scalar(
currentagentsnapshot,
'$.configuration.username') AS current_username,
json_extract_scalar(
currentagentsnapshot,
'$.configuration.routingprofile.defaultoutboundqueue.name') AS current_outboundqueue,
json_extract_scalar(
currentagentsnapshot,
'$.configuration.routingprofile.inboundqueues[0].name') as current_inboundqueue,
-- PREVIOUS STATE
json_extract_scalar(
previousagentsnapshot,
'$.agentstatus.name') as prev_status,
from_iso8601_timestamp(
json_extract_scalar(
previousagentsnapshot,
'$.agentstatus.starttimestamp')) as prev_starttimestamp,
json_extract_scalar(
previousagentsnapshot,
'$.configuration.firstname') as prev_firstname,
json_extract_scalar(
previousagentsnapshot,
'$.configuration.lastname') as prev_lastname,
json_extract_scalar(
previousagentsnapshot,
'$.configuration.username') as prev_username,
json_extract_scalar(
previousagentsnapshot,
'$.configuration.routingprofile.defaultoutboundqueue.name') as current_outboundqueue,
json_extract_scalar(
previousagentsnapshot,
'$.configuration.routingprofile.inboundqueues[0].name') as prev_inboundqueue
from kfhconnectblog
where eventtype <> 'HEART_BEAT'
)
SELECT
current_status as status,
current_username as username,
event_ts
FROM dataset
WHERE eventtype = 'LOGIN' AND current_username <> ''
ORDER BY event_ts DESC
The query output looks something like this:
Here is another query that shows the sessions each of the agents engaged with. It tells us where they were incoming or outgoing, if they were completed, and where there were missed or failed calls.
WITH src AS (
SELECT
eventid,
json_extract_scalar(currentagentsnapshot, '$.configuration.username') as username,
cast(json_extract(currentagentsnapshot, '$.contacts') AS ARRAY(JSON)) as c,
cast(json_extract(previousagentsnapshot, '$.contacts') AS ARRAY(JSON)) as p
from kfhconnectblog
),
src2 AS (
SELECT *
FROM src CROSS JOIN UNNEST (c, p) AS contacts(c_item, p_item)
),
dataset AS (
SELECT
eventid,
username,
json_extract_scalar(c_item, '$.contactid') as c_contactid,
json_extract_scalar(c_item, '$.channel') as c_channel,
json_extract_scalar(c_item, '$.initiationmethod') as c_direction,
json_extract_scalar(c_item, '$.queue.name') as c_queue,
json_extract_scalar(c_item, '$.state') as c_state,
from_iso8601_timestamp(json_extract_scalar(c_item, '$.statestarttimestamp')) as c_ts,
json_extract_scalar(p_item, '$.contactid') as p_contactid,
json_extract_scalar(p_item, '$.channel') as p_channel,
json_extract_scalar(p_item, '$.initiationmethod') as p_direction,
json_extract_scalar(p_item, '$.queue.name') as p_queue,
json_extract_scalar(p_item, '$.state') as p_state,
from_iso8601_timestamp(json_extract_scalar(p_item, '$.statestarttimestamp')) as p_ts
FROM src2
)
SELECT
username,
c_channel as channel,
c_direction as direction,
p_state as prev_state,
c_state as current_state,
c_ts as current_ts,
c_contactid as id
FROM dataset
WHERE c_contactid = p_contactid
ORDER BY id DESC, current_ts ASC
The query output looks similar to the following:
6. Analyze the data with Amazon Redshift Spectrum
With Amazon Redshift Spectrum, you can query data directly in S3 using your existing Amazon Redshift data warehouse cluster. Because the data is already in Parquet format, Redshift Spectrum gets the same great benefits that Athena does.
Here is a simple query to show querying the same data from Amazon Redshift. Note that to do this, you need to first create an external schema in Amazon Redshift that points to the AWS Glue Data Catalog.
SELECT
eventtype,
json_extract_path_text(currentagentsnapshot,'agentstatus','name') AS current_status,
json_extract_path_text(currentagentsnapshot, 'configuration','firstname') AS current_firstname,
json_extract_path_text(currentagentsnapshot, 'configuration','lastname') AS current_lastname,
json_extract_path_text(
currentagentsnapshot,
'configuration','routingprofile','defaultoutboundqueue','name') AS current_outboundqueue,
FROM default_schema.kfhconnectblog
The following shows the query output:
Summary
In this post, I showed you how to use Kinesis Data Firehose to ingest and convert data to columnar file format, enabling real-time analysis using Athena and Amazon Redshift. This great feature enables a level of optimization in both cost and performance that you need when storing and analyzing large amounts of data. This feature is equally important if you are investing in building data lakes on AWS.
Roy Hasson is a Global Business Development Manager for AWS Analytics. He works with customers around the globe to design solutions to meet their data processing, analytics and business intelligence needs. Roy is big Manchester United fan cheering his team on and hanging out with his family.
This blog post was co-authored by Ujjwal Ratan, a senior AI/ML solutions architect on the global life sciences team.
Healthcare data is generated at an ever-increasing rate and is predicted to reach 35 zettabytes by 2020. Being able to cost-effectively and securely manage this data whether for patient care, research or legal reasons is increasingly important for healthcare providers.
Healthcare providers must have the ability to ingest, store and protect large volumes of data including clinical, genomic, device, financial, supply chain, and claims. AWS is well-suited to this data deluge with a wide variety of ingestion, storage and security services (e.g. AWS Direct Connect, Amazon Kinesis Streams, Amazon S3, Amazon Macie) for customers to handle their healthcare data. In a recent Healthcare IT News article, healthcare thought-leader, John Halamka, noted, “I predict that five years from now none of us will have datacenters. We’re going to go out to the cloud to find EHRs, clinical decision support, analytics.”
I realize simply storing this data is challenging enough. Magnifying the problem is the fact that healthcare data is increasingly attractive to cyber attackers, making security a top priority. According to Mariya Yao in her Forbes column, it is estimated that individual medical records can be worth hundreds or even thousands of dollars on the black market.
In this first of a 2-part post, I will address the value that AWS can bring to customers for ingesting, storing and protecting provider’s healthcare data. I will describe key components of any cloud-based healthcare workload and the services AWS provides to meet these requirements. In part 2 of this post we will dive deep into the AWS services used for advanced analytics, artificial intelligence and machine learning.
The data tsunami is upon us
So where is this data coming from? In addition to the ubiquitous electronic health record (EHR), the sources of this data include:
genomic sequencers
devices such as MRIs, x-rays and ultrasounds
sensors and wearables for patients
medical equipment telemetry
mobile applications
Additional sources of data come from non-clinical, operational systems such as:
human resources
finance
supply chain
claims and billing
Data from these sources can be structured (e.g., claims data) as well as unstructured (e.g., clinician notes). Some data comes across in streams such as that taken from patient monitors, while some comes in batch form. Still other data comes in near-real time such as HL7 messages. All of this data has retention policies dictating how long it must be stored. Much of this data is stored in perpetuity as many systems in use today have no purge mechanism. AWS has services to manage all these data types as well as their retention, security and access policies.
Imaging is a significant contributor to this data tsunami. Increasing demand for early-stage diagnoses along with aging populations drive increasing demand for images from CT, PET, MRI, ultrasound, digital pathology, X-ray and fluoroscopy. For example, a thin-slice CT image can be hundreds of megabytes. Increasing demand and strict retention policies make storage costly.
Due to the plummeting cost of gene sequencing, molecular diagnostics (including liquid biopsy) is a large contributor to this data deluge. Many predict that as the value of molecular testing becomes more identifiable, the reimbursement models will change and it will increasingly become the standard of care. According to the Washington Post article “Sequencing the Genome Creates so Much Data We Don’t Know What to do with It,”
“Some researchers predict that up to one billion people will have their genome sequenced by 2025 generating up to 40 exabytes of data per year.”
Although genomics is primarily used for oncology diagnostics today, it’s also used for other purposes, pharmacogenomics — used to understand how an individual will metabolize a medication.
Reference Architecture
It is increasingly challenging for the typical hospital, clinic or physician practice to securely store, process and manage this data without cloud adoption.
Amazon has a variety of ingestion techniques depending on the nature of the data including size, frequency and structure. AWS Snowball and AWS Snowmachine are appropriate for extremely-large, secure data transfers whether one time or episodic. AWS Glue is a fully-managed ETL service for securely moving data from on-premise to AWS and Amazon Kinesis can be used for ingesting streaming data.
Amazon S3, Amazon S3 IA, and Amazon Glacier are economical, data-storage services with a pay-as-you-go pricing model that expand (or shrink) with the customer’s requirements.
The above architecture has four distinct components – ingestion, storage, security, and analytics. In this post I will dive deeper into the first three components, namely ingestion, storage and security. In part 2, I will look at how to use AWS’ analytics services to draw value on, and optimize, your healthcare data.
Ingestion
A typical provider data center will consist of many systems with varied datasets. AWS provides multiple tools and services to effectively and securely connect to these data sources and ingest data in various formats. The customers can choose from a range of services and use them in accordance with the use case.
For use cases involving one-time (or periodic), very large data migrations into AWS, customers can take advantage of AWS Snowball devices. These devices come in two sizes, 50 TB and 80 TB and can be combined together to create a petabyte scale data transfer solution.
The devices are easy to connect and load and they are shipped to AWS avoiding the network bottlenecks associated with such large-scale data migrations. The devices are extremely secure supporting 256-bit encryption and come in a tamper-resistant enclosure. AWS Snowball imports data in Amazon S3 which can then interface with other AWS compute services to process that data in a scalable manner.
For use cases involving a need to store a portion of datasets on premises for active use and offload the rest on AWS, the Amazon storage gateway service can be used. The service allows you to seamlessly integrate on premises applications via standard storage protocols like iSCSI or NFS mounted on a gateway appliance. It supports a file interface, a volume interface and a tape interface which can be utilized for a range of use cases like disaster recovery, backup and archiving, cloud bursting, storage tiering and migration.
The AWS Storage Gateway appliance can use the AWS Direct Connect service to establish a dedicated network connection from the on premises data center to AWS.
Specific Industry Use Cases
By using the AWS proposed reference architecture for disaster recovery, healthcare providers can ensure their data assets are securely stored on the cloud and are easily accessible in the event of a disaster. The “AWS Disaster Recovery” whitepaper includes details on options available to customers based on their desired recovery time objective (RTO) and recovery point objective (RPO).
AWS is an ideal destination for offloading large volumes of less-frequently-accessed data. These datasets are rarely used in active compute operations but are exceedingly important to retain for reasons like compliance. By storing these datasets on AWS, customers can take advantage of the highly-durable platform to securely store their data and also retrieve them easily when they need to. For more details on how AWS enables customers to run back and archival use cases on AWS, please refer to the following set of whitepapers.
A healthcare provider may have a variety of databases spread throughout the hospital system supporting critical applications such as EHR, PACS, finance and many more. These datasets often need to be aggregated to derive information and calculate metrics to optimize business processes. AWS Glue is a fully-managed Extract, Transform and Load (ETL) service that can read data from a JDBC-enabled, on-premise database and transfer the datasets into AWS services like Amazon S3, Amazon Redshift and Amazon RDS. This allows customers to create transformation workflows that integrate smaller datasets from multiple sources and aggregates them on AWS.
Healthcare providers deal with a variety of streaming datasets which often have to be analyzed in near real time. These datasets come from a variety of sources such as sensors, messaging buses and social media, and often do not adhere to an industry standard. The Amazon Kinesis suite of services, that includes Amazon Kinesis Streams, Amazon Kinesis Firehose, and Amazon Kinesis Analytics, are the ideal set of services to accomplish the task of deriving value from streaming data.
Example: Using AWS Glue to de-identify and ingest healthcare data into S3 Let’s consider a scenario in which a provider maintains patient records in a database they want to ingest into S3. The provider also wants to de-identify the data by stripping personally- identifiable attributes and store the non-identifiable information in an S3 bucket. This bucket is different from the one that contains identifiable information. Doing this allows the healthcare provider to separate sensitive information with more restrictions set up via S3 bucket policies.
To ingest records into S3, we create a Glue job that reads from the source database using a Glue connection. The connection is also used by a Glue crawler to populate the Glue data catalog with the schema of the source database. We will use the Glue development endpoint and a zeppelin notebook server on EC2 to develop and execute the job.
Step 1: Import the necessary libraries and also set a glue context which is a wrapper on the spark context:
Step 2: Create a dataframe from the source data. I call the dataframe “readmissionsdata”. Here is what the schema would look like:
Step 3: Now select the columns that contains indentifiable information and store it in a new dataframe. Call the new dataframe “phi”.
Step 4: Non-PHI columns are stored in a separate dataframe. Call this dataframe “nonphi”.
Step 5: Write the two dataframes into two separate S3 buckets
Once successfully executed, the PHI and non-PHI attributes are stored in two separate files in two separate buckets that can be individually maintained.
Storage
In 2016, 327 healthcare providers reported a protected health information (PHI) breach, affecting 16.4m patient records[1]. There have been 342 data breaches reported in 2017 — involving 3.2 million patient records.[2]
To date, AWS has released 51 HIPAA-eligible services to help customers address security challenges and is in the process of making many more services HIPAA-eligible. These HIPAA-eligible services (along with all other AWS services) help customers build solutions that comply with HIPAA security and auditing requirements. A catalogue of HIPAA-enabled services can be found at AWS HIPAA-eligible services. It is important to note that AWS manages physical and logical access controls for the AWS boundary. However, the overall security of your workloads is a shared responsibility, where you are responsible for controlling user access to content on your AWS accounts.
AWS storage services allow you to store data efficiently while maintaining high durability and scalability. By using Amazon S3 as the central storage layer, you can take advantage of the Amazon S3 storage management features to get operational metrics on your data sets and transition them between various storage classes to save costs. By tagging objects on Amazon S3, you can build a governance layer on Amazon S3 to grant role based access to objects using Amazon IAM and Amazon S3 bucket policies.
To learn more about the Amazon S3 storage management features, see the following link.
Security
In the example above, we are storing the PHI information in a bucket named “phi.” Now, we want to protect this information to make sure its encrypted, does not have unauthorized access, and all access requests to the data are logged.
Encryption: S3 provides settings to enable default encryption on a bucket. This ensures any object in the bucket is encrypted by default.
Logging: S3 provides object level logging that can be used to capture all API calls to the object. The API calls are logged in cloudtrail for easy access and consolidation. Moreover, it also supports events to proactively alert customers of read and write operations.
Access control: Customers can use S3 bucket policies and IAM policies to restrict access to the phi bucket. It can also put a restriction to enforce multi-factor authentication on the bucket. For example, the following policy enforces multi-factor authentication on the phi bucket:
In Part 1 of this blog, we detailed the ingestion, storage, security and management of healthcare data on AWS. Stay tuned for part two where we are going to dive deep into optimizing the data for analytics and machine learning.
This is a customer post by Stephen Borg, the Head of Big Data and BI at Cerberus Technologies.
Cerberus Technologies, in their own words: Cerberus is a company founded in 2017 by a team of visionary iGaming veterans. Our mission is simple – to offer the best tech solutions through a data-driven and a customer-first approach, delivering innovative solutions that go against traditional forms of working and process. This mission is based on the solid foundations of reliability, flexibility and security, and we intend to fundamentally change the way iGaming and other industries interact with technology.
Over the years, I have developed and created a number of data warehouses from scratch. Recently, I built a data warehouse for the iGaming industry single-handedly. To do it, I used the power and flexibility of Amazon Redshift and the wider AWS data management ecosystem. In this post, I explain how I was able to build a robust and scalable data warehouse without the large team of experts typically needed.
In two of my recent projects, I ran into challenges when scaling our data warehouse using on-premises infrastructure. Data was growing at many tens of gigabytes per day, and query performance was suffering. Scaling required major capital investment for hardware and software licenses, and also significant operational costs for maintenance and technical staff to keep it running and performing well. Unfortunately, I couldn’t get the resources needed to scale the infrastructure with data growth, and these projects were abandoned. Thanks to cloud data warehousing, the bottleneck of infrastructure resources, capital expense, and operational costs have been significantly reduced or have totally gone away. There is no more excuse for allowing obstacles of the past to delay delivering timely insights to decision makers, no matter how much data you have.
With Amazon Redshift and AWS, I delivered a cloud data warehouse to the business very quickly, and with a small team: me. I didn’t have to order hardware or software, and I no longer needed to install, configure, tune, or keep up with patches and version updates. Instead, I easily set up a robust data processing pipeline and we were quickly ingesting and analyzing data. Now, my data warehouse team can be extremely lean, and focus more time on bringing in new data and delivering insights. In this post, I show you the AWS services and the architecture that I used.
Handling data feeds
I have several different data sources that provide everything needed to run the business. The data includes activity from our iGaming platform, social media posts, clickstream data, marketing and campaign performance, and customer support engagements.
To handle the diversity of data feeds, I developed abstract integration applications using Docker that run on Amazon EC2 Container Service (Amazon ECS) and feed data to Amazon Kinesis Data Streams. These data streams can be used for real time analytics. In my system, each record in Kinesis is preprocessed by an AWS Lambda function to cleanse and aggregate information. My system then routes it to be stored where I need on Amazon S3 by Amazon Kinesis Data Firehose. Suppose that you used an on-premises architecture to accomplish the same task. A team of data engineers would be required to maintain and monitor a Kafka cluster, develop applications to stream data, and maintain a Hadoop cluster and the infrastructure underneath it for data storage. With my stream processing architecture, there are no servers to manage, no disk drives to replace, and no service monitoring to write.
Setting up a Kinesis stream can be done with a few clicks, and the same for Kinesis Firehose. Firehose can be configured to automatically consume data from a Kinesis Data Stream, and then write compressed data every N minutes to Amazon S3. When I want to process a Kinesis data stream, it’s very easy to set up a Lambda function to be executed on each message received. I can just set a trigger from the AWS Lambda Management Console, as shown following.
Regardless of the format I receive the data from our partners, I can send it to Kinesis as JSON data using my own formatters. After Firehose writes this to Amazon S3, I have everything in nearly the same structure I received but compressed, encrypted, and optimized for reading.
This data is automatically crawled by AWS Glue and placed into the AWS Glue Data Catalog. This means that I can immediately query the data directly on S3 using Amazon Athena or through Amazon Redshift Spectrum. Previously, I used Amazon EMR and an Amazon RDS–based metastore in Apache Hive for catalog management. Now I can avoid the complexity of maintaining Hive Metastore catalogs. Glue takes care of high availability and the operations side so that I know that end users can always be productive.
Working with Amazon Athena and Amazon Redshift for analysis
I found Amazon Athena extremely useful out of the box for ad hoc analysis. Our engineers (me) use Athena to understand new datasets that we receive and to understand what transformations will be needed for long-term query efficiency.
For our data analysts and data scientists, we’ve selected Amazon Redshift. Amazon Redshift has proven to be the right tool for us over and over again. It easily processes 20+ million transactions per day, regardless of the footprint of the tables and the type of analytics required by the business. Latency is low and query performance expectations have been more than met. We use Redshift Spectrum for long-term data retention, which enables me to extend the analytic power of Amazon Redshift beyond local data to anything stored in S3, and without requiring me to load any data. Redshift Spectrum gives me the freedom to store data where I want, in the format I want, and have it available for processing when I need it.
To load data directly into Amazon Redshift, I use AWS Data Pipeline to orchestrate data workflows. I create Amazon EMR clusters on an intra-day basis, which I can easily adjust to run more or less frequently as needed throughout the day. EMR clusters are used together with Amazon RDS, Apache Spark 2.0, and S3 storage. The data pipeline application loads ETL configurations from Spring RESTful services hosted on AWS Elastic Beanstalk. The application then loads data from S3 into memory, aggregates and cleans the data, and then writes the final version of the data to Amazon Redshift. This data is then ready to use for analysis. Spark on EMR also helps with recommendations and personalization use cases for various business users, and I find this easy to set up and deliver what users want. Finally, business users use Amazon QuickSight for self-service BI to slice, dice, and visualize the data depending on their requirements.
Each AWS service in this architecture plays its part in saving precious time that’s crucial for delivery and getting different departments in the business on board. I found the services easy to set up and use, and all have proven to be highly reliable for our use as our production environments. When the architecture was in place, scaling out was either completely handled by the service, or a matter of a simple API call, and crucially doesn’t require me to change one line of code. Increasing shards for Kinesis can be done in a minute by editing a stream. Increasing capacity for Lambda functions can be accomplished by editing the megabytes allocated for processing, and concurrency is handled automatically. EMR cluster capacity can easily be increased by changing the master and slave node types in Data Pipeline, or by using Auto Scaling. Lastly, RDS and Amazon Redshift can be easily upgraded without any major tasks to be performed by our team (again, me).
In the end, using AWS services including Kinesis, Lambda, Data Pipeline, and Amazon Redshift allows me to keep my team lean and highly productive. I eliminated the cost and delays of capital infrastructure, as well as the late night and weekend calls for support. I can now give maximum value to the business while keeping operational costs down. My team pushed out an agile and highly responsive data warehouse solution in record time and we can handle changing business requirements rapidly, and quickly adapt to new data and new user requests.
Stephen Borg is the Head of Big Data and BI at Cerberus Technologies. He has a background in platform software engineering, and first became involved in data warehousing using the typical RDBMS, SQL, ETL, and BI tools. He quickly became passionate about providing insight to help others optimize the business and add personalization to products. He is now the Head of Big Data and BI at Cerberus Technologies.
This is a guest post by Yukinori Koide, an the head of development for the Newspass department at Gunosy.
Gunosy is a news curation application that covers a wide range of topics, such as entertainment, sports, politics, and gourmet news. The application has been installed more than 20 million times.
Gunosy aims to provide people with the content they want without the stress of dealing with a large influx of information. We analyze user attributes, such as gender and age, and past activity logs like click-through rate (CTR). We combine this information with article attributes to provide trending, personalized news articles to users.
Users need fresh and personalized news. There are two constraints to consider when delivering appropriate articles:
Time: Articles have freshness—that is, they lose value over time. New articles need to reach users as soon as possible.
Frequency (volume): Only a limited number of articles can be shown. It’s unreasonable to display all articles in the application, and users can’t read all of them anyway.
To deliver fresh articles with a high probability that the user is interested in them, it’s necessary to include not only past user activity logs and some feature values of articles, but also the most recent (real-time) user activity logs.
We optimize the delivery of articles with these two steps.
Personalization: Deliver articles based on each user’s attributes, past activity logs, and feature values of each article—to account for each user’s interests.
Trends analysis/identification: Optimize delivering articles using recent (real-time) user activity logs—to incorporate the latest trends from all users.
Optimizing the delivery of articles is always a cold start. Initially, we deliver articles based on past logs. We then use real-time data to optimize as quickly as possible. In addition, news has a short freshness time. Specifically, day-old news is past news, and even the news that is three hours old is past news. Therefore, shortening the time between step 1 and step 2 is important.
To tackle this issue, we chose AWS for processing streaming data because of its fully managed services, cost-effectiveness, and so on.
Solution
The following diagrams depict the architecture for optimizing article delivery by processing real-time user activity logs
There are three processing flows:
Process real-time user activity logs.
Store and process all user-based and article-based logs.
Execute ad hoc or heavy queries.
In this post, I focus on the first processing flow and explain how it works.
Process real-time user activity logs
The following are the steps for processing user activity logs in real time using Kinesis Data Streams and Kinesis Data Analytics.
The Fluentd server sends the following user activity logs to Kinesis Data Streams:
b. Insert the joined source stream and application reference data source into the temporary stream.
CREATE OR REPLACE PUMP "TMP_PUMP" AS
INSERT INTO "TMP_SQL_STREAM"
SELECT STREAM
R.GENDER, R.SEGMENT_ID, S.ARTICLE_ID, S.ACTION
FROM "SOURCE_SQL_STREAM_001" S
LEFT JOIN "REFERENCE_DATA_SOURCE" R
ON S.USER_ID = R.USER_ID;
c. Define the destination stream named DESTINATION_SQL_STREAM.
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
TIME TIMESTAMP, GENDER VARCHAR(32), SEGMENT_ID INTEGER, ARTICLE_ID INTEGER,
IMPRESSION INTEGER, CLICK INTEGER
);
d. Insert the processed temporary stream, using a tumbling window, into the destination stream per minute.
CREATE OR REPLACE PUMP "STREAM_PUMP" AS
INSERT INTO "DESTINATION_SQL_STREAM"
SELECT STREAM
ROW_TIME AS TIME,
GENDER, SEGMENT_ID, ARTICLE_ID,
SUM(CASE ACTION WHEN 'impression' THEN 1 ELSE 0 END) AS IMPRESSION,
SUM(CASE ACTION WHEN 'click' THEN 1 ELSE 0 END) AS CLICK
FROM "TMP_SQL_STREAM"
GROUP BY
GENDER, SEGMENT_ID, ARTICLE_ID,
FLOOR("TMP_SQL_STREAM".ROWTIME TO MINUTE);
Batch servers get results from Amazon ES every minute. They then optimize delivering articles with other data sources using a proprietary optimization algorithm.
How to connect a stream to another stream in another AWS Region
When we built the solution, Kinesis Data Analytics was not available in the Asia Pacific (Tokyo) Region, so we used the US West (Oregon) Region. The following shows how we connected a data stream to another data stream in the other Region.
There is no need to continue containing all components in a single AWS Region, unless you have a situation where a response difference at the millisecond level is critical to the service.
Benefits
The solution provides benefits for both our company and for our users. Benefits for the company are cost savings—including development costs, operational costs, and infrastructure costs—and reducing delivery time. Users can now find articles of interest more quickly. The solution can process more than 500,000 records per minute, and it enables fast and personalized news curating for our users.
Conclusion
In this post, I showed you how we optimize trending user activities to personalize news using Amazon Kinesis Data Firehose, Amazon Kinesis Data Analytics, and related AWS services in Gunosy.
AWS gives us a quick and economical solution and a good experience.
If you have questions or suggestions, please comment below.
Yukinori Koide is the head of development for the Newspass department at Gunosy. He is working on standardization of provisioning and deployment flow, promoting the utilization of serverless and containers for machine learning and AI services. His favorite AWS services are DynamoDB, Lambda, Kinesis, and ECS.
Akihiro Tsukada is a start-up solutions architect with AWS. He supports start-up companies in Japan technically at many levels, ranging from seed to later-stage.
Yuta Ishii is a solutions architect with AWS. He works with our customers to provide architectural guidance for building media & entertainment services, helping them improve the value of their services when using AWS.
In late September, during the annual Splunk .conf, Splunk and Amazon Web Services (AWS) jointly announced that Amazon Kinesis Data Firehose now supports Splunk Enterprise and Splunk Cloud as a delivery destination. This native integration between Splunk Enterprise, Splunk Cloud, and Amazon Kinesis Data Firehose is designed to make AWS data ingestion setup seamless, while offering a secure and fault-tolerant delivery mechanism. We want to enable customers to monitor and analyze machine data from any source and use it to deliver operational intelligence and optimize IT, security, and business performance.
With Kinesis Data Firehose, customers can use a fully managed, reliable, and scalable data streaming solution to Splunk. In this post, we tell you a bit more about the Kinesis Data Firehose and Splunk integration. We also show you how to ingest large amounts of data into Splunk using Kinesis Data Firehose.
Push vs. Pull data ingestion
Presently, customers use a combination of two ingestion patterns, primarily based on data source and volume, in addition to existing company infrastructure and expertise:
Push-based approach: Streaming data directly from AWS to Splunk HTTP Event Collector (HEC) by using AWS Lambda. Examples of applicable data sources include CloudWatch Logs and Amazon Kinesis Data Streams.
The pull-based approach offers data delivery guarantees such as retries and checkpointing out of the box. However, it requires more ops to manage and orchestrate the dedicated pollers, which are commonly running on Amazon EC2 instances. With this setup, you pay for the infrastructure even when it’s idle.
On the other hand, the push-based approach offers a low-latency scalable data pipeline made up of serverless resources like AWS Lambda sending directly to Splunk indexers (by using Splunk HEC). This approach translates into lower operational complexity and cost. However, if you need guaranteed data delivery then you have to design your solution to handle issues such as a Splunk connection failure or Lambda execution failure. To do so, you might use, for example, AWS Lambda Dead Letter Queues.
How about getting the best of both worlds?
Let’s go over the new integration’s end-to-end solution and examine how Kinesis Data Firehose and Splunk together expand the push-based approach into a native AWS solution for applicable data sources.
By using a managed service like Kinesis Data Firehose for data ingestion into Splunk, we provide out-of-the-box reliability and scalability. One of the pain points of the old approach was the overhead of managing the data collection nodes (Splunk heavy forwarders). With the new Kinesis Data Firehose to Splunk integration, there are no forwarders to manage or set up. Data producers (1) are configured through the AWS Management Console to drop data into Kinesis Data Firehose.
You can also create your own data producers. For example, you can drop data into a Firehose delivery stream by using Amazon Kinesis Agent, or by using the Firehose API (PutRecord(), PutRecordBatch()), or by writing to a Kinesis Data Stream configured to be the data source of a Firehose delivery stream. For more details, refer to Sending Data to an Amazon Kinesis Data Firehose Delivery Stream.
You might need to transform the data before it goes into Splunk for analysis. For example, you might want to enrich it or filter or anonymize sensitive data. You can do so using AWS Lambda. In this scenario, Kinesis Data Firehose buffers data from the incoming source data, sends it to the specified Lambda function (2), and then rebuffers the transformed data to the Splunk Cluster. Kinesis Data Firehose provides the Lambda blueprints that you can use to create a Lambda function for data transformation.
Systems fail all the time. Let’s see how this integration handles outside failures to guarantee data durability. In cases when Kinesis Data Firehose can’t deliver data to the Splunk Cluster, data is automatically backed up to an S3 bucket. You can configure this feature while creating the Firehose delivery stream (3). You can choose to back up all data or only the data that’s failed during delivery to Splunk.
In addition to using S3 for data backup, this Firehose integration with Splunk supports Splunk Indexer Acknowledgments to guarantee event delivery. This feature is configured on Splunk’s HTTP Event Collector (HEC) (4). It ensures that HEC returns an acknowledgment to Kinesis Data Firehose only after data has been indexed and is available in the Splunk cluster (5).
Now let’s look at a hands-on exercise that shows how to forward VPC flow logs to Splunk.
How-to guide
To process VPC flow logs, we implement the following architecture.
Amazon Virtual Private Cloud (Amazon VPC) delivers flow log files into an Amazon CloudWatch Logs group. Using a CloudWatch Logs subscription filter, we set up real-time delivery of CloudWatch Logs to an Kinesis Data Firehose stream.
Data coming from CloudWatch Logs is compressed with gzip compression. To work with this compression, we need to configure a Lambda-based data transformation in Kinesis Data Firehose to decompress the data and deposit it back into the stream. Firehose then delivers the raw logs to the Splunk Http Event Collector (HEC).
If delivery to the Splunk HEC fails, Firehose deposits the logs into an Amazon S3 bucket. You can then ingest the events from S3 using an alternate mechanism such as a Lambda function.
When data reaches Splunk (Enterprise or Cloud), Splunk parsing configurations (packaged in the Splunk Add-on for Kinesis Data Firehose) extract and parse all fields. They make data ready for querying and visualization using Splunk Enterprise and Splunk Cloud.
Walkthrough
Install the Splunk Add-on for Amazon Kinesis Data Firehose
The Splunk Add-on for Amazon Kinesis Data Firehose enables Splunk (be it Splunk Enterprise, Splunk App for AWS, or Splunk Enterprise Security) to use data ingested from Amazon Kinesis Data Firehose. Install the Add-on on all the indexers with an HTTP Event Collector (HEC). The Add-on is available for download from Splunkbase.
HTTP Event Collector (HEC)
Before you can use Kinesis Data Firehose to deliver data to Splunk, set up the Splunk HEC to receive the data. From Splunk web, go to the Setting menu, choose Data Inputs, and choose HTTP Event Collector. Choose Global Settings, ensure All tokens is enabled, and then choose Save. Then choose New Token to create a new HEC endpoint and token. When you create a new token, make sure that Enable indexer acknowledgment is checked.
When prompted to select a source type, select aws:cloudwatch:vpcflow.
Create an S3 backsplash bucket
To provide for situations in which Kinesis Data Firehose can’t deliver data to the Splunk Cluster, we use an S3 bucket to back up the data. You can configure this feature to back up all data or only the data that’s failed during delivery to Splunk.
Note: Bucket names are unique. Thus, you can’t use tmak-backsplash-bucket.
Create an IAM role for the Lambda transform function
Firehose triggers an AWS Lambda function that transforms the data in the delivery stream. Let’s first create a role for the Lambda function called LambdaBasicRole.
Note: You can also set this role up when creating your Lambda function.
$ aws iam create-role --role-name LambdaBasicRole --assume-role-policy-document file://TrustPolicyForLambda.json
After the role is created, attach the managed Lambda basic execution policy to it.
$ aws iam attach-role-policy
--policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
--role-name LambdaBasicRole
Create a Firehose Stream
On the AWS console, open the Amazon Kinesis service, go to the Firehose console, and choose Create Delivery Stream.
In the next section, you can specify whether you want to use an inline Lambda function for transformation. Because incoming CloudWatch Logs are gzip compressed, choose Enabled for Record transformation, and then choose Create new.
From the list of the available blueprint functions, choose Kinesis Data Firehose CloudWatch Logs Processor. This function unzips data and place it back into the Firehose stream in compliance with the record transformation output model.
Enter a name for the Lambda function, choose Choose an existing role, and then choose the role you created earlier. Then choose Create Function.
Go back to the Firehose Stream wizard, choose the Lambda function you just created, and then choose Next.
Select Splunk as the destination, and enter your Splunk Http Event Collector information.
Note: Amazon Kinesis Data Firehose requires the Splunk HTTP Event Collector (HEC) endpoint to be terminated with a valid CA-signed certificate matching the DNS hostname used to connect to your HEC endpoint. You receive delivery errors if you are using a self-signed certificate.
In this example, we only back up logs that fail during delivery.
To monitor your Firehose delivery stream, enable error logging. Doing this means that you can monitor record delivery errors.
Create an IAM role for the Firehose stream by choosing Create new, or Choose. Doing this brings you to a new screen. Choose Create a new IAM role, give the role a name, and then choose Allow.
If you look at the policy document, you can see that the role gives Kinesis Data Firehose permission to publish error logs to CloudWatch, execute your Lambda function, and put records into your S3 backup bucket.
You now get a chance to review and adjust the Firehose stream settings. When you are satisfied, choose Create Stream. You get a confirmation once the stream is created and active.
Create a VPC Flow Log
To send events from Amazon VPC, you need to set up a VPC flow log. If you already have a VPC flow log you want to use, you can skip to the “Publish CloudWatch to Kinesis Data Firehose” section.
On the AWS console, open the Amazon VPC service. Then choose VPC, Your VPC, and choose the VPC you want to send flow logs from. Choose Flow Logs, and then choose Create Flow Log. If you don’t have an IAM role that allows your VPC to publish logs to CloudWatch, choose Set Up Permissions and Create new role. Use the defaults when presented with the screen to create the new IAM role.
Once active, your VPC flow log should look like the following.
Publish CloudWatch to Kinesis Data Firehose
When you generate traffic to or from your VPC, the log group is created in Amazon CloudWatch. The new log group has no subscription filter, so set up a subscription filter. Setting this up establishes a real-time data feed from the log group to your Firehose delivery stream.
At present, you have to use the AWS Command Line Interface (AWS CLI) to create a CloudWatch Logs subscription to a Kinesis Data Firehose stream. However, you can use the AWS console to create subscriptions to Lambda and Amazon Elasticsearch Service.
To allow CloudWatch to publish to your Firehose stream, you need to give it permissions.
$ aws iam create-role --role-name CWLtoKinesisFirehoseRole --assume-role-policy-document file://TrustPolicyForCWLToFireHose.json
Here is the content for TrustPolicyForCWLToFireHose.json.
When you run the AWS CLI command preceding, you don’t get any acknowledgment. To validate that your CloudWatch Log Group is subscribed to your Firehose stream, check the CloudWatch console.
As soon as the subscription filter is created, the real-time log data from the log group goes into your Firehose delivery stream. Your stream then delivers it to your Splunk Enterprise or Splunk Cloud environment for querying and visualization. The screenshot following is from Splunk Enterprise.
In addition, you can monitor and view metrics associated with your delivery stream using the AWS console.
Conclusion
Although our walkthrough uses VPC Flow Logs, the pattern can be used in many other scenarios. These include ingesting data from AWS IoT, other CloudWatch logs and events, Kinesis Streams or other data sources using the Kinesis Agent or Kinesis Producer Library. We also used Lambda blueprint Kinesis Data Firehose CloudWatch Logs Processor to transform streaming records from Kinesis Data Firehose. However, you might need to use a different Lambda blueprint or disable record transformation entirely depending on your use case. For an additional use case using Kinesis Data Firehose, check out This is My Architecture Video, which discusses how to securely centralize cross-account data analytics using Kinesis and Splunk.
Tarik Makota is a solutions architect with the Amazon Web Services Partner Network. He provides technical guidance, design advice and thought leadership to AWS’ most strategic software partners. His career includes work in an extremely broad software development and architecture roles across ERP, financial printing, benefit delivery and administration and financial services. He holds an M.S. in Software Development and Management from Rochester Institute of Technology.
Roy Arsan is a solutions architect in the Splunk Partner Integrations team. He has a background in product development, cloud architecture, and building consumer and enterprise cloud applications. More recently, he has architected Splunk solutions on major cloud providers, including an AWS Quick Start for Splunk that enables AWS users to easily deploy distributed Splunk Enterprise straight from their AWS console. He’s also the co-author of the AWS Lambda blueprints for Splunk. He holds an M.S. in Computer Science Engineering from the University of Michigan.
This is a guest post by Rafi Ton, founder and CEO of NUVIAD. NUVIAD is, in their own words, “a mobile marketing platform providing professional marketers, agencies and local businesses state of the art tools to promote their products and services through hyper targeting, big data analytics and advanced machine learning tools.”
At NUVIAD, we’ve been using Amazon Redshift as our main data warehouse solution for more than 3 years.
We store massive amounts of ad transaction data that our users and partners analyze to determine ad campaign strategies. When running real-time bidding (RTB) campaigns in large scale, data freshness is critical so that our users can respond rapidly to changes in campaign performance. We chose Amazon Redshift because of its simplicity, scalability, performance, and ability to load new data in near real time.
Over the past three years, our customer base grew significantly and so did our data. We saw our Amazon Redshift cluster grow from three nodes to 65 nodes. To balance cost and analytics performance, we looked for a way to store large amounts of less-frequently analyzed data at a lower cost. Yet, we still wanted to have the data immediately available for user queries and to meet their expectations for fast performance. We turned to Amazon Redshift Spectrum.
In this post, I explain the reasons why we extended Amazon Redshift with Redshift Spectrum as our modern data warehouse. I cover how our data growth and the need to balance cost and performance led us to adopt Redshift Spectrum. I also share key performance metrics in our environment, and discuss the additional AWS services that provide a scalable and fast environment, with data available for immediate querying by our growing user base.
Amazon Redshift as our foundation
The ability to provide fresh, up-to-the-minute data to our customers and partners was always a main goal with our platform. We saw other solutions provide data that was a few hours old, but this was not good enough for us. We insisted on providing the freshest data possible. For us, that meant loading Amazon Redshift in frequent micro batches and allowing our customers to query Amazon Redshift directly to get results in near real time.
The benefits were immediately evident. Our customers could see how their campaigns performed faster than with other solutions, and react sooner to the ever-changing media supply pricing and availability. They were very happy.
However, this approach required Amazon Redshift to store a lot of data for long periods, and our data grew substantially. In our peak, we maintained a cluster running 65 DC1.large nodes. The impact on our Amazon Redshift cluster was evident, and we saw our CPU utilization grow to 90%.
Why we extended Amazon Redshift to Redshift Spectrum
Redshift Spectrum gives us the ability to run SQL queries using the powerful Amazon Redshift query engine against data stored in Amazon S3, without needing to load the data. With Redshift Spectrum, we store data where we want, at the cost that we want. We have the data available for analytics when our users need it with the performance they expect.
Seamless scalability, high performance, and unlimited concurrency
Scaling Redshift Spectrum is a simple process. First, it allows us to leverage Amazon S3 as the storage engine and get practically unlimited data capacity.
Second, if we need more compute power, we can leverage Redshift Spectrum’s distributed compute engine over thousands of nodes to provide superior performance – perfect for complex queries running against massive amounts of data.
Third, all Redshift Spectrum clusters access the same data catalog so that we don’t have to worry about data migration at all, making scaling effortless and seamless.
Lastly, since Redshift Spectrum distributes queries across potentially thousands of nodes, they are not affected by other queries, providing much more stable performance and unlimited concurrency.
Keeping it SQL
Redshift Spectrum uses the same query engine as Amazon Redshift. This means that we did not need to change our BI tools or query syntax, whether we used complex queries across a single table or joins across multiple tables.
An interesting capability introduced recently is the ability to create a view that spans both Amazon Redshift and Redshift Spectrum external tables. With this feature, you can query frequently accessed data in your Amazon Redshift cluster and less-frequently accessed data in Amazon S3, using a single view.
Leveraging Parquet for higher performance
Parquet is a columnar data format that provides superior performance and allows Redshift Spectrum (or Amazon Athena) to scan significantly less data. With less I/O, queries run faster and we pay less per query. You can read all about Parquet at https://parquet.apache.org/ or https://en.wikipedia.org/wiki/Apache_Parquet.
Lower cost
From a cost perspective, we pay standard rates for our data in Amazon S3, and only small amounts per query to analyze data with Redshift Spectrum. Using the Parquet format, we can significantly reduce the amount of data scanned. Our costs are now lower, and our users get fast results even for large complex queries.
What we learned about Amazon Redshift vs. Redshift Spectrum performance
When we first started looking at Redshift Spectrum, we wanted to put it to the test. We wanted to know how it would compare to Amazon Redshift, so we looked at two key questions:
What is the performance difference between Amazon Redshift and Redshift Spectrum on simple and complex queries?
Does the data format impact performance?
During the migration phase, we had our dataset stored in Amazon Redshift and S3 as CSV/GZIP and as Parquet file formats. We tested three configurations:
Amazon Redshift cluster with 28 DC1.large nodes
Redshift Spectrum using CSV/GZIP
Redshift Spectrum using Parquet
We performed benchmarks for simple and complex queries on one month’s worth of data. We tested how much time it took to perform the query, and how consistent the results were when running the same query multiple times. The data we used for the tests was already partitioned by date and hour. Properly partitioning the data improves performance significantly and reduces query times.
Simple query
First, we tested a simple query aggregating billing data across a month:
SELECT
user_id,
count(*) AS impressions,
SUM(billing)::decimal /1000000 AS billing
FROM <table_name>
WHERE
date >= '2017-08-01' AND
date <= '2017-08-31'
GROUP BY
user_id;
We ran the same query seven times and measured the response times (red marking the longest time and green the shortest time):
Execution Time (seconds)
Amazon Redshift
Redshift Spectrum CSV
Redshift Spectrum Parquet
Run #1
39.65
45.11
11.92
Run #2
15.26
43.13
12.05
Run #3
15.27
46.47
13.38
Run #4
21.22
51.02
12.74
Run #5
17.27
43.35
11.76
Run #6
16.67
44.23
13.67
Run #7
25.37
40.39
12.75
Average
21.53
44.82
12.61
For simple queries, Amazon Redshift performed better than Redshift Spectrum, as we thought, because the data is local to Amazon Redshift.
What was surprising was that using Parquet data format in Redshift Spectrum significantly beat ‘traditional’ Amazon Redshift performance. For our queries, using Parquet data format with Redshift Spectrum delivered an average 40% performance gain over traditional Amazon Redshift. Furthermore, Redshift Spectrum showed high consistency in execution time with a smaller difference between the slowest run and the fastest run.
Comparing the amount of data scanned when using CSV/GZIP and Parquet, the difference was also significant:
Data Scanned (GB)
CSV (Gzip)
135.49
Parquet
2.83
Because we pay only for the data scanned by Redshift Spectrum, the cost saving of using Parquet is evident and substantial.
Complex query
Next, we compared the same three configurations with a complex query.
Execution Time (seconds)
Amazon Redshift
Redshift Spectrum CSV
Redshift Spectrum Parquet
Run #1
329.80
84.20
42.40
Run #2
167.60
65.30
35.10
Run #3
165.20
62.20
23.90
Run #4
273.90
74.90
55.90
Run #5
167.70
69.00
58.40
Average
220.84
71.12
43.14
This time, Redshift Spectrum using Parquet cut the average query time by 80% compared to traditional Amazon Redshift!
Bottom line: For complex queries, Redshift Spectrum provided a 67% performance gain over Amazon Redshift. Using the Parquet data format, Redshift Spectrum delivered an 80% performance improvement over Amazon Redshift. For us, this was substantial.
Optimizing the data structure for different workloads
Because the cost of S3 is relatively inexpensive and we pay only for the data scanned by each query, we believe that it makes sense to keep our data in different formats for different workloads and different analytics engines. It is important to note that we can have any number of tables pointing to the same data on S3. It all depends on how we partition the data and update the table partitions.
Data permutations
For example, we have a process that runs every minute and generates statistics for the last minute of data collected. With Amazon Redshift, this would be done by running the query on the table with something as follows:
SELECT
user,
COUNT(*)
FROM
events_table
WHERE
ts BETWEEN ‘2017-08-01 14:00:00’ AND ‘2017-08-01 14:00:59’
GROUP BY
user;
(Assuming ‘ts’ is your column storing the time stamp for each event.)
With Redshift Spectrum, we pay for the data scanned in each query. If the data is partitioned by the minute instead of the hour, a query looking at one minute would be 1/60th the cost. If we use a temporary table that points only to the data of the last minute, we save that unnecessary cost.
Creating Parquet data efficiently
On the average, we have 800 instances that process our traffic. Each instance sends events that are eventually loaded into Amazon Redshift. When we started three years ago, we would offload data from each server to S3 and then perform a periodic copy command from S3 to Amazon Redshift.
Recently, Amazon Kinesis Firehose added the capability to offload data directly to Amazon Redshift. While this is now a viable option, we kept the same collection process that worked flawlessly and efficiently for three years.
This changed, however, when we incorporated Redshift Spectrum. With Redshift Spectrum, we needed to find a way to:
Collect the event data from the instances.
Save the data in Parquet format.
Partition the data effectively.
To accomplish this, we save the data as CSV and then transform it to Parquet. The most effective method to generate the Parquet files is to:
Send the data in one-minute intervals from the instances to Kinesis Firehose with an S3 temporary bucket as the destination.
Aggregate hourly data and convert it to Parquet using AWS Lambda and AWS Glue.
Add the Parquet data to S3 by updating the table partitions.
With this new process, we had to give more attention to validating the data before we sent it to Kinesis Firehose, because a single corrupted record in a partition fails queries on that partition.
Data validation
To store our click data in a table, we considered the following SQL create table command:
create external TABLE spectrum.blog_clicks (
user_id varchar(50),
campaign_id varchar(50),
os varchar(50),
ua varchar(255),
ts bigint,
billing float
)
partitioned by (date date, hour smallint)
stored as parquet
location 's3://nuviad-temp/blog/clicks/';
The above statement defines a new external table (all Redshift Spectrum tables are external tables) with a few attributes. We stored ‘ts’ as a Unix time stamp and not as Timestamp, and billing data is stored as float and not decimal (more on that later). We also said that the data is partitioned by date and hour, and then stored as Parquet on S3.
First, we need to get the table definitions. This can be achieved by running the following query:
SELECT
*
FROM
svv_external_columns
WHERE
tablename = 'blog_clicks';
This query lists all the columns in the table with their respective definitions:
schemaname
tablename
columnname
external_type
columnnum
part_key
spectrum
blog_clicks
user_id
varchar(50)
1
0
spectrum
blog_clicks
campaign_id
varchar(50)
2
0
spectrum
blog_clicks
os
varchar(50)
3
0
spectrum
blog_clicks
ua
varchar(255)
4
0
spectrum
blog_clicks
ts
bigint
5
0
spectrum
blog_clicks
billing
double
6
0
spectrum
blog_clicks
date
date
7
1
spectrum
blog_clicks
hour
smallint
8
2
Now we can use this data to create a validation schema for our data:
Next, we create a function that uses this schema to validate data:
function valueIsValid(value, item_schema) {
if (schema.type == 'string') {
return (typeof value == 'string' && value.length <= schema.max_length);
}
else if (schema.type == 'integer') {
return (typeof value == 'number' && value >= schema.min_value && value <= schema.max_value);
}
else if (schema.type == 'float' || schema.type == 'double') {
return (typeof value == 'number' && value >= schema.min_value && value <= schema.max_value);
}
else if (schema.type == 'boolean') {
return typeof value == 'boolean';
}
else if (schema.type == 'timestamp') {
return (new Date(value)).getTime() > 0;
}
else {
return true;
}
}
Near real-time data loading with Kinesis Firehose
On Kinesis Firehose, we created a new delivery stream to handle the events as follows:
Delivery stream name: events
Source: Direct PUT
S3 bucket: nuviad-events
S3 prefix: rtb/
IAM role: firehose_delivery_role_1
Data transformation: Disabled
Source record backup: Disabled
S3 buffer size (MB): 100
S3 buffer interval (sec): 60
S3 Compression: GZIP
S3 Encryption: No Encryption
Status: ACTIVE
Error logging: Enabled
This delivery stream aggregates event data every minute, or up to 100 MB, and writes the data to an S3 bucket as a CSV/GZIP compressed file. Next, after we have the data validated, we can safely send it to our Kinesis Firehose API:
if (validated) {
let itemString = item.join('|')+'\n'; //Sending csv delimited by pipe and adding new line
let params = {
DeliveryStreamName: 'events',
Record: {
Data: itemString
}
};
firehose.putRecord(params, function(err, data) {
if (err) {
console.error(err, err.stack);
}
else {
// Continue to your next step
}
});
}
Now, we have a single CSV file representing one minute of event data stored in S3. The files are named automatically by Kinesis Firehose by adding a UTC time prefix in the format YYYY/MM/DD/HH before writing objects to S3. Because we use the date and hour as partitions, we need to change the file naming and location to fit our Redshift Spectrum schema.
Automating data distribution using AWS Lambda
We created a simple Lambda function triggered by an S3 put event that copies the file to a different location (or locations), while renaming it to fit our data structure and processing flow. As mentioned before, the files generated by Kinesis Firehose are structured in a pre-defined hierarchy, such as:
All we need to do is parse the object name and restructure it as we see fit. In our case, we did the following (the event is an object received in the Lambda function with all the data about the object written to S3):
/*
object key structure in the event object:
your-prefix/2017/08/01/20/event-4-2017-08-01-20-06-06-536f5c40-6893-4ee4-907d-81e4d3b09455.gz
*/
let key_parts = event.Records[0].s3.object.key.split('/');
let event_type = key_parts[0];
let date = key_parts[1] + '-' + key_parts[2] + '-' + key_parts[3];
let hour = key_parts[4];
if (hour.indexOf('0') == 0) {
hour = parseInt(hour, 10) + '';
}
let parts1 = key_parts[5].split('-');
let minute = parts1[7];
if (minute.indexOf('0') == 0) {
minute = parseInt(minute, 10) + '';
}
Now, we can redistribute the file to the two destinations we need—one for the minute processing task and the other for hourly aggregation:
Kinesis Firehose stores the data in a temporary folder. We copy the object to another folder that holds the data for the last processed minute. This folder is connected to a small Redshift Spectrum table where the data is being processed without needing to scan a much larger dataset. We also copy the data to a folder that holds the data for the entire hour, to be later aggregated and converted to Parquet.
Because we partition the data by date and hour, we created a new partition on the Redshift Spectrum table if the processed minute is the first minute in the hour (that is, minute 0). We ran the following:
ALTER TABLE
spectrum.events
ADD partition
(date='2017-08-01', hour=0)
LOCATION 's3://nuviad-temp/events/2017-08-01/0/';
After the data is processed and added to the table, we delete the processed data from the temporary Kinesis Firehose storage and from the minute storage folder.
Migrating CSV to Parquet using AWS Glue and Amazon EMR
The simplest way we found to run an hourly job converting our CSV data to Parquet is using Lambda and AWS Glue (and thanks to the awesome AWS Big Data team for their help with this).
Creating AWS Glue jobs
What this simple AWS Glue script does:
Gets parameters for the job, date, and hour to be processed
Creates a Spark EMR context allowing us to run Spark code
Reads CSV data into a DataFrame
Writes the data as Parquet to the destination S3 bucket
Adds or modifies the Redshift Spectrum / Amazon Athena table partition for the table
Note: Because Redshift Spectrum and Athena both use the AWS Glue Data Catalog, we could use the Athena client to add the partition to the table.
Here are a few words about float, decimal, and double. Using decimal proved to be more challenging than we expected, as it seems that Redshift Spectrum and Spark use them differently. Whenever we used decimal in Redshift Spectrum and in Spark, we kept getting errors, such as:
S3 Query Exception (Fetch). Task failed due to an internal error. File 'https://s3-external-1.amazonaws.com/nuviad-temp/events/2017-08-01/hour=2/part-00017-48ae5b6b-906e-4875-8cde-bc36c0c6d0ca.c000.snappy.parquet has an incompatible Parquet schema for column 's3://nuviad-events/events.lat'. Column type: DECIMAL(18, 8), Parquet schema:\noptional float lat [i:4 d:1 r:0]\n (https://s3-external-1.amazonaws.com/nuviad-temp/events/2017-08-01/hour=2/part-00017-48ae5b6b-906e-4875-8cde-bc36c0c6d0ca.c000.snappy.parq
We had to experiment with a few floating-point formats until we found that the only combination that worked was to define the column as double in the Spark code and float in Spectrum. This is the reason you see billing defined as float in Spectrum and double in the Spark code.
Creating a Lambda function to trigger conversion
Next, we created a simple Lambda function to trigger the AWS Glue script hourly using a simple Python code:
Using Amazon CloudWatch Events, we trigger this function hourly. This function triggers an AWS Glue job named ‘convertEventsParquetHourly’ and runs it for the previous hour, passing job names and values of the partitions to process to AWS Glue.
Redshift Spectrum and Node.js
Our development stack is based on Node.js, which is well-suited for high-speed, light servers that need to process a huge number of transactions. However, a few limitations of the Node.js environment required us to create workarounds and use other tools to complete the process.
Node.js and Parquet
The lack of Parquet modules for Node.js required us to implement an AWS Glue/Amazon EMR process to effectively migrate data from CSV to Parquet. We would rather save directly to Parquet, but we couldn’t find an effective way to do it.
One interesting project in the works is the development of a Parquet NPM by Marc Vertes called node-parquet (https://www.npmjs.com/package/node-parquet). It is not in a production state yet, but we think it would be well worth following the progress of this package.
Timestamp data type
According to the Parquet documentation, Timestamp data are stored in Parquet as 64-bit integers. However, JavaScript does not support 64-bit integers, because the native number type is a 64-bit double, giving only 53 bits of integer range.
The result is that you cannot store Timestamp correctly in Parquet using Node.js. The solution is to store Timestamp as string and cast the type to Timestamp in the query. Using this method, we did not witness any performance degradation whatsoever.
Lessons learned
You can benefit from our trial-and-error experience.
Lesson #1: Data validation is critical
As mentioned earlier, a single corrupt entry in a partition can fail queries running against this partition, especially when using Parquet, which is harder to edit than a simple CSV file. Make sure that you validate your data before scanning it with Redshift Spectrum.
Lesson #2: Structure and partition data effectively
One of the biggest benefits of using Redshift Spectrum (or Athena for that matter) is that you don’t need to keep nodes up and running all the time. You pay only for the queries you perform and only for the data scanned per query.
Keeping different permutations of your data for different queries makes a lot of sense in this case. For example, you can partition your data by date and hour to run time-based queries, and also have another set partitioned by user_id and date to run user-based queries. This results in faster and more efficient performance of your data warehouse.
Storing data in the right format
Use Parquet whenever you can. The benefits of Parquet are substantial. Faster performance, less data to scan, and much more efficient columnar format. However, it is not supported out-of-the-box by Kinesis Firehose, so you need to implement your own ETL. AWS Glue is a great option.
Creating small tables for frequent tasks
When we started using Redshift Spectrum, we saw our Amazon Redshift costs jump by hundreds of dollars per day. Then we realized that we were unnecessarily scanning a full day’s worth of data every minute. Take advantage of the ability to define multiple tables on the same S3 bucket or folder, and create temporary and small tables for frequent queries.
Lesson #3: Combine Athena and Redshift Spectrum for optimal performance
Moving to Redshift Spectrum also allowed us to take advantage of Athena as both use the AWS Glue Data Catalog. Run fast and simple queries using Athena while taking advantage of the advanced Amazon Redshift query engine for complex queries using Redshift Spectrum.
Redshift Spectrum excels when running complex queries. It can push many compute-intensive tasks, such as predicate filtering and aggregation, down to the Redshift Spectrum layer, so that queries use much less of your cluster’s processing capacity.
Lesson #4: Sort your Parquet data within the partition
We achieved another performance improvement by sorting data within the partition using sortWithinPartitions(sort_field). For example:
We were extremely pleased with using Amazon Redshift as our core data warehouse for over three years. But as our client base and volume of data grew substantially, we extended Amazon Redshift to take advantage of scalability, performance, and cost with Redshift Spectrum.
Redshift Spectrum lets us scale to virtually unlimited storage, scale compute transparently, and deliver super-fast results for our users. With Redshift Spectrum, we store data where we want at the cost we want, and have the data available for analytics when our users need it with the performance they expect.
About the Author
With 7 years of experience in the AdTech industry and 15 years in leading technology companies, Rafi Ton is the founder and CEO of NUVIAD. He enjoys exploring new technologies and putting them to use in cutting edge products and services, in the real world generating real money. Being an experienced entrepreneur, Rafi believes in practical-programming and fast adaptation of new technologies to achieve a significant market advantage.
We can’t believe that there are just few days left before re:Invent 2017. If you are attending this year, you’ll want to check out our Big Data sessions! The Big Data and Machine Learning categories are bigger than ever. As in previous years, you can find these sessions in various tracks, including Analytics & Big Data, Deep Learning Summit, Artificial Intelligence & Machine Learning, Architecture, and Databases.
We have great sessions from organizations and companies like Vanguard, Cox Automotive, Pinterest, Netflix, FINRA, Amtrak, AmazonFresh, Sysco Foods, Twilio, American Heart Association, Expedia, Esri, Nextdoor, and many more. All sessions are recorded and made available on YouTube. In addition, all slide decks from the sessions will be available on SlideShare.net after the conference.
This post highlights the sessions that will be presented as part of the Analytics & Big Data track, as well as relevant sessions from other tracks like Architecture, Artificial Intelligence & Machine Learning, and IoT. If you’re interested in Machine Learning sessions, don’t forget to check out our Guide to Machine Learning at re:Invent 2017.
This year’s session catalog contains the following breakout sessions.
Raju Gulabani, VP, Database, Analytics and AI at AWS will discuss the evolution of database and analytics services in AWS, the new database and analytics services and features we launched this year, and our vision for continued innovation in this space. We are witnessing an unprecedented growth in the amount of data collected, in many different forms. Storage, management, and analysis of this data require database services that scale and perform in ways not possible before. AWS offers a collection of database and other data services—including Amazon Aurora, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Amazon ElastiCache, Amazon Kinesis, and Amazon EMR—to process, store, manage, and analyze data. In this session, we provide an overview of AWS database and analytics services and discuss how customers are using these services today.
Deep dive customer use cases
ABD401 – How Netflix Monitors Applications in Near Real-Time with Amazon Kinesis Thousands of services work in concert to deliver millions of hours of video streams to Netflix customers every day. These applications vary in size, function, and technology, but they all make use of the Netflix network to communicate. Understanding the interactions between these services is a daunting challenge both because of the sheer volume of traffic and the dynamic nature of deployments. In this session, we first discuss why Netflix chose Kinesis Streams to address these challenges at scale. We then dive deep into how Netflix uses Kinesis Streams to enrich network traffic logs and identify usage patterns in real time. Lastly, we cover how Netflix uses this system to build comprehensive dependency maps, increase network efficiency, and improve failure resiliency. From this session, you will learn how to build a real-time application monitoring system using network traffic logs and get real-time, actionable insights.
In this session, learn how Nextdoor replaced their home-grown data pipeline based on a topology of Flume nodes with a completely serverless architecture based on Kinesis and Lambda. By making these changes, they improved both the reliability of their data and the delivery times of billions of records of data to their Amazon S3–based data lake and Amazon Redshift cluster. Nextdoor is a private social networking service for neighborhoods.
ABD205 – Taking a Page Out of Ivy Tech’s Book: Using Data for Student Success Data speaks. Discover how Ivy Tech, the nation’s largest singly accredited community college, uses AWS to gather, analyze, and take action on student behavioral data for the betterment of over 3,100 students. This session outlines the process from inception to implementation across the state of Indiana and highlights how Ivy Tech’s model can be applied to your own complex business problems.
ABD207 – Leveraging AWS to Fight Financial Crime and Protect National Security Banks aren’t known to share data and collaborate with one another. But that is exactly what the Mid-Sized Bank Coalition of America (MBCA) is doing to fight digital financial crime—and protect national security. Using the AWS Cloud, the MBCA developed a shared data analytics utility that processes terabytes of non-competitive customer account, transaction, and government risk data. The intelligence produced from the data helps banks increase the efficiency of their operations, cut labor and operating costs, and reduce false positive volumes. The collective intelligence also allows greater enforcement of Anti-Money Laundering (AML) regulations by helping members detect internal risks—and identify the challenges to detecting these risks in the first place. This session demonstrates how the AWS Cloud supports the MBCA to deliver advanced data analytics, provide consistent operating models across financial institutions, reduce costs, and strengthen national security.
ABD208 – Cox Automotive Empowered to Scale with Splunk Cloud & AWS and Explores New Innovation with Amazon Kinesis Firehose In this session, learn how Cox Automotive is using Splunk Cloud for real time visibility into its AWS and hybrid environments to achieve near instantaneous MTTI, reduce auction incidents by 90%, and proactively predict outages. We also introduce a highly anticipated capability that allows you to ingest, transform, and analyze data in real time using Splunk and Amazon Kinesis Firehose to gain valuable insights from your cloud resources. It’s now quicker and easier than ever to gain access to analytics-driven infrastructure monitoring using Splunk Enterprise & Splunk Cloud.
ABD209 – Accelerating the Speed of Innovation with a Data Sciences Data & Analytics Hub at Takeda Historically, silos of data, analytics, and processes across functions, stages of development, and geography created a barrier to R&D efficiency. Gathering the right data necessary for decision-making was challenging due to issues of accessibility, trust, and timeliness. In this session, learn how Takeda is undergoing a transformation in R&D to increase the speed-to-market of high-impact therapies to improve patient lives. The Data and Analytics Hub was built, with Deloitte, to address these issues and support the efficient generation of data insights for functions such as clinical operations, clinical development, medical affairs, portfolio management, and R&D finance. In the AWS hosted data lake, this data is processed, integrated, and made available to business end users through data visualization interfaces, and to data scientists through direct connectivity. Learn how Takeda has achieved significant time reductions—from weeks to minutes—to gather and provision data that has the potential to reduce cycle times in drug development. The hub also enables more efficient operations and alignment to achieve product goals through cross functional team accountability and collaboration due to the ability to access the same cross domain data.
ABD210 – Modernizing Amtrak: Serverless Solution for Real-Time Data Capabilities As the nation’s only high-speed intercity passenger rail provider, Amtrak needs to know critical information to run their business such as: Who’s onboard any train at any time? How are booking and revenue trending? Amtrak was faced with unpredictable and often slow response times from existing databases, ranging from seconds to hours; existing booking and revenue dashboards were spreadsheet-based and manual; multiple copies of data were stored in different repositories, lacking integration and consistency; and operations and maintenance (O&M) costs were relatively high. Join us as we demonstrate how Deloitte and Amtrak successfully went live with a cloud-native operational database and analytical datamart for near-real-time reporting in under six months. We highlight the specific challenges and the modernization of architecture on an AWS native Platform as a Service (PaaS) solution. The solution includes cloud-native components such as AWS Lambda for microservices, Amazon Kinesis and AWS Data Pipeline for moving data, Amazon S3 for storage, Amazon DynamoDB for a managed NoSQL database service, and Amazon Redshift for near-real time reports and dashboards. Deloitte’s solution enabled “at scale” processing of 1 million transactions/day and up to 2K transactions/minute. It provided flexibility and scalability, largely eliminate the need for system management, and dramatically reduce operating costs. Moreover, it laid the groundwork for decommissioning legacy systems, anticipated to save at least $1M over 3 years.
ABD211 – Sysco Foods: A Journey from Too Much Data to Curated Insights In this session, we detail Sysco’s journey from a company focused on hindsight-based reporting to one focused on insights and foresight. For this shift, Sysco moved from multiple data warehouses to an AWS ecosystem, including Amazon Redshift, Amazon EMR, AWS Data Pipeline, and more. As the team at Sysco worked with Tableau, they gained agile insight across their business. Learn how Sysco decided to use AWS, how they scaled, and how they became more strategic with the AWS ecosystem and Tableau.
ABD217 – From Batch to Streaming: How Amazon Flex Uses Real-time Analytics to Deliver Packages on Time Reducing the time to get actionable insights from data is important to all businesses, and customers who employ batch data analytics tools are exploring the benefits of streaming analytics. Learn best practices to extend your architecture from data warehouses and databases to real-time solutions. Learn how to use Amazon Kinesis to get real-time data insights and integrate them with Amazon Aurora, Amazon RDS, Amazon Redshift, and Amazon S3. The Amazon Flex team describes how they used streaming analytics in their Amazon Flex mobile app, used by Amazon delivery drivers to deliver millions of packages each month on time. They discuss the architecture that enabled the move from a batch processing system to a real-time system, overcoming the challenges of migrating existing batch data to streaming data, and how to benefit from real-time analytics.
ABD218 – How EuroLeague Basketball Uses IoT Analytics to Engage Fans IoT and big data have made their way out of industrial applications, general automation, and consumer goods, and are now a valuable tool for improving consumer engagement across a number of industries, including media, entertainment, and sports. The low cost and ease of implementation of AWS analytics services and AWS IoT have allowed AGT, a leader in IoT, to develop their IoTA analytics platform. Using IoTA, AGT brought a tailored solution to EuroLeague Basketball for real-time content production and fan engagement during the 2017-18 season. In this session, we take a deep dive into how this solution is architected for secure, scalable, and highly performant data collection from athletes, coaches, and fans. We also talk about how the data is transformed into insights and integrated into a content generation pipeline. Lastly, we demonstrate how this solution can be easily adapted for other industries and applications.
ABD222 – How to Confidently Unleash Data to Meet the Needs of Your Entire Organization Where are you on the spectrum of IT leaders? Are you confident that you’re providing the technology and solutions that consistently meet or exceed the needs of your internal customers? Do your peers at the executive table see you as an innovative technology leader? Innovative IT leaders understand the value of getting data and analytics directly into the hands of decision makers, and into their own. In this session, Daren Thayne, Domo’s Chief Technology Officer, shares how innovative IT leaders are helping drive a culture change at their organizations. See how transformative it can be to have real-time access to all of the data that’ is relevant to YOUR job (including a complete view of your entire AWS environment), as well as understand how it can help you lead the way in applying that same pattern throughout your entire company
ABD303 – Developing an Insights Platform – Sysco’s Journey from Disparate Systems to Data Lake and Beyond Sysco has nearly 200 operating companies across its multiple lines of business throughout the United States, Canada, Central/South America, and Europe. As the global leader in food services, Sysco identified the need to streamline the collection, transformation, and presentation of data produced by the distributed units and systems, into a central data ecosystem. Sysco’s Business Intelligence and Analytics team addressed these requirements by creating a data lake with scalable analytics and query engines leveraging AWS services. In this session, Sysco will outline their journey from a hindsight reporting focused company to an insights driven organization. They will cover solution architecture, challenges, and lessons learned from deploying a self-service insights platform. They will also walk through the design patterns they used and how they designed the solution to provide predictive analytics using Amazon Redshift Spectrum, Amazon S3, Amazon EMR, AWS Glue, Amazon Elasticsearch Service and other AWS services.
ABD309 – How Twilio Scaled Its Data-Driven Culture As a leading cloud communications platform, Twilio has always been strongly data-driven. But as headcount and data volumes grew—and grew quickly—they faced many new challenges. One-off, static reports work when you’re a small startup, but how do you support a growth stage company to a successful IPO and beyond? Today, Twilio’s data team relies on AWS and Looker to provide data access to 700 colleagues. Departments have the data they need to make decisions, and cloud-based scale means they get answers fast. Data delivers real-business value at Twilio, providing a 360-degree view of their customer, product, and business. In this session, you hear firsthand stories directly from the Twilio data team and learn real-world tips for fostering a truly data-driven culture at scale.
ABD310 – How FINRA Secures Its Big Data and Data Science Platform on AWS FINRA uses big data and data science technologies to detect fraud, market manipulation, and insider trading across US capital markets. As a financial regulator, FINRA analyzes highly sensitive data, so information security is critical. Learn how FINRA secures its Amazon S3 Data Lake and its data science platform on Amazon EMR and Amazon Redshift, while empowering data scientists with tools they need to be effective. In addition, FINRA shares AWS security best practices, covering topics such as AMI updates, micro segmentation, encryption, key management, logging, identity and access management, and compliance.
ABD331 – Log Analytics at Expedia Using Amazon Elasticsearch Service Expedia uses Amazon Elasticsearch Service (Amazon ES) for a variety of mission-critical use cases, ranging from log aggregation to application monitoring and pricing optimization. In this session, the Expedia team reviews how they use Amazon ES and Kibana to analyze and visualize Docker startup logs, AWS CloudTrail data, and application metrics. They share best practices for architecting a scalable, secure log analytics solution using Amazon ES, so you can add new data sources almost effortlessly and get insights quickly
ABD316 – American Heart Association: Finding Cures to Heart Disease Through the Power of Technology Combining disparate datasets and making them accessible to data scientists and researchers is a prevalent challenge for many organizations, not just in healthcare research. American Heart Association (AHA) has built a data science platform using Amazon EMR, Amazon Elasticsearch Service, and other AWS services, that corrals multiple datasets and enables advanced research on phenotype and genotype datasets, aimed at curing heart diseases. In this session, we present how AHA built this platform and the key challenges they addressed with the solution. We also provide a demo of the platform, and leave you with suggestions and next steps so you can build similar solutions for your use cases
ABD319 – Tooling Up for Efficiency: DIY Solutions @ Netflix At Netflix, we have traditionally approached cloud efficiency from a human standpoint, whether it be in-person meetings with the largest service teams or manually flipping reservations. Over time, we realized that these manual processes are not scalable as the business continues to grow. Therefore, in the past year, we have focused on building out tools that allow us to make more insightful, data-driven decisions around capacity and efficiency. In this session, we discuss the DIY applications, dashboards, and processes we built to help with capacity and efficiency. We start at the ten thousand foot view to understand the unique business and cloud problems that drove us to create these products, and discuss implementation details, including the challenges encountered along the way. Tools discussed include Picsou, the successor to our AWS billing file cost analyzer; Libra, an easy-to-use reservation conversion application; and cost and efficiency dashboards that relay useful financial context to 50+ engineering teams and managers.
ABD312 – Deep Dive: Migrating Big Data Workloads to AWS Customers are migrating their analytics, data processing (ETL), and data science workloads running on Apache Hadoop, Spark, and data warehouse appliances from on-premise deployments to AWS in order to save costs, increase availability, and improve performance. AWS offers a broad set of analytics services, including solutions for batch processing, stream processing, machine learning, data workflow orchestration, and data warehousing. This session will focus on identifying the components and workflows in your current environment; and providing the best practices to migrate these workloads to the right AWS data analytics product. We will cover services such as Amazon EMR, Amazon Athena, Amazon Redshift, Amazon Kinesis, and more. We will also feature Vanguard, an American investment management company based in Malvern, Pennsylvania with over $4.4 trillion in assets under management. Ritesh Shah, Sr. Program Manager for Cloud Analytics Program at Vanguard, will describe how they orchestrated their migration to AWS analytics services, including Hadoop and Spark workloads to Amazon EMR. Ritesh will highlight the technical challenges they faced and overcame along the way, as well as share common recommendations and tuning tips to accelerate the time to production.
ABD402 – How Esri Optimizes Massive Image Archives for Analytics in the Cloud Petabyte scale archives of satellites, planes, and drones imagery continue to grow exponentially. They mostly exist as semi-structured data, but they are only valuable when accessed and processed by a wide range of products for both visualization and analysis. This session provides an overview of how ArcGIS indexes and structures data so that any part of it can be quickly accessed, processed, and analyzed by reading only the minimum amount of data needed for the task. In this session, we share best practices for structuring and compressing massive datasets in Amazon S3, so it can be analyzed efficiently. We also review a number of different image formats, including GeoTIFF (used for the Public Datasets on AWS program, Landsat on AWS), cloud optimized GeoTIFF, MRF, and CRF as well as different compression approaches to show the effect on processing performance. Finally, we provide examples of how this technology has been used to help image processing and analysis for the response to Hurricane Harvey.
ABD329 – A Look Under the Hood – How Amazon.com Uses AWS Services for Analytics at Massive Scale Amazon’s consumer business continues to grow, and so does the volume of data and the number and complexity of the analytics done in support of the business. In this session, we talk about how Amazon.com uses AWS technologies to build a scalable environment for data and analytics. We look at how Amazon is evolving the world of data warehousing with a combination of a data lake and parallel, scalable compute engines such as Amazon EMR and Amazon Redshift.
ABD327 – Migrating Your Traditional Data Warehouse to a Modern Data Lake In this session, we discuss the latest features of Amazon Redshift and Redshift Spectrum, and take a deep dive into its architecture and inner workings. We share many of the recent availability, performance, and management enhancements and how they improve your end user experience. You also hear from 21st Century Fox, who presents a case study of their fast migration from an on-premises data warehouse to Amazon Redshift. Learn how they are expanding their data warehouse to a data lake that encompasses multiple data sources and data formats. This architecture helps them tie together siloed business units and get actionable 360-degree insights across their consumer base. MCL202 – Ally Bank & Cognizant: Transforming Customer Experience Using Amazon Alexa Given the increasing popularity of natural language interfaces such as Voice as User technology or conversational artificial intelligence (AI), Ally® Bank was looking to interact with customers by enabling direct transactions through conversation or voice. They also needed to develop a capability that allows third parties to connect to the bank securely for information sharing and exchange, using oAuth, an authentication protocol seen as the future of secure banking technology. Cognizant’s Architecture team partnered with Ally Bank’s Enterprise Architecture group and identified the right product for oAuth integration with Amazon Alexa and third-party technologies. In this session, we discuss how building products with conversational AI helps Ally Bank offer an innovative customer experience; increase retention through improved data-driven personalization; increase the efficiency and convenience of customer service; and gain deep insights into customer needs through data analysis and predictive analytics to offer new products and services.
MCL317 – Orchestrating Machine Learning Training for Netflix Recommendations At Netflix, we use machine learning (ML) algorithms extensively to recommend relevant titles to our 100+ million members based on their tastes. Everything on the member home page is an evidence-driven, A/B-tested experience that we roll out backed by ML models. These models are trained using Meson, our workflow orchestration system. Meson distinguishes itself from other workflow engines by handling more sophisticated execution graphs, such as loops and parameterized fan-outs. Meson can schedule Spark jobs, Docker containers, bash scripts, gists of Scala code, and more. Meson also provides a rich visual interface for monitoring active workflows and inspecting execution logs. It has a powerful Scala DSL for authoring workflows as well as the REST API. In this session, we focus on how Meson trains recommendation ML models in production, and how we have re-architected it to scale up for a growing need of broad ETL applications within Netflix. As a driver for this change, we have had to evolve the persistence layer for Meson. We talk about how we migrated from Cassandra to Amazon RDS backed by Amazon Aurora
MCL350 – Humans vs. the Machines: How Pinterest Uses Amazon Mechanical Turk’s Worker Community to Improve Machine Learning Ever since the term “crowdsourcing” was coined in 2006, it’s been a buzzword for technology companies and social institutions. In the technology sector, crowdsourcing is instrumental for verifying machine learning algorithms, which, in turn, improves the user’s experience. In this session, we explore how Pinterest adapted to an increased reliability on human evaluation to improve their product, with a focus on how they’ve integrated with Mechanical Turk’s platform. This presentation is aimed at engineers, analysts, program managers, and product managers who are interested in how companies rely on Mechanical Turk’s human evaluation platform to better understand content and improve machine learning algorithms. The discussion focuses on the analysis and product decisions related to building a high quality crowdsourcing system that takes advantage of Mechanical Turk’s powerful worker community.
ABD201 – Big Data Architectural Patterns and Best Practices on AWS In this session, we simplify big data processing as a data bus comprising various stages: collect, store, process, analyze, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architectures, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost
ABD202 – Best Practices for Building Serverless Big Data Applications Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. In this session, we show you how to incorporate serverless concepts into your big data architectures. We explore the concepts behind and benefits of serverless architectures for big data, looking at design patterns to ingest, store, process, and visualize your data. Along the way, we explain when and how you can use serverless technologies to streamline data processing, minimize infrastructure management, and improve agility and robustness and share a reference architecture using a combination of cloud and open source technologies to solve your big data problems. Topics include: use cases and best practices for serverless big data applications; leveraging AWS technologies such as Amazon DynamoDB, Amazon S3, Amazon Kinesis, AWS Lambda, Amazon Athena, and Amazon EMR; and serverless ETL, event processing, ad hoc analysis, and real-time analytics.
ABD206 – Building Visualizations and Dashboards with Amazon QuickSight Just as a picture is worth a thousand words, a visual is worth a thousand data points. A key aspect of our ability to gain insights from our data is to look for patterns, and these patterns are often not evident when we simply look at data in tables. The right visualization will help you gain a deeper understanding in a much quicker timeframe. In this session, we will show you how to quickly and easily visualize your data using Amazon QuickSight. We will show you how you can connect to data sources, generate custom metrics and calculations, create comprehensive business dashboards with various chart types, and setup filters and drill downs to slice and dice the data.
ABD203 – Real-Time Streaming Applications on AWS: Use Cases and Patterns To win in the marketplace and provide differentiated customer experiences, businesses need to be able to use live data in real time to facilitate fast decision making. In this session, you learn common streaming data processing use cases and architectures. First, we give an overview of streaming data and AWS streaming data capabilities. Next, we look at a few customer examples and their real-time streaming applications. Finally, we walk through common architectures and design patterns of top streaming data use cases.
ABD213 – How to Build a Data Lake with AWS Glue Data Catalog As data volumes grow and customers store more data on AWS, they often have valuable data that is not easily discoverable and available for analytics. The AWS Glue Data Catalog provides a central view of your data lake, making data readily available for analytics. We introduce key features of the AWS Glue Data Catalog and its use cases. Learn how crawlers can automatically discover your data, extract relevant metadata, and add it as table definitions to the AWS Glue Data Catalog. We will also explore the integration between AWS Glue Data Catalog and Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.
ABD214 – Real-time User Insights for Mobile and Web Applications with Amazon Pinpoint With customers demanding relevant and real-time experiences across a range of devices, digital businesses are looking to gather user data at scale, understand this data, and respond to customer needs instantly. This requires tools that can record large volumes of user data in a structured fashion, and then instantly make this data available to generate insights. In this session, we demonstrate how you can use Amazon Pinpoint to capture user data in a structured yet flexible manner. Further, we demonstrate how this data can be set up for instant consumption using services like Amazon Kinesis Firehose and Amazon Redshift. We walk through example data based on real world scenarios, to illustrate how Amazon Pinpoint lets you easily organize millions of events, record them in real-time, and store them for further analysis.
ABD223 – IT Innovators: New Technology for Leveraging Data to Enable Agility, Innovation, and Business Optimization Companies of all sizes are looking for technology to efficiently leverage data and their existing IT investments to stay competitive and understand where to find new growth. Regardless of where companies are in their data-driven journey, they face greater demands for information by customers, prospects, partners, vendors and employees. All stakeholders inside and outside the organization want information on-demand or in “real time”, available anywhere on any device. They want to use it to optimize business outcomes without having to rely on complex software tools or human gatekeepers to relevant information. Learn how IT innovators at companies such as MasterCard, Jefferson Health, and TELUS are using Domo’s Business Cloud to help their organizations more effectively leverage data at scale.
ABD301 – Analyzing Streaming Data in Real Time with Amazon Kinesis Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. In this session, we present an end-to-end streaming data solution using Kinesis Streams for data ingestion, Kinesis Analytics for real-time processing, and Kinesis Firehose for persistence. We review in detail how to write SQL queries using streaming data and discuss best practices to optimize and monitor your Kinesis Analytics applications. Lastly, we discuss how to estimate the cost of the entire system
ABD302 – Real-Time Data Exploration and Analytics with Amazon Elasticsearch Service and Kibana In this session, we use Apache web logs as example and show you how to build an end-to-end analytics solution. First, we cover how to configure an Amazon ES cluster and ingest data using Amazon Kinesis Firehose. We look at best practices for choosing instance types, storage options, shard counts, and index rotations based on the throughput of incoming data. Then we demonstrate how to set up a Kibana dashboard and build custom dashboard widgets. Finally, we review approaches for generating custom, ad-hoc reports.
ABD304 – Best Practices for Data Warehousing with Amazon Redshift & Redshift Spectrum Most companies are over-run with data, yet they lack critical insights to make timely and accurate business decisions. They are missing the opportunity to combine large amounts of new, unstructured big data that resides outside their data warehouse with trusted, structured data inside their data warehouse. In this session, we take an in-depth look at how modern data warehousing blends and analyzes all your data, inside and outside your data warehouse without moving the data, to give you deeper insights to run your business. We will cover best practices on how to design optimal schemas, load data efficiently, and optimize your queries to deliver high throughput and performance.
ABD305 – Design Patterns and Best Practices for Data Analytics with Amazon EMR Amazon EMR is one of the largest Hadoop operators in the world, enabling customers to run ETL, machine learning, real-time processing, data science, and low-latency SQL at petabyte scale. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about lowering cost with Auto Scaling and Spot Instances, and security best practices for encryption and fine-grained access control. Finally, we dive into some of our recent launches to keep you current on our latest features.
ABD307 – Deep Analytics for Global AWS Marketing Organization To meet the needs of the global marketing organization, the AWS marketing analytics team built a scalable platform that allows the data science team to deliver custom econometric and machine learning models for end user self-service. To meet data security standards, we use end-to-end data encryption and different AWS services such as Amazon Redshift, Amazon RDS, Amazon S3, Amazon EMR with Apache Spark and Auto Scaling. In this session, you see real examples of how we have scaled and automated critical analysis, such as calculating the impact of marketing programs like re:Invent and prioritizing leads for our sales teams.
ABD311 – Deploying Business Analytics at Enterprise Scale with Amazon QuickSight One of the biggest tradeoffs customers usually make when deploying BI solutions at scale is agility versus governance. Large-scale BI implementations with the right governance structure can take months to design and deploy. In this session, learn how you can avoid making this tradeoff using Amazon QuickSight. Learn how to easily deploy Amazon QuickSight to thousands of users using Active Directory and Federated SSO, while securely accessing your data sources in Amazon VPCs or on-premises. We also cover how to control access to your datasets, implement row-level security, create scheduled email reports, and audit access to your data.
ABD315 – Building Serverless ETL Pipelines with AWS Glue Organizations need to gain insight and knowledge from a growing number of Internet of Things (IoT), APIs, clickstreams, unstructured and log data sources. However, organizations are also often limited by legacy data warehouses and ETL processes that were designed for transactional data. In this session, we introduce key ETL features of AWS Glue, cover common use cases ranging from scheduled nightly data warehouse loads to near real-time, event-driven ETL flows for your data lake. We discuss how to build scalable, efficient, and serverless ETL pipelines using AWS Glue. Additionally, Merck will share how they built an end-to-end ETL pipeline for their application release management system, and launched it in production in less than a week using AWS Glue.
ABD318 – Architecting a data lake with Amazon S3, Amazon Kinesis, and Amazon Athena Learn how to architect a data lake where different teams within your organization can publish and consume data in a self-service manner. As organizations aim to become more data-driven, data engineering teams have to build architectures that can cater to the needs of diverse users – from developers, to business analysts, to data scientists. Each of these user groups employs different tools, have different data needs and access data in different ways. In this talk, we will dive deep into assembling a data lake using Amazon S3, Amazon Kinesis, Amazon Athena, Amazon EMR, and AWS Glue. The session will feature Mohit Rao, Architect and Integration lead at Atlassian, the maker of products such as JIRA, Confluence, and Stride. First, we will look at a couple of common architectures for building a data lake. Then we will show how Atlassian built a self-service data lake, where any team within the company can publish a dataset to be consumed by a broad set of users.
Companies have valuable data that they may not be analyzing due to the complexity, scalability, and performance issues of loading the data into their data warehouse. However, with the right tools, you can extend your analytics to query data in your data lake—with no loading required. Amazon Redshift Spectrum extends the analytic power of Amazon Redshift beyond data stored in your data warehouse to run SQL queries directly against vast amounts of unstructured data in your Amazon S3 data lake. This gives you the freedom to store your data where you want, in the format you want, and have it available for analytics when you need it. Join a discussion with AWS solution architects to ask question.
ABD330 – Combining Batch and Stream Processing to Get the Best of Both Worlds Today, many architects and developers are looking to build solutions that integrate batch and real-time data processing, and deliver the best of both approaches. Lambda architecture (not to be confused with the AWS Lambda service) is a design pattern that leverages both batch and real-time processing within a single solution to meet the latency, accuracy, and throughput requirements of big data use cases. Come join us for a discussion on how to implement Lambda architecture (batch, speed, and serving layers) and best practices for data processing, loading, and performance tuning
ABD335 – Real-Time Anomaly Detection Using Amazon Kinesis Amazon Kinesis Analytics offers a built-in machine learning algorithm that you can use to easily detect anomalies in your VPC network traffic and improve security monitoring. Join us for an interactive discussion on how to stream your VPC flow Logs to Amazon Kinesis Streams and identify anomalies using Kinesis Analytics.
ABD339 – Deep Dive and Best Practices for Amazon Athena Amazon Athena is an interactive query service that enables you to process data directly from Amazon S3 without the need for infrastructure. Since its launch at re:invent 2016, several organizations have adopted Athena as the central tool to process all their data. In this talk, we dive deep into the most common use cases, including working with other AWS services. We review the best practices for creating tables and partitions and performance optimizations. We also dive into how Athena handles security, authorization, and authentication. Lastly, we hear from a customer who has reduced costs and improved time to market by deploying Athena across their organization.
We look forward to meeting you at re:Invent 2017!
About the Author
Roy Ben-Alta is a solution architect and principal business development manager at Amazon Web Services in New York. He focuses on Data Analytics and ML Technologies, working with AWS customers to build innovative data-driven products.
Contributed by Otavio Ferreira, Manager, Software Development, AWS Messaging
Like other developers around the world, you may be tackling increasingly complex business problems. A key success factor, in that case, is the ability to break down a large project scope into smaller, more manageable components. A service-oriented architecture guides you toward designing systems as a collection of loosely coupled, independently scaled, and highly reusable services. Microservices take this even further. To improve performance and scalability, they promote fine-grained interfaces and lightweight protocols.
However, the communication among isolated microservices can be challenging. Services are often deployed onto independent servers and don’t share any compute or storage resources. Also, you should avoid hard dependencies among microservices, to preserve maintainability and reusability.
If you apply the pub/sub design pattern, you can effortlessly decouple and independently scale out your microservices and serverless architectures. A pub/sub messaging service, such as Amazon SNS, promotes event-driven computing that statically decouples event publishers from subscribers, while dynamically allowing for the exchange of messages between them. An event-driven architecture also introduces the responsiveness needed to deal with complex problems, which are often unpredictable and asynchronous.
What is event-driven computing?
Given the context of microservices, event-driven computing is a model in which subscriber services automatically perform work in response to events triggered by publisher services. This paradigm can be applied to automate workflows while decoupling the services that collectively and independently work to fulfil these workflows. Amazon SNS is an event-driven computing hub, in the AWS Cloud, that has native integration with several AWS publisher and subscriber services.
Which AWS services publish events to SNS natively?
Several AWS services have been integrated as SNS publishers and, therefore, can natively trigger event-driven computing for a variety of use cases. In this post, I specifically cover AWS compute, storage, database, and networking services, as depicted below.
Compute services
Auto Scaling: Helps you ensure that you have the correct number of Amazon EC2 instances available to handle the load for your application. You can configure Auto Scaling lifecycle hooks to trigger events, as Auto Scaling resizes your EC2 cluster.As an example, you may want to warm up the local cache store on newly launched EC2 instances, and also download log files from other EC2 instances that are about to be terminated. To make this happen, set an SNS topic as your Auto Scaling group’s notification target, then subscribe two Lambda functions to this SNS topic. The first function is responsible for handling scale-out events (to warm up cache upon provisioning), whereas the second is in charge of handling scale-in events (to download logs upon termination).
AWS Elastic Beanstalk: An easy-to-use service for deploying and scaling web applications and web services developed in a number of programming languages. You can configure event notifications for your Elastic Beanstalk environment so that notable events can be automatically published to an SNS topic, then pushed to topic subscribers.As an example, you may use this event-driven architecture to coordinate your continuous integration pipeline (such as Jenkins CI). That way, whenever an environment is created, Elastic Beanstalk publishes this event to an SNS topic, which triggers a subscribing Lambda function, which then kicks off a CI job against your newly created Elastic Beanstalk environment.
Elastic Load Balancing:Automatically distributes incoming application traffic across Amazon EC2 instances, containers, or other resources identified by IP addresses.You can configure CloudWatch alarms on Elastic Load Balancing metrics, to automate the handling of events derived from Classic Load Balancers. As an example, you may leverage this event-driven design to automate latency profiling in an Amazon ECS cluster behind a Classic Load Balancer. In this example, whenever your ECS cluster breaches your load balancer latency threshold, an event is posted by CloudWatch to an SNS topic, which then triggers a subscribing Lambda function. This function runs a task on your ECS cluster to trigger a latency profiling tool, hosted on the cluster itself. This can enhance your latency troubleshooting exercise by making it timely.
Amazon S3:Object storage built to store and retrieve any amount of data.You can enable S3 event notifications, and automatically get them posted to SNS topics, to automate a variety of workflows. For instance, imagine that you have an S3 bucket to store incoming resumes from candidates, and a fleet of EC2 instances to encode these resumes from their original format (such as Word or text) into a portable format (such as PDF).In this example, whenever new files are uploaded to your input bucket, S3 publishes these events to an SNS topic, which in turn pushes these messages into subscribing SQS queues. Then, encoding workers running on EC2 instances poll these messages from the SQS queues; retrieve the original files from the input S3 bucket; encode them into PDF; and finally store them in an output S3 bucket.
Amazon EFS: Provides simple and scalable file storage, for use with Amazon EC2 instances, in the AWS Cloud.You can configure CloudWatch alarms on EFS metrics, to automate the management of your EFS systems. For example, consider a highly parallelized genomics analysis application that runs against an EFS system. By default, this file system is instantiated on the “General Purpose” performance mode. Although this performance mode allows for lower latency, it might eventually impose a scaling bottleneck. Therefore, you may leverage an event-driven design to handle it automatically.Basically, as soon as the EFS metric “Percent I/O Limit” breaches 95%, CloudWatch could post this event to an SNS topic, which in turn would push this message into a subscribing Lambda function. This function automatically creates a new file system, this time on the “Max I/O” performance mode, then switches the genomics analysis application to this new file system. As a result, your application starts experiencing higher I/O throughput rates.
Amazon Glacier: A secure, durable, and low-cost cloud storage service for data archiving and long-term backup.You can set a notification configuration on an Amazon Glacier vault so that when a job completes, a message is published to an SNS topic. Retrieving an archive from Amazon Glacier is a two-step asynchronous operation, in which you first initiate a job, and then download the output after the job completes. Therefore, SNS helps you eliminate polling your Amazon Glacier vault to check whether your job has been completed, or not. As usual, you may subscribe SQS queues, Lambda functions, and HTTP endpoints to your SNS topic, to be notified when your Amazon Glacier job is done.
AWS Snowball: A petabyte-scale data transport solution that uses secure appliances to transfer large amounts of data.You can leverage Snowball notifications to automate workflows related to importing data into and exporting data from AWS. More specifically, whenever your Snowball job status changes, Snowball can publish this event to an SNS topic, which in turn can broadcast the event to all its subscribers.As an example, imagine a Geographic Information System (GIS) that distributes high-resolution satellite images to users via Web browser. In this example, the GIS vendor could capture up to 80 TB of satellite images; create a Snowball job to import these files from an on-premises system to an S3 bucket; and provide an SNS topic ARN to be notified upon job status changes in Snowball. After Snowball changes the job status from “Importing” to “Completed”, Snowball publishes this event to the specified SNS topic, which delivers this message to a subscribing Lambda function, which finally creates a CloudFront web distribution for the target S3 bucket, to serve the images to end users.
Amazon RDS: Makes it easy to set up, operate, and scale a relational database in the cloud.RDS leverages SNS to broadcast notifications when RDS events occur. As usual, these notifications can be delivered via any protocol supported by SNS, including SQS queues, Lambda functions, and HTTP endpoints.As an example, imagine that you own a social network website that has experienced organic growth, and needs to scale its compute and database resources on demand. In this case, you could provide an SNS topic to listen to RDS DB instance events. When the “Low Storage” event is published to the topic, SNS pushes this event to a subscribing Lambda function, which in turn leverages the RDS API to increase the storage capacity allocated to your DB instance. The provisioning itself takes place within the specified DB maintenance window.
Amazon ElastiCache: A web service that makes it easy to deploy, operate, and scale an in-memory data store or cache in the cloud.ElastiCache can publish messages using Amazon SNS when significant events happen on your cache cluster. This feature can be used to refresh the list of servers on client machines connected to individual cache node endpoints of a cache cluster. For instance, an ecommerce website fetches product details from a cache cluster, with the goal of offloading a relational database and speeding up page load times. Ideally, you want to make sure that each web server always has an updated list of cache servers to which to connect.To automate this node discovery process, you can get your ElastiCache cluster to publish events to an SNS topic. Thus, when ElastiCache event “AddCacheNodeComplete” is published, your topic then pushes this event to all subscribing HTTP endpoints that serve your ecommerce website, so that these HTTP servers can update their list of cache nodes.
Amazon Redshift: A fully managed data warehouse that makes it simple to analyze data using standard SQL and BI (Business Intelligence) tools.Amazon Redshift uses SNS to broadcast relevant events so that data warehouse workflows can be automated. As an example, imagine a news website that sends clickstream data to a Kinesis Firehose stream, which then loads the data into Amazon Redshift, so that popular news and reading preferences might be surfaced on a BI tool. At some point though, this Amazon Redshift cluster might need to be resized, and the cluster enters a ready-only mode. Hence, this Amazon Redshift event is published to an SNS topic, which delivers this event to a subscribing Lambda function, which finally deletes the corresponding Kinesis Firehose delivery stream, so that clickstream data uploads can be put on hold.At a later point, after Amazon Redshift publishes the event that the maintenance window has been closed, SNS notifies a subscribing Lambda function accordingly, so that this function can re-create the Kinesis Firehose delivery stream, and resume clickstream data uploads to Amazon Redshift.
AWS DMS: Helps you migrate databases to AWS quickly and securely. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database.DMS also uses SNS to provide notifications when DMS events occur, which can automate database migration workflows. As an example, you might create data replication tasks to migrate an on-premises MS SQL database, composed of multiple tables, to MySQL. Thus, if replication tasks fail due to incompatible data encoding in the source tables, these events can be published to an SNS topic, which can push these messages into a subscribing SQS queue. Then, encoders running on EC2 can poll these messages from the SQS queue, encode the source tables into a compatible character set, and restart the corresponding replication tasks in DMS. This is an event-driven approach to a self-healing database migration process.
Amazon Route 53: A highly available and scalable cloud-based DNS (Domain Name System). Route 53 health checks monitor the health and performance of your web applications, web servers, and other resources.You can set CloudWatch alarms and get automated Amazon SNS notifications when the status of your Route 53 health check changes. As an example, imagine an online payment gateway that reports the health of its platform to merchants worldwide, via a status page. This page is hosted on EC2 and fetches platform health data from DynamoDB. In this case, you could configure a CloudWatch alarm for your Route 53 health check, so that when the alarm threshold is breached, and the payment gateway is no longer considered healthy, then CloudWatch publishes this event to an SNS topic, which pushes this message to a subscribing Lambda function, which finally updates the DynamoDB table that populates the status page. This event-driven approach avoids any kind of manual update to the status page visited by merchants.
AWS Direct Connect (AWS DX): Makes it easy to establish a dedicated network connection from your premises to AWS, which can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections.You can monitor physical DX connections using CloudWatch alarms, and send SNS messages when alarms change their status. As an example, when a DX connection state shifts to 0 (zero), indicating that the connection is down, this event can be published to an SNS topic, which can fan out this message to impacted servers through HTTP endpoints, so that they might reroute their traffic through a different connection instead. This is an event-driven approach to connectivity resilience.
In addition to SNS, event-driven computing is also addressed by Amazon CloudWatch Events, which delivers a near real-time stream of system events that describe changes in AWS resources. With CloudWatch Events, you can route each event type to one or more targets, including:
Many AWS services publish events to CloudWatch. As an example, you can get CloudWatch Events to capture events on your ETL (Extract, Transform, Load) jobs running on AWS Glue and push failed ones to an SQS queue, so that you can retry them later.
Conclusion
Amazon SNS is a pub/sub messaging service that can be used as an event-driven computing hub to AWS customers worldwide. By capturing events natively triggered by AWS services, such as EC2, S3 and RDS, you can automate and optimize all kinds of workflows, namely scaling, testing, encoding, profiling, broadcasting, discovery, failover, and much more. Business use cases presented in this post ranged from recruiting websites, to scientific research, geographic systems, social networks, retail websites, and news portals.
So many launches and cloud innovations, that you simply may not believe. In order to catch up on some service launches and features, this post will be a round-up of some cool releases that happened this summer and through the end of September.
The launches and features I want to share with you today are:
AWS IAM for Authenticating Database Users for RDS MySQL and Amazon Aurora
Amazon SES Reputation Dashboard
Amazon SES Open and Click Tracking Metrics
Serverless Image Handler by the Solutions Builder Team
AWS Ops Automator by the Solutions Builder Team
Let’s dive in, shall we!
AWS IAM for Authenticating Database Users for RDS MySQL and Amazon Aurora
Wished you could manage access to your Amazon RDS database instances and clusters using AWS IAM? Well, wish no longer. Amazon RDS has launched the ability for you to use IAM to manage database access for Amazon RDS for MySQL and Amazon Aurora DB.
What I like most about this new service feature is, it’s very easy to get started. To enable database user authentication using IAM, you would select a checkbox Enable IAM DB Authentication when creating, modifying, or restoring your DB instance or cluster. You can enable IAM access using the RDS console, the AWS CLI, and/or the Amazon RDS API.
After configuring the database for IAM authentication, client applications authenticate to the database engine by providing temporary security credentials generated by the IAM Security Token Service. These credentials can be used instead of providing a password to the database engine.
You can learn more about using IAM to provide targeted permissions and authentication to MySQL and Aurora by reviewing the Amazon RDS user guide.
Amazon SES Reputation Dashboard
In order to aid Amazon Simple Email Service customers’ in utilizing best practice guidelines for sending email, I am thrilled to announce we launched the Reputation Dashboard to provide comprehensive reporting on email sending health. To aid in proactively managing emails being sent, customers now have visibility into overall account health, sending metrics, and compliance or enforcement status.
The Reputation Dashboard will provide the following information:
Account status: A description of your account health status.
Healthy – No issues currently impacting your account.
Probation – Account is on probation; Issues causing probation must be resolved to prevent suspension
Pending end of probation decision – Your account is on probation. Amazon SES team member must review your account prior to action.
Shutdown – Your account has been shut down. No email will be able to be sent using Amazon SES.
Pending shutdown – Your account is on probation and issues causing probation are unresolved.
Bounce Rate: Percentage of emails sent that have bounced and bounce rate status messages.
Complaint Rate: Percentage of emails sent that recipients have reported as spam and complaint rate status messages.
Notifications: Messages about other account reputation issues.
Amazon SES Open and Click Tracking Metrics
Another exciting feature recently added to Amazon SES is support for Email Open and Click Tracking Metrics. With Email Open and Click Tracking Metrics feature, SES customers can now track when email they’ve sent has been opened and track when links within the email have been clicked. Using this SES feature will allow you to better track email campaign engagement and effectiveness.
How does this work?
When using the email open tracking feature, SES will add a transparent, miniature image into the emails that you choose to track. When the email is opened, the mail application client will load the aforementioned tracking which triggers an open track event with Amazon SES. For the email click (link) tracking, links in email and/or email templates are replaced with a custom link. When the custom link is clicked, a click event is recorded in SES and the custom link will redirect the email user to the link destination of the original email.
You can take advantage of the new open tracking and click tracking features by creating a new configuration set or altering an existing configuration set within SES. After choosing either; Amazon SNS, Amazon CloudWatch, or Amazon Kinesis Firehose as the AWS service to receive the open and click metrics, you would only need to select a new configuration set to successfully enable these new features for any emails you want to send.
The AWS Solution Builder team has been hard at work helping to make it easier for you all to find answers to common architectural questions to aid in building and running applications on AWS. You can find these solutions on the AWS Answers page. Two new solutions released earlier this fall on AWS Answers are Serverless Image Handler and the AWS Ops Automator. Serverless Image Handler was developed to provide a solution to help customers dynamically process, manipulate, and optimize the handling of images on the AWS Cloud. The solution combines Amazon CloudFront for caching, AWS Lambda to dynamically retrieve images and make image modifications, and Amazon S3 bucket to store images. Additionally, the Serverless Image Handler leverages the open source image-processing suite, Thumbor, for additional image manipulation, processing, and optimization.
AWS Ops Automator solution helps you to automate manual tasks using time-based or event-based triggers to automatically such as snapshot scheduling by providing a framework for automated tasks and includes task audit trails, logging, resource selection, scaling, concurrency handling, task completion handing, and API request retries. The solution includes the following AWS services:
AWS CloudFormation: a templates to launches the core framework of microservices and solution generated task configurations
Amazon DynamoDB: a table which stores task configuration data to defines the event triggers, resources, and saves the results of the action and the errors.
Amazon CloudWatch Logs: provides logging to track warning and error messages
Amazon SNS: topic to send messages to a subscribed email address to which to send the logging information from the solution
Starting today, you can connect to your Amazon Elasticsearch Service domains from within an Amazon VPC without the need for NAT instances or Internet gateways. VPC support for Amazon ES is easy to configure, reliable, and offers an extra layer of security. With VPC support, traffic between other services and Amazon ES stays entirely within the AWS network, isolated from the public Internet. You can manage network access using existing VPC security groups, and you can use AWS Identity and Access Management (IAM) policies for additional protection. VPC support for Amazon ES domains is available at no additional charge.
Getting Started
Creating an Amazon Elasticsearch Service domain in your VPC is easy. Follow all the steps you would normally follow to create your cluster and then select “VPC access”.
That’s it. There are no additional steps. You can now access your domain from within your VPC!
Things To Know
To support VPCs, Amazon ES places an endpoint into at least one subnet of your VPC. Amazon ES places an Elastic Network Interface (ENI) into the VPC for each data node in the cluster. Each ENI uses a private IP address from the IPv4 range of your subnet and receives a public DNS hostname. If you enable zone awareness, Amazon ES creates endpoints in two subnets in different availability zones, which provides greater data durability.
You need to set aside three times the number of IP addresses as the number of nodes in your cluster. You can divide that number by two if Zone Awareness is enabled. Ideally, you would create separate subnets just for Amazon ES.
A few notes:
Currently, you cannot move existing domains to a VPC or vice-versa. To take advantage of VPC support, you must create a new domain and migrate your data.
Currently, Amazon ES does not support Amazon Kinesis Firehose integration for domains inside a VPC.
The Big Data Lambda Architecture seeks to provide data engineers and architects with a scalable, fault-tolerant data processing architecture and framework using loosely coupled, distributed systems. At a high level, the Lambda Architecture is designed to handle both real-time and historically aggregated batched data in an integrated fashion. It separates the duties of real-time and batch processing so purpose-built engines, processes, and storage can be used for each, while serving and query layers present a unified view of all of the data.
Historically, the Lambda Architecture demanded the use of various complex systems to achieve the outcomes of uniting batch and real-time views. Data platform engineers and architects were required to implement services running on Amazon EC2 for data collection and ingestion, batch processing, stream processing, serving layers, and dashboards/reporting. As time has gone on, AWS customers have continued to ask for managed solutions that scale seamlessly and put less focus on infrastructure, allowing teams to focus on what really matters: the data and the resulting insights.
In this post, I show you how you can use AWS services like AWS Glue to build a Lambda Architecture completely without servers. I use a practical demonstration to examine the tight integration between serverless services on AWS and create a robust data processing Lambda Architecture system.
New: AWS Glue
With the launch of AWS Glue, AWS provides a portfolio of services to architect a Big Data platform without managing any servers or clusters. AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (for example, the table definition and schema) in the AWS Glue Data Catalog. After it’s cataloged, your data is immediately searchable, queryable, and available for ETL.
AWS Glue generates the code to execute your data transformations and data loading processes. Furthermore, AWS Glue provides a managed Spark execution environment to run ETL jobs against a data lake in Amazon S3. In short, you can now run a Lambda Architecture in AWS in a completely 100% serverless fashion!
“Serverless” applications allow you to build and run applications without thinking about servers. What this means is that you can now stream data in real-time, process huge volumes of data in S3, and run SQL queries and visualizations against that data without managing server provisioning, installation, patching, or capacity scaling. This frees up your users to spend more time interpreting the data and deriving business value for your organization.
Solution overview
The AWS services involved in this solution include:
The following diagram explains how the services work together:
None of the included services require the creation, configuration, or installation of servers, clusters, and databases.
In this example, you use these services to send and process simulated streaming data of sensor devices to Kinesis Firehose and store the raw data in S3. Using AWS Glue, you analyze the raw data from S3 in batch-oriented fashion to look at the thermostat efficiency over time against the historical data, and store results back in S3.
Using Amazon Kinesis Analytics, you analyze and filter the data to detect inefficient sensors in real time. In this example, you detect inefficiencies in thermostat devices by comparing their temperature settings against the temperature they are reading (where thermostats are typically set between 70-72º F).
Finally, you use Athena and Amazon QuickSight to query and visualize the data and build a dashboard that can be shared with other users of Amazon QuickSight in your organization.
Walkthrough
You use the AWS CLI and AWS CloudFormation throughout this tutorial, therefore a basic understanding of these services is assumed.
At the time of writing this post, AWS Glue is only available in the US East (N. Virginia) AWS Region. Complete all the steps in that region.
Step 1: Set up baseline AWS resources
I’ve set up a CloudFormation template to set up the Kinesis streaming resources for you.
Before you continue with the rest of the walkthrough, make sure that the Status value is CREATE_COMPLETE. This can take anywhere between 3 and 5 minutes, so grab coffee or take a moment to prepare and review the next steps.
Step 2: Set up the Amazon Kinesis Data Generator
You will be leveraging the Amazon Kinesis Data Generator (KDG) to simulate devices pushing data into Kinesis Firehose.
Make sure that you can log in to the KDG producer page before you continue.
Step 3: Send data to Kinesis Firehose using the KDG
In this step, you configure the KDG to send simulated records to simulate devices streaming data into Kinesis Firehose. In the data generator template below, notice the complex weighting. This demonstrates outliers in device temperature readings later on. Kinesis Firehose sends all the raw data to S3 in addition to Kinesis Analytics for real-time processing.
On the KDG producer page, for Region, select “us-east-1”. For Stream/delivery stream, select AWSBigDataBlog-RawDeliveryStream. For Records per second, type 100.
Paste the template shown below into KDG template window:
Without leaving the KDG page, navigate to the Kinesis Analytics console to view the status of the application processing the data in real time. When you are happy with the amount of data sent, choose Stop Sending Data to Kinesis. I recommend waiting until at least 20,000 records are sent.
Step 4: Submit AWS Glue crawlers to interpret the table definition for Kinesis Firehose outputs in S3
In this step, you use AWS Glue crawlers to crawl and generate table definitions against the produced data in S3. This allows you to analyze data in aggregate over a historical context instead of just using the latest data.
Use CLI commands to define and run AWS Glue crawlers:
Replace the <bucketName> text in the command with the bucket name that you specified in the CloudFormation template in step 1.
Replace the <RoleArn> text in the command with an IAM role ARN with permissions to use AWS Glue. For more information, see Setting up IAM Permissions for AWS Glue.
Select the crawler that you created earlier, “thermostat-data-crawler”, and choose Run crawler.
After just a few seconds, the crawler shows as “Ready”, and will have automatically inferred the schema of three tables:
“raw”—Data being sent from KDG to Kinesis Firehose, delivered to S3.
“results”—Kinesis Analytics results data on real-time processing for percent inefficiency of devices, stored in S3.
“sampledata”—Simulated user device settings table for 1,000 thermostats in S3. This data is joined to the raw data to calculate the “percent inefficiency” (absolute value of temperature reading minus temperature setting, divided by the temperature setting).
Zooming in on the architecture diagram, I’ve included the names of these tables. Take a moment to see how the tables fit in the overall Lambda Architecture:
Step 5: Submit an AWS Glue job to pre-process daily thermostat efficiency
In this step, you use an AWS Glue job to join data and pre-process calculated views to assess the efficiency of the thermostat devices. The results are stored in S3 to be queried using Athena.
Replace the <bucketName> text in the command with the bucket name that you specified in the CloudFormation template in step 1.
Replace the <RoleName> text in the command with an IAM role with permissions to use AWS Glue. For more information, see Setting up IAM Permissions for AWS Glue.
Select the job named “daily_avg”, and choose Action, Run job.
In the Parameters (optional) dialog box, choose Run job.
This job can take between 10–14 minutes to run. Take a moment to view the script definition in the console. Take care not to save any changes to the script at this time:
Try out each of the individual queries below to analyze the data in Athena.
Query and analyze the raw time-series data in the “master dataset” Lambda Architecture layer
--"Events by Device ID"
SELECT uuid,
devicets,
deviceid,
temp
FROM awsblogsgluedemo."raw"
WHERE deviceid = 1
ORDER BY devicets DESC;
--"Top 20 most active thermostats"
SELECT deviceid,
COUNT(*) AS num_events
FROM awsblogsgluedemo."raw"
GROUP BY deviceid
ORDER BY num_events DESC LIMIT 20;
Query and analyze the data processed by AWS Glue at the “Serving” layer:
--"Top 10 Most inefficient devices"
SELECT deviceid,
daily_avg_inefficiency
FROM awsblogsgluedemo.daily_avg_inefficiency
ORDER BY daily_avg_inefficiency DESC LIMIT 10;
--"KPI - Overall device daily inefficiency"
SELECT ( SUM(daily_avg_inefficiency)/COUNT(*) ) AS all_device_avg_inefficiency,
date
FROM awsblogsgluedemo.daily_avg_inefficiency
GROUP BY date;
Query and analyze the data filtered by Kinesis Analytics, the outputs of the “speed” layer:
--"Top 10 most inefficient devices - event-level granularity"
SELECT col0 AS "uuid",
col1 AS "deviceid",
col2 AS "devicets",
col3 AS "temp",
col4 AS "settemp",
col5 AS "pct_inefficiency"
FROM awsblogsgluedemo.results
ORDER BY pct_inefficiency DESC limit 10;
Step 7: Visualize with Amazon QuickSight
Now you have the data in S3 and table definitions in the AWS Glue Data Catalog, and you can query the data directly in S3 with Athena. You are now ready to begin creating visualizations with Amazon QuickSight.
In these steps, I demonstrate how you can visualize your data in S3, again without provisioning any infrastructure for databases, clusters, or BI tools.
Tie in all the data sources into Amazon QuickSight
Choose Connect to another data source or upload a file, Athena.
Give the data source a name, like awsblogsgluedemo
Choose daily_avg_inefficiency, Select.
Choose Directly query your data, Edit/Preview data. On the next screen, leave the settings unchanged and choose Save.
Choose New data set in the top right corner.
Under From existing data sources, choose awsblogsgluedemo.
Choose Create data set, raw, and Select.
Choose Directly query your data, Edit/Preview data. On the next screen, leave the settings unchanged and choose Save.
Repeat steps 6–9 to add the “results” dataset. At step 9, rename the column “col1” to “deviceid” by choosing the pencil icon.
Create an analysis that includes all data sources (raw, processed, real-time results)
Return to the Amazon QuickSight main page by choosing QuickSight.
Choose New analysis, daily_avg_inefficiency, and Create analysis.
Choose Fields, Edit analysis data sets.
Choose Add data set, results, and Select.
Repeat steps 3 and 4 to add the “raw” dataset.
Create a KPI chart on AWS Glue processed data
Choose Fields, daily_avg_inefficiency.
In the Visual types pane, choose the Key Performance Indicator (KPI) icon, shown below.
Drag daily_avg_inefficiency from Fields into the Value
Drag date into the Trend group well.
For daily_avg_inefficiency (Sum), choose the down arrow, pause on over Aggregate: Sum, and choose Average. Choose the down arrow again, pause on over Show as: Number and choose Percent.
Click daily_avg_inefficiency (Average) in the Value well to modify the presentation to fewer decimal points.
You should have a visualization similar to the following:
To create a scatter plot of all devices:
At the top right of the screen, choose Add, Add visual*.
For Visual type, choose Scatter plot.
Drag deviceid into the X axis well and the Group/Color
Drag daily_avg_inefficiency into the Y axis well and the Size
You should now have a scatter plot visualization similar to the following figure. See if you can find a poorly functioning device with a high percentage inefficiency!
To create a line chart against the raw time-series dataset
Choose Fields, raw.
Choose Add, Add visual*.
For Visual type, choose Line chart.
Drag devicets into the X axis Drag temp into the Value well.
For temp (Sum), choose the down arrow, pause on over Aggregate: Sum, and then choose Average.
Choose the down-arrow at the top right, and choose Format visual.
Expand the Y-Axis menu on the left, and choose Range, Auto (based on data range).
Drag the slider to expand the time-series line characters.
You should now have a line chart similar to the following image. The spikes represent an increase in average temperature readings across the devices.
To create a heat map of the Kinesis Analytics results dataset
Choose Fields, results.
Choose Add, Add visual*.
For Visual type, choose Heap map.
Drag deviceid into the Columns and Values For Aggregate, choose Count.
Choose deviceid (Count). For Sort, choose the descending icon.
You should have a visualization like the following graph, which shows deviceid values on the far top left that have the most occurrences of an inefficient temperature reading. Notice that the heat map also colors them darker shades for easy visualization. Look for the devices at the top right that have a high frequency of inefficiency readings.
Finally, you can resize the charts and save them as a dashboard to share to other Amazon QuickSight users in your organization:
Resize the charts by choosing the bottom right corner of a visual.
Choose Share, Create dashboard.
For Create new dashboard as, type a dashboard name, such as “Thermostats Performance.” Choose Create dashboard.
You should now have a dashboard containing all your visuals, similar to the following:
Conclusion
In this post, you saw how to process and analyze data from both streaming and batch sources together in a 100% serverless fashion. The Lambda architecture principles guided a system that separates the ingestion and processing mechanisms, but using fully managed, serverless AWS services.
You went from collecting real-time data, processing and joining with a user settings file, and querying and visualizing the data without servers.
I hope you found this post useful. In the comments, please share your thoughts on how you might extend this architecture to suit your needs.
Laith Al-Saadoon is a Solutions Architect with a focus on data analytics at Amazon Web Services. He spends his days obsessing over designing customer architectures to process enormous amounts of data at scale. In his free time, he follows the latest trends in artificial intelligence.
Deriving insights from large datasets is central to nearly every industry, and life sciences is no exception. To combat the rising cost of bringing drugs to market, pharmaceutical companies are looking for ways to optimize their drug development processes. They are turning to big data analytics to better quantify the effect that their drug compounds have on different populations and to look for new clinical indications for existing drugs.
A real world evidence (RWE) platform is loosely defined as an integrated set of services and products that life sciences companies use to securely acquire, store, and analyze large, often disparate datasets to gain insight into the functions of a specific drug or intervention. The petabyte-scale data generated from wearables, medical devices, genomics, clinical imaging, and claims (to name a few) allows pharmaceutical and other life sciences companies to build big data platforms to analyze these datasets. And many of them are doing it on AWS.
At the center of nearly every RWE platform is a data lake that houses different data types. It also stores the related metadata to identify where each piece of data came from, who owned it, etc. Analytics engines integrate the relevant streaming (e.g., wearables), structured (e.g., claims data), and unstructured (e.g., notes in electronic health records) data.
In this post, I highlight common architectural patterns that customers are using to maximize the value of real world evidence on AWS. The architecture presented here can be reproduced in multiple regions, so you can respect local data sovereignty requirements, when applicable, while conducting global studies. This post doesn’t cover all considerations for real world evidence (such as security and authentication), but instead focuses on the areas that are related to your data flow.
A data lake is your source of truth
The ability to integrate disparate data types is critical to maximizing the utility of RWE. Life sciences companies have to be able to store, search, and retrieve data of different types and sizes, including (but not limited to) the following:
Streaming data from wearables and medical devices
Structured data from genomics and claims data
Unstructured data from notes in electronic health records
A common solution to integrate these data types is a data lake. Data lakes allow organizations to store all their data, regardless of data type, in a centralized repository. Because data can be stored as-is, there’s no need to convert it to a predefined schema. And you no longer need to know what questions you want to ask of your data beforehand. You can use data lakes for ad hoc analyses, so you can quickly explore and discover new insights without needing to structure the data first, as you would with a traditional data warehouse.
Although there are many uses for storing real world evidence data in a data lake, here are a few examples of the processes it can facilitate for pharma and biotech companies:
Quickly access all data for a given subject, from clinical images to genomics to claims data.
Associate incoming RWE data with existing data in your RWE data lake, such as by subject or study.
Select specific windows in longitudinal studies (microbiome, metabolomics, etc.).
The following diagram shows an architecture of the features within a data lake and several examples of how data enters a data lake:
This architecture, which is entirely serverless and backed by Amazon S3, lets you scale your data lake to easily accommodate any data size for your RWE platform. Additionally, the components presented in this diagram can be secured via IAM controls and service policies. This enables you to secure and protect the sensitive information that often resides within an RWE platform. Amazon S3 is the canonical source for objects in your RWE data lake. You track and search metadata that is associated with these objects in a data catalog built on Amazon Elasticsearch Service and Amazon DynamoDB.
Let’s dive deeper into what this architecture does and how data makes it into the data lake:
Streaming (e.g., wearables), structured, and unstructured data is acquired from myriad devices and sources. Depending on the data size, you might use AWS IoT (streaming), AWS Storage Gateway (mid-size/continual batch), and AWS Snowball (large legacy datasets, such as imaging). AWS IoT writes to Amazon Kinesis Firehose, which transforms the telemetry data in-flight to land both transformed and raw data in Amazon S3.
When data lands in Amazon S3 buckets, an AWS Lambda function is invoked (either by trigger or manually).
This Lambda function writes to a data catalog that is fronted by Amazon API Gateway. The data catalog contains metadata about all the object data in Amazon S3, as well as data that resides in databases, such as Amazon Redshift.
An AWS Lambda function on the other side of API Gateway writes the appropriate metadata about the objects, such as the study that the data was generated from, into Amazon Elasticsearch Service and/or Amazon DynamoDB, which I refer to as the data catalog.
The data catalog mentioned in steps 3 and 4 is central to your data management. In most analyses, the first step is to build a data manifest of where the data-of-interest lies, which is discussed in later sections.
Normalizing data for real world evidence
As I previously mentioned, data can enter your real world evidence data lake in many different formats. In many cases, you might need to normalize this data into a specific format to ease downstream analysis. For example, you might have to integrate many different electronic health records that are stored in different formats. You also might have to transform genomics data, often represented as a variant call format (VCF) file, into a format that’s easier for big data technologies like Apache Parquet to query. While you can certainly analyze this data in the format in which it entered, it might be better for querying after it’s transformed into a different format, or into a different data store.
In the architecture shown in the following diagram, AWS Step Functions orchestrates the data normalization (extract, transform, load—or ETL) process. Step Functions is a serverless workflow service that ensures that your long-running ETL jobs execute in order and complete successfully. At a high level, you first query the data catalog to get a manifest of the data to be normalized. Normalization occurs on Amazon EC2 instances, and the results are then stored back in Amazon S3 or in databases such as Amazon Redshift for future analysis. Locations of these results data are also logged in the data catalog for future querying.
Here is an example Step Functions state machine for this process:
The following is the accompanying JSON that produces it:
Either a user or an application submits a POST request through Amazon API Gateway to process or transform a set of data. This request contains the query parameters that correspond to generating the data manifest. It is passed into an AWS Step Functions state machine, which orchestrates the ETL process.
A Lambda function queries the data catalog and builds a manifest that contains the location of the data of interest.
The manifest is passed to downstream Lambda functions in the state machine that orchestrate the batch workflow to process the data that’s specified in the manifest.
These Lambda functions submit jobs to AWS Batch to execute batch jobs on Amazon EC2.
Amazon EC2 processes the data (e.g., data normalization) and stores results back in Amazon S3 and/or Amazon Redshift.
Results data is then logged in the data catalog, using the process shown in the data acquisition diagram.
Analyzing and visualizing data in real world evidence
Ultimately, the value in real world evidence platforms is the value you derive from the wide variety of data that feeds into it. Big data analytics, such as Spark on EMR, machine learning with the Deep Learning AMIs, and healthcare data warehouses, are all fundamental to maximizing the value of RWE.
Given the wide array of questions you can answer, you want your architecture to be as flexible as possible. Your organization also might use business intelligence (BI) tools. Again, as with the data normalization tier, you can use Step Functions to orchestrate your workflow. You first build the data manifest and then submit each portion of your desired analysis to the relevant data location and compute options, such as Amazon EMR or Amazon Redshift. Your organization’s BI tool of choice, such as Amazon QuickSight or one offered by our AWS Big Data Competency partners, can extract and visualize the results.
Here is the workflow in more detail:
A user initiates a query on a dataset. In the context of RWE, this is largely analyzing a cohort in the context of a specific indication (drug response, etc.).
AWS Step Functions invokes a Lambda function to query the data catalog and build a manifest of data.
The manifest is passed to subsequent Lambda functions, which orchestrate the data analysis through different AWS services.
These analyses can include Amazon EMR for population-scale genomics, Amazon EC2 for HPC and machine learning, and Amazon Redshift for your healthcare data warehouse.
Results from each analysis are staged back in Amazon S3 and logged in the data catalog.
BI tools like Amazon QuickSight or Tableau, or data analysis workbooks like Jupyter can query Amazon S3 to visualize results, such as through Amazon Athena.
A practical example
Imagine that you are at a pharmaceutical company that is developing a drug to address a neurological condition. You have longitudinal data in the form of brain scans and have noticed that the drug compound under development seems to benefit a specific subpopulation of individuals. As a result, you determine to sequence the genomes of all the responders and non-responders to look for specific biomarkers that could indicate response levels. Success would result in a companion diagnostic that you could use alongside your drug to increase its overall efficacy by delivering it only to the responding population.
Let’s briefly look at how this study might play out in the steps of data acquisition, normalization, and analysis described earlier.
Data acquisition
Data from longitudinal brain imaging studies can be moved to AWS using AWS Snowball. Genomics data coming off your genome sequencers would land in Amazon S3 via AWS Storage Gateway and/or AWS Direct Connect. These datasets are stored in your data catalog where you track items such as date generated, anonymized subject ID, etc.
Data processing
Genomes are transformed from raw reads (e.g., FASTQ) to a human readable format (VCF) that identifies variations in a genome. These VCF files are subsequently extracted, transformed, and loaded into a format that is amenable to big data analytics, such as Parquet.
Data analysis
You build a data manifest of your brain scans in Amazon S3. You use the deep learning AMI and P2 instance family to build machine learning models to identify images that represent different stages in your disorder-of-interest. You then run association analyses that combine your genomics data with your brain imaging models and drug response data to identify what genomic variants associate with improved treatment outcomes. You can manage these analyses via Jupyter notebooks, or you can connect with your BI tool of choice to visualize your results.
It will always be day one for RWE
By implementing real world evidence platforms on AWS, you can quickly integrate and interrogate disparate healthcare data to advance human health. By designing your RWE platform to be flexible, you can readily incorporate new big data technologies as they become relevant to your needs. This enables you to innovate and discover at a quicker rate.
The healthcare ecosystem has chosen a variety of tools and techniques for working with big data, but one tool that comes up again and again in many of the architectures we design and review is Spark on Amazon EMR. Will spark power the data behind precision medicine?
About the Authors
Dr. Aaron Friedman is a Healthcare and Life Sciences Partner Solutions Architect at Amazon Web Services. He works with ISVs and SIs to architect healthcare solutions on AWS, and bring the best possible experience to their customers. His passion is working at the intersection of science, big data, and software. In his spare time, he’s exploring the outdoors, learning a new thing to cook, or spending time with his wife and his dog, Macaroon.
Nowadays, streaming data is seen and used everywhere—from social networks, to mobile and web applications, IoT devices, instrumentation in data centers, and many other sources. As the speed and volume of this type of data increases, the need to perform data analysis in real time with machine learning algorithms and extract a deeper understanding from the data becomes ever more important. For example, you might want a continuous monitoring system to detect sentiment changes in a social media feed so that you can react to the sentiment in near real time.
In this post, we use Amazon Kinesis Streams to collect and store streaming data. We then use Amazon Kinesis Analytics to process and analyze the streaming data continuously. Specifically, we use the Kinesis Analytics built-in RANDOM_CUT_FOREST function, a machine learning algorithm, to detect anomalies in the streaming data. Finally, we use Amazon Kinesis Firehose to export the anomalies data to Amazon Elasticsearch Service (Amazon ES). We then build a simple dashboard in the open source tool Kibana to visualize the result.
Solution overview
The following diagram depicts a high-level overview of this solution.
Amazon Kinesis Streams
You can use Amazon Kinesis Streams to build your own streaming application. This application can process and analyze streaming data by continuously capturing and storing terabytes of data per hour from hundreds of thousands of sources.
Amazon Kinesis Analytics
Kinesis Analytics provides an easy and familiar standard SQL language to analyze streaming data in real time. One of its most powerful features is that there are no new languages, processing frameworks, or complex machine learning algorithms that you need to learn.
Amazon Kinesis Firehose
Kinesis Firehose is the easiest way to load streaming data into AWS. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service.
Amazon Elasticsearch Service
Amazon ES is a fully managed service that makes it easy to deploy, operate, and scale Elasticsearch for log analytics, full text search, application monitoring, and more.
Solution summary
The following is a quick walkthrough of the solution that’s presented in the diagram:
IoT sensors send streaming data into Kinesis Streams. In this post, you use a Python script to simulate an IoT temperature sensor device that sends the streaming data.
By using the built-in RANDOM_CUT_FOREST function in Kinesis Analytics, you can detect anomalies in real time with the sensor data that is stored in Kinesis Streams. RANDOM_CUT_FOREST is also an appropriate algorithm for many other kinds of anomaly-detection use cases—for example, the media sentiment example mentioned earlier in this post.
The processed anomaly data is then loaded into the Kinesis Firehose delivery stream.
By using the built-in integration that Kinesis Firehose has with Amazon ES, you can easily export the processed anomaly data into the service and visualize it with Kibana.
Implementation steps
The following sections walk through the implementation steps in detail.
Creating the delivery stream
Open the Amazon Kinesis Streams console.
Create a new Kinesis stream. Give it a name that indicates it’s for raw incoming stream data—for example, RawStreamData. For Number of shards, type 1.
The Python code provided below simulates a streaming application, such as an IoT device, and generates random data and anomalies into a Kinesis stream. The code generates two temperature ranges, where the first range is the hypothetical sensor’s normal operating temperature range (10–20), and the second is the anomaly temperature range (100–120).Make sure to change the stream name on line 16 and 20 and the Region on line 6 to match your configuration. Alternatively, you can download the Amazon Kinesis Data Generator from this repository and use it to generate the data.
import json
import datetime
import random
import testdata
from boto import kinesis
kinesis = kinesis.connect_to_region("us-east-1")
def getData(iotName, lowVal, highVal):
data = {}
data["iotName"] = iotName
data["iotValue"] = random.randint(lowVal, highVal)
return data
while 1:
rnd = random.random()
if (rnd < 0.01):
data = json.dumps(getData("DemoSensor", 100, 120))
kinesis.put_record("RawStreamData", data, "DemoSensor")
print '***************************** anomaly ************************* ' + data
else:
data = json.dumps(getData("DemoSensor", 10, 20))
kinesis.put_record("RawStreamData", data, "DemoSensor")
print data
Open the Amazon Elasticsearch Service console and create a new domain.
Give the domain a unique name. In the Configure cluster screen, use the default settings.
In the Set up access policy screen, in the Set the domain access policy list, choose Allow access to the domain from specific IP(s).
Enter the public IP address of your computer. Note: If you’re working behind a proxy or firewall, see the “Use a proxy to simplify request signing” section in this AWS Database blog post to learn how to work with a proxy. For additional information about securing access to your Amazon ES domain, see How to Control Access to Your Amazon Elasticsearch Domain in the AWS Security Blog.
After the Amazon ES domain is up and running, you can set up and configure Kinesis Firehose to export results to Amazon ES:
Open the Amazon Kinesis Firehose console and choose Create Delivery Stream.
In the Destination dropdown list, choose Amazon Elasticsearch Service.
Type a stream name, and choose the Amazon ES domain that you created in Step 4.
Provide an index name and ES type. In the S3 bucket dropdown list, choose Create New S3 bucket. Choose Next.
In the configuration, change the Elasticsearch Buffer size to 1 MB and the Buffer interval to 60s. Use the default settings for all other fields. This shortens the time for the data to reach the ES cluster.
Under IAM Role, choose Create/Update existing IAM role. The best practice is to create a new role every time. Otherwise, the console keeps adding policy documents to the same role. Eventually the size of the attached policies causes IAM to reject the role, but it does it in a non-obvious way, where the console basically quits functioning.
Choose Next to move to the Review page.
Review the configuration, and then choose Create Delivery Stream.
Run the Python file for 1–2 minutes, and then press Ctrl+C to stop the execution. This loads some data into the stream for you to visualize in the next step.
Analyzing the data
Now it’s time to analyze the IoT streaming data using Amazon Kinesis Analytics.
Open the Amazon Kinesis Analytics console and create a new application. Give the application a name, and then choose Create Application.
On the next screen, choose Connect to a source. Choose the raw incoming data stream that you created earlier. (Note the stream name Source_SQL_STREAM_001 because you will need it later.)
Use the default settings for everything else. When the schema discovery process is complete, it displays a success message with the formatted stream sample in a table as shown in the following screenshot. Review the data, and then choose Save and continue.
Next, choose Go to SQL editor. When prompted, choose Yes, start application.
Copy the following SQL code and paste it into the SQL editor window.
CREATE OR REPLACE STREAM "TEMP_STREAM" (
"iotName" varchar (40),
"iotValue" integer,
"ANOMALY_SCORE" DOUBLE);
-- Creates an output stream and defines a schema
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
"iotName" varchar(40),
"iotValue" integer,
"ANOMALY_SCORE" DOUBLE,
"created" TimeStamp);
-- Compute an anomaly score for each record in the source stream
-- using Random Cut Forest
CREATE OR REPLACE PUMP "STREAM_PUMP_1" AS INSERT INTO "TEMP_STREAM"
SELECT STREAM "iotName", "iotValue", ANOMALY_SCORE FROM
TABLE(RANDOM_CUT_FOREST(
CURSOR(SELECT STREAM * FROM "SOURCE_SQL_STREAM_001")
)
);
-- Sort records by descending anomaly score, insert into output stream
CREATE OR REPLACE PUMP "OUTPUT_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"
SELECT STREAM "iotName", "iotValue", ANOMALY_SCORE, ROWTIME FROM "TEMP_STREAM"
ORDER BY FLOOR("TEMP_STREAM".ROWTIME TO SECOND), ANOMALY_SCORE DESC;
Choose Save and run SQL. As the application is running, it displays the results as stream data arrives. If you don’t see any data coming in, run the Python script again to generate some fresh data. When there is data, it appears in a grid as shown in the following screenshot.Note that you are selecting data from the source stream name Source_SQL_STREAM_001 that you created previously. Also note the ANOMALY_SCORE column. This is the value that the Random_Cut_Forest function calculates based on the temperature ranges provided by the Python script. Higher (anomaly) temperature ranges have a higher score.Looking at the SQL code, note that the first two blocks of code create two new streams to store temporary data and the final result. The third block of code analyzes the raw source data (Stream_Pump_1) using the Random_Cut_Forest function. It calculates an anomaly score (ANOMALY_SCORE) and inserts it into the TEMP_STREAM stream. The final code block loads the result stored in the TEMP_STREAM into DESTINATION_SQL_STREAM.
Choose Exit (done editing) next to the Save and run SQL button to return to the application configuration page.
Load processed data into the Kinesis Firehose delivery stream
Now, you can export the result from DESTINATION_SQL_STREAM into the Amazon Kinesis Firehose stream that you created previously.
On the application configuration page, choose Connect to a destination.
Choose the stream name that you created earlier, and use the default settings for everything else. Then choose Save and Continue.
On the application configuration page, choose Exit to Kinesis Analytics applications to return to the Amazon Kinesis Analytics console.
Run the Python script again for 4–5 minutes to generate enough data to flow through Amazon Kinesis Streams, Kinesis Analytics, Kinesis Firehose, and finally into the Amazon ES domain.
Open the Kinesis Firehose console, choose the stream, and then choose the Monitoring
As the processed data flows into Kinesis Firehose and Amazon ES, the metrics appear on the Delivery Stream metrics page. Keep in mind that the metrics page takes a few minutes to refresh with the latest data.
Open the Amazon Elasticsearch Service dashboard in the AWS Management Console. The count in the Searchable documents column increases as shown in the following screenshot. In addition, the domain shows a cluster health of Yellow. This is because, by default, it needs two instances to deploy redundant copies of the index. To fix this, you can deploy two instances instead of one.
Visualize the data using Kibana
Now it’s time to launch Kibana and visualize the data.
Use the ES domain link to go to the cluster detail page, and then choose the Kibana link as shown in the following screenshot. If you’re working behind a proxy or firewall, see the “Use a proxy to simplify request signing” section in this blog post to learn how to work with a proxy.
In the Kibana dashboard, choose the Discover tab to perform a query.
You can also visualize the data using the different types of charts offered by Kibana. For example, by going to the Visualize tab, you can quickly create a split bar chart that aggregates by ANOMALY_SCORE per minute.
Conclusion
In this post, you learned how to use Amazon Kinesis to collect, process, and analyze real-time streaming data, and then export the results to Amazon ES for analysis and visualization with Kibana. If you have comments about this post, add them to the “Comments” section below. If you have questions or issues with implementing this solution, please open a new thread on the Amazon Kinesis or Amazon ES discussion forums.
Tristan Li is a Solutions Architect with Amazon Web Services. He works with enterprise customers in the US, helping them adopt cloud technology to build scalable and secure solutions on AWS.
If you run an operation that continuously generates a large amount of data, you may want to know what kind of data is being inserted by your application. The ability to analyze data intake quickly can be very valuable for business units, such as operations and marketing. For many operations, it’s important to see what is driving the business at any particular moment. For retail companies, for example, understanding which products are currently popular can aid in planning for future growth. Similarly, for PR companies, understanding the impact of an advertising campaign can help them market their products more effectively.
This post covers an architecture that helps you analyze your streaming data. You’ll build a solution using Amazon DynamoDB Streams, AWS Lambda, Amazon Kinesis Firehose, and Amazon Athena to analyze data intake at a frequency that you choose. And because this is a serverless architecture, you can use all of the services here without the need to provision or manage servers.
The data source
You’ll collect a random sampling of tweets via Twitter’s API and store a variety of attributes in your DynamoDB table, such as: Twitter handle, tweet ID, hashtags, location, and Time-To-Live (TTL) value.
In DynamoDB, the primary key is used as an input to an internal hash function. The output from this function determines the partition in which the data will be stored. When using a combination of primary key and sort key as a DynamoDB schema, you need to make sure that no single partition key contains many more objects than the other partition keys because this can cause partition level throttling. For the demonstration in this blog, the Twitter handle will be the primary key and the tweet ID will be the sort key. This allows you to group and sort tweets from each user.
To help you get started, I have written a script that pulls a live Twitter stream that you can use to generate your data. All you need to do is provide your own Twitter Apps credentials, and it should generate the data immediately. Alternatively, I have also provided a script that you can use to generate random Tweets with little effort.
You can find both scripts in the Github repository:
To get your own Twitter credentials, go to https://www.twitter.com/ and sign up for a free account, if you don’t already have one. After your account is set up, go to https://apps.twitter.com/. On the main landing page, choose the Create New App button. After the application is created, go to Keys and Access Tokens to get your credentials to use the Twitter API. You’ll need to generate Customer Tokens/Secret and Access Token/Secret. All four keys will be used to authenticate your request.
Architecture overview
Before we begin, let’s take a look at the overall flow of information will look like, from data ingestion into DynamoDB to visualization of results in Amazon QuickSight.
As illustrated in the architecture diagram above, any changes made to the items in DynamoDB will be captured and processed using DynamoDB Streams. Next, a Lambda function will be invoked by a trigger that is configured to respond to events in DynamoDB Streams. The Lambda function processes the data prior to pushing to Amazon Kinesis Firehose, which will output to Amazon S3. Finally, you use Amazon Athena to analyze the streaming data landing in Amazon S3. The result can be explored and visualized in Amazon QuickSight for your company’s business analytics.
You’ll need to implement your custom Lambda function to help transform the raw <key, value> data stored in DynamoDB to a JSON format for Athena to digest, but I can help you with a sample code that you are free to modify.
Implementation
In the following sections, I’ll walk through how you can set up the architecture discussed earlier.
Create your DynamoDB table
First, let’s create a DynamoDB table and enable DynamoDB Streams. This will enable data to be copied out of this table. From the console, use the user_id as the partition key and tweet_id as the sort key:
After the table is ready, you can enable DynamoDB Streams. This process operates asynchronously, so there is no performance impact on the table when you enable this feature. The easiest way to manage DynamoDB Streams is also through the DynamoDB console.
In the Overview tab of your newly created table, click Manage Stream. In the window, choose the information that will be written to the stream whenever data in the table is added or modified. In this example, you can choose either New image or New and old images.
For more details on this process, check out our documentation:
Before creating the Lambda function, you need to configure Kinesis Firehose delivery stream so that it’s ready to accept data from Lambda. Open the Firehose console and choose Create Firehose Delivery Stream. From here, choose S3 as the destination and use the following to information to configure the resource. Note the Delivery stream name because you will use it in the next step.
For more details on this process, check out our documentation:
Now that Kinesis Firehose is ready to accept data, you can create your Lambda function.
From the AWS Lambda console, choose the Create a Lambda function button and use the Blank Function. Enter a name and description, and choose Python 2.7 as the Runtime. Note your Lambda function name because you’ll need it in the next step.
In the Lambda function code field, you can paste the script that I have written for this purpose. All this function needs is the name of your Firehose stream name set as an environment variable.
The handler should be set to lambda_function.lambda_handler and you can use the existing lambda_dynamodb_streams role that’s been created by default.
Enable DynamoDB trigger and start collecting data
Everything is ready to go. Open your table using the DynamoDB console and go to the Triggers tab. Select the Create trigger drop down list and choose Existing Lambda function. In the pop-up window, select the function that you just created, and choose the Create button.
At this point, you can start collecting data with the Python script that I’ve provided. The first one will create a script that will pull public Twitter data and the other will generate fake tweets using Lorem Ipsum text.
Configure Amazon Athena to read the data
Next, you will configure Amazon Athena so that it can read the data Kinesis Firehose outputs to Amazon S3 and allow you to analyze the data as needed. You can connect to Athena directly from the Athena console, and you can establish a connection using JDBC or the Athena API. In this example, I’m going to demonstrate what this looks like on the Athena console.
First, create a new database and a new table. You can do this by running the following two queries. The first query creates a new database:
CREATE DATABASE IF NOT EXISTS ddbtablestats
And the second query creates a new table:
CREATE EXTERNAL TABLE IF NOT EXISTS ddbtablestats.twitterfeed (
`table_name` string,
`user_id` string,
`tweet_id` bigint,
`approx_post_time` timestamp
) PARTITIONED BY (
year string,
month string,
day string,
hour string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ('serialization.format' = '1')
LOCATION 's3://myBucket/dynamodb/streams/transactions/'
Note that this table is created using partitions. Partitioning separates your data into logical parts based on certain criteria, such as date, location, language, etc. This allows Athena to selectively pull your data without needing to process the entire data set. This effectively minimizes the query execution time, and it also allows you to have greater control over the data that you want to query.
After the query has completed, you should be able to see the table in the left side pane of the Athena dashboard.
After the database and table have been created, execute the ALTER TABLE query to populate the partitions in your table. Replace the date with the current date when the script was executed.
ALTER TABLE ddbtablestats.TwitterFeed ADD IF NOT EXISTS
PARTITION (year='2017',month='05',day='17',hour='01') location 's3://myBucket/dynamodb/streams/transactions/2017/05/17/01/'
Using the Athena console, you’ll need to manually populate each partition for each additional partition that you’d like to analyze, however you can programmatically automate this process by using the JDBC driver or any AWS SDK of your choice.
For more information on partitioning in Athena, check out our documentation:
This is it! Let’s run this query to see the top 10 most active Twitter users in the last 24 hours. You can do this from the Athena console:
SELECT user_id, COUNT(DISTINCT tweet_id) tweets FROM ddbTableStats.TwitterFeed
WHERE year='2017' AND month='05' AND day='17'
GROUP BY user_id
ORDER BY tweets DESC
LIMIT 10
The result should look similar to the following:
Linking Athena to Amazon QuickSight
Finally, to make this data available to a larger audience, let’s visualize this data in Amazon QuickSight. Amazon QuickSight provides native connectivity to AWS data sources such as Amazon Redshift, Amazon RDS, and Amazon Athena. Amazon QuickSight can also connect to on-premises databases, Excel, or CSV files, and it can connect to cloud data sources such as Salesforce.com. For this solution, we will connect Amazon QuickSight to the Athena table we just created.
Amazon QuickSight has a free tier that provides 1 user and 1GB of SPICE (Superfast Parallel In-memory Calculated Engine) capacity free. So you can sign up and use QuickSight free of charge.
When you are signing up for Amazon QuickSight, ensure that you grant permissions for QuickSight to connect to Athena and the S3 bucket where the data is stored.
After you’ve signed up, navigate to the new analysis button, and choose new data set, and then select the Athena data source option. Create a new name for your data source and proceed to the next prompt. At this point, you should see the Athena table you created earlier.
Choose the option to import the data to SPICE for a quicker analysis. SPICE is an in-memory optimized calculation engine that is designed for quick data visualization through parallel processing. SPICE also enables you to refresh your data sets at a regular interval or on-demand as you want.
In the dialog box, confirm this data set creation, and you’ll arrive on the landing page where you can start building your graph. The X-axis will represent the user_id and the Value will be used to represent the SUM total of the tweets from each user.
The Amazon QuickSight report looks like this:
Through this visualization, I can easily see that there are 3 users that tweeted over 20 times that day and that the majority of the users have fewer than 10 tweets that day. I can also set up a scheduled refresh of my SPICE dataset so that I have a dashboard that is regularly updated with the latest data.
Closing thoughts
Here are the benefits that you can gain from using this architecture:
You can optimize the design of your DynamoDB schema that follows AWS best practice recommendations.
You can run analysis and data intelligence in order to understand the current customer demands for your business.
You can store incremental backup for future auditing.
The flexibility of our AWS services invites you to create and design the ideal workflow for your production at any scale, and, as always, if you ever need some guidance, don’t hesitate to reach out to us.I hope this has been helpful to you! Please leave any questions and comments below.
Additional Reading
Learn how to analyze VPC Flow Logs with Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight.
About the Author
Rendy Oka is a Big Data Support Engineer for Amazon Web Services. He provides consultations and architectural designs and partners with the TAMs, Solution Architects, and AWS product teams to help develop solutions for our customers. He is also a team lead for the big data support team in Seattle. Rendy has traveled to dozens of countries around the world and takes every opportunity to experience the local culture wherever he goes
I am an AWS Professional Services consultant, which has me working directly with AWS customers on a daily basis. One of my customers recently asked me to provide a solution to help them fulfill their security requirements by having the flow log data from VPC Flow Logs sent to a central AWS account. This is a common requirement in some companies so that logs can be available for in-depth analysis. In addition, my customers regularly request a simple, scalable, and serverless solution that doesn’t require them to create and maintain custom code.
In this blog post, I demonstrate how to configure your AWS accounts to send flow log data from VPC Flow Logs to an Amazon S3 bucket located in a central AWS account by using only fully managed AWS services. The benefit of using fully managed services is that you can lower or even completely eliminate operational costs because AWS manages the resources and scales the resources automatically.
Solution overview
The solution in this post uses VPC Flow Logs, which is configured in a source account to send flow logs to an Amazon CloudWatch Logs log group. To receive the logs from multiple accounts, this solution uses a CloudWatch Logs destination in the central account. Finally, the solution utilizes fully managed Amazon Kinesis Firehose, which delivers streaming data to scalable and durable S3 object storage automatically without the need to write custom applications or manage resources. When the logs are processed and stored in an S3 bucket, these can be tiered into a lower cost, long-term storage solution (such as Amazon Glacier) automatically to help meet any company-specific or industry-specific requirements for data retention.
The following diagram illustrates the process and components of the solution described in this post.
As numbered in the preceding diagram, these are the high-level steps for implementing this solution:
Enable VPC Flow Logs to send data to the CloudWatch Logs log group in the source accounts.
Finally, in the source accounts, set a subscription filter on the CloudWatch Logs log group to send data to the CloudWatch Logs destination.
Configure the solution by using AWS CloudFormation and the AWS CLI
Now that I have explained the solution, its benefits, and the components involved, I will show how to configure a source account by using the AWS CLI and the central account using a CloudFormation template. To implement this, you need two separate AWS accounts. If you need to set up a new account, navigate to the AWS home page, choose Create an AWS Account, and follow the instructions. Alternatively, you can use AWS Organizations to create your accounts. See AWS Organizations – Policy-Based Management for Multiple AWS Accounts for more details and some step-by-step instructions about how to use this service.
Note your source and target account IDs, source account VPC ID number, and target account region where you create your resources with the CloudFormation template. You will need these values as part of implementing this solution.
Gather the required configuration information
To find your AWS account ID number in the AWS Management Console, choose Support in the navigation bar, and then choose Support Center. Your currently signed-in, 12-digit account ID number appears in the upper-right corner under the Support menu. Sign in to both the source and target accounts and take note of their respective account numbers.
To find your VPC ID number in the AWS Management Console, sign in to your source account and choose Services. In the search box, type VPC and then choose VPC Isolated Cloud Resources.
In the VPC console, click Your VPCs, as shown in the following screenshot.
Note your VPC ID.
Finally, take a note of the region in which you are creating your resources in the target account. The region name is shown in the region selector in the navigation bar of the AWS Management Console (see the following screenshot), and the region is shown in your browser’s address bar. Sign in to your target account, choose Services, and type CloudFormation. In the following screenshot, the region name is Ireland and the region is eu-west-1.
Create the resources in the central account
In this example, I use a CloudFormation template to create all related resources in the central account. You can use CloudFormation to create and manage a collection of AWS resources called a stack. CloudFormation also takes care of the resource provisioning for you.
The provided CloudFormation template creates an S3 bucket that will store all the VPC Flow Logs, IAM roles with associated IAM policies used by Kinesis Firehose and the CloudWatch Logs destination, a Kinesis Firehose delivery stream, and a CloudWatch Logs destination.
To configure the resources in the central account:
Sign in to the AWS Management Console, navigate to CloudFormation, and then choose Create Stack. On the Select Template page, type the URL of the CloudFormation template (https://s3.amazonaws.com/awsiammedia/public/sample/VPCFlowLogsCentralAccount/targetaccount.template) for the target account and choose Next.
The CloudFormation template defines a parameter, paramDestinationPolicy, which sets the IAM policy on the CloudWatch Logs destination. This policy governs which AWS accounts can create subscription filters against this destination.
Change the Principal to your SourceAccountID, and in the Resource section, change TargetAccountRegion and TargetAccountID to the values you noted in the previous section.
After you have updated the policy, choose Next. On the Options page, choose Next.
On the Review page, scroll down and choose the I acknowledge that AWS CloudFormation might create IAM resources with custom names check box. Finally, choose Create to create the stack.
After you are done creating the stack, verify the status of the resources. Choose the check box next to the Stack Name and choose the Resources tab, where you can see a list of all the resources created with this template and their status.
This completes the configuration of the central account.
Configure a source account
Now that you have created all the necessary resources in the central account, I will show you how to configure a source account by using the AWS CLI, a tool for managing your AWS resources. For more information about installing and configuring the AWS CLI, see Configuring the AWS CLI.
To configure a source account:
Create the IAM role that will grant VPC Flow Logs the permissions to send data to the CloudWatch Logs log group. To start, create a trust policy in a file named TrustPolicyForCWL.json by using the following policy document.
Use the following command to create the IAM role, specifying the trust policy file you just created. Note the returned Arn value because that will also be passed to VPC Flow Logs later.
aws iam create-role \
--role-name PublishFlowLogs \
--assume-role-policy-document file://~/TrustPolicyForCWL.json
Now, create a permissions policy to define which actions VPC Flow Logs can perform on the source account. Start by creating the permissions policy in a file named PermissionsForVPCFlowLogs.json. The following set of permissions (in the Action element) is the minimum requirement for VPC Flow Logs to be able to send data to the CloudWatch Logs log group.
Now that you have created an IAM role with required permissions and a CloudWatch Logs log group, change the placeholder values of the role in the following command and run it to enable VPC Flow Logs.
Finally, change the destination arn in the following command to reflect your targetaccountregion and targetaccountID, and run the command to subscribe your CloudWatch Logs log group to CloudWatch Logs in the central account.
This completes the configuration of this central logging solution. Within a few minutes, you should start seeing compressed logs sent from your source account to the S3 bucket in your central account (see the following screenshot). You can process and analyze these logs with the analytics tool of your choice. In addition, you can tier the data into a lower cost, long-term storage solution such as Amazon Glacier and archive the data to meet the data retention requirements of your company.
Summary
In this post, I demonstrated how to configure AWS accounts to send VPC Flow Logs to an S3 bucket located in a central AWS account by using only fully managed AWS services. If you have comments about this post, submit them in the “Comments” section below. If you have questions about implementing this solution, start a new thread on the CloudWatch forum.
– Tomasz
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.