Tag Archives: serverless architecture

Take the Journey: Build Your First Serverless Web Application

Post Syndicated from Tara Walker original https://aws.amazon.com/blogs/aws/build-your-first-serverless-application/

I realized at a young age that I really liked writing those special statements that would control the computer and make it work in the manner in which I desired. This technique of controlling the computer and building things on the machine, I learned from my teachers was called writing code, and it fascinated me. Even now, what seems like centuries later, I still get the thrill of writing code, building cool solutions, and tackling all the associated challenges of this craft. It is no wonder then, that I am a huge fan of serverless computing and serverless architectures.

Serverless Computing allows me to do what I enjoy, which is write code, without having to provision and/or configure servers. Using the AWS Serverless Platform means that all the heavy lifting of server management is handled by AWS, allowing you to focus on building your application.

If you enjoy coding like I do and have yet to dive into building serverless applications, boy do I have some sensational news for you. You can build your own serverless web application with our new Serverless Web Application Guide, which provides step-by-step instructions for you to create and deploy your serverless web application on AWS.

 

The Serverless Web Application Guide is a hands-on tutorial that will assist you in building a fully scalable, serverless web application using the following AWS Services:

  • AWS Lambda: a managed service for serverless compute that allows you to run code without provisioning or managing servers
  • Amazon S3: a managed service that provides simple, durable, scalable object storage
  • Amazon Cognito: a managed service that allows you to add user sign-up, and data synchronization to your application
  • Amazon API Gateway: a managed service which you can create, publish, and maintain secure APIs
  • Amazon DynamoDB: a fast and flexible NoSQL managed cloud database with support for various document and key-value storage models

The application you will build is a simple web application designed for a fictional transportation service. The application will enable users to register and login into the website to request rides from a very unique transportation fleet. You will accomplish this by using the aforementioned AWS services with the serverless application architecture shown in the diagram below.

 
The guide breaks up the each step to build your serverless web application into five separate modules.

 

  1. Static Web Hosting: Amazon S3 hosts static web resources including HTML, CSS, JavaScript, and image files that are loaded in the user’s browser.
  2. User Management: Amazon Cognito provides user management and authentication functions to secure the backend API.
  3. Serverless Backend: Amazon DynamoDB provides a persistence layer where data can be stored by the API’s Lambda function.
  4. RESTful APIs: JavaScript executed in the browser sends and receives data from a public backend API built using AWS Lambda and API Gateway.
  5. Resource Cleanup: All the resources created throughout the tutorial will be terminated.

To be successful in building the application, you must remember to complete each module in sequential order, as the modules are dependent on resources created in the previous one. Some of the guide’s modules provide CloudFormation templates to aid you in generating the necessary resources to build the application if you do not wish to create them manually.

 

Summary

Now that you know all about this fantastic new guide for building a serverless web application, you are ready to journey into the world of AWS serverless computing and have some fun writing the code to build the application. The guide is great for beginners and yet still has cool features that even seasoned serverless computing developers will enjoy building. And to top it off, you don’t have to worry about the cost. Each service used is eligible for the AWS Free Tier and is only estimated to cost less than $0.25 if you are outside of Free Tier usage limits.

Take the plunge today and dive into building serverless applications on the AWS serverless platform with this new and exciting Serverless Web Application Guide.

 

Tara

Analysis of Top-N DynamoDB Objects using Amazon Athena and Amazon QuickSight

Post Syndicated from Rendy Oka original https://aws.amazon.com/blogs/big-data/analysis-of-top-n-dynamodb-objects-using-amazon-athena-and-amazon-quicksight/

If you run an operation that continuously generates a large amount of data, you may want to know what kind of data is being inserted by your application. The ability to analyze data intake quickly can be very valuable for business units, such as operations and marketing. For many operations, it’s important to see what is driving the business at any particular moment. For retail companies, for example, understanding which products are currently popular can aid in planning for future growth. Similarly, for PR companies, understanding the impact of an advertising campaign can help them market their products more effectively.

This post covers an architecture that helps you analyze your streaming data. You’ll build a solution using Amazon DynamoDB Streams, AWS Lambda, Amazon Kinesis Firehose, and Amazon Athena to analyze data intake at a frequency that you choose. And because this is a serverless architecture, you can use all of the services here without the need to provision or manage servers.

The data source

You’ll collect a random sampling of tweets via Twitter’s API and store a variety of attributes in your DynamoDB table, such as: Twitter handle, tweet ID, hashtags, location, and Time-To-Live (TTL) value.

In DynamoDB, the primary key is used as an input to an internal hash function. The output from this function determines the partition in which the data will be stored. When using a combination of primary key and sort key as a DynamoDB schema, you need to make sure that no single partition key contains many more objects than the other partition keys because this can cause partition level throttling. For the demonstration in this blog, the Twitter handle will be the primary key and the tweet ID will be the sort key. This allows you to group and sort tweets from each user.

To help you get started, I have written a script that pulls a live Twitter stream that you can use to generate your data. All you need to do is provide your own Twitter Apps credentials, and it should generate the data immediately. Alternatively, I have also provided a script that you can use to generate random Tweets with little effort.

You can find both scripts in the Github repository:

https://github.com/awslabs/aws-blog-dynamodb-analysis

There are some modules that you may need to install to run these scripts. You can find them in Python’s module repository:

To get your own Twitter credentials, go to https://www.twitter.com/ and sign up for a free account, if you don’t already have one. After your account is set up, go to https://apps.twitter.com/. On the main landing page, choose the Create New App button. After the application is created, go to Keys and Access Tokens to get your credentials to use the Twitter API. You’ll need to generate Customer Tokens/Secret and Access Token/Secret. All four keys will be used to authenticate your request.

Architecture overview

Before we begin, let’s take a look at the overall flow of information will look like, from data ingestion into DynamoDB to visualization of results in Amazon QuickSight.

As illustrated in the architecture diagram above, any changes made to the items in DynamoDB will be captured and processed using DynamoDB Streams. Next, a Lambda function will be invoked by a trigger that is configured to respond to events in DynamoDB Streams. The Lambda function processes the data prior to pushing to Amazon Kinesis Firehose, which will output to Amazon S3. Finally, you use Amazon Athena to analyze the streaming data landing in Amazon S3. The result can be explored and visualized in Amazon QuickSight for your company’s business analytics.

You’ll need to implement your custom Lambda function to help transform the raw <key, value> data stored in DynamoDB to a JSON format for Athena to digest, but I can help you with a sample code that you are free to modify.

Implementation

In the following sections, I’ll walk through how you can set up the architecture discussed earlier.

Create your DynamoDB table

First, let’s create a DynamoDB table and enable DynamoDB Streams. This will enable data to be copied out of this table. From the console, use the user_id as the partition key and tweet_id as the sort key:

After the table is ready, you can enable DynamoDB Streams. This process operates asynchronously, so there is no performance impact on the table when you enable this feature. The easiest way to manage DynamoDB Streams is also through the DynamoDB console.

In the Overview tab of your newly created table, click Manage Stream. In the window, choose the information that will be written to the stream whenever data in the table is added or modified. In this example, you can choose either New image or New and old images.

For more details on this process, check out our documentation:

http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html

Configure Kinesis Firehose

Before creating the Lambda function, you need to configure Kinesis Firehose delivery stream so that it’s ready to accept data from Lambda. Open the Firehose console and choose Create Firehose Delivery Stream. From here, choose S3 as the destination and use the following to information to configure the resource. Note the Delivery stream name because you will use it in the next step.

For more details on this process, check out our documentation:

http://docs.aws.amazon.com/firehose/latest/dev/basic-create.html#console-to-s3

Create your Lambda function

Now that Kinesis Firehose is ready to accept data, you can create your Lambda function.

From the AWS Lambda console, choose the Create a Lambda function button and use the Blank Function. Enter a name and description, and choose Python 2.7 as the Runtime. Note your Lambda function name because you’ll need it in the next step.

In the Lambda function code field, you can paste the script that I have written for this purpose. All this function needs is the name of your Firehose stream name set as an environment variable.

import boto3
import json
import os

# Initiate Firehose client
firehose_client = boto3.client('firehose')

def lambda_handler(event, context):
    records = []
    batch   = []
    try :
        for record in event['Records']:
            tweet = {}
            t_stats = '{ "table_name":"%s", "user_id":"%s", "tweet_id":"%s", "approx_post_time":"%d" }\n' \
                      % ( record['eventSourceARN'].split('/')[1], \
                          record['dynamodb']['Keys']['user_id']['S'], \
                          record['dynamodb']['Keys']['tweet_id']['N'], \
                          int(record['dynamodb']['ApproximateCreationDateTime']) )
            tweet["Data"] = t_stats
            records.append(tweet)
        batch.append(records)
        res = firehose_client.put_record_batch(
            DeliveryStreamName = os.environ['firehose_stream_name'],
            Records = batch[0]
        )
        return 'Successfully processed {} records.'.format(len(event['Records']))
    except Exception :
        pass

The handler should be set to lambda_function.lambda_handler and you can use the existing lambda_dynamodb_streams role that’s been created by default.

Enable DynamoDB trigger and start collecting data

Everything is ready to go. Open your table using the DynamoDB console and go to the Triggers tab. Select the Create trigger drop down list and choose Existing Lambda function. In the pop-up window, select the function that you just created, and choose the Create button.

At this point, you can start collecting data with the Python script that I’ve provided. The first one will create a script that will pull public Twitter data and the other will generate fake tweets using Lorem Ipsum text.

Configure Amazon Athena to read the data

Next, you will configure Amazon Athena so that it can read the data Kinesis Firehose outputs to Amazon S3 and allow you to analyze the data as needed. You can connect to Athena directly from the Athena console, and you can establish a connection using JDBC or the Athena API. In this example, I’m going to demonstrate what this looks like on the Athena console.

First, create a new database and a new table. You can do this by running the following two queries. The first query creates a new database:

CREATE DATABASE IF NOT EXISTS ddbtablestats

And the second query creates a new table:

CREATE EXTERNAL TABLE IF NOT EXISTS ddbtablestats.twitterfeed (
    `table_name` string,
    `user_id` string,
    `tweet_id` bigint,
    `approx_post_time` timestamp 
) PARTITIONED BY (
    year string,
    month string,
    day string,
    hour string 
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ('serialization.format' = '1')
LOCATION 's3://myBucket/dynamodb/streams/transactions/'

Note that this table is created using partitions. Partitioning separates your data into logical parts based on certain criteria, such as date, location, language, etc. This allows Athena to selectively pull your data without needing to process the entire data set. This effectively minimizes the query execution time, and it also allows you to have greater control over the data that you want to query.

After the query has completed, you should be able to see the table in the left side pane of the Athena dashboard.

After the database and table have been created, execute the ALTER TABLE query to populate the partitions in your table. Replace the date with the current date when the script was executed.

ALTER TABLE ddbtablestats.TwitterFeed ADD IF NOT EXISTS
PARTITION (year='2017',month='05',day='17',hour='01') location 's3://myBucket/dynamodb/streams/transactions/2017/05/17/01/'

Using the Athena console, you’ll need to manually populate each partition for each additional partition that you’d like to analyze, however you can programmatically automate this process by using the JDBC driver or any AWS SDK of your choice.

For more information on partitioning in Athena, check out our documentation:

http://docs.aws.amazon.com/athena/latest/ug/partitions.html

Querying the data in Amazon Athena

This is it! Let’s run this query to see the top 10 most active Twitter users in the last 24 hours. You can do this from the Athena console:

SELECT user_id, COUNT(DISTINCT tweet_id) tweets FROM ddbTableStats.TwitterFeed
WHERE year='2017' AND month='05' AND day='17'
GROUP BY user_id
ORDER BY tweets DESC
LIMIT 10

The result should look similar to the following:

Linking Athena to Amazon QuickSight

Finally, to make this data available to a larger audience, let’s visualize this data in Amazon QuickSight. Amazon QuickSight provides native connectivity to AWS data sources such as Amazon Redshift, Amazon RDS, and Amazon Athena. Amazon QuickSight can also connect to on-premises databases, Excel, or CSV files, and it can connect to cloud data sources such as Salesforce.com. For this solution, we will connect Amazon QuickSight to the Athena table we just created.

Amazon QuickSight has a free tier that provides 1 user and 1GB of SPICE (Superfast Parallel In-memory Calculated Engine) capacity free. So you can sign up and use QuickSight free of charge.

When you are signing up for Amazon QuickSight, ensure that you grant permissions for QuickSight to connect to Athena and the S3 bucket where the data is stored.

After you’ve signed up, navigate to the new analysis button, and choose new data set, and then select the Athena data source option. Create a new name for your data source and proceed to the next prompt. At this point, you should see the Athena table you created earlier.

Choose the option to import the data to SPICE for a quicker analysis. SPICE is an in-memory optimized calculation engine that is designed for quick data visualization through parallel processing. SPICE also enables you to refresh your data sets at a regular interval or on-demand as you want.

In the dialog box, confirm this data set creation, and you’ll arrive on the landing page where you can start building your graph. The X-axis will represent the user_id and the Value will be used to represent the SUM total of the tweets from each user.

The Amazon QuickSight report looks like this:

Through this visualization, I can easily see that there are 3 users that tweeted over 20 times that day and that the majority of the users have fewer than 10 tweets that day. I can also set up a scheduled refresh of my SPICE dataset so that I have a dashboard that is regularly updated with the latest data.

Closing thoughts

Here are the benefits that you can gain from using this architecture:

  1. You can optimize the design of your DynamoDB schema that follows AWS best practice recommendations.
  1. You can run analysis and data intelligence in order to understand the current customer demands for your business.
  1. You can store incremental backup for future auditing.

The flexibility of our AWS services invites you to create and design the ideal workflow for your production at any scale, and, as always, if you ever need some guidance, don’t hesitate to reach out to us.I  hope this has been helpful to you! Please leave any questions and comments below.

 


Additional Reading

Learn how to analyze VPC Flow Logs with Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight.


About the Author

Rendy Oka is a Big Data Support Engineer for Amazon Web Services. He provides consultations and architectural designs and partners with the TAMs, Solution Architects, and AWS product teams to help develop solutions for our customers. He is also a team lead for the big data support team in Seattle. Rendy has traveled to dozens of countries around the world and takes every opportunity to experience the local culture wherever he goes

 

 

 

 

Introducing Our NEW AWS Community Heroes (Summer 2017 Edition)

Post Syndicated from Tara Walker original https://aws.amazon.com/blogs/aws/introducing-our-new-aws-community-heroes-summer-2017-edition/

The AWS Community Heroes program seeks to recognize and honor the most engaged Amazon Web Services developers who have had a positive impact in the global community.  If you are interested in learning more about the AWS Community Heroes program or curious about ways to get involved with your local AWS community, please click the graphic below to see the AWS Heroes talk directly about the program.

Now that you know more about the AWS Community Hero program, I am elated to introduce to you all the latest AWS Heroes to join the fold:

These guys and gals impart their passion for AWS and cloud technologies with the technical community by sharing their time and knowledge across social media and via in-person events.

Ben Kehoe

Ben Kehoe works in the field of Cloud Robotics—using the internet to enable robots to do more and better things—an area of IoT involving computation in the cloud and at the edge, Big Data, and machine learning. Approaching cloud computing from this angle, Ben focuses on developing business value rapidly through serverless (and service full) applications.

At iRobot, Ben guided the transition to a serverless architecture on AWS based on AWS Lambda and AWS IoT to support iRobot’s connected robot fleet. This architecture enables iRobot to focus on its core mission of building amazing robots with a minimum of development and operations effort.

Ben seeks to amplify voices from dev, operations, and security to help the community shape the evolution of serverless and event-driven designs for IoT and cloud computing more broadly.

 

 

Marcia Villalba

Marcia is a Senior Full-stack Developer at Rovio, the creators of Angry Birds. She is originally from Uruguay but has been living in Finland for almost a decade.

She has been designing and developing software professionally for over 10 years. For more than four years she has been working with AWS, including the past year which she’s worked mostly with serverless technologies.

Marcia runs her own YouTube channel, in which she publishes at least one new video every week. In her channel, she focuses on teaching how to use AWS serverless technologies and managed services. In addition to her professional work, she is the Tech Lead in “Girls in Tech” Helsinki, helping to inspire more women to enter into technology and programming.

 

 

Joshua Levy

Joshua Levy is an entrepreneur, engineer, writer, and serial startup technologist and advisor in cloud, AI, search, and startup scaling.

He co-founded the Open Guide to AWS, which is one of the most popular AWS resources and communities on the web. The collaborative project welcomes new contributors or editors, and anyone who wishes to ask or answer questions.

Josh has years of experience in hands-on software engineering and leadership at fast-growing consumer and enterprise startups, including Viv Labs (acquired by Samsung) and BloomReach (where he led engineering and AWS infrastructure), and a background in AI and systems research at SRI and mathematics at Berkeley. He has a passion for improving how we share knowledge on complex engineering, product, or business topics. If you share any of these interests, reach out on Twitter or find his contact details on GitHub.

 

Michael Ezzell

Michael Ezzell is a frequent contributor of detailed, in-depth solutions to questions spanning a wide variety of AWS services on Stack Overflow and other sites on the Stack Exchange Network.

Michael is the resident DBA and systems administrator for Online Rewards, a leading provider of web-based employee recognition, channel incentive, and customer loyalty programs, where he was a key player in the company’s full transition to the AWS platform.

Based in Cincinnati, and known to coworkers and associates as “sqlbot,” he also provides design, development, and support services to freelance consulting clients for AWS services and MySQL, as well as, broadcast & cable television and telecommunications technologies.

 

 

 

Thanos Baskous

Thanos Baskous is a San Francisco-based software engineer and entrepreneur who is passionate about designing and building scalable and robust systems.

He co-founded the Open Guide to AWS, which is one of the most popular AWS resources and communities on the web.

At Twitter, he built infrastructure that allows engineers to seamlessly deploy and run their applications across private data centers and public cloud environments. He previously led a team at TellApart (acquired by Twitter) that built an internal platform-as-a-service (Docker, Apache Aurora, Mesos on AWS) in support of a migration from a monolithic application architecture to a microservice-based architecture. Before TellApart, he co-founded AWS-hosted AdStack (acquired by TellApart) in order to automatically personalize and improve the quality of content in marketing emails and email newsletters.

 

 

Rob Gruhl

Rob is a senior engineering manager located in Seattle, WA. He supports a team of talented engineers at Nordstrom Technology exploring and deploying a variety of serverless systems to production.

From the beginning of the serverless era, Rob has been exclusively using serverless architectures to allow a small team of engineers to deliver incredible solutions that scale effortlessly and wake them in the middle of the night rarely. In addition to a number of production services, together with his team Rob has created and released two major open source projects and accompanying open source workshops using a 100% serverless approach. He’d love to talk with you about serverless, event-sourcing, and/or occasionally-connected distributed data layers.

 

Feel free to follow these great AWS Heroes on Twitter and check out their blogs. It is exciting to have them all join the AWS Community Heroes program.

–  Tara

Visualize and Monitor Amazon EC2 Events with Amazon CloudWatch Events and Amazon Kinesis Firehose

Post Syndicated from Karan Desai original https://aws.amazon.com/blogs/big-data/visualize-and-monitor-amazon-ec2-events-with-amazon-cloudwatch-events-and-amazon-kinesis-firehose/

Monitoring your AWS environment is important for security, performance, and cost control purposes. For example, by monitoring and analyzing API calls made to your Amazon EC2 instances, you can trace security incidents and gain insights into administrative behaviors and access patterns. The kinds of events you might monitor include console logins, Amazon EBS snapshot creation/deletion/modification, VPC creation/deletion/modification, and instance reboots, etc.

In this post, I show you how to build a near real-time API monitoring solution for EC2 events using Amazon CloudWatch Events and Amazon Kinesis Firehose. Please be sure to have Amazon CloudTrail enabled in your account.

  • CloudWatch Events offers a near real-time stream of system events that describe changes in AWS resources. CloudWatch Events now supports Kinesis Firehose as a target.
  • Kinesis Firehose is a fully managed service for continuously capturing, transforming, and delivering data in minutes to storage and analytics destinations such as Amazon S3, Amazon Kinesis Analytics, Amazon Redshift, and Amazon Elasticsearch Service.

Walkthrough

For this walkthrough, you create a CloudWatch event rule that matches specific EC2 events such as:

  • Starting, stopping, and terminating an instance
  • Creating and deleting VPC route tables
  • Creating and deleting a security group
  • Creating, deleting, and modifying instance volumes and snapshots

Your CloudWatch event target is a Kinesis Firehose delivery stream that delivers this data to an Elasticsearch cluster, where you set up Kibana for visualization. Using this solution, you can easily load and visualize EC2 events in minutes without setting up complicated data pipelines.

Set up the Elasticsearch cluster

Create the Amazon ES domain in the Amazon ES console, or by using the create-elasticsearch-domain command in the AWS CLI.

This example uses the following configuration:

  • Domain Name: esLogSearch
  • Elasticsearch Version: 1
  • Instance Count: 2
  • Instance type:elasticsearch
  • Enable dedicated master: true
  • Enable zone awareness: true
  • Restrict Amazon ES to an IP-based access policy

Other settings are left as the defaults.

Create a Kinesis Firehose delivery stream

In the Kinesis Firehose console, create a new delivery stream with Amazon ES as the destination. For detailed steps, see Create a Kinesis Firehose Delivery Stream to Amazon Elasticsearch Service.

Set up CloudWatch Events

Create a rule, and configure the event source and target. You can choose to configure multiple event sources with several AWS resources, along with options to specify specific or multiple event types.

In the CloudWatch console, choose Events.

For Service Name, choose EC2.

In Event Pattern Preview, choose Edit and copy the pattern below. For this walkthrough, I selected events that are specific to the EC2 API, but you can modify it to include events for any of your AWS resources.

 

{
	"source": [
		"aws.ec2"
	],
	"detail-type": [
		"AWS API Call via CloudTrail"
	],
	"detail": {
		"eventSource": [
			"ec2.amazonaws.com"
		],
		"eventName": [
			"RunInstances",
			"StopInstances",
			"StartInstances",
			"CreateFlowLogs",
			"CreateImage",
			"CreateNatGateway",
			"CreateVpc",
			"DeleteKeyPair",
			"DeleteNatGateway",
			"DeleteRoute",
			"DeleteRouteTable",
"CreateSnapshot",
"DeleteSnapshot",
			"DeleteVpc",
			"DeleteVpcEndpoints",
			"DeleteSecurityGroup",
			"ModifyVolume",
			"ModifyVpcEndpoint",
			"TerminateInstances"
		]
	}
}

The following screenshot shows what your event looks like in the console.

Next, choose Add target and select the delivery stream that you just created.

Set up Kibana on the Elasticsearch cluster

Amazon ES provides a default installation of Kibana with every Amazon ES domain. You can find the Kibana endpoint on your domain dashboard in the Amazon ES console. You can restrict Amazon ES access to an IP-based access policy.

In the Kibana console, for Index name or pattern, type log. This is the name of the Elasticsearch index.

For Time-field name, choose @time.

To view the events, choose Discover.

The following chart demonstrates the API operations and the number of times that they have been triggered in the past 12 hours.

Summary

In this post, you created a continuous, near real-time solution to monitor various EC2 events such as starting and shutting down instances, creating VPCs, etc. Likewise, you can build a continuous monitoring solution for all the API operations that are relevant to your daily AWS operations and resources.

With Kinesis Firehose as a new target for CloudWatch Events, you can retrieve, transform, and load system events to the storage and analytics destination of your choice in minutes, without setting up complicated data pipelines.

If you have any questions or suggestions, please comment below.


Additional Reading

Learn how to build a serverless architecture to analyze Amazon CloudFront access logs using AWS Lambda, Amazon Athena, and Amazon Kinesis Analytics

 

 

 

AWS Online Tech Talks – June 2017

Post Syndicated from Tara Walker original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-june-2017/

As the sixth month of the year, June is significant in that it is not only my birth month (very special), but it contains the summer solstice in the Northern Hemisphere, the day with the most daylight hours, and the winter solstice in the Southern Hemisphere, the day with the fewest daylight hours. In the United States, June is also the month in which we celebrate our dads with Father’s Day and have month-long celebrations of music, heritage, and the great outdoors.

Therefore, the month of June can be filled with lots of excitement. So why not add even more delight to the month, by enhancing your cloud computing skills. This month’s AWS Online Tech Talks features sessions on Artificial Intelligence (AI), Storage, Big Data, and Compute among other great topics.

June 2017 – Schedule

Noted below are the upcoming scheduled live, online technical sessions being held during the month of June. Make sure to register ahead of time so you won’t miss out on these free talks conducted by AWS subject matter experts. All schedule times for the online tech talks are shown in the Pacific Time (PDT) time zone.

Webinars featured this month are:

Thursday, June 1

Storage

9:00 AM – 10:00 AM: Deep Dive on Amazon Elastic File System

Big Data

10:30 AM – 11:30 AM: Migrating Big Data Workloads to Amazon EMR

Serverless

12:00 Noon – 1:00 PM: Building AWS Lambda Applications with the AWS Serverless Application Model (AWS SAM)

 

Monday, June 5

Artificial Intelligence

9:00 AM – 9:40 AM: Exploring the Business Use Cases for Amazon Lex

 

Tuesday, June 6

Management Tools

9:00 AM – 9:40 AM: Automated Compliance and Governance with AWS Config and AWS CloudTrail

 

Wednesday, June 7

Storage

9:00 AM – 9:40 AM: Backing up Amazon EC2 with Amazon EBS Snapshots

Big Data

10:30 AM – 11:10 AM: Intro to Amazon Redshift Spectrum: Quickly Query Exabytes of Data in S3

DevOps

12:00 Noon – 12:40 PM: Introduction to AWS CodeStar: Quickly Develop, Build, and Deploy Applications on AWS

 

Thursday, June 8

Artificial Intelligence

9:00 AM – 9:40 AM: Exploring the Business Use Cases for Amazon Polly

10:30 AM – 11:10 AM: Exploring the Business Use Cases for Amazon Rekognition

 

Monday, June 12

Artificial Intelligence

9:00 AM – 9:40 AM: Exploring the Business Use Cases for Amazon Machine Learning

 

Tuesday, June 13

Compute

9:00 AM – 9:40 AM: DevOps with Visual Studio, .NET and AWS

IoT

10:30 AM – 11:10 AM: Create, with Intel, an IoT Gateway and Establish a Data Pipeline to AWS IoT

Big Data

12:00 Noon – 12:40 PM: Real-Time Log Analytics using Amazon Kinesis and Amazon Elasticsearch Service

 

Wednesday, June 14

Containers

9:00 AM – 9:40 AM: Batch Processing with Containers on AWS

Security & Identity

12:00 Noon – 12:40 PM: Using Microsoft Active Directory across On-premises and Cloud Workloads

 

Thursday, June 15

Big Data

12:00 Noon – 1:00 PM: Building Big Data Applications with Serverless Architectures

 

Monday, June 19

Artificial Intelligence

9:00 AM – 9:40 AM: Deep Learning for Data Scientists: Using Apache MxNet and R on AWS

 

Tuesday, June 20

Storage

9:00 AM – 9:40 AM: Cloud Backup & Recovery Options with AWS Partner Solutions

Artificial Intelligence

10:30 AM – 11:10 AM: An Overview of AI on the AWS Platform

 

The AWS Online Tech Talks series covers a broad range of topics at varying technical levels. These sessions feature live demonstrations & customer examples led by AWS engineers and Solution Architects. Check out the AWS YouTube channel for more on-demand webinars on AWS technologies.

Tara

Build a Serverless Architecture to Analyze Amazon CloudFront Access Logs Using AWS Lambda, Amazon Athena, and Amazon Kinesis Analytics

Post Syndicated from Rajeev Srinivasan original https://aws.amazon.com/blogs/big-data/build-a-serverless-architecture-to-analyze-amazon-cloudfront-access-logs-using-aws-lambda-amazon-athena-and-amazon-kinesis-analytics/

Nowadays, it’s common for a web server to be fronted by a global content delivery service, like Amazon CloudFront. This type of front end accelerates delivery of websites, APIs, media content, and other web assets to provide a better experience to users across the globe.

The insights gained by analysis of Amazon CloudFront access logs helps improve website availability through bot detection and mitigation, optimizing web content based on the devices and browser used to view your webpages, reducing perceived latency by caching of popular object closer to its viewer, and so on. This results in a significant improvement in the overall perceived experience for the user.

This blog post provides a way to build a serverless architecture to generate some of these insights. To do so, we analyze Amazon CloudFront access logs both at rest and in transit through the stream. This serverless architecture uses Amazon Athena to analyze large volumes of CloudFront access logs (on the scale of terabytes per day), and Amazon Kinesis Analytics for streaming analysis.

The analytic queries in this blog post focus on three common use cases:

  1. Detection of common bots using the user agent string
  2. Calculation of current bandwidth usage per Amazon CloudFront distribution per edge location
  3. Determination of the current top 50 viewers

However, you can easily extend the architecture described to power dashboards for monitoring, reporting, and trigger alarms based on deeper insights gained by processing and analyzing the logs. Some examples are dashboards for cache performance, usage and viewer patterns, and so on.

Following we show a diagram of this architecture.

Prerequisites

Before you set up this architecture, install the AWS Command Line Interface (AWS CLI) tool on your local machine, if you don’t have it already.

Setup summary

The following steps are involved in setting up the serverless architecture on the AWS platform:

  1. Create an Amazon S3 bucket for your Amazon CloudFront access logs to be delivered to and stored in.
  2. Create a second Amazon S3 bucket to receive processed logs and store the partitioned data for interactive analysis.
  3. Create an Amazon Kinesis Firehose delivery stream to batch, compress, and deliver the preprocessed logs for analysis.
  4. Create an AWS Lambda function to preprocess the logs for analysis.
  5. Configure Amazon S3 event notification on the CloudFront access logs bucket, which contains the raw logs, to trigger the Lambda preprocessing function.
  6. Create an Amazon DynamoDB table to look up partition details, such as partition specification and partition location.
  7. Create an Amazon Athena table for interactive analysis.
  8. Create a second AWS Lambda function to add new partitions to the Athena table based on the log delivered to the processed logs bucket.
  9. Configure Amazon S3 event notification on the processed logs bucket to trigger the Lambda partitioning function.
  10. Configure Amazon Kinesis Analytics application for analysis of the logs directly from the stream.

ETL and preprocessing

In this section, we parse the CloudFront access logs as they are delivered, which occurs multiple times in an hour. We filter out commented records and use the user agent string to decipher the browser name, the name of the operating system, and whether the request has been made by a bot. For more details on how to decipher the preceding information based on the user agent string, see user-agents 1.1.0 in the Python documentation.

We use the Lambda preprocessing function to perform these tasks on individual rows of the access log. On successful completion, the rows are pushed to an Amazon Kinesis Firehose delivery stream to be persistently stored in an Amazon S3 bucket, the processed logs bucket.

To create a Firehose delivery stream with a new or existing S3 bucket as the destination, follow the steps described in Create a Firehose Delivery Stream to Amazon S3 in the S3 documentation. Keep most of the default settings, but select an AWS Identity and Access Management (IAM) role that has write access to your S3 bucket and specify GZIP compression. Name the delivery stream CloudFrontLogsToS3.

Another pre-requisite for this setup is to create an IAM role that provides the necessary permissions our AWS Lambda function to get the data from S3, process it, and deliver it to the CloudFrontLogsToS3 delivery stream.

Let’s use the AWS CLI to create the IAM role using the following the steps:

  1. Create the IAM policy (lambda-exec-policy) for the Lambda execution role to use.
  2. Create the Lambda execution role (lambda-cflogs-exec-role) and assign the service to use this role.
  3. Attach the policy created in step 1 to the Lambda execution role.

To download the policy document to your local machine, type the following command.

aws s3 cp s3://aws-bigdata-blog/artifacts/Serverless-CF-Analysis/preprocessiong-lambda/lambda-exec-policy.json  <path_on_your_local_machine>

To download the assume policy document to your local machine, type the following command.

aws s3 cp s3://aws-bigdata-blog/artifacts/Serverless-CF-Analysis/preprocessiong-lambda/assume-lambda-policy.json  <path_on_your_local_machine>

Following is the lambda-exec-policy.json file, which is the IAM policy used by the Lambda execution role.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "CloudWatchAccess",
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:*:*"
        },
        {
            "Sid": "S3Access",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::*"
            ]
        },
        {
            "Sid": "FirehoseAccess",
            "Effect": "Allow",
            "Action": [
                "firehose:ListDeliveryStreams",
                "firehose:PutRecord",
                "firehose:PutRecordBatch"
            ],
            "Resource": [
                "arn:aws:firehose:*:*:deliverystream/CloudFrontLogsToS3"
            ]
        }
    ]
}

To create the IAM policy used by Lambda execution role, type the following command.

aws iam create-policy --policy-name lambda-exec-policy --policy-document file://<path>/lambda-exec-policy.json

To create the AWS Lambda execution role and assign the service to use this role, type the following command.

aws iam create-role --role-name lambda-cflogs-exec-role --assume-role-policy-document file://<path>/assume-lambda-policy.json

Following is the assume-lambda-policy.json file, to grant Lambda permission to assume a role.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

To attach the policy (lambda-exec-policy) created to the AWS Lambda execution role (lambda-cflogs-exec-role), type the following command.

aws iam attach-role-policy --role-name lambda-cflogs-exec-role --policy-arn arn:aws:iam::<your-account-id>:policy/lambda-exec-policy

Now that we have created the CloudFrontLogsToS3 Firehose delivery stream and the lambda-cflogs-exec-role IAM role for Lambda, the next step is to create a Lambda preprocessing function.

This Lambda preprocessing function parses the CloudFront access logs delivered into the S3 bucket and performs a few transformation and mapping operations on the data. The Lambda function adds descriptive information, such as the browser and the operating system that were used to make this request based on the user agent string found in the logs. The Lambda function also adds information about the web distribution to support scenarios where CloudFront access logs are delivered to a centralized S3 bucket from multiple distributions. With the solution in this blog post, you can get insights across distributions and their edge locations.

Use the Lambda Management Console to create a new Lambda function with a Python 2.7 runtime and the s3-get-object-python blueprint. Open the console, and on the Configure triggers page, choose the name of the S3 bucket where the CloudFront access logs are delivered. Choose Put for Event type. For Prefix, type the name of the prefix, if any, for the folder where CloudFront access logs are delivered, for example cloudfront-logs/. To invoke Lambda to retrieve the logs from the S3 bucket as they are delivered, select Enable trigger.

Choose Next and provide a function name to identify this Lambda preprocessing function.

For Code entry type, choose Upload a file from Amazon S3. For S3 link URL, type https.amazonaws.com//preprocessing-lambda/pre-data.zip. In the section, also create an environment variable with the key KINESIS_FIREHOSE_STREAM and a value with the name of the Firehose delivery stream as CloudFrontLogsToS3.

Choose lambda-cflogs-exec-role as the IAM role for the Lambda function, and type prep-data.lambda_handler for the value for Handler.

Choose Next, and then choose Create Lambda.

Table creation in Amazon Athena

In this step, we will build the Athena table. Use the Athena console in the same region and create the table using the query editor.

CREATE EXTERNAL TABLE IF NOT EXISTS cf_logs (
  logdate date,
  logtime string,
  location string,
  bytes bigint,
  requestip string,
  method string,
  host string,
  uri string,
  status bigint,
  referrer string,
  useragent string,
  uriquery string,
  cookie string,
  resulttype string,
  requestid string,
  header string,
  csprotocol string,
  csbytes string,
  timetaken bigint,
  forwardedfor string,
  sslprotocol string,
  sslcipher string,
  responseresulttype string,
  protocolversion string,
  browserfamily string,
  osfamily string,
  isbot string,
  filename string,
  distribution string
)
PARTITIONED BY(year string, month string, day string, hour string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3://<pre-processing-log-bucket>/prefix/';

Creation of the Athena partition

A popular website with millions of requests each day routed using Amazon CloudFront can generate a large volume of logs, on the order of a few terabytes a day. We strongly recommend that you partition your data to effectively restrict the amount of data scanned by each query. Partitioning significantly improves query performance and substantially reduces cost. The Lambda partitioning function adds the partition information to the Athena table for the data delivered to the preprocessed logs bucket.

Before delivering the preprocessed Amazon CloudFront logs file into the preprocessed logs bucket, Amazon Kinesis Firehose adds a UTC time prefix in the format YYYY/MM/DD/HH. This approach supports multilevel partitioning of the data by year, month, date, and hour. You can invoke the Lambda partitioning function every time a new processed Amazon CloudFront log is delivered to the preprocessed logs bucket. To do so, configure the Lambda partitioning function to be triggered by an S3 Put event.

For a website with millions of requests, a large number of preprocessed logs can be delivered multiple times in an hour—for example, at the interval of one each second. To avoid querying the Athena table for partition information every time a preprocessed log file is delivered, you can create an Amazon DynamoDB table for fast lookup.

Based on the year, month, data and hour in the prefix of the delivered log, the Lambda partitioning function checks if the partition specification exists in the Amazon DynamoDB table. If it doesn’t, it’s added to the table using an atomic operation, and then the Athena table is updated.

Type the following command to create the Amazon DynamoDB table.

aws dynamodb create-table --table-name athenapartitiondetails \
--attribute-definitions AttributeName=PartitionSpec,AttributeType=S \
--key-schema AttributeName=PartitionSpec,KeyType=HASH \
--provisioned-throughput ReadCapacityUnits=100,WriteCapacityUnits=100

Here the following is true:

  • PartitionSpec is the hash key and is a representation of the partition signature—for example, year=”2017”; month=”05”; day=”15”; hour=”10”.
  • Depending on the rate at which the processed log files are delivered to the processed log bucket, you might have to increase the ReadCapacityUnits and WriteCapacityUnits values, if these are throttled.

The other attributes besides PartitionSpec are the following:

  • PartitionPath – The S3 path associated with the partition.
  • PartitionType – The type of partition used (Hour, Month, Date, Year, or ALL). In this case, ALL is used.

Next step is to create the IAM role to provide permissions for the Lambda partitioning function. You require permissions to do the following:

  1. Look up and write partition information to DynamoDB.
  2. Alter the Athena table with new partition information.
  3. Perform Amazon CloudWatch logs operations.
  4. Perform Amazon S3 operations.

To download the policy document to your local machine, type following command.

aws s3 cp s3://aws-bigdata-blog/artifacts/Serverless-CF-Analysis/partitioning-lambda/lambda-partition-function-execution-policy.json  <path_on_your_local_machine>

To download the assume policy document to your local machine, type the following command.

aws s3 cp s3://aws-bigdata-blog/artifacts/Serverless-CF-Analysis/partitioning-lambda/assume-lambda-policy.json <path_on_your_local_machine>

To create the Lambda execution role and assign the service to use this role, type the following command.

aws iam create-role --role-name lambda-cflogs-exec-role --assume-role-policy-document file://<path>/assume-lambda-policy.json

Let’s use the AWS CLI to create the IAM role using the following three steps:

  1. Create the IAM policy(lambda-partition-exec-policy) used by the Lambda execution role.
  2. Create the Lambda execution role (lambda-partition-execution-role)and assign the service to use this role.
  3. Attach the policy created in step 1 to the Lambda execution role.

To create the IAM policy used by Lambda execution role, type the following command.

aws iam create-policy --policy-name lambda-partition-exec-policy --policy-document file://<path>/lambda-partition-function-execution-policy.json

To create the Lambda execution role and assign the service to use this role, type the following command.

aws iam create-role --role-name lambda-partition-execution-role --assume-role-policy-document file://<path>/assume-lambda-policy.json

To attach the policy (lambda-partition-exec-policy) created to the AWS Lambda execution role (lambda-partition-execution-role), type the following command.

aws iam attach-role-policy --role-name lambda-partition-execution-role --policy-arn arn:aws:iam::<your-account-id>:policy/lambda-partition-exec-policy

Following is the lambda-partition-function-execution-policy.json file, which is the IAM policy used by the Lambda execution role.

{
    "Version": "2012-10-17",
    "Statement": [
      	{
            	"Sid": "DDBTableAccess",
            	"Effect": "Allow",
            	"Action": "dynamodb:PutItem"
            	"Resource": "arn:aws:dynamodb*:*:table/athenapartitiondetails"
        	},
        	{
            	"Sid": "S3Access",
            	"Effect": "Allow",
            	"Action": [
                		"s3:GetBucketLocation",
                		"s3:GetObject",
                		"s3:ListBucket",
                		"s3:ListBucketMultipartUploads",
                		"s3:ListMultipartUploadParts",
                		"s3:AbortMultipartUpload",
                		"s3:PutObject"
            	],
          		"Resource":"arn:aws:s3:::*"
		},
	              {
		      "Sid": "AthenaAccess",
      		"Effect": "Allow",
      		"Action": [ "athena:*" ],
      		"Resource": [ "*" ]
	      },
        	{
            	"Sid": "CloudWatchLogsAccess",
            	"Effect": "Allow",
            	"Action": [
                		"logs:CreateLogGroup",
                		"logs:CreateLogStream",
             	   	"logs:PutLogEvents"
            	],
            	"Resource": "arn:aws:logs:*:*:*"
        	}
    ]
}

Download the .jar file containing the Java deployment package to your local machine.

aws s3 cp s3://aws-bigdata-blog/artifacts/Serverless-CF-Analysis/partitioning-lambda/aws-lambda-athena-1.0.0.jar <path_on_your_local_machine>

From the AWS Management Console, create a new Lambda function with Java8 as the runtime. Select the Blank Function blueprint.

On the Configure triggers page, choose the name of the S3 bucket where the preprocessed logs are delivered. Choose Put for the Event Type. For Prefix, type the name of the prefix folder, if any, where preprocessed logs are delivered by Firehose—for example, out/. For Suffix, type the name of the compression format that the Firehose stream (CloudFrontLogToS3) delivers the preprocessed logs —for example, gz. To invoke Lambda to retrieve the logs from the S3 bucket as they are delivered, select Enable Trigger.

Choose Next and provide a function name to identify this Lambda partitioning function.

Choose Java8 for Runtime for the AWS Lambda function. Choose Upload a .ZIP or .JAR file for the Code entry type, and choose Upload to upload the downloaded aws-lambda-athena-1.0.0.jar file.

Next, create the following environment variables for the Lambda function:

  • TABLE_NAME – The name of the Athena table (for example, cf_logs).
  • PARTITION_TYPE – The partition to be created based on the Athena table for the logs delivered to the sub folders in S3 bucket based on Year, Month, Date, Hour, or Set this to ALL to use Year, Month, Date, and Hour.
  • DDB_TABLE_NAME – The name of the DynamoDB table holding partition information (for example, athenapartitiondetails).
  • ATHENA_REGION – The current AWS Region for the Athena table to construct the JDBC connection string.
  • S3_STAGING_DIR – The Amazon S3 location where your query output is written. The JDBC driver asks Athena to read the results and provide rows of data back to the user (for example, s3://<bucketname>/<folder>/).

To configure the function handler and IAM, for Handler copy and paste the name of the handler: com.amazonaws.services.lambda.CreateAthenaPartitionsBasedOnS3EventWithDDB::handleRequest. Choose the existing IAM role, lambda-partition-execution-role.

Choose Next and then Create Lambda.

Interactive analysis using Amazon Athena

In this section, we analyze the historical data that’s been collected since we added the partitions to the Amazon Athena table for data delivered to the preprocessing logs bucket.

Scenario 1 is robot traffic by edge location.

SELECT COUNT(*) AS ct, requestip, location FROM cf_logs
WHERE isbot='True'
GROUP BY requestip, location
ORDER BY ct DESC;

Scenario 2 is total bytes transferred per distribution for each edge location for your website.

SELECT distribution, location, SUM(bytes) as totalBytes
FROM cf_logs
GROUP BY location, distribution;

Scenario 3 is the top 50 viewers of your website.

SELECT requestip, COUNT(*) AS ct  FROM cf_logs
GROUP BY requestip
ORDER BY ct DESC;

Streaming analysis using Amazon Kinesis Analytics

In this section, you deploy a stream processing application using Amazon Kinesis Analytics to analyze the preprocessed Amazon CloudFront log streams. This application analyzes directly from the Amazon Kinesis Stream as it is delivered to the preprocessing logs bucket. The stream queries in section are focused on gaining the following insights:

  • The IP address of the bot, identified by its Amazon CloudFront edge location, that is currently sending requests to your website. The query also includes the total bytes transferred as part of the response.
  • The total bytes served per distribution per population for your website.
  • The top 10 viewers of your website.

To download the firehose-access-policy.json file, type the following.

aws s3 cp s3://aws-bigdata-blog/artifacts/Serverless-CF-Analysis/kinesisanalytics/firehose-access-policy.json  <path_on_your_local_machine>

To download the kinesisanalytics-policy.json file, type the following.

aws s3 cp s3://aws-bigdata-blog/artifacts/Serverless-CF-Analysis/kinesisanalytics/assume-kinesisanalytics-policy.json <path_on_your_local_machine>

Before we create the Amazon Kinesis Analytics application, we need to create the IAM role to provide permission for the analytics application to access Amazon Kinesis Firehose stream.

Let’s use the AWS CLI to create the IAM role using the following three steps:

  1. Create the IAM policy(firehose-access-policy) for the Lambda execution role to use.
  2. Create the Lambda execution role (ka-execution-role) and assign the service to use this role.
  3. Attach the policy created in step 1 to the Lambda execution role.

Following is the firehose-access-policy.json file, which is the IAM policy used by Kinesis Analytics to read Firehose delivery stream.

{
    "Version": "2012-10-17",
    "Statement": [
      	{
    	"Sid": "AmazonFirehoseAccess",
    	"Effect": "Allow",
    	"Action": [
       	"firehose:DescribeDeliveryStream",
        	"firehose:Get*"
    	],
    	"Resource": [
              "arn:aws:firehose:*:*:deliverystream/CloudFrontLogsToS3”
       ]
     }
}

Following is the assume-kinesisanalytics-policy.json file, to grant Amazon Kinesis Analytics permissions to assume a role.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "kinesisanalytics.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

To create the IAM policy used by Analytics access role, type the following command.

aws iam create-policy --policy-name firehose-access-policy --policy-document file://<path>/firehose-access-policy.json

To create the Analytics execution role and assign the service to use this role, type the following command.

aws iam attach-role-policy --role-name ka-execution-role --policy-arn arn:aws:iam::<your-account-id>:policy/firehose-access-policy

To attach the policy (irehose-access-policy) created to the Analytics execution role (ka-execution-role), type the following command.

aws iam attach-role-policy --role-name ka-execution-role --policy-arn arn:aws:iam::<your-account-id>:policy/firehose-access-policy

To deploy the Analytics application, first download the configuration file and then modify ResourceARN and RoleARN for the Amazon Kinesis Firehose input configuration.

"KinesisFirehoseInput": { 
    "ResourceARN": "arn:aws:firehose:<region>:<account-id>:deliverystream/CloudFrontLogsToS3", 
    "RoleARN": "arn:aws:iam:<account-id>:role/ka-execution-role"
}

To download the Analytics application configuration file, type the following command.

aws s3 cp s3://aws-bigdata-blog/artifacts/Serverless-CF-Analysis//kinesisanalytics/kinesis-analytics-app-configuration.json <path_on_your_local_machine>

To deploy the application, type the following command.

aws kinesisanalytics create-application --application-name "cf-log-analysis" --cli-input-json file://<path>/kinesis-analytics-app-configuration.json

To start the application, type the following command.

aws kinesisanalytics start-application --application-name "cf-log-analysis" --input-configuration Id="1.1",InputStartingPositionConfiguration={InputStartingPosition="NOW"}

SQL queries using Amazon Kinesis Analytics

Scenario 1 is a query for detecting bots for sending request to your website detection for your website.

-- Create output stream, which can be used to send to a destination
CREATE OR REPLACE STREAM "BOT_DETECTION" (requesttime TIME, destribution VARCHAR(16), requestip VARCHAR(64), edgelocation VARCHAR(64), totalBytes BIGINT);
-- Create pump to insert into output 
CREATE OR REPLACE PUMP "BOT_DETECTION_PUMP" AS INSERT INTO "BOT_DETECTION"
--
SELECT STREAM 
    STEP("CF_LOG_STREAM_001"."request_time" BY INTERVAL '1' SECOND) as requesttime,
    "distribution_name" as distribution,
    "request_ip" as requestip, 
    "edge_location" as edgelocation, 
    SUM("bytes") as totalBytes
FROM "CF_LOG_STREAM_001"
WHERE "is_bot" = true
GROUP BY "request_ip", "edge_location", "distribution_name",
STEP("CF_LOG_STREAM_001"."request_time" BY INTERVAL '1' SECOND),
STEP("CF_LOG_STREAM_001".ROWTIME BY INTERVAL '1' SECOND);

Scenario 2 is a query for total bytes transferred per distribution for each edge location for your website.

-- Create output stream, which can be used to send to a destination
CREATE OR REPLACE STREAM "BYTES_TRANSFFERED" (requesttime TIME, destribution VARCHAR(16), edgelocation VARCHAR(64), totalBytes BIGINT);
-- Create pump to insert into output 
CREATE OR REPLACE PUMP "BYTES_TRANSFFERED_PUMP" AS INSERT INTO "BYTES_TRANSFFERED"
-- Bytes Transffered per second per web destribution by edge location
SELECT STREAM 
    STEP("CF_LOG_STREAM_001"."request_time" BY INTERVAL '1' SECOND) as requesttime,
    "distribution_name" as distribution,
    "edge_location" as edgelocation, 
    SUM("bytes") as totalBytes
FROM "CF_LOG_STREAM_001"
GROUP BY "distribution_name", "edge_location", "request_date",
STEP("CF_LOG_STREAM_001"."request_time" BY INTERVAL '1' SECOND),
STEP("CF_LOG_STREAM_001".ROWTIME BY INTERVAL '1' SECOND);

Scenario 3 is a query for the top 50 viewers for your website.

-- Create output stream, which can be used to send to a destination
CREATE OR REPLACE STREAM "TOP_TALKERS" (requestip VARCHAR(64), requestcount DOUBLE);
-- Create pump to insert into output 
CREATE OR REPLACE PUMP "TOP_TALKERS_PUMP" AS INSERT INTO "TOP_TALKERS"
-- Top Ten Talker
SELECT STREAM ITEM as requestip, ITEM_COUNT as requestcount FROM TABLE(TOP_K_ITEMS_TUMBLING(
  CURSOR(SELECT STREAM * FROM "CF_LOG_STREAM_001"),
  'request_ip', -- name of column in single quotes
  50, -- number of top items
  60 -- tumbling window size in seconds
  )
);

Conclusion

Following the steps in this blog post, you just built an end-to-end serverless architecture to analyze Amazon CloudFront access logs. You analyzed these both in interactive and streaming mode, using Amazon Athena and Amazon Kinesis Analytics respectively.

By creating a partition in Athena for the logs delivered to a centralized bucket, this architecture is optimized for performance and cost when analyzing large volumes of logs for popular websites that receive millions of requests. Here, we have focused on just three common use cases for analysis, sharing the analytic queries as part of the post. However, you can extend this architecture to gain deeper insights and generate usage reports to reduce latency and increase availability. This way, you can provide a better experience on your websites fronted with Amazon CloudFront.

In this blog post, we focused on building serverless architecture to analyze Amazon CloudFront access logs. Our plan is to extend the solution to provide rich visualization as part of our next blog post.


About the Authors

Rajeev Srinivasan is a Senior Solution Architect for AWS. He works very close with our customers to provide big data and NoSQL solution leveraging the AWS platform and enjoys coding . In his spare time he enjoys riding his motorcycle and reading books.

 

Sai Sriparasa is a consultant with AWS Professional Services. He works with our customers to provide strategic and tactical big data solutions with an emphasis on automation, operations & security on AWS. In his spare time, he follows sports and current affairs.

 

 


Related

Analyzing VPC Flow Logs with Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight

ServerlessConf and More!

Post Syndicated from Bryan Liston original https://aws.amazon.com/blogs/compute/serverless-conference-and-more/

ServerlessConf Austin

ServerlessConf Austin is just around the corner! April 26-28th come join us in Austin at the Zach Topfer Theater. Our very own Tim Wagner, Chris Munns and Randall Hunt will be giving some great talks.

Serverlessconf is a community led conference focused on sharing experiences building applications using serverless architectures. Serverless architectures enable developers to express their creativity and focus on user needs instead of spending time managing infrastructure and servers.

Tim Wagner, GM Serverless Applications, will be giving a keynote on Friday the 28th, do not miss this!!!
Chris Munns, Sr. Developer Advocate, will be giving an excellent talk on CI/CD for Serverless Applications.

Check out the full agenda here!

AWS Serverless Updates and More!

Incase you’ve missed out lately on some of our new content such as our new YouTube series “Coding with Sam”, or our new Serverless Special AWS Podcast Series, check them out!

Meet SAM!

We’ve recently come out with a new branding for AWS SAM (Serverless Application Model), so please join me in welcoming SAM the Squirrel!

The goal of AWS SAM is to define a standard application model for serverless applications.

Once again, don’t hesitate to reach out if you have questions, comments, or general feedback.

Thanks,
@listonb

AWS Big Data Blog Month in Review: March 2017

Post Syndicated from Derek Young original https://aws.amazon.com/blogs/big-data/aws-big-data-blog-month-in-review-march-2017/

Another month of big data solutions on the Big Data Blog. Please take a look at our summaries below and learn, comment, and share. Thank you for reading!

Analyze Security, Compliance, and Operational Activity Using AWS CloudTrail and Amazon Athena
In this blog post, walk through how to set up and use the recently released Amazon Athena CloudTrail SerDe to query CloudTrail log files for EC2 security group modifications, console sign-in activity, and operational account activity.  

Big Updates to the Big Data on AWS Training Course!
AWS offers a range of training resources to help you advance your knowledge with practical skills so you can get more out of the cloud. We’ve updated Big Data on AWS, a three-day, instructor-led training course to keep pace with the latest AWS big data innovations. This course allows you to hear big data best practices from an expert, get answers to your questions in person, and get hands-on practice using AWS big data services. 

Analyzing VPC Flow Logs with Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight
In this blog post, build a serverless architecture using Amazon Kinesis Firehose, AWS Lambda, Amazon S3, Amazon Athena, and Amazon QuickSight to collect, store, query, and visualize flow logs. In building this solution, you also learn how to implement Athena best practices with regard to compressing and partitioning data so as to reduce query latencies and drive down query costs. 

Amazon Redshift Monitoring Now Supports End User Queries and Canaries
The serverless Amazon Redshift Monitoring utility lets you gather important performance metrics from your Redshift cluster’s system tables and persists the results in Amazon CloudWatch. You can now create your own diagnostic queries and plug-in “canaries” that monitor the runtime of your most vital end user queries. These user-defined metrics can be used to create dashboards and trigger Alarms and should improve visibility into workloads running on a Cluster.  

Running R on Amazon Athena
In this blog post, connect R/RStudio running on an Amazon EC2 instance with Athena. You’ll learn to build a simple interactive application with Athena and R. Athena can be used to store and query the underlying data for your big data applications using standard SQL, while R can be used to interactively query Athena and generate analytical insights using the powerful set of libraries that R provides. This post has been translated into Japanese. 

Top 10 Performance Tuning Tips for Amazon Athena
In this blog post, we review the top 10 tips that can improve query performance. We focus on aspects related to storing data in Amazon S3 and tuning specific to queries. Amazon Athena uses Presto to run SQL queries and hence some of the advice will work if you are running Presto on Amazon EMR. This post has been translated into Japanese. 

Big Data Resources on the AWS Knowledge Center
The AWS Knowledge Center answers the questions we receive most frequently from AWS customers. It is a resource for you that is distinct from AWS Documentation, the AWS Discussion Forums, and the AWS Support Center. It covers questions from across every AWS service. This post is an introduction to Big Data resources on the AWS Knowledge Center. 

Encrypt and Decrypt Amazon Kinesis Records Using AWS KMS
In this bog post, learn to build encryption and decryption into sample Kinesis producer and consumer applications using the Amazon Kinesis Producer Library (KPL), the Amazon Kinesis Consumer Library (KCL), AWS KMS, and the aws-encryption-sdk. The methods and the techniques used in this post to encrypt and decrypt Kinesis records can be easily replicated into your architecture.

Want to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.

Leave a comment below to let us know what big data topics you’d like to see next on the AWS Big Data Blog.

A Serverless Authentication System by Jumia

Post Syndicated from Bryan Liston original https://aws.amazon.com/blogs/compute/a-serverless-authentication-system-by-jumia/

Jumia Group is an ecosystem of nine different companies operating in 22 different countries in Africa. Jumia employs 3000 people and serves 15 million users/month.

Want to secure and centralize millions of user accounts across Africa? Shut down your servers! Jumia Group unified and centralized customer authentication on nine digital services platforms, operating in 22 (and counting) countries in Africa, totaling over 120 customer and merchant facing applications. All were unified into a custom Jumia Central Authentication System (JCAS), built in a timely fashion and designed using a serverless architecture.

In this post, we give you our solution overview. For the full technical implementation, see the Jumia Central Authentication System post on the Jumia Tech blog.

The challenge

A group-level initiative was started to centralize authentication for all Jumia users in all countries for all companies. But it was impossible to unify the operational databases for the different companies. Each company had its own user database with its own technological implementation. Each company alone had yet to unify the logins for their own countries. The effects of deduplicating all user accounts were yet to be determined but were considered to be large. Finally, there was no team responsible for manage this new project, given that a new dedicated infrastructure would be needed.

With these factors in mind, we decided to design this project as a serverless architecture to eliminate the need for infrastructure and a dedicated team. AWS was immediately considered as the best option for its level of
service, intelligent pricing model, and excellent serverless services.

The goal was simple. For all user accounts on all Jumia Group websites:

  • Merge duplicates that might exist in different companies and countries
  • Create a unique login (username and password)
  • Enrich the user profile in this centralized system
  • Share the profile with all Jumia companies

Requirements

We had the following initial requirements while designing this solution on the AWS platform:

  • Secure by design
  • Highly available via multimaster replication
  • Single login for all platforms/countries
  • Minimal performance impact
  • No admin overhead

Chosen technologies

We chose the following AWS services and technologies to implement our solution.

Amazon API Gateway

Amazon API Gateway is a fully managed service, making it really simple to set up an API. It integrates directly with AWS Lambda, which was chosen as our endpoint. It can be easily replicated to other regions, using Swagger import/export.

AWS Lambda

AWS Lambda is the base of serverless computing, allowing you to run code without worrying about infrastructure. All our code runs on Lambda functions using Python; some functions are
called from the API Gateway, others are scheduled like cron jobs.

Amazon DynamoDB

Amazon DynamoDB
is a highly scalable, NoSQL database with a good API and a clean pricing model. It has great scalability as well as high availability, and fits the serverless model we aimed for with JCAS.

AWS KMS

AWS KMS was chosen as a key manager to perform envelope encryption. It’s a simple and secure key manager with multiple encryption functionalities.

Envelope encryption

Oversimplifying envelope encryption is when you encrypt a key rather than the data itself. That encrypted key, which was used to encrypt the data, may now be stored with the data itself on your persistence layer since it doesn’t decrypt the data if compromised. For more information, see How Envelope Encryption Works with Supported AWS Services.

Envelope encryption was chosen given that master keys have a 4 KB limit for data to be encrypted or decrypted.

Amazon SQS

Amazon SQS is an inexpensive queuing service with dynamic scaling, 14-day retention availability, and easy management. It’s the perfect choice for our needs, as we use queuing systems only as a
fallback when saving data to remote regions fails. All the features needed for those fallback cases were covered.

JWT

We also use JSON web tokens for encoding and signing communications between JCAS and company servers. It’s another layer of security for data in transit.

Data structure

Our database design was pretty straightforward. DynamoDB records can be accessed by the primary key and allow access using secondary indexes as well. We created a UUID for each user on JCAS, which is used as a primary key
on DynamoDB.

Most data must be encrypted so we use a single field for that data. On the other hand, there’s data that needs to be stored in separate fields as they need to be accessed from the code without decryption for lookups or basic checks. This indexed data was also stored, as a hash or plain, outside the main ciphered blob store.

We used a field for each searchable data piece:

  • Id (primary key)
  • main phone (indexed)
  • main email (indexed)
  • account status
  • document timestamp

To hold the encrypted data we use a data dictionary with two main dictionaries, info with the user’s encrypted data and keys with the key for each AWS KMS region to decrypt the info blob.
Passwords are stored in two dictionaries, old_hashes contains the legacy hashes from the origin systems and secret holds the user’s JCAS password.

Here’s an example:
data_at_rest

Security

Security is like an onion, it needs to be layered. That’s what we did when designing this solution. Our design makes all of our data unreadable at each layer of this solution, while easing our compliance needs.

A field called data stores all personal information from customers. It’s encrypted using AES256-CBC with a key generated and managed by AWS KMS. A new data key is used for each transaction. For communication between companies and the API, we use API Keys, TLS and JWT, in the body to ensure that the post is signed and verified.

Data flow

Our second requirement on JCAS was system availability, so we designed some data pipelines. These pipelines allow multi-region replication, which evades collision using idempotent operations on all endpoints. The only technology added to the stack was Amazon SQS. On SQS queues, we place all the items we aren’t able to replicate at the time of the client’s request.

JCAS may have inconsistencies between regions, caused by network or region availability; by using SQS, we have workers that synchronize all regions as soon as possible.

Example with two regions

We have three Lambda functions and two SQS queues, where:

1) A trigger Lambda function is called for the DynamoDB stream. Upon changes to the user’s table, it tries to write directly to another DynamoDB table in the second region, falling back to writing to two SQS queues.

fig1

2) A scheduled Lambda function (cron-style) checks a SQS queue for items and tries writing them to the DynamoDB table that potentially failed.

fig2

3) A cron-style Lambda function checks the SQS queue, calling KMS for any items, and fixes the issue.

fig3

The following diagram shows the full infrastrucure (for clarity, this diagram leaves out KMS recovery).

CAS Full Diagram

Results

Upon going live, we noticed a minor impact in our response times. Note the brown legend in the images below.

lat_1

This was a cold start, as the infrastructure started to get hot, response time started to converge. On 27 April, we were almost at the initial 500ms.
lat_2

It kept steady on values before JCAS went live (≈500ms).
lat_3

As of the writing of this post, our response time kept improving (dev changed the method name and we changed the subdomain name).
lat_4

Customer login used to take ≈500ms and it still takes ≈500ms with JCAS. Those times have improved as other components changed inside our code.

Lessons learned

  • Instead of the standard cross-region replication, creating your own DynamoDB cross-region replication might be a better fit for you, if your data and applications allow it.
  • Take some time to tweak the Lambda runtime memory. Remember that it’s billed per 100ms, so it saves you money if you have it run near a round number.
  • KMS takes away the problem of key management with great security by design. It really simplifies your life.
  • Always check the timestamp before operating on data. If it’s invalid, save the money by skipping further KMS, DynamoDB, and Lambda calls. You’ll love your systems even more.
  • Embrace the serverless paradigm, even if it looks complicated at first. It will save your life further down the road when your traffic bursts or you want to find an engineer who knows your whole system.

Next steps

We are going to leverage everything we’ve done so far to implement SSO in Jumia. For a future project, we are already testing OpenID connect with DynamoDB as a backend.

Conclusion

We did a mindset revolution in many ways.
Not only we went completely serverless as we started storing critical info on the cloud.
On top of this we also architectured a system where all user data is decoupled between local systems and our central auth silo.
Managing all of these critical systems became far more predictable and less cumbersome than we thought possible.
For us this is the proof that good and simple designs are the best features to look out for when sketching new systems.

If we were to do this again, we would do it in exactly the same way.

Daniel Loureiro (SecOps) & Tiago Caxias (DBA) – Jumia Group

AWS Week in Review – February 27, 2016

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-week-in-review-february-27-2016/

This edition includes all of our announcements, content from all of our blogs, and as much community-generated AWS content as I had time for. Going forward I hope to bring back the other sections, as soon as I get my tooling and automation into better shape.

Monday

February 27

Tuesday

February 28

Wednesday

March 1

Thursday

March 2

Friday

March 3

Saturday

March 4

Sunday

March 5

Jeff;

 

Resize Images on the Fly with Amazon S3, AWS Lambda, and Amazon API Gateway

Post Syndicated from Bryan Liston original https://aws.amazon.com/blogs/compute/resize-images-on-the-fly-with-amazon-s3-aws-lambda-and-amazon-api-gateway/


John Pignata, Solutions Architect

With the explosion of device types used to access the Internet with different capabilities, screen sizes, and resolutions, developers must often provide images in an array of sizes to ensure a great user experience. This can become complex to manage and drive up costs.

Images stored using Amazon S3 are often processed into multiple sizes to fit within the design constraints of a website or mobile application. It’s a common approach to use S3 event notifications and AWS Lambda for eager processing of images when a new object is created in a bucket.

In this post, I explore a different approach and outline a method of lazily generating images, in which a resized asset is only created if a user requests that specific size.

Resizing on the fly

Instead of processing and resizing images into all necessary sizes upon upload, the approach of processing images on the fly has several upsides:

  • Increased agility
  • Reduced storage costs
  • Resilience to failure

Increased agility

When you redesign your website or application, you can add new dimensions on the fly, rather than working to reprocess the entire archive of images that you have stored.

Running a batch process to resize all original images into new, resized dimensions can be time-consuming, costly, and error-prone. With the on-the-fly approach, a developer can instead specify a new set of dimensions and lazily generate new assets as customers use the new website or application.

Reduced storage costs

With eager image processing, the resized images must be stored indefinitely as the operation only happens one time. The approach of resizing on-demand means that developers do not need to store images that are not accessed by users.

As a user request initiates resizing, this also unlocks options for optimizing storage costs for resized image assets, such as S3 lifecycle rules to expire older images that can be tuned to an application’s specific access patterns. If a user attempts to access a resized image that has been removed by a lifecycle rule, the API resizes it on demand to fulfill the request.

Resilience to failure

A key best practice outlined in the Architecting for the Cloud: Best Practices whitepaper is

“Design for failure and nothing will fail.” When building distributed services, developers should be pessimistic and assume that failures will occur.

If image processing is designed to occur only one time upon object creation, an intermittent failure in that process―or any data loss to the processed images―could cause continual failures to future users. When resizing images on-demand, each request initiates processing if a resized image is not found, meaning that future requests could recover from a previous failure automatically.

Architecture overview

diagram

Here’s the process:

  1. A user requests a resized asset from an S3 bucket through its static website hosting endpoint. The bucket has a routing rule configured to redirect to the resize API any request for an object that cannot be found.
  2. Because the resized asset does not exist in the bucket, the request is temporarily redirected to the resize API method.
  3. The user’s browser follows the redirect and requests the resize operation via API Gateway.
  4. The API Gateway method is configured to trigger a Lambda function to serve the request.
  5. The Lambda function downloads the original image from the S3 bucket, resizes it, and uploads the resized image back into the bucket as the originally requested key.
  6. When the Lambda function completes, API Gateway permanently redirects the user to the file stored in S3.
  7. The user’s browser requests the now-available resized image from the S3 bucket. Subsequent requests from this and other users will be served directly from S3 and bypass the resize operation. If the resized image is deleted in the future, the above process repeats and the resized image is re-created and replaced into the S3 bucket.

Set up resources

A working example with code is open source and available in the serverless-image-resizing GitHub repo. You can create the required resources by following the README directions, which use an AWS Serverless Application Model (AWS SAM) template, or manually following the directions below.

To create and configure the S3 bucket

  1. In the S3 console, create a new S3 bucket.
  2. Choose Permissions, Add Bucket Policy. Add a bucket policy to allow anonymous access.
  3. Choose Static Website Hosting, Enable website hosting and, for Index Document, enter index.html.
  4. Choose Save.
  5. Note the name of the bucket that you’ve created and the hostname in the Endpoint field.

To create the Lambda function

  1. In the Lambda console, choose Create a Lambda function, Blank Function.
  2. To select an integration, choose the dotted square and choose API Gateway.
  3. To allow all users to invoke the API method, for Security, choose Open and then Next.
  4. For Name, enter resize. For Code entry type, choose Upload a .ZIP file.
  5. Choose Function package and upload the .ZIP file of the contents of the Lambda function.
  6. To configure your function, for Environment variables, add two variables:
    • For Key, enter BUCKET; for Value,enter the bucket name that you created above.
    • For Key, enter URL; for Value, enter the endpoint field that you noted above, prefixed with http://.
  7. To define the execution role permissions for the function, for Role, choose Create a custom role. Choose View Policy Document, Edit, Ok.
  8. Replace YOUR_BUCKET_NAME_HERE with the name of the bucket that you’ve created and copy the following code into the policy document. Note that any leading spaces in your policy may cause a validation error.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:*"
    },
    {
      "Effect": "Allow",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::__YOUR_BUCKET_NAME_HERE__/*"    
    }
  ]
}
  1. For Memory, choose 1536. For Timeout, enter 10 sec. Choose Next, Create function.
  2. Choose Triggers, and note the hostname in the URL of your function.

resizefly_2.png

To set up the S3 redirection rule

  1. In the S3 console, open the bucket that you created above.
  2. Expand Static Website Hosting, Edit Redirection Rules.
  3. Replace YOUR_API_HOSTNAME_HERE with the hostname that you noted above and copy the following into the redirection rules configuration:
<routingrules>
 <routingrule>
  <condition>
   <keyprefixequals>
    <httperrorcodereturnedequals>
     404
    </httperrorcodereturnedequals>
   </keyprefixequals>
  </condition>
  <redirect>
   <protocol>
    https
   </protocol>
   <hostname>
    __YOUR_API_HOSTNAME_HERE__
   </hostname>
   <replacekeyprefixwith>
    prod/resize?key=
   </replacekeyprefixwith>
   <httpredirectcode>
    307
   </httpredirectcode>
  </redirect>
 </routingrule>
</routingrules>

Test image resizing

Upload a test image into your bucket to for testing. The blue marble is a great sample image for testing because it is large and square. Once uploaded, try to retrieve resized versions of the image using your bucket’s static website hosting endpoint:

http://YOUR_BUCKET_WEBSITE_HOSTNAME_HERE/300×300/blue_marble.jpg

http://YOUR_BUCKET_WEBSITE_HOSTNAME_HERE/25×25/blue_marble.jpg

http://YOUR_BUCKET_WEBSITE_HOSTNAME_HERE/500×500/blue_marble.jpg

You should see a smaller version of the test photo. If not, choose Monitoring in your Lambda function and check CloudWatch Logs for troubleshooting. You can also refer to the serverless-image-resizing GitHub repo for a working example that you can deploy to your account.

Conclusion

The solution I’ve outlined is a simplified example of how to implement this functionality. For example, in a real-world implementation, there would likely be a list of permitted sizes to prevent a requestor from filling your bucket with randomly sized images. Further cost optimizations could be employed, such as using S3 lifecycle rules on a bucket dedicated to resized images to expire resized images after a given amount of time.

This approach allows you to lazily generate resized images while taking advantage of serverless architecture. This means you have no operating systems to manage, secure, or patch; no servers to right-size, monitor, or scale; no risk of over-spending by over-provisioning; and no risk of delivering a poor user experience due to poor performance by under-provisioning.

If you have any questions, feedback, or suggestions about this approach, please let us know in the comments!

AWS Lambda – A Look Back at 2016

Post Syndicated from Tara Walker original https://aws.amazon.com/blogs/aws/aws-lambda-a-look-back-at-2016/

2016 was an exciting year for AWS Lambda, Amazon API Gateway and serverless compute technology, to say the least. But just in case you have been hiding away and haven’t heard of serverless computing with AWS Lambda and Amazon API Gateway, let me introduce these great services to you.  AWS Lambda lets you run code without provisioning or managing servers, making it a serverless compute service that is event-driven and allows developers to bring their functions to the cloud easily for virtually any type of application or backend.  Amazon API Gateway helps you quickly build highly scalable, secure, and robust APIs at scale and provides the ability to maintain and monitor created APIs.

With the momentum of serverless in 2016, of course, the year had to end with a bang as the AWS team launched some powerful service features at re:Invent to make it even easier to build serverless solutions.  These features include:

Since Jeff has already introduced most of the aforementioned new service features for building distributed applications and microservices like Step Functions, let’s walk-through the last four new features not yet discussed using a common serverless use case example: Real-time Stream Processing.  In our walk-through of the stream processing use case, we will implement a Dead Letter Queue for notifications of errors that may come from the Lambda function processing a stream of data, we will take an existing Lambda function written in Node.js to process the stream and rewrite it using the C# language.  We then will build an example of the monetization of a Lambda backed API using API Gateway’s integration with AWS Marketplace.  This will be exciting, so let’s get started.

During the AWS Developer Days in San Francisco and Austin, I presented an example of leveraging AWS Lambda for real-time stream processing by building a demo showcasing a streaming solution with Twitter Streaming APIs. I will build upon this example to demonstrate the power of Dead Letter Queues (DLQ), C# Support, API Gateway Monetization features, and the open source template for API Gateway Developer Portal.  In the demo, a console or web application streams tweets gathered from the Twitter Streaming API that has the keywords ‘awscloud’ and/or ‘serverless’.  Those tweets are sent real-time to Amazon Kinesis Streams where Lambda detects the new records and processes the stream batch by writing the tweets to the NoSQL database, Amazon DynamoDB.

Now that we understand the real-time streaming process demo’s workflow, let’s take a deeper look at the Lambda function that processes the batch records from Kinesis.  First, you will notice below that the Lambda function, DevDayStreamProcessor, has an event source or trigger that is a Kinesis stream named DevDay2016Stream with a Batch size of 100.  Our Lambda function will poll the stream periodically for new records and automatically read and process batches of records, in this case, the tweets detected on the stream.

Now we will examine our Lambda function code which is written in Node.js 4.3. The section of the Lambda function shown below loops through the batch of tweet records from our Kinesis stream, parses each record, and writes desired tweet information into an array of JSON data. The array of the JSON tweet items is passed to the function, ddbItemsWrite which is outside of our Lambda handler.

'use strict';
console.log('Loading function');

var timestamp;
var twitterID;
var tweetData;
var ddbParams;
var itemNum = 0;
var dataItemsBatch = [];
var dbBatch = [];
var AWS = require('aws-sdk');
var ddbTable = 'TwitterStream';
var dynamoDBClient = new AWS.DynamoDB.DocumentClient();

exports.handler = (event, context, callback) => {
    var counter = 0; 
    
    event.Records.forEach((record) => {
        // Kinesis data is base64 encoded so decode here
        console.log("Base 64 record: " + JSON.stringify(record, null, 2));
        const payload = new Buffer(record.kinesis.data, 'base64').toString('ascii');
        console.log('Decoded payload:', payload);
        
        var data = payload.replace(/[\u0000-\u0019]+/g," "); 
        try
        {  tweetData = JSON.parse(data);   }
        catch(err)
        {  callback(err, err.stack);   }
        
        timestamp = "" + new Date().getTime();
        twitterID = tweetData.id.toString();
        itemNum = itemNum+1;
               
         var ddbItem = {
                PutRequest: { 
                    Item: { 
                        TwitterID: twitterID,
                        TwitterUser: tweetData.username.toString(),
                        TwitterUserPic: tweetData.pic,
                        TwitterTime: new Date(tweetData.time.replace(/( \+)/, ' UTC$1')).toLocaleString(), 
                        Tweet: tweetData.text,
                        TweetTopic: tweetData.topic,
                        Tags: (tweetData.hashtags) ? tweetData.hashtags : " ",
                        Location: (tweetData.loc) ? tweetData.loc : " ",
                        Country: (tweetData.country) ? tweetData.country : " ",
                        TimeStamp: timestamp,
                        RecordNum: itemNum
                    }
                }
            };
            
            dataItemsBatch.push(ddbItem);
            counter++;
});
    
    var twitterItems = {}; 
    twitterItems[ddbTable] = dataItemsBatch; 
    ddbItemsWrite(twitterItems, 0, context, callback); 

};

The ddbItemsWrite function shown below will take the array of JSON tweet records processed from the Kinesis stream, and write the records multiple items at a time to our DynamoDB table using batch operations. This function leverages the DynamoDB best practice of retrying unprocessed items by implementing an exponential backoff algorithm to prevent write request failures due to throttling on the individual tables.

 function ddbItemsWrite(items, retries, ddbContext, ddbCallback) 
    { 
        dynamoDBClient.batchWrite({ RequestItems: items }, function(err, data) 
            { 
                if (err) 
                { 
                    console.log('DDB call failed: ' + err, err.stack); 
                    ddbCallback(err, err.stack); 
                } 
                else 
                { 
                    if(Object.keys(data.UnprocessedItems).length) 
                    { 
                        console.log('Unprocessed items remain, retrying.'); 
                        var delay = Math.min(Math.pow(2, retries) * 100, ddbContext.getRemainingTimeInMillis() - 200); 
                        setTimeout(function() {ddbItemsWrite(data.UnprocessedItems, retries + 1, ddbContext, ddbCallback)}, delay); 
                    } 
                    else 
                    { 
                         ddbCallback(null, "Success");
                         console.log("Completed Successfully");
                    } 
                } 
            } 
        );
    }

Currently, this Lambda function works as expected and will successfully process tweets captured in Kinesis from the Twitter Streaming API, however, this function has a flaw that will cause an error to occur when processing batch write requests to our DynamoDB table.  In the Lambda function, the current code does not take into account that the DynamoDB batchWrite function should be comprised of no more than 25 write (put) requests per single call to this function up to 16 MB of data. Therefore, without changing the code appropriately to have the ddbItemsWrite function to handle batches of 25 or have the handler function put items in the array in groups of 25 requests before sending to the ddbItemsWrite function; there will be a validation exception thrown when the batch of tweets items sent is greater than 25.  This is a great example of a bug that is not easily detected in small-scale testing scenarios yet will cause failures under production load.

 

Dead Letter Queues

Now that we are aware of an event that will cause the ddbItemsWrite Lambda function to throw an exception and/or an event that will fail while processing records, we have a first-rate scenario for leveraging Dead Letter Queues (DLQ).

Since AWS Lambda DLQ functionality is only available for asynchronous event sources like Amazon S3, Amazon SNS, AWS IoT or direct asynchronous invocations, and not for streaming event sources such as Amazon Kinesis or Amazon DynamoDB streams; our first step is to break this Lambda function into two functions.  The first Lambda function will handle the processing of the Kinesis stream, and the second Lambda function will take the data processed by the first function and write the tweet information to DynamoDB.  We will then setup our DLQ on the second Lambda function for the error that will occur on writing the batch of tweets to DynamoDB as noted above.

We have two options when setting up a target for our DLQ; Amazon SNS topic or an Amazon SQS queue.  In this walk-through, we will opt for using an Amazon SQS queue.  Therefore, my first step in using DLQ is to create a SQS Standard queue.  A Standard queue type is a queue which has high transactions throughput, a message will be delivered at least once, but another copy of the message may also be delivered, and it is possible that messages might be delivered in an order different from which they were sent.  You can learn more about creating SQS queues and queue type in the Amazon SQS documentation.

Once my queue, StreamDemoDLQ, is created, I will grab the ARN from the Details tab of this selected queue. If I am not using the console to designate the DLQ resource for this function, I will need the ARN for the queue for my Lambda function to identify this SQS queue as the DLQ target for error and event failure notifications. Additionally, I will use the ARN to add permissions to my Lambda execution role policy in order to access this SQS queue.

I will now return to my Lambda function and select the Configuration tab and expand the Advanced settings section. I will select SQS in the DLQ Resource field and select my StreamDemoDLQ queue in the SQS Queue field dropdown.

Remember, the execution role for the Lambda function must explicitly provide sqs:SendMessage access permissions to in order to successfully send messages to your SQS DLQ.  Therefore, I ensured that my Lambda role, lambda_kinesis_role, has the following IAM policy for SQS permissions.

 

We have now successfully configured a Dead Letter Queue for our Lambda function using Amazon SQS. To learn more about Dead Letter Queues in Lambda, read the Troubleshooting and Monitoring section of the AWS Lambda Developer Guide and check out the AWS Compute Blog post on Dead Letter Queues.

C# Support

As I mentioned earlier, another very exciting feature added to Lambda during AWS re:Invent was the support for the C# language via the open source .NET Core 1.0 platform.  Since the Lambda console does not offer editing for compiled languages yet, in order to author a C# Lambda function you can use tooling in Visual Studio with the AWS Toolkit, Yeoman, and/or the .NET CLI.  To deploy Lambda functions written in C#, you can use the Lambda plugin in the AWS ToolKit for Visual Studio or create a deployment package with the .NET Core command line.

A C# Lambda function handler should be defined as an instance or static method in a class. There are two handler function parameters; the first is the input type which is the event data and second is the Lambda context object of type ILambdaContext. The event data input object types for AWS Services include the following:

  • Amazon.Lambda.APIGatewayEvents
  • Amazon.Lambda.CognitoEvents
  • Amazon.Lambda.ConfigEvents
  • Amazon.Lambda.DynamoDBEvents
  • Amazon.Lambda.KinesisEvents
  • Amazon.Lambda.S3Events
  • Amazon.Lambda.SNSEvents

Now that we have discussed more detail around C# Support in Lambda, let’s rewrite our DevDayStreamProcessor lambda function with the C# language. For this example, I will use Visual Studio IDE to write the Lambda function, and additionally take advantage of the AWS Lambda Visual Studio plugin to deploy the function. Remember in order to use the AWS Toolkit for Visual Studio with Lambda, you will need to have Visual Studio 2015 Update 3 version and NET Core tools. You can read more about installing Visual Studio 2015 Update 3 and .NET Core here.

To create the C# function using Visual Studio, I start a New Project, select AWS Lambda Project (.NET Core) and name it ServerlessStreamProcessor.

What’s really cool about taking advantage of the AWS Toolkit for Visual Studio to author this function, is that inside of Visual Studio I can use Lambda blueprints to get started in a similar way that I would in using the Lambda console.  Therefore in order to replicate the DevDayStreamProcessor in C#, I will select the Simple Kinesis Function blueprint.

It should be noted that when writing Lambda functions in C#, there is no need to mark the class declaration nor the target handler function as a Lambda function. Additionally, when writing CloudWatch logs you can use the standard C# Console class WriteLine function or use the ILambdaContext LogLine function found as a part of the ILambdaContext interface. With the template for accessing the Kinesis stream in place, I finish writing the C# Lambda function, ServerlessStreamProcessor, utilizing the same variable names as in the Node.js code in DevDayStreamProcessor. Please note the C# Lambda handler function below.

using System.Collections.Generic;
using Amazon.Lambda.Core;
using Amazon.Lambda.KinesisEvents;
using Amazon.DynamoDBv2;
using Amazon.DynamoDBv2.DataModel;
using Newtonsoft.Json.Linq;

// Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
[assembly: LambdaSerializerAttribute(typeof(Amazon.Lambda.Serialization.Json.JsonSerializer))]

namespace ServerlessStreamProcessor
{
    public class LambdaTwitterStream
    {
        string twitterID, timeStamp;
        int itemNum = 0;
        
        private static AmazonDynamoDBClient dynamoDBClient = new AmazonDynamoDBClient();
        List<TwitterItem> dataItemsBatch = new List<TwitterItem>();
        
        public void FunctionHandler(KinesisEvent kinesisEvent, ILambdaContext context)
        {
            DynamoDBContext dbContext = new DynamoDBContext(dynamoDBClient);
            context.Logger.LogLine($"Beginning to process {kinesisEvent.Records.Count} records...");
            
            foreach (var record in kinesisEvent.Records)
            {
                context.Logger.LogLine($"Event ID: {record.EventId}");
                context.Logger.LogLine($"Event Name: {record.EventName}");

                // Kinesis data is base64 encoded so decode here
                string tweetData = GetRecordContents(record.Kinesis);
                context.Logger.LogLine($"Decoded Payload: {tweetData}");
                tweetData = @"" + tweetData;
                JObject twitterObj = JObject.Parse(tweetData);
                
                twitterID = twitterObj["id"].ToString();
                timeStamp = DateTime.Now.Millisecond.ToString();
                itemNum++;
                context.Logger.LogLine(timeStamp);
                context.Logger.LogLine($"Twitter ID is: {twitterID}");
                context.Logger.LogLine(itemNum.ToString());

                TwitterItem ddbItem = new TwitterItem()
                { 
                    TwitterID = twitterID,
                    TwitterUser = twitterObj["username"].ToString(),
                    TwitterUserPic = twitterObj["pic"].ToString(),
                    TwitterTime = DateTime.Parse(twitterObj["time"].ToString()).ToUniversalTime().ToString(),
                    Tweet = twitterObj["text"].ToString(),
                    TweetTopic = twitterObj["topic"].ToString(),
                    Tags = twitterObj["hashtags"] != null ? twitterObj["hashtags"].ToString() : String.Empty,
                    Location = twitterObj["loc"] != null ? twitterObj["loc"].ToString() : String.Empty,
                    Country = twitterObj["country"] != null ? twitterObj["country"].ToString() : String.Empty,
                    TimeStamp =  timeStamp,
                    RecordNum = itemNum
                };
                
                dataItemsBatch.Add(ddbItem);
            }

            context.Logger.LogLine(JObject.FromObject(dataItemsBatch).ToString());
            ddbItemsWrite(dataItemsBatch, 0, dbContext, context);
            context.Logger.LogLine("Success - Completed Successfully");
            context.Logger.LogLine("Stream processing complete.");
        }

There are only a few differences that should be noted between our Kinesis stream processor written in C# and our original Node.js code.  Since the input parameter type supported by default in C# Lambda functions is the System.IO.Stream type, the Kinesis base64 string is decoded by using a StreamReader with ASCII encoding in a blueprint provided function, GetRecordContents.

 

private string GetRecordContents(KinesisEvent.Record streamRecord)
{
    using (var reader = new StreamReader(streamRecord.Data, Encoding.ASCII))
    {
        return reader.ReadToEnd();
    }
}

The other thing to note is that in order to write the tweet data to the DynamoDB Table, I added the AWS .NET SDK NuGet package for DynamoDB; AWSSDK.DynamoDBv2 to the Lambda function project via the NuGet package manager within Visual Studio.  I also created a .NET data object, TwitterItem, to map to the data being stored in the DynamoDB table. Using the AWS .NET SDK higher level programming interface, object persistence model for DynamoDB, I created a collection of TwitterItem objects to be written via the BatchWrite object class in our ddbItemsWrite C# function.

private async void ddbItemsWrite(List<TwitterItem> items, int retries, DynamoDBContext ddbContext, ILambdaContext context)
{
BatchWrite<TwitterItem> twitterStreamBatchWrite = ddbContext.CreateBatchWrite<TwitterItem>();
        
        try
        {
            twitterStreamBatchWrite.AddPutItems(items);   
            await twitterStreamBatchWrite.ExecuteAsync();
        }
        catch (Exception ex)
        {
            context.Logger.LogLine($"DDB call failed: {ex.Source} ");
            context.Logger.LogLine($"Exception: {ex.Message}");
            context.Logger.LogLine($"Exception Stacktrace: {ex.StackTrace}");
        }      
}

Another benefit of using AWS Toolkit for Visual Studio to author my C# Lambda function is that I can deploy my Lambda function directly to AWS with a single click.  Selecting my project name in the Solution Explorer and performing a right-click, I get a menu option, Publish to AWS Lambda, which brings up a menu for information to include about my Lambda function for deployment to AWS.

It is important to note that the handler function signature follows the nomenclature of Assembly :: Namespace :: ClassName :: Method, therefore, the signature of our C# Lambda function shown here is: ServerlessStreamProcessor :: ServerlessStreamProcessor.LambdaTwitterStream :: FunctionHandler.  We provide this information to the Upload to AWS Lambda dialog box and select Next to assign a role for the function.

Upon completion, you can test in the Lambda console or in Visual Studio with AWS toolkit provided plugin (shown below) using the sample data of the triggering event source for an iterative approach to developing the Lambda function.

You can learn more about authoring AWS Lambda functions using the C# Language in the AWS Lambda developer guide or by reading the post announcing C# Support on the Compute Blog.

API Gateway Monetization and Developer Portal

If you have been following the microservices momentum, you may be aware of an architectural pattern that calls for using smart endpoints and/or using an API gateway via REST APIs to manage access and exposure of individual services that make up a microservices solution.  Amazon API Gateway enables creation and management of RESTful APIs to expose AWS Lambda functions, external HTTP endpoints, as well as, other AWS services.  In addition, Amazon API Gateway allows clients and external developers to have access to a deployed APIs by via HTTP protocol or a platform/language targeted SDK.

With the introduction of SaaS Subscriptions on AWS Marketplace and the API Gateway integration with the AWS Marketplace, you can now monetize your APIs by allowing customers to directly consume the APIs you create with API Gateway in the AWS Marketplace.  AWS customers can subscribe and be billed for the APIs published on the marketplace with their existing AWS account.  With the integration of API Gateway with the AWS Marketplace, the process to get started is easy on the AWS Marketplace.

To get started, you must ensure that you have enabled the Usage Plan feature in Amazon API Gateway.

Once enabled the next step is to create a Usage Plan, enable throttling (if desired) with targeted rate and burst request thresholds, and finally enable quotas (if you choose) by providing targeted request quota per a set timeframe.

Next, we would choose our APIs and related stage(s) that we wish to be associated with the usage plan. Please note that this is an optional step as you can opt not associate a specific API with your usage plan.

All that is left to do is add or create an API key for the usage plan.  Again, it should be noted that this is also an optional step in creating your usage plan.

Now that we have our usage plan, StreamingPlan, we are ready for the next step in preparation for selling our API on the marketplace. You have the option to create multiple usage plans with varying APIs and limits, and sell these plans as differentiated API products on AWS Marketplace.

In order to enable customers to buy our new API product, however, the AWS Marketplace requires that each API product has an external developer portal to handle subscription requests, provide API information details and ability for the management of usage.

This customer need for an external developer portal for the marketplace birthed the new open source API Gateway developer portal serverless web application implementation.  The goal of the API Gateway developer portal project was to allow customers to follow a few easy steps to create a serverless web application that lists a catalog of your APIs built with API Gateway while allowing for developer signups.

The API Gateway developer portal was built upon AWS Serverless Express; an open source library published by AWS which aids you in utilizing AWS Lambda and Amazon API Gateway in building web applications/services with the Node.js Express framework.  Additionally, the API Gateway developer portal application uses an AWS SAM (Serverless Application Model) template to deploy its serverless resources.  AWS SAM is a simplified CloudFormation template and specification that allows easier management and deployment of serverless applications on AWS.

To build your developer portal using the API Gateway portal, you would start by cloning the aws-api-gateway-developer-portal project from GitHub.

Assuming you have the latest version of the AWS CLI and Node.js installed, you would setup the developer portal by running “npm run setup” on the command line for Mac and Linux OS users. For Windows users, you would run “npm run win-setup” on the command line setup the developer portal.

The result is a functional sample developer portal website running on S3 that you can customize in order to create your own developer portal for your APIs.

The frontend of the sample developer portal website is built with the React JavaScript library, and the backend is an AWS Lambda function running using the aws-serverless-express library. Additionally, a Lambda function with a SNS event source was created as a listener for notification when customers subscribe or unsubscribe to your API via the AWS Marketplace console.  You can learn more about the steps to build, customize, and deploy your API Gateway developer portal web application with this reference project by visiting the AWS Compute blog post which discusses the architecture and implementation in more detail.

 

The next key step in monetizing our API is establishing an account on the AWS Marketplace.  If an account is not already established, registering is simply verifying that you meet the requirement prerequisites provided in the AWS Marketplace Seller Guide and completing a seller registration form on the AWS Marketplace Management Portal.  You can see a snapshot of the start of the seller registration form below.

To list the API, you would fill a product load form describing the API, establish the pricing for the API, and provide t\he IDs of AWS Accounts that will test the API subscription process.  Completing this form would also require you to submit the URL for your API developer portal.

When your seller registration is complete, you will be supplied an AWS Marketplace product code.  You will need to associate your marketplace product code with your API usage plan.  In order to complete this step, you would simply log into the API Gateway console and go to your API usage plan. Go to the Marketplace tab and enter your product code. This tells API Gateway to send measurement data to AWS Marketplace when your API is used.

With your Amazon API Gateway managed API packaged into a usage plan, the accompanying API developer portal created, seller account registration completed, and product code associated with API usage plan; we are now ready to monetize our API on the AWS Marketplace.

Learn more about monetizing your APIs created with API Gateway by checking out the related blog post and reviewing the API Gateway developer guide documentation.

Summary

As you can see, the AWS teams were busy in 2016 working to make the customer experience easier for creating and deploying serverless architectures, as well as, providing mechanisms for customers to generate and monetize their API Gateway managed APIs.

Visit the product documentation for AWS Lambda and Amazon API Gateway to learn more about these services and all the newly released features.

Tara

Derive Insights from IoT in Minutes using AWS IoT, Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight

Post Syndicated from Ben Snively original https://aws.amazon.com/blogs/big-data/derive-insights-from-iot-in-minutes-using-aws-iot-amazon-kinesis-firehose-amazon-athena-and-amazon-quicksight/

Ben Snively is a Solutions Architect with AWS

Speed and agility are essential with today’s analytics tools. The quicker you can get from idea to first results, the more you can experiment and innovate with your data, perform ad-hoc analysis, and drive answers to new business questions.

Serverless architectures help in this respect by taking care of the non-differentiated heavy lifting in your environment―tasks such as managing servers, clusters, and device endpoints – allowing you to focus on assembling your IoT system, analyzing data, and building meaningful reports quickly and efficiently.

In this post, I show how you can build a business intelligence capability for streaming IoT device data using AWS serverless and managed services. You can be up and running in minutes―starting small, but able to easily grow to millions of devices and billions of messages.

AWS serverless services

AWS services offer a quicker time to intelligence. The following is a high-level diagram showing how the services in this post are configured:

o_IoT_Minutes_1

AWS IoT easily and securely connects devices through the MQQT and HTTPS  protocols. The IoT Rules Engine continuously processes incoming messages, enabling your devices to interact with other AWS services.

Amazon S3 is object storage that provides you a highly reliable, secure, and scalable storage for all your data, big or small.  It is designed to deliver 99.999999999% durability, and scale past trillions of objects.

Amazon Kinesis Firehose allows you to capture and automatically load streaming data into Amazon Kinesis Analytics, Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service, enabling near-real-time business intelligence and the use of dashboards.

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using SQL.  Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL, with most results delivered in seconds.

Amazon QuickSight is a fast, cloud-powered business analytics service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from your data. You can easily run SQL queries using Athena on data stored in S3, and build business dashboards within QuickSight.

Walkthrough: Heartrate sensors

In this walkthrough, you run a script to mimic multiple sensors publishing messages on an IoT MQTT topic, with one message published every second. The events get sent to AWS IoT, where an IoT rule is configured.  The IoT rule captures all messages and sends them to Firehose.  From there, Firehose writes the messages in batches to objects stored in S3.  In S3, you set up a table in Athena and use QuickSight to analyze the IoT data.

Configuring Firehose to write to S3

Create a Firehose delivery stream using the following field values.

Name IoT-to-BI-Example
S3 Bucket <Create one>
S3 Prefix iot_to_bi_example/

Configuring the AWS IoT rule

Create a new AWS IoT rule with the following field values.

Name Iot_to_bi_demoRule
Attribute *
Topic Filter /health/#
Add Action Send messages to an Amazon Kinesis Firehose stream (from previous section).
Select Separator “\n (newline)”

Generating sample data

Running the heartrate.py python script generates fictitious IoT messages from multiple userid values.  The IoT rule sends the message to Firehose, which writes the data out to S3.

The script assumes that it has access to AWS CLI credentials and that boto3 is installed on the machine running the script.

Download and run the following Python script:

https://github.com/awslabs/aws-big-data-blog/blob/master/aws-blog-iot-athena-quicksight-bi/heartrate.py

The script generates random data that looks like the following:

{"heartRate": 67, "userId": "Brady", "rateType": "NORMAL", "dateTime": "2016-12-18 14:01:39"}
{"heartRate": 78, "userId": "Bailey", "rateType": "NORMAL", "dateTime": "2016-12-18 14:01:41"}
{"heartRate": 94, "userId": "Beatrice", "rateType": "NORMAL", "dateTime": "2016-12-18 14:01:42"}
…
{"heartRate": 73, "userId": "Ben", "rateType": "NORMAL", "dateTime": "2016-12-18 14:01:54"}

Run the script for a few hours and then end the task or process.

Configuring Athena

Log into the Athena console. In the Query Editor, create a table using the following query:

CREATE EXTERNAL TABLE heartrate_iot_data (
    heartRate int,
    userId string,
    rateType string,
    dateTime timestamp)
ROW FORMAT  serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties( 'ignore.malformed.json' = 'true' )
LOCATION 's3://<CREATED-BUCKET>/iot_to_bi_example/'

After the query completes, you see the new table in the left-hand pane:

o_IoT_Minutes_2

Run the following command to see the different userid values that were generated by the test script:

SELECT userid, COUNT(userid) FROM heartrate_iot_data GROUP BY userid

Your counts depend on how long you left the script to run, but you should have the following names listed:

o_IoT_Minutes_3

Analyzing the data

Now, analyze this data using Athena via QuickSight.

Set up a data source

Log into QuickSight and choose Manage data, New data set. Choose Athena as a new data source.

o_IoT_Minutes_4

Name your data source “AthenaDataSource”. Select the default schema and the heartrate_iot_data table.

o_IoT_Minutes_5

On the left, choose Visualize.

NOTE: The Edit/Preview data button allows you to prepare the dataset differently prior to visualizing it.  Examples include being able to format and transform your data, create alias data fields, change data types, and filter or join your data.

Build an analysis

On the left, choose userid; you’ll see the same names and counts that you saw when you ran the Athena SQL query earlier.

o_IoT_Minutes_6

Choose ratetype to query over the IoT data for the number of userids that have had HIGH versus NORMAL heartrates.

o_IoT_Minutes_7

Focus on “HIGH” heart rates by interacting with the graph. Open one of the bars under the HIGH section and choose Focus only on HIGH.

o_IoT_Minutes_8

This configures a filter to include only “HIGH” records.

o_IoT_Minutes_9

On the left, choose Visualize. Clear ratetype (as you are now filtering by it), choose heartrate, and switch to AVERAGE instead of SUM.

IoT_Minutes_10

Selecting the datetime X axis title allows you to change the time units easily. Change it to an hour.

You can easily change the parameters and see the users who haven’t had any HIGH heartrate data (in this dataset: Bonny, Brady, and Branden). 

o_IoT_Minutes_11

Conclusion

With these few easy steps, you’ve assembled a fully managed set of services for collecting sensor data from devices, batching the data into S3, and building intelligence from that data using QuickSight and Athena.

This was all done without launching a single server or cluster; throughout, you’ve been able to focus on the data and analytics, rather than the management of the infrastructure.

In this example, you kept the data in JSON.  Converting it to Parquet would have been much more performant. Setting up partitions for the data would also restrict the amount of data scanned.

If you have questions or suggestions, please comment below.


About the Author

 

ben_snively_90Ben Snively is a Public Sector Specialist Solutions Architect. He works with government, non-profit and education customers on big data and analytical projects, helping them build solutions using AWS. In his spare time he adds IoT sensors throughout his house and runs analytics on it.

 

 


Related

Integrating IoT Events into Your Analytic Platform

iot_1_smlal

Expanding the AWS Blog Team – Welcome Ana, Tina, Tara, and the Localization Team

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/expanding-the-aws-blog-team-welcome-ana-tina-tara-and-the-localization-team/

I wrote my first post for this blog back in 2004, and have published over 2,700 more since then, including 52 last month! Given the ever-increasing pace of AWS innovation, and the amount of cool stuff that we have to share with you, we are expanding our blogging team. Please give a warm welcome to Ana, Tina, and Tara:

Ana Visneski (@acvisneski) was the first official blogger for the United States Coast Guard. While there she focused on search & rescue coordination and also led the team that established a social media presence. Ana is a graduate of the University of Washington Communications Leadership Program, and was the first to complete both the Master of Communication in Digital Media (MCDM) and Master of Communication in Communities and Networks (MCCN) degree programs. Ana works with our guest posters, tracks our metrics, and manages the ticketing system that we use to coordinate our activities.

Tina Barr (@tinathebarr) is a Recruiting Coordinator for the AWS Commercial Sales organization. In order to provide a great first impression for her candidates, she began to read the AWS Customer Success Stories and developed a special interest in startups. In addition to her recruiting duties, Tina writes the AWS Hot Startups (September, October, November) posts each month. Tina earned a Bachelor’s degree in Community Health from Western Washington University and has always enjoyed writing.

Tara Walker (@taraw) is an AWS Technical Evangelist. Her background includes time as a developer and software engineer at multiple high-tech and media companies. With a focus on IoT, mobile, gaming, serverless architectures, and cross-platform development, Tara loves to dive deep into the latest and greatest technical topics and build compelling demos. Tara has a Bachelor’s degree from Georgia State University and is currently working on a Master’s degree in Computer Science from Georgia Institute of Technology. Like me, Tara will focus on writing posts for upcoming AWS launches.

I am thrilled to be working with these three talented and creative new members of our blogging team, and am looking forward to seeing what they come up with.

AWS Blog Localization Team
Many of the posts on this blog have been translated into Japanese and Korean for AWS customers who are most comfortable in those languages. A big thank-you is due to those who are doing this work:

  • JapaneseTatsuji Ishibashi (Product Marketing Manager for AWS Japan) manages and reviews the translation work that is done by a team of Solution Architects in Japan and publishes the content on the AWS Japan Blog.
  • KoreanChanny Yun (Technical Evangelist for AWS Korea) translates posts for the AWS Korea Blog.
  • Mandarin Chinese – Our colleagues in China are translating important announcements for the AWS China Blog.

Jeff;