Tag Archives: AWS Step Functions

Orchestrate Amazon Redshift-Based ETL workflows with AWS Step Functions and AWS Glue

Post Syndicated from Ben Romano original https://aws.amazon.com/blogs/big-data/orchestrate-amazon-redshift-based-etl-workflows-with-aws-step-functions-and-aws-glue/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud that offers fast query performance using the same SQL-based tools and business intelligence applications that you use today. Many customers also like to use Amazon Redshift as an extract, transform, and load (ETL) engine to use existing SQL developer skillsets, to quickly migrate pre-existing SQL-based ETL scripts, and—because Amazon Redshift is fully ACID-compliant—as an efficient mechanism to merge change data from source data systems.

In this post, I show how to use AWS Step Functions and AWS Glue Python Shell to orchestrate tasks for those Amazon Redshift-based ETL workflows in a completely serverless fashion. AWS Glue Python Shell is a Python runtime environment for running small to medium-sized ETL tasks, such as submitting SQL queries and waiting for a response. Step Functions lets you coordinate multiple AWS services into workflows so you can easily run and monitor a series of ETL tasks. Both AWS Glue Python Shell and Step Functions are serverless, allowing you to automatically run and scale them in response to events you define, rather than requiring you to provision, scale, and manage servers.

While many traditional SQL-based workflows use internal database constructs like triggers and stored procedures, separating workflow orchestration, task, and compute engine components into standalone services allows you to develop, optimize, and even reuse each component independently. So, while this post uses Amazon Redshift as an example, my aim is to more generally show you how to orchestrate any SQL-based ETL.

Prerequisites

If you want to follow along with the examples in this post using your own AWS account, you need a Virtual Private Cloud (VPC) with at least two private subnets that have routes to an S3 VPC endpoint.

If you don’t have a VPC, or are unsure if yours meets these requirements, I provide an AWS CloudFormation template stack you can launch by selecting the following button. Provide a stack name on the first page and leave the default settings for everything else. Wait for the stack to display Create Complete (this should only take a few minutes) before moving on to the other sections.

Scenario

For the examples in this post, I use the Amazon Customer Reviews Dataset to build an ETL workflow that completes the following two tasks which represent a simple ETL process.

  • Task 1: Move a copy of the dataset containing reviews from the year 2015 and later from S3 to an Amazon Redshift table.
  • Task 2: Generate a set of output files to another Amazon S3 location which identifies the “most helpful” reviews by market and product category, allowing an analytics team to glean information about high quality reviews.

This dataset is publicly available via an Amazon Simple Storage Service (Amazon S3) bucket. Complete the following tasks to get set up.

Solution overview

The following diagram highlights the solution architecture from end to end:

The steps in this process are as follows:

  1. The state machine launches a series of runs of an AWS Glue Python Shell job (more on how and why I use a single job later in this post!) with parameters for retrieving database connection information from AWS Secrets Manager and an .sql file from S3.
  2. Each run of the AWS Glue Python Shell job uses the database connection information to connect to the Amazon Redshift cluster and submit the queries contained in the .sql file.
    1. For Task 1: The cluster utilizes Amazon Redshift Spectrum to read data from S3 and load it into an Amazon Redshift table. Amazon Redshift Spectrum is commonly used as an means for loading data to Amazon Redshift. (See Step 7 of Twelve Best Practices for Amazon Redshift Spectrum for more information.)
    2. For Task 2: The cluster executes an aggregation query and exports the results to another Amazon S3 location via UNLOAD.
  3. The state machine may send a notification to an Amazon Simple Notification Service (SNS) topic in the case of pipeline failure.
  4. Users can query the data from the cluster and/or retrieve report output files directly from S3.

I include an AWS CloudFormation template to jumpstart the ETL environment so that I can focus this post on the steps dedicated to building the task and orchestration components. The template launches the following resources:

  • Amazon Redshift Cluster
  • Secrets Manager secret for storing Amazon Redshift cluster information and credentials
  • S3 Bucket preloaded with Python scripts and .sql files
  • Identity and Access Management (IAM) Role for AWS Glue Python Shell jobs

See the following resources for how to complete these steps manually:

Be sure to select at least two private subnets and the corresponding VPC, as shown in the following screenshot. If you are using the VPC template from above, the VPC appears as 10.71.0.0/16 and the subnet names are A private and B private.

The stack should take 10-15 minutes to launch. Once it displays Create Complete, you can move on to the next section. Be sure to take note of the Resources tab in the AWS CloudFormation console, shown in the following screenshot, as I refer to these resources throughout the post.

Building with AWS Glue Python Shell

Begin by navigating to AWS Glue in the AWS Management Console.

Making a connection

Amazon Redshift cluster resides in a VPC, so you first need to create a connection using AWS Glue. Connections contain properties, including VPC networking information, needed to access your data stores. You eventually attach this connection to your Glue Python Shell Job so that it can reach your Amazon Redshift cluster.

Select Connections from the menu bar, and then select Add connection. Give your connection a name like blog_rs_connection,  select Amazon Redshift as the Connection type, and then select Next, as shown in the following screenshot.

Under Cluster, enter the name of the cluster that the AWS CloudFormation template launched, i.e blogstack-redshiftcluster-####. Because the Python code I provide for this blog already handles credential retrieval, the rest of the values around database information you enter here are largely placeholders. The key information you are associating with the connection is networking-related.

Please note that you are not able to test the connection without the correct cluster information.  If you are interested in doing so, note that Database name and Username are auto-populated after selecting the correct cluster, as shown in the following screenshot. Follow the instructions here to retrieve the password information from Secrets Manager to copy into the Password field.

ETL code review

Take a look at the two main Python scripts used in this example:

Pygresql_redshift_common.py is a set of functions that can retrieve cluster connection information and credentials from Secrets Manger, make a connection to the cluster, and submit queries respectively. By retrieving cluster information at runtime via a passed parameter, these functions allow the job to connect to any cluster to which it has access. You can package these functions into a library by following the instructions to create a python .egg file (already completed as a part of the AWS CloudFormation template launch). Note that AWS Glue Python Shell supports several python libraries natively.

import pg
import boto3
import base64
from botocore.exceptions import ClientError
import json

#uses session manager name to return connection and credential information
def connection_info(db):

	session = boto3.session.Session()
	client = session.client(
		service_name='secretsmanager'
	)

	get_secret_value_response = client.get_secret_value(SecretId=db)

	if 'SecretString' in get_secret_value_response:
		secret = json.loads(get_secret_value_response['SecretString'])
	else:
		secret = json.loads(base64.b64decode(get_secret_value_response['SecretBinary']))
		
	return secret


#creates a connection to the cluster
def get_connection(db,db_creds):

	con_params = connection_info(db_creds)
	
	rs_conn_string = "host=%s port=%s dbname=%s user=%s password=%s" % (con_params['host'], con_params['port'], db, con_params['username'], con_params['password'])
	rs_conn = pg.connect(dbname=rs_conn_string)
	rs_conn.query("set statement_timeout = 1200000")
	
	return rs_conn


#submits a query to the cluster
def query(con,statement):
    res = con.query(statement)
    return res

The AWS Glue Python Shell job runs rs_query.py when called. It starts by parsing job arguments that are passed at invocation. It uses some of those arguments to retrieve a .sql file from S3, then connects and submits the statements within the file to the cluster using the functions from pygresql_redshift_common.py. So, in addition to connecting to any cluster using the Python library you just packaged, it can also retrieve and run any SQL statement. This means you can manage a single AWS Glue Python Shell job for all of your Amazon Redshift-based ETL by simply passing in parameters on where it should connect and what it should submit to complete each task in your pipeline.

from redshift_module import pygresql_redshift_common as rs_common
import sys
from awsglue.utils import getResolvedOptions
import boto3

#get job args
args = getResolvedOptions(sys.argv,['db','db_creds','bucket','file'])
db = args['db']
db_creds = args['db_creds']
bucket = args['bucket']
file = args['file']

#get sql statements
s3 = boto3.client('s3') 
sqls = s3.get_object(Bucket=bucket, Key=file)['Body'].read().decode('utf-8')
sqls = sqls.split(';')

#get database connection
print('connecting...')
con = rs_common.get_connection(db,db_creds)

#run each sql statement
print("connected...running query...")
results = []
for sql in sqls[:-1]:
    sql = sql + ';'
    result = rs_common.query(con, sql)
    print(result)
    results.append(result)

print(results)

Creating the Glue Python Shell Job

Next, put that code into action:

  1. Navigate to Jobs on the left menu of the AWS Glue console page and from there, select Add job.
  2. Give the job a name like blog_rs_query.
  3. For the IAM role, select the same GlueExecutionRole you previously noted from the Resources section of the AWS CloudFormation console.
  4. For Type, select Python shell, leave Python version as the default of Python 3, and for This job runs select An existing script that you provide.
  5. For S3 path where the script is stored, navigate to the script bucket created by the AWS CloudFormation template (look for ScriptBucket in the Resources), then select the python/py file.
  6. Expand the Security configuration, script libraries, and job parameters section to add the Python .egg file with the Amazon Redshift connection library to the Python library path. It is also located in the script bucket under python /redshift_module-0.1-py3.6.egg.

When all is said and done everything should look as it does in the following screenshot:

Choose Next. Add the connection you created by choosing Select to move it under Required connections. (Recall from the Making a connection section that this gives the job the ability to interact with your VPC.) Choose Save job and edit script to finish, as shown in the following screenshot.

Test driving the Python Shell job

After creating the job, you are taken to the AWS Glue Python Shell IDE. If everything went well, you should see the rs_query.py code. Right now, the Amazon Redshift cluster is sitting there empty, so use the Python code to run the following SQL statements and populate it with tables.

  1. Create an external database (amzreviews).
  2. Create an external table (reviews) from which Amazon Redshift Spectrum can read from the source data in S3 (the public reviews dataset). The table is partitioned by product_category because the source files are organized by category, but in general you should partition on frequently filtered columns (see #4).
  3. Add partitions to the external table.
  4. Create an internal table (reviews) local to the Amazon Redshift cluster. product_id works well as a DISTKEY because it has high cardinality, even distribution, and most likely (although not explicitly part of this blog’s scenario) a column that will be used to join with other tables. I choose review_date as a SORTKEY to efficiently filter out review data that is not part of my target query (after 2015). Learn more about how to best choose DISTKEY/SORTKEY as well as additional table design parameters for optimizing performance by reading the Designing Tables documentation.
    CREATE EXTERNAL SCHEMA amzreviews 
    from data catalog
    database 'amzreviews'
    iam_role 'rolearn'
    CREATE EXTERNAL database IF NOT EXISTS;
    
    
    
    CREATE EXTERNAL TABLE amzreviews.reviews(
      marketplace varchar(10), 
      customer_id varchar(15), 
      review_id varchar(15), 
      product_id varchar(25), 
      product_parent varchar(15), 
      product_title varchar(50), 
      star_rating int, 
      helpful_votes int, 
      total_votes int, 
      vine varchar(5), 
      verified_purchase varchar(5), 
      review_headline varchar(25), 
      review_body varchar(1024), 
      review_date date, 
      year int)
    PARTITIONED BY ( 
      product_category varchar(25))
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
      's3://amazon-reviews-pds/parquet/';
      
      
      
    ALTER TABLE amzreviews.reviews ADD
    partition(product_category='Apparel') 
    location 's3://amazon-reviews-pds/parquet/product_category=Apparel/'
    partition(product_category='Automotive') 
    location 's3://amazon-reviews-pds/parquet/product_category=Automotive'
    partition(product_category='Baby') 
    location 's3://amazon-reviews-pds/parquet/product_category=Baby'
    partition(product_category='Beauty') 
    location 's3://amazon-reviews-pds/parquet/product_category=Beauty'
    partition(product_category='Books') 
    location 's3://amazon-reviews-pds/parquet/product_category=Books'
    partition(product_category='Camera') 
    location 's3://amazon-reviews-pds/parquet/product_category=Camera'
    partition(product_category='Grocery') 
    location 's3://amazon-reviews-pds/parquet/product_category=Grocery'
    partition(product_category='Furniture') 
    location 's3://amazon-reviews-pds/parquet/product_category=Furniture'
    partition(product_category='Watches') 
    location 's3://amazon-reviews-pds/parquet/product_category=Watches'
    partition(product_category='Lawn_and_Garden') 
    location 's3://amazon-reviews-pds/parquet/product_category=Lawn_and_Garden';
    
    
    CREATE TABLE reviews(
      marketplace varchar(10),
      customer_id varchar(15), 
      review_id varchar(15), 
      product_id varchar(25) DISTKEY, 
      product_parent varchar(15), 
      product_title varchar(50), 
      star_rating int, 
      helpful_votes int, 
      total_votes int, 
      vine varchar(5), 
      verified_purchase varchar(5), 
      review_date date, 
      year int,
      product_category varchar(25))
      
      SORTKEY (
         review_date
        );

Do this first job run manually so you can see where all of the elements I’ve discussed come into play. Select Run Job at the top of the IDE screen. Expand the Security configuration, script libraries, and job parameters section. This is where you add in the parameters as key-value pairs, as shown in the following screenshot.

KeyValue
–dbreviews
–db_credsreviewssecret
–bucket<name of s3 script bucket>
–filesql/reviewsschema.sql

Select Run job to start it. The job should take a few seconds to complete. You can look for log outputs below the code in the IDE to watch job progress.

Once the job completes, navigate to Databases in the AWS Glue console and look for the amzreviews database and reviews table, as shown in the following screenshot. If they are there, then everything worked as planned! You can also connect to your Amazon Redshift cluster using the Redshift Query Editor or with your own SQL client tool and look for the local reviews table.

Step Functions Orchestration

Now that you’ve had a chance to run a job manually, it’s time to move onto something more programmatic that is orchestrated by Step Functions.

Launch Template

I provide a third AWS CloudFormation template for kickstarting this process as well. It creates a Step Functions state machine that calls two instances of the AWS Glue Python Shell job you just created to complete the two tasks I outlined at the beginning of this post.

For BucketName, paste the name of the script bucket created in the second AWS CloudFormation stack. For GlueJobName, type in the name of the job you just created. Leave the other information as default, as shown in the following screenshot. Launch the stack and wait for it to display Create Complete—this should take only a couple of minutes—before moving on to the next section.

Working with the Step Functions State Machine

State Machines are made up of a series of steps, allowing you to stitch together services into robust ETL workflows. You can monitor each step of execution as it happens, which means you can identify and fix problems in your ETL workflow quickly, and even automatically.

Take a look at the state machine you just launched to get a better idea. Navigate to Step Functions in the AWS Console and look for a state machine with a name like GlueJobStateMachine-######. Choose Edit to view the state machine configuration, as shown in the following screenshot.

It should look as it does in the following screenshot:

As you can see, state machines are created using JSON templates made up of task definitions and workflow logic. You can run parallel tasks, catch errors, and even pause workflows and wait for manual callback to continue. The example I provide contains two tasks for running the SQL statements that complete the goals I outlined at the beginning of the post:

  1. Load data from S3 using Redshift Spectrum
  2. Transform and writing data back to S3

Each task contains basic error handling which, if caught, routes the workflow to an error notification task. This example is a simple one to show you how to build a basic workflow, but you can refer to the Step Functions documentation for an example of more complex workflows to help build a robust ETL pipeline. Step Functions also supports reusing modular components with Nested Workflows.

SQL Review

The state machine will retrieve and run the following SQL statements:

INSERT INTO reviews
SELECT marketplace, customer_id, review_id, product_id, product_parent, product_title, star_rating, helpful_votes, total_votes, vine, verified_purchase, review_date, year, product_category
FROM amzreviews.reviews
WHERE year > 2015;

As I mentioned previously, Amazon Redshift Spectrum is a great way to run ETL using an INSERT INTO statement. This example is a simple load of the data as it is in S3, but keep in mind you can add more complex SQL statements to transform your data prior to loading.

UNLOAD ('SELECT marketplace, product_category, product_title, review_id, helpful_votes, AVG(star_rating) as average_stars FROM reviews GROUP BY marketplace, product_category, product_title, review_id, helpful_votes ORDER BY helpful_votes DESC, average_stars DESC')
TO 's3://bucket/testunload/'
iam_role 'rolearn';

This statement groups reviews by product, ordered by number of helpful votes, and writes to Amazon S3 using UNLOAD.

State Machine execution

Now that everything is in order, start an execution. From the state machine main page select Start an Execution.

Leave the defaults as they are and select Start to begin execution. Once execution begins you are taken to a visual workflow interface where you can follow the execution progress, as shown in the following screenshot.

Each of the queries takes a few minutes to run. In the meantime, you can watch the Amazon Redshift query logs to track the query progress in real time. These can be found by navigating to Amazon Redshift in the AWS Console, selecting your Amazon Redshift cluster, and then selecting the Queries tab, as shown in the following screenshot.

Once you see COMPLETED for both queries, navigate back to the state machine execution. You should see success for each of the states, as shown in the following screenshot.

Next, navigate to the data bucket in the S3 AWS Console page (refer to the DataBucket in the CloudFormation Resources tab). If all went as planned, you should see a folder named testunload in the bucket with the unloaded data, as shown in the following screenshot.

Inject Failure into Step Functions State Machine

Next, test the error handling component of the state machine by intentionally causing an error. An easy way to do this is to edit the state machine and misspell the name of the Secrets Manager secret in the ReadFilterJob task, as shown in the following screenshot.

If you want the error output sent to you, optionally subscribe to the error notification SNS Topic. Start another state machine execution as you did previously. This time the workflow should take the path toward the NotifyFailure task, as shown in the following screenshot. If you subscribed to the SNS Topic associated with it, you should receive a message shortly thereafter.

The state machine logs will show the error in more detail, as shown in the following screenshot.

Conclusion

In this post I demonstrated how you can orchestrate Amazon Redshift-based ETL using serverless AWS Step Functions and AWS Glue Python Shells jobs. As I mentioned in the introduction, the concepts can also be more generally applied to other SQL-based ETL, so use them to start building your own SQL-based ETL pipelines today!

 


About the Author

Ben Romano is a Data Lab solution architect at AWS. Ben helps our customers architect and build data and analytics prototypes in just four days in the AWS Data Lab.

 

 

 

 

New – Step Functions Support for Dynamic Parallelism

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-step-functions-support-for-dynamic-parallelism/

Microservices make applications easier to scale and faster to develop, but coordinating the components of a distributed application can be a daunting task. AWS Step Functions is a fully managed service that makes coordinating tasks easier by letting you design and run workflows that are made of steps, each step receiving as input the output of the previous step. For example, Novartis Institutes for Biomedical Research is using Step Functions to empower scientists to run image analysis without depending on cluster experts.

Step Functions added some very interesting capabilities recently, such as callback patterns, to simplify the integration of human activities and third-party services, and nested workflows, to assemble together modular, reusable workflows. Today, we are adding support for dynamic parallelism within a workflow!

How Dynamic Parallelism Works
States machines are defined using the Amazon States Language, a JSON-based, structured language. The Parallel state can be used to execute in parallel a fixed number of branches defined in the state machine. Now, Step Functions supports a new Map state type for dynamic parallelism.

To configure a Map state, you define an Iterator, which is a complete sub-workflow. When a Step Functions execution enters a Map state, it will iterate over a JSON array in the state input. For each item, the Map state will execute one sub-workflow, potentially in parallel. When all sub-workflow executions complete, the Map state will return an array containing the output for each item processed by the Iterator.

You can configure an upper bound on how many concurrent sub-workflows Map executes by adding the MaxConcurrency field. The default value is 0, which places no limit on parallelism and iterations are invoked as concurrently as possible. A MaxConcurrency value of 1 has the effect to invoke the Iterator one element at a time, in the order of their appearance in the input state, and will not start an iteration until the previous iteration has completed execution.

One way to use the new Map state is to leverage fan-out or scatter-gather messaging patterns in your workflows:

  • Fan-out is a applied when delivering a message to multiple destinations, and can be useful in workflows such as order processing or batch data processing. For example, you can retrieve arrays of messages from Amazon SQS and Map will send each message to a separate AWS Lambda function.
  • Scatter-gather broadcasts a single message to multiple destinations (scatter) and then aggregates the responses back for the next steps (gather). This can be useful in file processing and test automation. For example, you can transcode ten 500 MB media files in parallel and then join to create a 5 GB file.

Like Parallel and Task states, Map supports Retry and Catch fields to handle service and custom exceptions. You can also apply Retry and Catch to states inside your Iterator to handle exceptions. If any Iterator execution fails, because of an unhandled error or by transitioning to a Fail state, the entire Map state is considered to have failed and all its iterations are stopped. If the error is not handled by the Map state itself, Step Functions stops the workflow execution with an error.

Using the Map State
Let’s build a workflow to process an order and, by using the Map state, to work on the items in the order in parallel. The tasks executed as part of this workflow are all Lambda functions, but with Step Functions you can use other AWS service integrations and have code running on EC2 instances, containers, or on-premises infrastructure.

Here’s our sample order, expressed as a JSON document, for a few books, plus some coffee to drink while reading them. The order has a detail section, where there is a list of items that are part of the order.

{
  "orderId": "12345678",
  "orderDate": "20190820101213",
  "detail": {
    "customerId": "1234",
    "deliveryAddress": "123, Seattle, WA",
    "deliverySpeed": "1-day",
    "paymentMethod": "aCreditCard",
    "items": [
      {
        "productName": "Agile Software Development",
        "category": "book",
        "price": 60.0,
        "quantity": 1
      },
      {
        "productName": "Domain-Driven Design",
        "category": "book",
        "price": 32.0,
        "quantity": 1
      },
      {
        "productName": "The Mythical Man Month",
        "category": "book",
        "price": 18.0,
        "quantity": 1
      },
      {
        "productName": "The Art of Computer Programming",
        "category": "book",
        "price": 180.0,
        "quantity": 1
      },
      {
        "productName": "Ground Coffee, Dark Roast",
        "category": "grocery",
        "price": 8.0,
        "quantity": 6
      }
    ]
  }
}

To process this order, I am using a state machine defining how the different tasks should be executed. The Step Functions console creates a visual representation of the workflow I am building:

  • First, I validate and check the payment.
  • Then, I process the items in the order, potentially in parallel, to check their availability, prepare for delivery, and start the delivery process.
  • At the end, a summary of the order is sent to the customer.
  • In case the payment check fails, I intercept that, for example to send a notification to the customer.

 

Here is the same state machine definition expressed as a JSON document. The ProcessAllItems state is using Map to process items in the order in parallel. In this case, I limit concurrency to 3 using the MaxConcurrency field. Inside the Iterator, I can put a sub-workflow of arbitrary complexity. In this case, I have three steps, to CheckAvailability, PrepareForDelivery, and StartDelivery of the item. Each of this step can Retry and Catch errors to make the sub-workflow execution more reliable, for example in case of integrations with external services.

{
  "StartAt": "ValidatePayment",
  "States": {
    "ValidatePayment": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-west-2:123456789012:function:validatePayment",
      "Next": "CheckPayment"
    },
    "CheckPayment": {
      "Type": "Choice",
      "Choices": [
        {
          "Not": {
            "Variable": "$.payment",
            "StringEquals": "Ok"
          },
          "Next": "PaymentFailed"
        }
      ],
      "Default": "ProcessAllItems"
    },
    "PaymentFailed": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-west-2:123456789012:function:paymentFailed",
      "End": true
    },
    "ProcessAllItems": {
      "Type": "Map",
      "InputPath": "$.detail",
      "ItemsPath": "$.items",
      "MaxConcurrency": 3,
      "Iterator": {
        "StartAt": "CheckAvailability",
        "States": {
          "CheckAvailability": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-west-2:123456789012:function:checkAvailability",
            "Retry": [
              {
                "ErrorEquals": [
                  "TimeOut"
                ],
                "IntervalSeconds": 1,
                "BackoffRate": 2,
                "MaxAttempts": 3
              }
            ],
            "Next": "PrepareForDelivery"
          },
          "PrepareForDelivery": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-west-2:123456789012:function:prepareForDelivery",
            "Next": "StartDelivery"
          },
          "StartDelivery": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-west-2:123456789012:function:startDelivery",
            "End": true
          }
        }
      },
      "ResultPath": "$.detail.processedItems",
      "Next": "SendOrderSummary"
    },
    "SendOrderSummary": {
      "Type": "Task",
      "InputPath": "$.detail.processedItems",
      "Resource": "arn:aws:lambda:us-west-2:123456789012:function:sendOrderSummary",
      "ResultPath": "$.detail.summary",
      "End": true
    }
  }
}

The Lambda functions used by this workflow are not aware of the overall structure of the order JSON document. They just need to know the part of the input state they are going to process. This is a best practice to make those functions easily reusable in multiple workflows. The state machine definition is manipulating the path used for the input and the output of the functions using JsonPath syntax via the InputPathItemsPathResultPath, and OutputPath fields:

  • InputPath is used to filter the data in the input state, for example to only pass the detail of the order to the Iterator.
  • ItemsPath is specific to the Map state and is used to identify where, in the input, the array field to process is found, for example to process the items inside the detail of the order.
  • ResultPath makes it possible to add the output of a task to the input state, and not overwrite it completely, for example to add a summary to the detail of the order.
  • I am not using OutputPath this time, but it could be useful to filter out unwanted information and pass only the portion of JSON that you care about to the next state. For example, to send as output only the detail of the order.

Optionally, the Parameters field may be used to customize the raw input used for each iteration. For example, the deliveryAddress is in the detail of the order, but not in each item. To have the Iterator have an index of the items, and access the deliveryAddress, I can add this to a Map state:

"Parameters": {
  "index.$": "$$.Map.Item.Index",
  "item.$": "$$.Map.Item.Value",
  "deliveryAddress.$": "$.deliveryAddress"
}

Available Now
This new feature is available today in all regions where Step Functions is offered. Dynamic parallelism was probably the most requested features for Step Functions. It unblocks the implementation of new use cases and can help optimize existing ones. Let us know what are you going to use it for!

ICYMI: Serverless Q2 2019

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/icymi-serverless-q2-2019/

This post is courtesy of Moheeb Zara, Senior Developer Advocate – AWS Serverless

Welcome to the sixth edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share all of the most recent product launches, feature enhancements, blog posts, webinars, Twitch live streams, and other interesting things that you might have missed!

In case you missed our last ICYMI, checkout what happened last quarter here.

April - June 2019

Amazon EventBridge

Before we dive in to all that happened in Q2, we’re excited about this quarter’s launch of Amazon EventBridge, the serverless event bus that connects application data from your own apps, SaaS, and AWS-as-a-service. This allows you to create powerful event-driven serverless applications using a variety of event sources.

Our very own AWS Solutions Architect, Mike Deck, sat down with AWS Serverless Hero Jeremy Daly and recorded a podcast on Amazon EventBridge. It’s a worthy listen if you’re interested in exploring all the features offered by this launch.

Now, back to Q2, here’s what’s new.

AWS Lambda

Lambda Monitoring

Amazon CloudWatch Logs Insights now allows you to see statistics from recent invocations of your Lambda functions in the Lambda monitoring tab.

Additionally, as of June, you can monitor the [email protected] functions associated with your Amazon CloudFront distributions directly from your Amazon CloudFront console. This includes a revamped monitoring dashboard for CloudFront distributions and [email protected] functions.

AWS Step Functions

Step Functions

AWS Step Functions now supports workflow execution events, which help in the building and monitoring of even-driven serverless workflows. Automatic Execution event notifications can be delivered upon start/completion of CloudWatch Events/Amazon EventBridge. This allows services such as AWS Lambda, Amazon SNS, Amazon Kinesis, or AWS Step Functions to respond to these events.

Additionally you can use callback patterns to automate workflows for applications with human activities and custom integrations with third-party services. You create callback patterns in minutes with less code to write and maintain, run without servers and infrastructure to manage, and scale reliably.

Amazon API Gateway

API Gateway Tag Based Control

Amazon API Gateway now offers tag-based access control for WebSocket APIs using AWS Identity and Access Management (IAM) policies, allowing you to categorize API Gateway resources for WebSocket APIs by purpose, owner, or other criteria.  With the addition of tag-based access control to WebSocket resources, you can now give permissions to WebSocket resources at various levels by creating policies based on tags. For example, you can grant full access to admins to while limiting access to developers.

You can now enforce a minimum Transport Layer Security (TLS) version and cipher suites through a security policy for connecting to your Amazon API Gateway custom domain.

In addition, Amazon API Gateway now allows you to define VPC Endpoint policies, enabling you to specify which Private APIs a VPC Endpoint can connect to. This enables granular security control using VPC Endpoint policies.

AWS Amplify

Amplify CLI (part of the open source Amplify Framework) now includes support for adding and configuring AWS Lambda triggers for events when using Amazon Cognito, Amazon Simple Storage Service, and Amazon DynamoDB as event sources. This means you can setup custom authentication flows for mobile and web applications via the Amplify CLI and Amazon Cognito User Pool as an authentication provider.

Amplify Console

Amplify Console,  a Git-based workflow for continuous deployment and hosting for fullstack serverless web apps, launched several updates to the build service including SAM CLI and custom container support.

Amazon Kinesis

Amazon Kinesis Data Firehose can now utilize AWS PrivateLink to securely ingest data. AWS PrivateLink provides private connectivity between VPCs, AWS services, and on-premises applications, securely over the Amazon network. When AWS PrivateLink is used with Amazon Kinesis Data Firehose, all traffic to a Kinesis Data Firehose from a VPC flows over a private connection.

You can now assign AWS resource tags to applications in Amazon Kinesis Data Analytics. These key/value tags can be used to organize and identify resources, create cost allocation reports, and control access to resources within Amazon Kinesis Data Analytics.

Amazon Kinesis Data Firehose is now available in the AWS GovCloud (US-East), Europe (Stockholm), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), and EU (London) regions.

For a complete list of where Amazon Kinesis Data Analytics is available, please see the AWS Region Table.

AWS Cloud9

Cloud9 Quick Starts

Amazon Web Services (AWS) Cloud9 integrated development environment (IDE) now has a Quick Start which deploys in the AWS cloud in about 30 minutes. This enables organizations to provide developers a powerful cloud-based IDE that can edit, run, and debug code in the browser and allow easy sharing and collaboration.

AWS Cloud9 is also now available in the EU (Frankfurt) and Asia Pacific (Tokyo) regions. For a current list of supported regions, see AWS Regions and Endpoints in the AWS documentation.

Amazon DynamoDB

You can now tag Amazon DynamoDB tables when you create them. Tags are labels you can attach to AWS resources to make them easier to manage, search, and filter.  Tagging support has also been extended to the AWS GovCloud (US) Regions.

DynamoDBMapper now supports Amazon DynamoDB transactional API calls. This support is included within the AWS SDK for Java. These transactional APIs provide developers atomic, consistent, isolated, and durable (ACID) operations to help ensure data correctness.

Amazon DynamoDB now applies adaptive capacity in real time in response to changing application traffic patterns, which helps you maintain uninterrupted performance indefinitely, even for imbalanced workloads.

AWS Training and Certification has launched Amazon DynamoDB: Building NoSQL Database–Driven Applications, a new self-paced, digital course available exclusively on edX.

Amazon Aurora

Amazon Aurora Serverless MySQL 5.6 can now be accessed using the built-in Data API enabling you to access Aurora Serverless with web services-based applications, including AWS LambdaAWS AppSync, and AWS Cloud9. For more check out this post.

Sharing snapshots of Aurora Serverless DB clusters with other AWS accounts or publicly is now possible. We are also giving you the ability to copy Aurora Serverless DB cluster snapshots across AWS regions.

You can now set the minimum capacity of your Aurora Serverless DB clusters to 1 Aurora Capacity Unit (ACU). With Aurora Serverless, you specify the minimum and maximum ACUs for your Aurora Serverless DB cluster instead of provisioning and managing database instances. Each ACU is a combination of processing and memory capacity. By setting the minimum capacity to 1 ACU, you can keep your Aurora Serverless DB cluster running at a lower cost.

AWS Serverless Application Repository

The AWS Serverless Application Repository is now available in 17 regions with the addition of the AWS GovCloud (US-West) region.

Region support includes Asia Pacific (Mumbai, Singapore, Sydney, Tokyo), Canada (Central), EU (Frankfurt, Ireland, London, Paris, Stockholm), South America (São Paulo), US West (N. California, Oregon), and US East (N. Virginia, Ohio).

Amazon Cognito

Amazon Cognito has launched a new API – AdminSetUserPassword – for the Cognito User Pool service that provides a way for administrators to set temporary or permanent passwords for their end users. This functionality is available for end users even when their verified phone or email are unavailable.

Serverless Posts

April

May

June

Events

Events this quarter

Senior Developer Advocates for AWS Serverless spoke at several conferences this quarter. Here are some recordings worth watching!

Tech Talks

We hold several AWS Online Tech Talks covering serverless tech talks throughout the year, so look out for them in the Serverless section of the AWS Online Tech Talks page. Here are the ones from Q2.

Twitch

Twitch Series

In April, we started a 13-week deep dive into building APIs on AWS as part of our Twitch Build On series. The Building Happy Little APIs series covers the common and not-so-common use cases for APIs on AWS and the features available to customers as they look to build secure, scalable, efficient, and flexible APIs.

There are also a number of other helpful video series covering Serverless available on the AWS Twitch Channel.

Build with Serverless on Twitch

Serverless expert and AWS Specialist Solutions architect, Heitor Lessa, has been hosting a weekly Twitch series since April. Join him and others as they build an end-to-end airline booking solution using serverless. The final episode airs on August 7th at Wednesday 8:00am PT.

Here’s a recap of the last quarter:

AWS re:Invent

AWS re:Invent 2019

AWS re:Invent 2019 is around the corner! From December 2 – 6 in Las Vegas, Nevada, join tens of thousands of AWS customers to learn, share ideas, and see exciting keynote announcements. Be sure to take a look at the growing catalog of serverless sessions this year.

Register for AWS re:Invent now!

What did we do at AWS re:Invent 2018? Check out our recap here: AWS re:Invent 2018 Recap at the San Francisco Loft

AWS Serverless Heroes

We urge you to explore the efforts of our AWS Serverless Heroes Community. This is a worldwide network of AWS Serverless experts with a diverse background of experience. For example, check out this post from last month where Marcia Villalba demonstrates how to set up unit tests for serverless applications.

Still looking for more?

The Serverless landing page has lots of information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials.

Ten Things Serverless Architects Should Know

Post Syndicated from Justin Pirtle original https://aws.amazon.com/blogs/architecture/ten-things-serverless-architects-should-know/

Building on the first three parts of the AWS Lambda scaling and best practices series where you learned how to design serverless apps for massive scale, AWS Lambda’s different invocation models, and best practices for developing with AWS Lambda, we now invite you to take your serverless knowledge to the next level by reviewing the following 10 topics to deepen your serverless skills.

1: API and Microservices Design

With the move to microservices-based architectures, decomposing monothlic applications and de-coupling dependencies is more important than ever. Learn more about how to design and deploy your microservices with Amazon API Gateway:

Get hands-on experience building out a serverless API with API Gateway, AWS Lambda, and Amazon DynamoDB powering a serverless web application by completing the self-paced Wild Rydes web application workshop.

Figure 1: WildRydes serverless web application workshop

2: Event-driven Architectures and Asynchronous Messaging Patterns

When building event-driven architectures, whether you’re looking for simple queueing and message buffering or a more intricate event-based choreography pattern, it’s valuable to learn about the mechanisms to enable asynchronous messaging and integration. These are enabled primarily through the use of queues or streams as a message buffer and topics for pub/sub messaging. Understand when to use each and the unique advantages and features of all three:

Gets hands-on experience building a real-time data processing application using Amazon Kinesis Data Streams and AWS Lambda by completing the self-paced Wild Rydes data processing workshop.

3: Workflow Orchestration in a Distributed, Microservices Environment

In distributed microservices architectures, you must design coordinated transactions in different ways than traditional database-based ACID transactions, which are typically implemented using a monolithic relational database. Instead, you must implement coordinated sequenced invocations across services along with rollback and retry mechanisms. For workloads where there a significant orchestration logic is required and you want to use more of an orchestrator pattern than the event choreography pattern mentioned above, AWS Step Functions enables the building complex workflows and distributed transactions through integration with a variety of AWS services, including AWS Lambda. Learn about the options you have to build your business workflows and keep orchestration logic out of your AWS Lambda code:

Get hands-on experience building an image processing workflow using computer vision AI services with AWS Rekognition and AWS Step Functions to orchestrate all logic and steps with the self-paced Serverless image processing workflow workshop.

Figure 2: Several AWS Lambda functions managed by an AWS Step Functions state machine

4: Lambda Computing Environment and Programming Model

Though AWS Lambda is a service that is quick to get started, there is value in learning more about the AWS Lambda computing environment and how to take advantage of deeper performance and cost optimization strategies with the AWS Lambda runtime. Take your understanding and skills of AWS Lambda to the next level:

5: Serverless Deployment Automation and CI/CD Patterns

When dealing with a large number of microservices or smaller components—such as AWS Lambda functions all working together as part of a broader application—it’s critical to integrate automation and code management into your application early on to efficiently create, deploy, and version your serverless architectures. AWS offers several first-party deployment tools and frameworks for Serverless architectures, including the AWS Serverless Application Model (SAM), the AWS Cloud Development Kit (CDK), AWS Amplify, and AWS Chalice. Additionally, there are several third party deployment tools and frameworks available, such as the Serverless Framework, Claudia.js, Sparta, or Zappa. You can also build your own custom-built homegrown framework. The important thing is to ensure your automation strategy works for your use case and team, and supports your planned data source integrations and development workflow. Learn more about the available options:

Learn how to build a full CI/CD pipeline and other DevOps deployment automation with the following workshops:

6: Serverless Identity Management, Authentication, and Authorization

Modern application developers need to plan for and integrate identity management into their applications while implementing robust authentication and authorization functionality. With Amazon Cognito, you can deploy serverless identity management and secure sign-up and sign-in directly into your applications. Beyond authentication, Amazon API Gateway also allows developers to granularly manage authorization logic at the gateway layer and authorize requests directly, without exposing their using several types of native authorization.

Learn more about the options and benefits of each:

Get hands-on experience working with Amazon Cognito, AWS Identity and Access Management (IAM), and Amazon API Gateway with the Serverless Identity Management, Authentication, and Authorization Workshop.

Figure 3: Serverless Identity Management, Authentication, and Authorization Workshop

7: End-to-End Security Techniques

Beyond identity and authentication/authorization, there are many other areas to secure in a serverless application. These include:

  • Input and request validation
  • Dependency and vulnerability management
  • Secure secrets storage and retrieval
  • IAM execution roles and invocation policies
  • Data encryption at-rest/in-transit
  • Metering and throttling access
  • Regulatory compliance concerns

Thankfully, there are AWS offerings and integrations for each of these areas. Learn more about the options and benefits of each:

Get hands-on experience adding end-to-end security with the techniques mentioned above into a serverless application with the Serverless Security Workshop.

8: Application Observability with Comprehensive Logging, Metrics, and Tracing

Before taking your application to production, it’s critical that you ensure your application is fully observable, both at a microservice or component level, as well as overall through comprehensive logging, metrics at various granularity, and tracing to understand distributed system performance and end user experiences end-to-end. With many different components making up modern architectures, having centralized visibility into all of your key logs, metrics, and end-to-end traces will make it much easier to monitor and understand your end users’ experiences. Learn more about the options for observability of your AWS serverless application:

9. Ensuring Your Application is Well-Architected

Adding onto the considerations mentioned above, we suggest architecting your applications more holistically to the AWS Well-Architected framework. This framework includes the five key pillars: security, reliability, performance efficiency, cost optimization, and operational excellence. Additionally, there is a serverless-specific lens to the Well-Architected framework, which more specifically looks at key serverless scenarios/use cases such as RESTful microservices, Alexa skills, mobile backends, stream processing, and web applications, and how they can implement best practices to be Well-Architected. More information:

10. Continuing your Learning as Serverless Computing Continues to Evolve

As we’ve discussed, there are many opportunities to dive deeper into serverless architectures in a variety of areas. Though the resources shared above should be helpful in familiarizing yourself with key concepts and techniques, there’s nothing better than continued learning from others over time as new advancements come out and patterns evolve.

Finally, we encourage you to check back often as we’ll be continuing further blog post series on serverless architectures, with the next series focusing on API design patterns and best practices.

About the author

Justin PritleJustin Pirtle is a specialist Solutions Architect at Amazon Web Services, focused on the Serverless platform. He’s responsible for helping customers design, deploy, and scale serverless applications using services such as AWS Lambda, Amazon API Gateway, Amazon Cognito, and Amazon DynamoDB. He is a regular speaker at AWS conferences, including re:Invent, as well as other AWS events. Justin holds a bachelor’s degree in Management Information Systems from the University of Texas at Austin and a master’s degree in Software Engineering from Seattle University.

Orchestrating an ETL process using AWS Step Functions for Amazon Redshift

Post Syndicated from Thiyagarajan Arumugam original https://aws.amazon.com/blogs/big-data/orchestrating-an-etl-process-using-aws-step-functions-for-amazon-redshift/

Modern data lakes depend on extract, transform, and load (ETL) operations to convert bulk information into usable data. This post walks through implementing an ETL orchestration process that is loosely coupled using AWS Step Functions, AWS Lambda, and AWS Batch to target an Amazon Redshift cluster.

Because Amazon Redshift uses columnar storage, it is well suited for fast analytical insights using the convenient ANSI SQL queries. You can rapidly scale your Amazon Redshift clusters up and down in minutes to meet the demanding workloads for both your end-user reports and timely data refresh into the data warehouse.

AWS Step Functions makes it easy to develop and use repeatable workflows that scale well. Step Functions lets you build automation workflows from individual Lambda functions. Each function performs a discrete task and lets you develop, test, and modify the components of your workflow quickly and seamlessly.

An ETL process refreshes your data warehouse from source systems, organizing the raw data into a format you can more readily use. Most organizations run ETL as a batch or as part of a real-time ingest process to keep the data warehouse current and provide timely analytics. A fully automated and highly scalable ETL process helps minimize the operational effort that you must invest in managing the regular ETL pipelines. It also ensures the timely and accurate refresh of your data warehouse. You can tailor this process to refresh data into any data warehouse or the data lake.

This post also provides an AWS CloudFormation template that launches the entire sample ETL process in one click to refresh the TPC-DS dataset. Find the template link in the Set up the entire workflow using AWS CloudFormation section.

Architectural overview

The following diagram illustrates the architectural overview of the different components involved in the orchestration of the ETL workflow. This workflow uses Step Functions to fetch source data from Amazon S3 to refresh the Amazon Redshift data warehouse.

Here are the core components of the workflow:

  • Amazon CloudWatch triggers the ETL process based on a schedule, through the AWS CLI, or using the various AWS SDKs in a Lambda function.
  • The ETL workflow uses Step Functions for a multi-step ETL process and manages AWS services into serverless workflows. You can build and easily iterate these using JSON-based templates. For example, a typical ETL process may involve refreshing dimensions first and later refreshing the fact tables. You can declare your order of operations using a Step Functions state machine.
  • A Lambda function lets you build microservices to coordinate job submission and monitoring without needing to write code for workflow logic, parallel processes, error handling, timeouts, or retries.
  • AWS Batch runs several ETL jobs such as transforms and loads into Amazon Redshift. AWS Batch manages all the infrastructure for you, avoiding the complexities of provisioning, managing, monitoring, and scaling your batch computing jobs. It also lets you wait for the jobs to complete.
  • The source data in Amazon S3 refreshes an Amazon Redshift data warehouse through a PL/SQL container. To specify the ETL logic, I use.sql files that contain the SQL code for a particular step. For example, a .sql file for a typical dimension table refresh contains steps to load the data from Amazon S3 to a temporary staging table and INSERT/UPDATE the target table. Before beginning, review a sample dimensional table .sql file.

You can execute the workflow and monitor it using the state machine. You can trigger the ETL according to a schedule or an event (for example, as soon as all the data files arrive in S3).

Prerequisites

Before you get started, create a Docker image that can execute .sql files. AWS Batch creates resources for executing the ETL steps using this Docker image. To create the Docker image, you need:

If this is your first time using AWS Batch, see Getting Started with AWS Batch. Create an environment to build and register the Docker image. For this post, register this image in an Amazon ECR repository. This is a private repository by default, making it useful for AWS Batch jobs.

Building the fetch and running psql Docker image

To build the Docker image, follow the steps outlined in the post Creating a Simple “Fetch & Run” AWS Batch Job.

Use the following Docker configuration and fetch and run psql scripts to build the images.

  1. DockerFetchRunPsqlUbundu
  2. fetch_and_run_psql.sh

Follow the steps in the post to import the Docker image into the ECR container registry. After you complete the previous steps, your Docker image is ready to trigger a .sql execution for an Amazon Redshift cluster.

Example: ETL process using TPC-DS dataset

This example uses a subset of the TPC-DS dataset to demonstrate a typical dimensional model refresh. Here is the Entity Relationship diagram of the TPC-DS data model that I use for this ETL application:

The ETL process refreshes table data for the Store_Sales fact table along with the Customer_Address and Item dimensions for a particular dataset date.

Setting up the ETL workflow using Step Functions

Step Functions make complicated workflows more straightforward. You can set up dependency management and failure handling using a JSON-based template. Workflows are just a series of steps, with the output of one step acting as input into the next.

This example completes various dimensional table transforms and loads before triggering the Fact table load. Also, a workflow can branch out into multiple parallel steps whenever needed. You can monitor each step of execution as it happens, which means you can identify and fix problems quickly.

This illustration outlines the example ETL process set up through Step Functions:

For more information, see the detailed workflow diagram.

In the above workflow, the ETL process checks the DB connection in Step 1 and triggers the Customer_Address (Step 2.1) and Item_dimension (Step 2.2) steps, which execute in parallel. The Store_Sales (Step 3) FACT table waits for the process to complete the dimensional tables. Each ETL step is autonomous, allowing you to monitor and respond to failures at any stage.

I now examine the Store_Sales step (Step 3) in detail. Other steps follow a similar pattern of implementation.

Here is the state implementation for Store_Sales step (Step 3):

{
   "Comment":"A simple ETL example that submits a Job to AWS Batch",
   "StartAt":"DBConnectionInit",
	...
      "Parallel":{
         "Type":"Parallel",
         "Next":"SalesFACTInit",
         "ResultPath":"$.status",
         "Branches":[
	...
      },
      "SalesFACTInit":{
         "Type":"Pass",
         "Next":"SubmitStoreSalesFACTJob",
         "Result":"SalesFACT",
         "ResultPath":"$.stepId"
      },
      "SubmitStoreSalesFACTJob":{
         "Type":"Task",
         "Resource":"arn:aws:lambda:us-west-2:1234567890:function:StepFunctionsSample-JobStatusPol-SubmitJobFunction-5M2HCJIG81R1",
         "Next":"GetStoreSalesFACTJobStatus"
      },
      "GetStoreSalesFACTJobStatus":{
         "Type":"Task",
         "Resource":"arn:aws:lambda:us-west-2:1234567890:function:StepFunctionsSample-JobStatusPoll-CheckJobFunction-1SKER18I6FU24",
         "Next":"CheckStoreSalesFACTJobStatus",
         "InputPath":"$",
         "ResultPath":"$.status"
      },
      "CheckStoreSalesFACTJobStatus":{
         "Type":"Choice",
         "Choices":[
            {
               "Variable":"$.status",
               "StringEquals":"FAILED",
               "Next":"FailState"
            },
            {
               "Variable":"$.status",
               "StringEquals":"SUCCEEDED",
               "Next":"GetFinalStoreSalesFACTJobStatus"
            }
         ],
         "Default":"StoreSalesFACTWait30Seconds"
      },
	...
   }
}

The Parallel process that loads all the dimension tables sets up a dependency on later Store Sales Fact transformation/load SalesFACTInit through the Next attribute. The SalesFACTInit step triggers the transformation using the SubmitStoreSalesFACTJob to AWS Batch triggered through AWS Lambda job JobStatusPol-SubmitJobFunction. GetStoreSalesFACTJobStatus polls through the AWS Lambda JobStatusPoll-CheckJobFunction every 30 seconds to check for completion. CheckStoreSalesFACTJobStatus validates the status and decides to succeed or fail the process depending on the returned status.

Here is snippet of input for executing the state machine job for Step 3:

{  
"DBConnection":{  
..
   "SalesFACT":{  
      "jobName":"my-job",
      "jobDefinition":"arn:aws:batch:us-west-2:1234567890:job-definition/JobDefinition-cd6aa175c07fb2a:1",
      "jobQueue":"arn:aws:batch:us-west-2:1234567890:job-queue/JobQueue-217beecdb0caa3f",
      "wait_time":60,
      "containerOverrides":{  
         "environment":[  
            {  
               "name":"BATCH_FILE_TYPE",
               "value":"script_psql"
            },
            {  
               "name":"BATCH_FILE_S3_URL",
               "value":"s3://salamander-us-east-1/reinvent2018/ant353/etlscript/psql_rs.sh"
            },
            {  
               "name":"BATCH_FILE_SQL_S3_URL",
               "value":"s3://salamander-us-east-1/reinvent2018/ant353/etlscript/store_sales.sql"
            },
            {  
               "name":"DATASET_DATE",
               "value":"2003-01-02"
            }
         ]     
}}}

The input defines which .sql file each step invokes, along with the refresh date. You can represent any complex ETL workflow as a JSON workflow, making it easy to manage. This also decouples the inputs to invoke for each step.

Executing the ETL workflow

AWS Batch executes each .sql script (store_sales.sql) that the state machine invokes by using an incremental data refresh for the sales data on a particular date.

Here is the load and transformation implementation for the store_sales.sql:

\set s3datareadrolevar 'aws_iam_role=' :s3datareadrole
-- This transform ETL will refresh data for the store_sales table
-- Start a new transaction

begin transaction;

-- Create a stg_store_sales staging table and COPY data from input S3 location it with updated rows from SALES_UPDATE

DROP TABLE if exists public.stg_store_sales;
CREATE TABLE public.stg_store_sales
(
	sold_date DATE   ENCODE lzo
	,sold_time INTEGER   ENCODE lzo
	,i_item_id CHAR(16)   ENCODE lzo
	,c_customer_id CHAR(16)   ENCODE lzo
	,cd_demo_sk INTEGER   ENCODE lzo
	,hd_income_band_sk INTEGER   ENCODE lzo
	,hd_buy_potential CHAR(15)   ENCODE lzo
	,hd_dep_count INTEGER   ENCODE lzo
	,hd_vehicle_count INTEGER   ENCODE lzo
	,ca_address_id CHAR(16)   ENCODE lzo
	,s_store_id CHAR(16)   ENCODE lzo
	,p_promo_id CHAR(16)   ENCODE lzo
	,ss_ticket_number INTEGER   ENCODE lzo
	,ss_quantity INTEGER   ENCODE lzo
	,ss_wholesale_cost NUMERIC(7,2)   ENCODE lzo
	,ss_list_price NUMERIC(7,2)   ENCODE lzo
	,ss_sales_price NUMERIC(7,2)   ENCODE lzo
	,ss_ext_discount_amt NUMERIC(7,2)   ENCODE lzo
	,ss_ext_sales_price NUMERIC(7,2)   ENCODE lzo
	,ss_ext_wholesale_cost NUMERIC(7,2)   ENCODE lzo
	,ss_ext_list_price NUMERIC(7,2)   ENCODE lzo
	,ss_ext_tax NUMERIC(7,2)   ENCODE lzo
	,ss_coupon_amt NUMERIC(7,2)   ENCODE lzo
	,ss_net_paid NUMERIC(7,2)   ENCODE lzo
	,ss_net_paid_inc_tax NUMERIC(7,2)   ENCODE lzo
	,ss_net_profit NUMERIC(7,2)   ENCODE lzo
)
BACKUP NO
DISTSTYLE EVEN
;


\set s3loc 's3://salamander-us-east-1/reinvent2018/ant353/store_sales/saledate=' :dt '/'
-- COPY input data to the staging table
copy public.stg_store_sales
from
:'s3loc'
CREDENTIALS :'s3datareadrolevar'
DELIMITER '~' gzip region 'us-east-1';

-- Delete any rows from target store_sales for the input date for idempotency
delete from store_sales where ss_sold_date_sk in (select d_date_sk from date_dim where d_date=:'dt');
--Insert data from staging table to the target TABLE
INSERT INTO store_sales
(
  ss_sold_date_sk,
  ss_sold_time_sk,
  ss_item_sk,
  ss_customer_sk,
  ss_cdemo_sk,
  ss_hdemo_sk,
  ss_addr_sk,
  ss_store_sk,
  ss_promo_sk,
  ss_ticket_number,
  ss_quantity,
  ss_wholesale_cost,
  ss_list_price,
  ss_sales_price,
  ss_ext_discount_amt,
  ss_ext_sales_price,
  ss_ext_wholesale_cost,
  ss_ext_list_price,
  ss_ext_tax,
  ss_coupon_amt,
  ss_net_paid,
  ss_net_paid_inc_tax,
  ss_net_profit
)
SELECT date_dim.d_date_sk ss_sold_date_sk,
       time_dim.t_time_sk ss_sold_time_sk,
       item.i_item_sk ss_item_sk,
       customer.c_customer_sk ss_customer_sk,
       customer_demographics.cd_demo_sk ss_cdemo_sk,
       household_demographics.hd_demo_sk ss_hdemo_sk,
       customer_address.ca_address_sk ss_addr_sk,
       store.s_store_sk ss_store_sk,
       promotion.p_promo_sk ss_promo_sk,
       ss_ticket_number,
       ss_quantity,
       ss_wholesale_cost,
       ss_list_price,
       ss_sales_price,
       ss_ext_discount_amt,
       ss_ext_sales_price,
       ss_ext_wholesale_cost,
       ss_ext_list_price,
       ss_ext_tax,
       ss_coupon_amt,
       ss_net_paid,
       ss_net_paid_inc_tax,
       ss_net_profit
FROM stg_store_sales AS store_sales
  JOIN date_dim ON store_sales.sold_date = date_dim.d_date
  LEFT JOIN time_dim ON store_sales.sold_time = time_dim.t_time
  LEFT JOIN item
         ON store_sales.i_item_id = item.i_item_id
        AND i_rec_end_date IS NULL
  LEFT JOIN customer ON store_sales.c_customer_id = customer.c_customer_id
  LEFT JOIN customer_demographics ON store_sales.cd_demo_sk = customer_demographics.cd_demo_sk
  LEFT JOIN household_demographics
         ON store_sales.hd_income_band_sk = household_demographics.hd_income_band_sk
        AND store_sales.hd_buy_potential = household_demographics.hd_buy_potential
        AND store_sales.hd_dep_count = household_demographics.hd_dep_count
        AND store_sales.hd_vehicle_count = household_demographics.hd_vehicle_count
  LEFT JOIN customer_address ON store_sales.ca_address_id = customer_address.ca_address_id
  LEFT JOIN store
         ON store_sales.s_store_id = store.s_store_id
        AND s_rec_end_date IS NULL
  LEFT JOIN promotion ON store_sales.p_promo_id = promotion.p_promo_id;

  --drop staging table

  DROP TABLE if exists public.stg_store_sales;

  -- End transaction and commit
  end transaction;

This ETL implementation runs through the following steps:

  1. A COPY command fast loads the data from S3 in bulk into the staging table stg_store_sales.
  2. Begin…end transactions encapsulate multiple steps in the transformation and load process. This leads to fewer commit operations in the end, making the process less expensive.
  3. ETL implementation is idempotent. If it fails, you can retry the job without any cleanup. For example, it recreates the stg_store_sales is each time, then deletes target table store_sales with the data for the particular refresh date each time.

For best practices used in the preceding implementation, see the Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift post.

Furthermore, Customer_Address demonstrates a Type 1 implementation and Item follows a Type 2 implementation in a typical dimensional model.

Set up the entire workflow using AWS CloudFormation

The AWS CloudFormation template includes all the steps of this solution. This template creates all the required AWS resources and invokes initial data setup and the refresh of this data for a particular day. Here is a list of all the resources it creates inside the CloudFormation stack:

  • A VPC and associated subnets, security groups, and routes
  • IAM roles
  • An Amazon Redshift cluster
  • An AWS Batch job definition and compute environment
  • A Lambda function to submit and poll AWS Batch jobs
  • A Step Functions state machine to orchestrate the ETL workflow and refresh the data in the Amazon Redshift cluster

Here is the architecture of this setup that shows the Amazon Redshift setup in the VPC and the ETL process orchestrated using Step Functions:

Step 1: Create the stack with AWS CloudFormation

To deploy this application in your AWS account, start by launching this CloudFormation stack:

  • This stack uses the password Password#123. Change it as soon as possible. Use a minimum of eight characters, at least one uppercase letter, one lowercase letter, one number, and one special character.
  1. Use the default values for all other parameters.

The stack takes about ten minutes to launch. Wait for it to complete when the status changes to CREATE_COMPLETE.

Make a note of the value of ExecutionInput in the Output section of the stack. The JSON looks like the following code example:

“
{ "DBConnection":{ "jobName":"
…
alue":"s3://salamander-us-east-1/reinvent2018/ant353/etlscript/store_sales.sql" }, { "name":"DATASET_DATE", "value":"2003-01-02" } ] } } }
”

Note the Physical ID of JobDefinition and JobQueue in the Resources section of the stack.

Step 2: Set up TPC-DS 1-GB initial data in Amazon Redshift

The following steps load an initial 1 GB of TPCDS data into the Amazon Redshift cluster:

  • In the AWS Batch console, choose Job, select the job queue noted earlier, and choose Submit Job.
  • Set a new job name, for example, TPCDSdataload, and select the JobDefinition value that you noted earlier. Choose Submit Job. Wait for the job to completely load the initial 1 GB of TPCDS data into the Amazon Redshift cluster.
  • In the AWS Batch dashboard and monitor for the completion of TPCDS data load. This takes about ten minutes to complete.

Step 3: Execute the ETL workflow in the setup function

The ETL process is a multi-step workflow to refresh the TPCDS dimensional model with data from 2010-10-10.

  1. In the Step Functions console, choose JobStatusPollerStateMachine-*.
  2. Choose Start execution and provide an optional execution name, for example, ETLWorkflowDataRefreshfor2003-01-02. In the execution input, enter the ExecutionInput value that you noted earlier. This kickstarts the ETL process. The state machine uses the Lambda poller to submit and monitor each step of the ETL job. Each input invokes the ETL workflow. You can monitor the process of the ETL by refreshing your browser.

Step 4: Verify the ETL data refresh in the Amazon Redshift cluster

In the Amazon Redshift console, choose Query Editor. Enter the following credentials:

  • Database: dev.
  • Database Use: awsuser.
  • Password: This requires the password that you created in Step 1 (default Password#123).

After you are logged in to the public schema, execute the following query to check the data load for 2010-10-10:

SELECT c_last_name, 
               c_first_name, 
               ca_city, 
               bought_city, 
               ss_ticket_number, 
               amt, 
               profit 
FROM   (SELECT ss_ticket_number, 
               ss_customer_sk, 
               ca_city            bought_city, 
               Sum(ss_coupon_amt) amt, 
               Sum(ss_net_profit) profit 
        FROM   store_sales, 
               date_dim, 
               store, 
               household_demographics, 
               customer_address 
        WHERE  store_sales.ss_sold_date_sk = date_dim.d_date_sk 
               AND store_sales.ss_store_sk = store.s_store_sk 
               AND store_sales.ss_hdemo_sk = household_demographics.hd_demo_sk 
               AND store_sales.ss_addr_sk = customer_address.ca_address_sk 
               AND ( household_demographics.hd_dep_count = 6 
                      OR household_demographics.hd_vehicle_count = 0 ) 
               AND d_date =  '2003-01-02'
        GROUP  BY ss_ticket_number, 
                  ss_customer_sk, 
                  ss_addr_sk, 
                  ca_city) dn, 
       customer, 
       customer_address current_addr 
WHERE  ss_customer_sk = c_customer_sk 
       AND customer.c_current_addr_sk = current_addr.ca_address_sk 
       AND current_addr.ca_city <> bought_city 
ORDER  BY c_last_name, 
          c_first_name, 
          ca_city, 
          bought_city, 
          ss_ticket_number
LIMIT 100;

The query should display the TPC-DS dataset for 2010-10-10 that the ETL process loaded.

Step 5: Cleaning up

When you finish testing this solution, remember to clean up all the AWS resources that you created using AWS CloudFormation. Use the AWS CloudFormation console or AWS CLI to delete the stack that you specified previously.

Conclusion

In this post, I described how to implement an ETL workflow using decoupled services in AWS and set up a highly scalable orchestration that lets you refresh data into an Amazon Redshift cluster.

You can easily expand on what you learned here. Here are some options that you let you extend this solution to accommodate other analytical services or make it robust enough to be production ready:

  • This example invokes the state machine manually using Step Functions. You can instead trigger the state machine automatically using a CloudWatch event or S3 event, such as whenever new files arrive in the source bucket. You also drive the ETL invocation using a schedule. For useful information for automating your ETL workflow, see Schedule a Serverless Workflow.
  • You can add an alert mechanism in case of failures. To do this, create a Lambda function that sends you an email based on the status of each step in the Step Functions workflow.
  • Each step of the state machine is autonomous and can invoke any service with a Lambda function. You can integrate any analytical service into your workflow. For example, you can create a separate Lambda function to invoke AWS Glue and clean some of your data before transforming the data using Amazon Redshift. In this case, you add the AWS Glue job as a dependency in the step before the dimension load.

With this Step Functions-based workflow, you can decouple the different steps of ETL orchestration using any analytical service. Because of this, the solution is adaptable and interchangeable to a wide variety of applications.

If you have questions or suggestions, please leave a comment below.

 


About the Author

Thiyagarajan Arumugam is a Big Data Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.

 

 

 

Configuring user creation workflows with AWS Step Functions and AWS Managed Microsoft AD logs

Post Syndicated from Rachel Richardson original https://aws.amazon.com/blogs/compute/configuring-user-creation-workflows-with-aws-step-functions-and-aws-managed-microsoft-ad-logs/

This post is contributed by Taka Matsumoto, Cloud Support Engineer

AWS Directory Service lets you run Microsoft Active Directory as a managed service. Directory Service for Microsoft Active Directory, also referred to as AWS Managed Microsoft AD, is powered by Microsoft Windows Server 2012 R2. It manages users and makes it easy to integrate with compatible AWS services and other applications. Using the log forwarding feature, you can stay aware of all security events in Amazon CloudWatch Logs. This helps monitor events like the addition of a new user.

When new users are created in your AWS Managed Microsoft AD, you might go through the initial setup workflow manually. However, AWS Step Functions can coordinate new user creation activities into serverless workflows that automate the process. With Step Functions, AWS Lambda can be also used to run code for the automation workflows without provisioning or managing servers.

In this post, I show how to create and trigger a new user creation workflow in Step Functions. This workflow creates a WorkSpace in Amazon WorkSpaces and a user in Amazon Connect using AWS Managed Microsoft AD, Step Functions, Lambda, and Amazon CloudWatch Logs.

Overview

The following diagram shows the solution graphically.

Configuring user creation workflows with AWS Step Functions and AWS Managed Microsoft AD logs

Walkthrough

Using the following procedures, create an automated user creation workflow with AWS Managed Microsoft AD. The solution requires the creation of new resources in CloudWatch, Lambda, and Step Functions, and a new user in Amazon WorkSpaces and Amazon Connect. Here’s the list of steps:

  1. Enable log forwarding.
  2. Create the Lambda functions.
  3. Set up log streaming.
  4. Create a state machine in Step Functions.
  5. Test the solution.

Requirements

To follow along, you need the following resources:

  • AWS Managed Microsoft AD
    • Must be registered with Amazon WorkSpaces
    • Must be registered with Amazon Connect

In this example, you use an Amazon Connect instance with SAML 2.0-based authentication as identity management. For more information, see Configure SAML for Identity Management in Amazon Connect.

Enable log forwarding

Enable log forwarding for your AWS Managed Microsoft AD.  Use /aws/directoryservice/<directory id> for the CloudWatch log group name. You will use this log group name when creating a Log Streaming in Step 3.

Create Lambda functions

Create two Lambda functions. The first starts a Step Functions execution with CloudWatch Logs. The second performs a user registration process with Amazon WorkSpaces and Amazon Connect within a Step Functions execution.

Create the first function with the following settings:

  • Name: DS-Log-Stream-Function
  • Runtime: Python 3.7
  • Memory: 128 MB
  • Timeout: 3 seconds
  • Environment variables:
    • Key: stateMachineArn
    • Value: arn:aws:states:<Region>:<AccountId>:stateMachine:NewUserWorkFlow
  • IAM role with the following permissions:
    • AWSLambdaBasicExecutionRole
    • The following permissions policy
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "states:StartExecution",
            "Resource": "*"
        }
    ]
}
import base64
import boto3
import gzip
import json
import re
import os
def lambda_handler(event, context):
    logEvents = DecodeCWPayload(event)
    print('Event payload:', logEvents)
    returnResultDict = []
    
    # Because there can be more than one message pushed in a single payload, use a for loop to start a workflow for every user
    for logevent in logEvents:
        logMessage = logevent['message']
        upnMessage =  re.search("(<Data Name='UserPrincipalName'>)(.*?)(<\/Data>)",logMessage)
        if upnMessage != None:
            upn = upnMessage.group(2).lower()
            userNameAndDomain = upn.split('@')
            userName = userNameAndDomain[0].lower()
            userNameAndDomain = upn.split('@')
            domainName = userNameAndDomain[1].lower()
            sfnInputDict = {'Username': userName, 'UPN': upn, 'DomainName': domainName}
            sfnResponse = StartSFNExecution(json.dumps(sfnInputDict))
            print('Username:',upn)
            print('Execution ARN:', sfnResponse['executionArn'])
            print('Execution start time:', sfnResponse['startDate'])
            returnResultDict.append({'Username': upn, 'ExectionArn': sfnResponse['executionArn'], 'Time': str(sfnResponse['startDate'])})

    returnObject = {'Result':returnResultDict}
    return {
        'statusCode': 200,
        'body': json.dumps(returnObject)
    }

# Helper function decode the payload
def DecodeCWPayload(payload):
    # CloudWatch Log Stream event 
    cloudWatchLog = payload['awslogs']['data']
    # Base 64 decode the log 
    base64DecodedValue = base64.b64decode(cloudWatchLog)
    # Uncompress the gzipped decoded value
    gunzipValue = gzip.decompress(base64DecodedValue)
    dictPayload = json.loads(gunzipValue)
    decodedLogEvents = dictPayload['logEvents']
    return decodedLogEvents

# Step Functions state machine execution function
def StartSFNExecution(sfnInput):
    sfnClient = boto3.client('stepfunctions')
    try:
        response = sfnClient.start_execution(
            stateMachineArn=os.environ['stateMachineArn'],
            input=sfnInput
        )
        return response
    except Exception as e:
        return e

For the other function used to perform a user creation task, use the following settings:

  • Name: SFN-New-User-Flow
  • Runtime: Python 3.7
  • Memory: 128 MB
  • Timeout: 3 seconds
  • Environment variables:
    • Key: nameDelimiter
    • Value: . [period]

This delimiter is used to split the username into a first name and last name, as Amazon Connect instances with SAML-based authentication require both a first name and last name for users. For more information, see CreateUser API and UserIdentity Info.

  • Key: bundleId
  • Value: <WorkSpaces bundle ID>

Run the following AWS CLI command to return Amazon-owned WorkSpaces bundles. Use one of the bundle IDs for the key-value pair.

aws workspaces describe-workspace-bundles –owner AMAZON

  • Key: directoryId
  • Value: <WorkSpaces directory ID>

Run the following AWS CLI command to return Amazon WorkSpaces directories. Use your directory ID for the key-value pair.

aws workspaces describe-workspace-directories

  • Key: instanceId
  • Value: <Amazon Connect instance ID>

Find the Amazon Connect instance ID the Amazon Connect instance ID.

  • Key: routingProfile
  • Value: <Amazon Connect routing profile>

Run the following AWS CLI command to list routing profiles with their IDs. For this walkthrough, use the ID for the basic routing profile.

aws connect list-routing-profiles –instance-id <instance id>

  • Key: securityProfile
  • Value: <Amazon Connect security profile>

Run the following AWS CLI command to list security profiles with their IDs. For this walkthrough, use the ID for an agent security profile.

aws connect list-security-profiles –instance-id  <instance id>

  • IAM role permissions:
    • AWSLambdaBasicExecutionRole

The following permissions policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "connect:CreateUser",
                "workspaces:CreateWorkspaces"
            ],
            "Resource": "*"
        }
    ]
}
import json
import os
import boto3

def lambda_handler(event, context):
    userName = event['input']['User']
    nameDelimiter = os.environ['nameDelimiter']
    if nameDelimiter in userName:
        firstName = userName.split(nameDelimiter)[0]
        lastName = userName.split(nameDelimiter)[1]
    else:
        firstName = userName
        lastName = userName
    domainName = event['input']['Domain']
    upn = event['input']['UPN']
    serviceName = event['input']['Service']
    if serviceName == 'WorkSpaces':
        # Setting WorkSpaces variables
        workspacesDirectoryId = os.environ['directoryId']
        workspacesUsername = upn
        workspacesBundleId = os.environ['bundleId']
        createNewWorkSpace = create_new_workspace(
            directoryId=workspacesDirectoryId,
            username=workspacesUsername,
            bundleId=workspacesBundleId
        )
        return createNewWorkSpace
    elif serviceName == 'Connect':
        createConnectUser = create_connect_user(
            connectUsername=upn,
            connectFirstName=firstName,
            connectLastName=lastName,
            securityProfile=os.environ['securityProfile'], 
            routingProfile=os.environ['routingProfile'], 
            instanceId=os.environ['instanceId']
        )
        return createConnectUser
    else:
        print(serviceName, 'is not recognized...')
        print('Available service names are WorkSpaces and Connect')
        unknownServiceException = {
            'statusCode': 500,
            'body': json.dumps(f'Service name, {serviceName}, is not recognized')}
        raise Exception(unknownServiceException)

class FailedWorkSpaceCreationException(Exception):
    pass

class WorkSpaceResourceExists(Exception):
    pass

def create_new_workspace(directoryId, username, bundleId):
    workspacesClient = boto3.client('workspaces')
    response = workspacesClient.create_workspaces(
        Workspaces=[{
                'DirectoryId': directoryId,
                'UserName': username,
                'BundleId': bundleId,
                'WorkspaceProperties': {
                    'RunningMode': 'AUTO_STOP',
                    'RunningModeAutoStopTimeoutInMinutes': 60,
                    'RootVolumeSizeGib': 80,
                    'UserVolumeSizeGib': 100,
                    'ComputeTypeName': 'VALUE'
                    }}]
                    )
    print('create_workspaces response:',response)
    for pendingRequest in response['PendingRequests']:
        if pendingRequest['UserName'] == username:
            workspacesResultObject = {'UserName':username, 'ServiceName':'WorkSpaces', 'Status': 'Success'}
            return {
                'statusCode': 200,
                'body': json.dumps(workspacesResultObject)
                }
    for failedRequest in response['FailedRequests']:
        if failedRequest['WorkspaceRequest']['UserName'] == username:
            errorCode = failedRequest['ErrorCode']
            errorMessage = failedRequest['ErrorMessage']
            errorResponse = {'Error Code:', errorCode, 'Error Message:', errorMessage}
            if errorCode == "ResourceExists.WorkSpace": 
                raise WorkSpaceResourceExists(str(errorResponse))
            else:
                raise FailedWorkSpaceCreationException(str(errorResponse))
                
def create_connect_user(connectUsername, connectFirstName,connectLastName,securityProfile,routingProfile,instanceId):
    connectClient = boto3.client('connect')
    response = connectClient.create_user(
                    Username=connectUsername,
                    IdentityInfo={
                        'FirstName': connectFirstName,
                        'LastName': connectLastName
                        },
                    PhoneConfig={
                        'PhoneType': 'SOFT_PHONE',
                        'AutoAccept': False,
                        },
                    SecurityProfileIds=[
                        securityProfile,
                        ],
                    RoutingProfileId=routingProfile,
                    InstanceId = instanceId
                    )
    connectSuccessResultObject = {'UserName':connectUsername,'ServiceName':'Connect','FirstName': connectFirstName, 'LastName': connectLastName,'Status': 'Success'}
    return {
        'statusCode': 200,
        'body': json.dumps(connectSuccessResultObject)
        }

Set up log streaming

Create a new CloudWatch Logs subscription filter that sends log data to the Lambda function DS-Log-Stream-Function created in Step 2.

  1. In the CloudWatch console, choose Logs, Log Groups, and select the log group, /aws/directoryservice/<directory id>, for the directory set up in Step 1.
  2. Choose Actions, Stream to AWS Lambda.
  3. Choose Destination, and select the Lambda function DS-Log-Stream-Function.
  4. For Log format, choose Other as the log format and enter “<EventID>4720</EventID>” (include the double quotes).
  5. Choose Start streaming.

If there is an existing subscription filter for the log group, run the following AWS CLI command to create a subscription filter for the Lambda function, DS-Log-Stream-Function.

aws logs put-subscription-filter \

--log-group-name /aws/directoryservice/<directoryid> \

--filter-name NewUser \

--filter-pattern "<EventID>4720</EventID>" \

--destination-arn arn:aws:lambda:<Region>:<ACCOUNT_NUMBER>:function:DS-Log-Stream-Function

For more information, see Using CloudWatch Logs Subscription Filters.

Create a state machine in Step Functions

The next step is to create a state machine in Step Functions. This state machine runs the Lambda function, SFN-New-User-Flow, to create a user in Amazon WorkSpaces and Amazon Connect.

Define the state machine, using the following settings:

  • Name: NewUserWorkFlow
  • State machine definition: Copy the following state machine definition:
{
    "Comment": "An example state machine for a new user creation workflow",
    "StartAt": "Parallel",
    "States": {
        "Parallel": {
            "Type": "Parallel",
            "End": true,
            "Branches": [
                {
                    "StartAt": "CreateWorkSpace",
                    "States": {
                        "CreateWorkSpace": {
                            "Type": "Task",
                            "Parameters": {
                                "input": {
                                    "User.$": "$.Username",
                                    "UPN.$": "$.UPN",
                                    "Domain.$": "$.DomainName",
                                    "Service": "WorkSpaces"
                                }
                            },
                            "Resource": "arn:aws:lambda:{region}:{account id}:function:SFN-New-User-Flow",
                            "Retry": [
                                {
                                    "ErrorEquals": [
                                        "WorkSpaceResourceExists"
                                    ],
                                    "IntervalSeconds": 1,
                                    "MaxAttempts": 0,
                                    "BackoffRate": 1
                                },
                                {
                                    "ErrorEquals": [
                                        "States.ALL"
                                    ],
                                    "IntervalSeconds": 10,
                                    "MaxAttempts": 2,
                                    "BackoffRate": 2
                                }
                            ],
                            "Catch": [
                                {
                                    "ErrorEquals": [
                                        "WorkSpaceResourceExists"
                                    ],
                                    "ResultPath": "$.workspacesResult",
                                    "Next": "WorkSpacesPassState"
                                },
                                {
                                    "ErrorEquals": [
                                        "States.ALL"
                                    ],
                                    "ResultPath": "$.workspacesResult",
                                    "Next": "WorkSpacesPassState"
                                }
                            ],
                            "End": true
                        },
                        "WorkSpacesPassState": {
                            "Type": "Pass",
                            "Parameters": {
                                "Result.$": "$.workspacesResult"
                            },
                            "End": true
                        }
                    }
                },
                {
                    "StartAt": "CreateConnectUser",
                    "States": {
                        "CreateConnectUser": {
                            "Type": "Task",
                            "Parameters": {
                                "input": {
                                    "User.$": "$.Username",
                                    "UPN.$": "$.UPN",
                                    "Domain.$": "$.DomainName",
                                    "Service": "Connect"
                                }
                            },
                            "Resource": "arn:aws:lambda:{region}:{account id}:function:SFN-New-User-Flow",
                            "Retry": [
                                {
                                    "ErrorEquals": [
                                        "DuplicateResourceException"
                                    ],
                                    "IntervalSeconds": 1,
                                    "MaxAttempts": 0,
                                    "BackoffRate": 1
                                },
                                {
                                    "ErrorEquals": [
                                        "States.ALL"
                                    ],
                                    "IntervalSeconds": 10,
                                    "MaxAttempts": 2,
                                    "BackoffRate": 2
                                }
                            ],
                            "Catch": [
                                {
                                    "ErrorEquals": [
                                        "DuplicateResourceException"
                                    ],
                                    "ResultPath": "$.connectResult",
                                    "Next": "ConnectPassState"
                                },
                                {
                                    "ErrorEquals": [
                                        "States.ALL"
                                    ],
                                    "ResultPath": "$.connectResult",
                                    "Next": "ConnectPassState"
                                }
                            ],
                            "End": true,
                            "ResultPath": "$.connectResult"
                        },
                        "ConnectPassState": {
                            "Type": "Pass",
                            "Parameters": {
                                "Result.$": "$.connectResult"
                            },
                            "End": true
                        }
                    }
                }
            ]
        }
    }
}

After entering the name and state machine definition, choose Next.

Configure the settings by choosing Create an IAM role for me. This creates an IAM role for the state machine to run the Lambda function SFN-New-User-Flow.

Here’s the list of states in the NewUserWorkFlow state machine definition:

  • Start—When the state machine starts, it creates a parallel state to start both the CreateWorkSpace and CreateConnectUser states.
  • CreateWorkSpace—This task state runs the SFN-New-User-Flow Lambda function to create a new WorkSpace for the user. If this is successful, it goes to the End state.
  • WorkSpacesPassState—This pass state returns the result from the CreateWorkSpace state.
  • CreateConnectUse — This task state runs the SFN-New-User-Flow Lambda function to create a user in Amazon Connect. If this is successful, it goes to the End state.
  • ConnectPassState—This pass state returns the result from the CreateWorkSpace state.
  • End

The following diagram shows how these states relate to each other.

Step Functions State Machine

Test the solution

It’s time to test the solution. Create a user in AWS Managed Microsoft AD. The new user should have the following attributes:

This starts a new state machine execution in Step Functions. Here’s the flow:

  1. When there is a user creation event (Event ID: 4720) in the AWS Managed Microsoft AD security log, CloudWatch invokes the Lambda function, DS-Log-Stream-Function, to start a new state machine execution in Step Functions.
  2. To create a new WorkSpace and create a user in the Amazon Connect instance, the state machine execution runs tasks to invoke the other Lambda function, SFN-New-User-Flow.

Conclusion

This solution automates the initial user registration workflow. Step Functions provides the flexibility to customize the workflow to meet your needs. This walkthrough included Amazon WorkSpaces and Amazon Connect; both services are used to register the new user. For organizations that create a number of new users on a regular basis, this new user automation workflow can save time when configuring resources for a new user.

The event source of the automation workflow can be any event that triggers the new user workflow, so the event source isn’t limited to CloudWatch Logs. Also, the integrated service used for new user registration can be any AWS service that offers API and works with AWS Managed Microsoft AD. Other programmatically accessible services within or outside AWS can also fill that role.

In this post, I showed you how serverless workflows can streamline and coordinate user creation activities. Step Functions provides this functionality, with the help of Lambda, Amazon WorkSpaces, AWS Managed Microsoft AD, and Amazon Connect. Together, these services offer increased power and functionality when managing users, monitoring security, and integrating with compatible AWS services.

FICO: Fraud Detection and Anti-Money Laundering with AWS Lambda and AWS Step Functions

Post Syndicated from Annik Stahl original https://aws.amazon.com/blogs/architecture/fico-fraud-detection-and-anti-money-laundering-with-aws-lambda-and-aws-step-functions/

In this episode of This is My Architecture, filmed in 2018 on the last day of re:Invent (a learning conference hosted by Amazon Web Services for the global cloud computing community), FICO lead Software Engineer Sven Ahlfeld talks to AWS Solutions Architect Tom Jones about how the company uses a combination of AWS Lambda and AWS Step Functions to architect an on-demand solution for fraud detection and anti-money laundering.

When you think of FICO, you probably thing credit score. And that’s true: founded in 1956, FICO introduced analytic solutions–such as credit scoring–that have made credit more widely available in the US and around the world. As well, the FICO score is the standard measure of consumer risk in the US.

In the video, Sven explains that FICO is making software to meet regulatory compliance goals and requirements, in this case to tackle money laundering. FICO ingests a massive amounts of customer data in the form of financial documents into S3, and then uses S3 to trigger and analyze each document for a number of different fraud and money laundering characteristics.

Key architecture components are designed to be immutable, assuring that the EC2 instances doing the analysis work themselves can’t be compromised and tampered with. But as well, an unchangeable instance can scale very fast and allow for the ability to ingest a large amount of documents. It can also scale back when there is less demand. The immutable images also support regulatory requirements for the various needs and regulations of localities around the world.

 

*Check out more This Is My Architecture videos on YouTube.

About the author

Annik StahlAnnik Stahl is a Senior Program Manager in AWS, specializing in blog and magazine content as well as customer ratings and satisfaction. Having been the face of Microsoft Office for 10 years as the Crabby Office Lady columnist, she loves getting to know her customers and wants to hear from you.

How to export an Amazon DynamoDB table to Amazon S3 using AWS Step Functions and AWS Glue

Post Syndicated from Joe Feeney original https://aws.amazon.com/blogs/big-data/how-to-export-an-amazon-dynamodb-table-to-amazon-s3-using-aws-step-functions-and-aws-glue/

In typical AWS fashion, not a week had gone by after I published How Goodreads offloads Amazon DynamoDB tables to Amazon S3 and queries them using Amazon Athena on the AWS Big Data blog when the AWS Glue team released the ability for AWS Glue crawlers and AWS Glue ETL jobs to read from DynamoDB tables natively. I was actually pretty excited about this. Less code means fewer bugs. The original architecture had been around for at least 18 months and could be simplified significantly with a little bit of work.

Refactoring the data pipeline

The AWS Data Pipeline architecture outlined in my previous blog post is just under two years old now. We had used data pipelines as a way to back up Amazon DynamoDB data to Amazon S3 in case of a catastrophic developer error. However, with DynamoDB point-in-time recovery we have a better, native mechanism for disaster recovery. Additionally, with data pipelines we still own the operations associated with the clusters themselves, even if they are transient. A common challenge is keeping our clusters up with recent releases of Amazon EMR to help mitigate any outstanding bugs. Another is the inefficiency of needing to spin up an EMR cluster for each DynamoDB table.

I decided to take a step back and list the capabilities I wanted to have in the next iteration:

  • Export tables using AWS Glue instead of EMR.
    • AWS Glue provides a serverless ETL environment where I don’t have to worry about the underlying infrastructure. This minimizes operational tasks like keeping up with the EMR release tags.
  • Use a workflow solution that works across services like AWS Glue and Amazon Athena.
    • In the first iteration, the workflow was spread across various services. Unless you had the entire pipeline in your head, it was difficult to get a bird’s-eye view of how the pipeline was progressing.
  • Ability to select different formats.
    • For data engineering, I prefer Apache Parquet. However, customers might prefer a different format.
  • Add exported data to Athena.
    • I find that the easier it is for the data to be queried, the more likely it’s used.

Architecture overview

At a high level, this is the architecture:

  • We’re using AWS Step Functions as the workflow engine.
    • Each step is either a built-in Step Functions state, a service integration, or a simple Python AWS Lambda For example, GlueStartJobRun is using the synchronous job run service integration, as discussed in the documentation.
    • We get a visual representation of the entire pipeline.
    • It’s quick to onboard new developers.
  • An event in Amazon CloudWatch Events, which is disabled to start, triggers a Step Functions state machine with a JSON payload that contains the following:
    • AWS Glue job name
    • Export destination
    • DynamoDB table name
    • Desired read percentage
    • AWS Glue crawler name
  • AWS Glue exports a DynamoDB table in your preferred format to S3 as snapshots_your_table_name. The data is partitioned by the snapshot_timestamp
  • An AWS Glue crawler adds or updates your data’s schema and partitions in the AWS Glue Data Catalog.
  • Finally, we create an Athena view that only has data from the latest export snapshot.

A simple AWS Glue ETL job

The script that I created accepts AWS Glue ETL job arguments for the table name, read throughput, output, and format. Behind the scenes, AWS Glue scans the DynamoDB table. AWS Glue makes sure that every top-level attribute makes it into the schema, no matter how sparse your attributes are (as discussed in the DynamoDB documentation).

Here’s the script:

import sys
import datetime
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext

ARG_TABLE_NAME = "table_name"
ARG_READ_PERCENT = "read_percentage"
ARG_OUTPUT = "output_prefix"
ARG_FORMAT = "output_format"

PARTITION = "snapshot_timestamp"

args = getResolvedOptions(sys.argv,
  [
    'JOB_NAME',
    ARG_TABLE_NAME,
    ARG_READ_PERCENT,
    ARG_OUTPUT,
    ARG_FORMAT
  ]
)

table_name = args[ARG_TABLE_NAME]
read = args[ARG_READ_PERCENT]
output_prefix = args[ARG_OUTPUT]
fmt = args[ARG_FORMAT]

print("Table name:", table_name)
print("Read percentage:", read)
print("Output prefix:", output_prefix)
print("Format:", fmt)

date_str = datetime.datetime.utcnow().strftime('%Y-%m-%dT%H:%M')
output = "%s/%s=%s" % (output_prefix, PARTITION, date_str)

sc = SparkContext()
glueContext = GlueContext(sc)

table = glueContext.create_dynamic_frame.from_options(
  "dynamodb",
  connection_options={
    "dynamodb.input.tableName": table_name,
    "dynamodb.throughput.read.percent": read
  }
)

glueContext.write_dynamic_frame.from_options(
  frame=table,
  connection_type="s3",
  connection_options={
    "path": output
  },
  format=fmt,
  transformation_ctx="datasink"
)

There’s not a lot here. We’re creating a DynamicFrameReader of connection type dynamodb and passing in the table name and desired maximum read throughput consumption. We pass that data frame to a DynamicFrameWriter that writes the table to S3 in the specified format.

Athena views

Most teams at Amazon own applications that have multiple DynamoDB tables, including my own team. Our current application uses five primary tables. Ideally, at the end of an export workflow you can write simple, obvious queries across a consistent view of your tables. However, each exported table is partitioned by the timestamp from when the table was exported. This makes querying across one or more tables very cumbersome, because you have to add a WHERE snapshot_timestamp = clause to every table reference in your query. Additionally, each table might have a different snapshot_timestamp value for any given day!

The final step in this export workflow creates an Athena view that adds that WHERE clause for you. This means that you can interact with your DynamoDB exports as if they were one sane view of your exported DynamoDB tables.

Setting up the infrastructure

The AWS CloudFormation stacks I create are split into two stacks. The common stack contains shared infrastructure, and you need only one of these per AWS Region. The table stacks are designed in such a way that you can create one per table-format combination in any given AWS Region. It contains the CloudWatch event logic and AWS Glue components needed to export and transform DynamoDB tables.

Creating the common stack

The common stack contains the majority of the infrastructure. That includes the Step Functions state machine and Lambda functions to trigger and check the state of asynchronous jobs. It also includes IAM roles that the export stacks use, and the S3 bucket to store the exports.

To create the common stack, do the following:

  1. Choose this Launch Stack
  2. Choose I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  3. Choose Create Stack.

Creating the table export stack

If you don’t have a DynamoDB table to export, follow the original blog post. Start with the Working with the Reviews stack section and continue until you’ve added the two Items to the table. Otherwise, feel free to point this CloudFormation stack at your favorite DynamoDB table that is using provisioned throughput. Tables that use on-demand throughput are not currently supported.

Because so much of this architecture is shareable, there’s not much in the table export stack. This stack defines the CloudWatch event used to trigger the Step Functions state machine with a JSON payload containing all the necessary metadata. Additionally, it contains the AWS Glue ETL job that exports the table and the AWS Glue Crawler that updates metadata in the AWS Glue Data Catalog.

Technically, you can define the AWS Glue ETL job in the common stack because it’s already parameterized. However, the default limit for concurrent runs for an AWS Glue job is three. This is a soft limit, but with this architecture you have headroom to export up to 25 tables before asking for a limit increase.

To create the table export stack, do the following:

  1. Choose this Launch Stack
  2. Choose an output format from the list. All the available formats are supported by Athena natively.
  3. Enter your DynamoDB table name.
  4. Enter the percentage of Read Capacity Units (RCUs) that the job should consume from your table’s currently provisioned throughput. This percentage is expressed as a float between 0.1 and 1.0 inclusive. The default is 0.25 (25 percent).

As an example: Suppose that your table’s RCUs are set to 100 and you use the default 0.25, 25 percent. Then the AWS Glue job consume 25 RCUs while running.

  1. Choose Create.

Kicking off a state machine execution

To demonstrate how this works, we run the DynamoDB export state machine manually by passing it the JSON payload that the CloudWatch event would pass to Step Functions.

Getting the JSON payload from CloudWatch Events

To get the JSON payload, do the following:

  1. Open CloudWatch in the AWS Management Console.
  2. In the left column under Events, choose Rules.
  3. Choose your rule from the list. It is prefixed by AWSBigDataBlog-.
  4. For Actions, choose Edit.
  5. Copy the JSON payload from the Configure input section of Targets.
  6. Choose Cancel to exit edit mode.

Starting a state machine execution

To start an execution of the state machine, take the following steps:

  1. Open Step Functions in the console.
  2. Choose the DynamoDBExportAndAthenaLoad state machine.
  3. Choose Start execution.
  4. Paste the JSON payload into the Input
  5. Choose Start execution.

There are a few ways to follow along with the execution. As steps are entered and exited, entries are added to the Execution event history list. This is a great way to see what state (event in Lambda speak) is passed to each step, in case you need to debug.

You can also expand the Visual workflow. It’s a great high-level view to see how the workflow is progressing.

After the workflow is finished, you see two new tables under the dynamodb_exports database in your AWS Glue Data Catalog. Your DynamoDB snapshots table name is prefixed with snapshots_. The schema is formatted for the AWS Glue Data Catalog (lowercase and hyphens transformed to underscores). You also have a view table with the same table name formatted for AWS Glue Data Catalog but without the snapshots_ prefix.

Querying your data

To showcase how having a separate view table of the most recent snapshot of a table is useful, I use the Reviews table from the previous blog post. The table has two items. I have also run the export workflow twice. As you can see when you preview the table, there are four items total. That’s because each snapshot contains two items.

From the items, the latest snapshot_timestamp is 2019-01-11T23:26. When I run the same preview query against the view table reviews, we see that there are only two items, which is what we expect. The view takes care of specifying the where snapshot_timestamp=… clause so you don’t have to.

Wrapping up

In this post, I showed you how to use AWS Glue’s DynamoDB integration and AWS Step Functions to create a workflow to export your DynamoDB tables to S3 in Parquet. I also show how to create an Athena view for each table’s latest snapshot, giving you a consistent view of your DynamoDB table exports.


About the Author

Joe Feeney is a Software Engineer at Amazon Go, where he does secret stuff and he’s quite chuffed with that. He enjoys embarrassing his family by taking Mario Kart entirely too seriously.

 

 

 

ICYMI: Serverless Q1 2019

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/icymi-serverless-q1-2019/

Welcome to the fifth edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share all of the most recent product launches, feature enhancements, blog posts, webinars, Twitch live streams, and other interesting things that you might have missed!

If you didn’t see them, check our previous posts for what happened in 2018:

So, what might you have missed this past quarter? Here’s the recap.

Amazon API Gateway

Amazon API Gateway improved the experience for publishing APIs on the API Gateway Developer Portal. In addition, we also added features like a search capability, feedback mechanism, and SDK-generation capabilities.

Last year, API Gateway announced support for WebSockets. As of early February 2019, it is now possible to build WebSocket-enabled APIs via AWS CloudFormation and AWS Serverless Application Model (AWS SAM). The following diagram shows an example application.WebSockets

API Gateway is also now supported in AWS Config. This feature enhancement allows API administrators to track changes to their API configuration automatically. With the power of AWS Config, you can automate alerts—and even remediation—with triggered Lambda functions.

In early January, API Gateway also announced a service level agreement (SLA) of 99.95% availability.

AWS Step Functions

Step Functions Local

AWS Step Functions added the ability to tag Step Function resources and provide access control with tag-based permissions. With this feature, developers can use tags to define access via AWS Identity and Access Management (IAM) policies.

In addition to tag-based permissions, Step Functions was one of 10 additional services to have support from the Resource Group Tagging API, which allows a single central point of administration for tags on resources.

In early February, Step Functions released the ability to develop and test applications locally using a local Docker container. This new feature allows you to innovate faster by iterating faster locally.

In late January, Step Functions joined the family of services offering SLAs with an SLA of 99.9% availability. They also increased their service footprint to include the AWS China (Ningxia) and AWS China (Beijing) Regions.

AWS SAM Command Line Interface

AWS SAM Command Line Interface (AWS SAM CLI) released the AWS Toolkit for Visual Studio Code and the AWS Toolkit for IntelliJ. These toolkits are open source plugins that make it easier to develop applications on AWS. The toolkits provide an integrated experience for developing serverless applications in Node.js (Visual Studio Code) as well as Java and Python (IntelliJ), with more languages and features to come.

The toolkits help you get started fast with built-in project templates that leverage AWS SAM to define and configure resources. They also include an integrated experience for step-through debugging of serverless applications and make it easy to deploy your applications from the integrated development environment (IDE).

AWS Serverless Application Repository

AWS Serverless Application Repository applications can now be published to the application repository using AWS CodePipeline. This allows you to update applications in the AWS Serverless Application Repository with a continuous integration and continuous delivery (CICD) process. The CICD process is powered by a pre-built application that publishes other applications to the AWS Serverless Application Repository.

AWS Event Fork Pipelines

Event Fork Pipelines

AWS Event Fork Pipelines is now available in AWS Serverless Application Repository. AWS Event Fork Pipelines is a suite of nested open-source applications based on AWS SAM. You can deploy Event Fork Pipelines directly from AWS Serverless Application Repository into your AWS account. These applications help you build event-driven serverless applications by providing pipelines for common event-handling requirements.

AWS Cloud9

Cloud9

AWS Cloud9 announced that, in addition to Amazon Linux, you can now select Ubuntu as the operating system for their AWS Cloud9 environment. Before this announcement, you would have to stand up an Ubuntu server and connect AWS Cloud9 to the instance by using SSH. With native support for Ubuntu, you can take advantage of AWS Cloud9 features, such as instance lifecycle management for cost efficiency and preconfigured tooling environments.

AWS Cloud9 also added support for AWS CloudTrail, which allows you to monitor and react to changes made to your AWS Cloud9 environment.

Amazon Kinesis Data Analytics

Amazon Kinesis Data Analytics now supports CloudTrail logging. CloudTrail captures changes made to Kinesis Data Analytics and delivers the logs to an Amazon S3 bucket. This makes it easy for administrators to understand changes made to the application and who made them.

Amazon DynamoDB

Amazon DynamoDB removed the associated costs of DynamoDB Streams used in replicating data globally. Because of their use of streams to replicate data between Regions, this translates to cost savings in global tables. However, DynamoDB streaming costs remain the same for your applications reading from a replica table’s stream.

DynamoDB added the ability to switch encryption keys used to encrypt data. DynamoDB, by default, encrypts all data at rest. You can use the default encryption, the AWS-owned customer master key (CMK), or the AWS managed CMK to encrypt data. It is now possible to change between the AWS-owned CMK and the AWS managed CMK without having to modify code or applications.

Amazon DynamoDB Local, a local installable version of DynamoDB, has added support for transactional APIs, on-demand capacity, and as many as 20 global secondary indexes per table.

AWS Amplify

Amplify Deploy

AWS Amplify added support for OAuth 2.0 Authorization Code Grant flows in the native (iOS and Android) and React Native libraries. Previously, you would have to use third-party libraries and handwritten logic to achieve these use cases.

Additionally, Amplify also launched the ability to perform instant cache invalidation and delta deployments on every code commit. To achieve this, Amplify creates unique references to all the build artifacts on each deploy. Amplify has also added the ability to detect and upload only modified artifacts at the time of release to help reduce deployment time.

Amplify also added features for multiple environments, custom resolvers, larger data models, and IAM roles, including multi-factor authentication (MFA).

AWS AppSync

AWS AppSync increased its availability footprint to the EU (London) Region.

Amazon Cognito

Amazon Cognito increased its service footprint to include the Canada (central) Region. It also published an SLA of 99.9% availability.

Amazon Aurora

Amazon Aurora Serverless increases performance visibility by publishing logs to Amazon CloudWatch.

AWS CodePipeline

CodePipeline

AWS CodePipeline announces support for deploying static files to Amazon S3. While this may not usually fall under the serverless blogs and announcements, if you’re a developer who builds single-page applications or host static websites, this makes your life easier. Your static site can now be part of your CICD process without custom coding.

Serverless Posts

January:

February:

March

Tech talks

We hold several AWS Online Tech Talks covering serverless tech talks throughout the year, so look out for them in the Serverless section of the AWS Online Tech Talks page. Here are the three tech talks that we delivered in Q1:

Whitepapers

Security Overview of AWS Lambda: This whitepaper presents a deep dive into the Lambda service through a security lens. It provides a well-rounded picture of the service, which can be useful for new adopters, as well as deepening understanding of Lambda for current users. Read the full whitepaper.

Twitch

AWS Launchpad Santa Clara

There is always something going on at our Twitch channel! Be sure and follow us so you don’t miss anything! For information about upcoming broadcasts and recent livestreams, keep an eye on AWS on Twitch for more Serverless videos and on the Join us on Twitch AWS page.

In other news

Building Happy Little APIs

Twitch Series: Building Happy Little APIs

In April, we started a 13-week deep dive into building APIs on AWS as part of our Twitch Build On series. The Building Happy Little APIs series covers the common and not-so-common use cases for APIs on AWS and the features available to customers as they look to build secure, scalable, efficient, and flexible APIs.

Twitch series: Build on Serverless: Season 2

Build On Serverless

Join Heitor Lessa across 14 weeks, nearly every Wednesday from April 24 – August 7 at 8AM PST/11AM EST/3PM UTC. Heitor is live-building a full-stack, serverless airline-booking application using a bunch of services: Lambda, Amplify, API Gateway, Amazon Cognito, AWS SAM, CloudWatch, AWS AppSync, and others. See the episode guide and sign up for stream reminders.

2019 AWS Summits

AWS Summit

The 2019 schedule is in full swing for 2019 AWS Global Summits held in major cities around the world. These free events bring the cloud computing community together to connect, collaborate, and learn about AWS. They attract technologists from all industries and skill levels who want to discover how AWS can help them innovate quickly and deliver flexible, reliable solutions at scale. Get notified when to register and learn more at the AWS Global Summit Program website.

Still looking for more?

The Serverless landing page has lots of information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials. Check it out!

From Poll to Push: Transform APIs using Amazon API Gateway REST APIs and WebSockets

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/from-poll-to-push-transform-apis-using-amazon-api-gateway-rest-apis-and-websockets/

This post is courtesy of Adam Westrich – AWS Principal Solutions Architect and Ronan Prenty – Cloud Support Engineer

Want to deploy a web application and give a large number of users controlled access to data analytics? Or maybe you have a retail site that is fulfilling purchase orders, or an app that enables users to connect data from potentially long-running third-party data sources. Similar use cases exist across every imaginable industry and entail sending long-running requests to perform a subsequent action. For example, a healthcare company might build a custom web portal for physicians to connect and retrieve key metrics from patient visits.

This post is aimed at optimizing the delivery of information without needing to poll an endpoint. First, we outline the current challenges with the consumer polling pattern and alternative approaches to solve these information delivery challenges. We then show you how to build and deploy a solution in your own AWS environment.

Here is a glimpse of the sample app that you can deploy in your own AWS environment:

What’s the problem with polling?

Many customers need implement the delivery of long-running activities (such as a query to a data warehouse or data lake, or retail order fulfillment). They may have developed a polling solution that looks similar to the following:

  1. POST sends a request.
  2. GET returns an empty response.
  3. Another… GET returns an empty response.
  4. Yet another… GET returns an empty response.
  5. Finally, GET returns the data for which you were looking.

The challenges of traditional polling methods

  • Unnecessary chattiness and cost due to polling for result sets—Every time your frontend polls an API, it’s adding costs by leveraging infrastructure to compute the result. Empty polls are just wasteful!
  • Hampered mobile battery life—One of the top contributors of apps that eat away at battery life is excessive polling. Make sure that your app isn’t on your users’ Top App Battery Usage list-of-shame that could result in deletion.
  • Delayed data arrival due to polling schedule—Some approaches to polling include an incremental backoff to limit the number of empty polls. This sometimes results in a delay between data readiness and data arrival.

But what about long polling?

  • User request deadlocks can hamper application performance—Long synchronous user responses can lead to unexpected user wait times or UI deadlocks, which can affect mobile devices especially.
  • Memory leaks and consumption could bring your app down—Keeping long-running tasks queries open may overburden your backend and create failure scenarios, which may bring down your app.
  • HTTP default timeouts across browsers may result in inconsistent client experience—These timeouts vary across browsers, and can lead to an inconsistent experience across your end users. Depending on the size and complexity of the requests, processing can last longer than many of these timeouts and take minutes to return results.

Instead, create an event-driven architecture and move your APIs from poll to push.

Asynchronous push model

To create optimal UX experiences, frontend developers often strive to create progressive and reactive user experiences. Users can interact with frontend objects (for example, push buttons) with little lag when sending requests and receiving data. But frontend developers also want users to receive timely data, without sacrificing additional user actions or performing unnecessary processing.

The birth of microservices and the cloud over the past several years has enabled frontend developers and backend service designers to think about these problems in an asynchronous manner. This enables the virtually unlimited resources of the cloud to choreograph data processing. It also enables clients to benefit from progressive and reactive user experiences.

This is a fresh alternative to the synchronous design pattern, which often relies on client consumers to act as the conductor for all user requests. The following diagram compares the flow of communication between patterns.Comparison of communication patterns

Asynchronous orchestration can be easily choreographed using the workflow definition, monitoring, and tracking console with AWS Step Functions state machines. Break up services into functions with AWS Lambda and track executions within a state diagram, like the following:

With this approach, consumers send a request, which initiates downstream distributed processing. The subsequent steps are invoked according to the state machine workflow and each execution can be monitored.

Okay, but how does the consumer get the result of what they’re looking for?

Multiple approaches to this problem

There are different approaches for consumers to retrieve their resulting data. In order to create the optimal solution, there are several questions that a service owner may need to ask.

Is there known trust with the client (such as another microservice)?

If the answer is Yes, one approach is to use Amazon SNS. That way, consumers can subscribe to topics and have data delivered using email, SMS, or even HTTP/HTTPS as an event subscriber (that is, webhooks).

With webhooks, the consumer creates an endpoint where the service provider can issue a callback using a small amount of consumer-side resources. The consumer is waiting for an incoming request to facilitate downstream processing.

  1. Send a request.
  2. Subscribe the consumer to the topic.
  3. Open the endpoint.
  4. SNS sends the POST request to the endpoint.

If the trust answer is No, then clients may not be able to open a lightweight HTTP webhook after sending the initial request. In that case, you must open the webhook. Consider an alternative framework.

Are you restricted to only using a REST protocol?

Some applications can only run HTTP requests. These scenarios include the age of the technology or browser for end users, and potential security requirements that may block other protocols.

In these scenarios, customers may use a GET method as a best practice, but we still advise avoiding polling. In these scenarios, some design questions may be: Does data readiness happen at a predefined time or duration interval? Or can the user experience tolerate time between data readiness and data arrival?

If the answer to both of these is Yes, then consider trying to send GET calls to your RESTful API one time. For example, if a job averages 10 minutes, make your GET call at 10 minutes after the submission. Sounds simple, right? Much simpler than polling.

Should I use GraphQL, a WebSocket API, or another framework?

Each framework has tradeoffs.

If you want a more flexible query schema, you may gravitate to GraphQL, which follows a “data-driven UI” approach. If data drives the UI, then GraphQL may be the best solution for your use case.

AWS AppSync is a serverless GraphQL engine that supports the heavy lifting of these constructs. It offers functionality such as AWS service integration, offline data synchronization, and conflict resolution. With GraphQL, there’s a construct called Subscriptions where clients can subscribe to event-based subscriptions upon a data change.

Amazon API Gateway makes it easy for developers to deploy secure APIs at scale. With the recent introduction of API Gateway WebSocket APIs, web, mobile clients, and backend services can communicate over an established WebSocket connection. This also allows clients to be more reactive to data updates and only do work one time after an update has been received over the WebSocket connection.

The typical frontend design approach is to create a UI component that is updated when the results of the given procedure are complete. This is beneficial over a complete webpage refresh and gains in the customer’s user experience.

Because many companies have elected to use the REST framework for creating API-driven tightly bound service contracts, a RESTful interface can be used to validate and receive the request. It can provide a status endpoint used to deliver status. Also, it provides additional flexibility in delivering the result to the variety of clients, along with the WebSocket API.

Poll-to-push solution with API Gateway

Imagine a scenario where you want to be updated as soon as the data is created. Instead of the traditional polling methods described earlier, use an API Gateway WebSocket API. That pushes new data to the client as it’s created, so that it can be rendered on the client UI.

Alternatively, a WebSocket server can be deployed on Amazon EC2. With this approach, your server is always running to accept and maintain new connections. In addition, you manage the scaling of the instance manually at times of high demand.

By using an API Gateway WebSocket API in front of Lambda, you don’t need a machine to stay always on, eating away your project budget. API Gateway handles connections and invokes Lambda whenever there’s a new event. Scaling is handled on the service side. To update our connected clients from the backend, we can use the API Gateway callback URL. The AWS SDKs make communication from the backend easy. As an example, see the Boto3 sample of post_to_connection:

import boto3 
#Use a layer or deployment package to 
#include the latest boto3 version.
...
apiManagement = boto3.client('apigatewaymanagementapi', region_name={{api_region}},
                      endpoint_url={{api_url}})
...
response = apiManagement.post_to_connection(Data={{message}},ConnectionId={{connectionId}})
...

Solution example

To create this more optimized solution, create a simple web application that enables a user to make a request against a large dataset. It returns the results in a flat file over WebSocket. In this case, we’re querying, via Amazon Athena, a data lake on S3 populated with AWS Twitter sentiment (Twitter: @awscloud or #awsreinvent). However, this solution could apply to any data store, data mart, or data warehouse environment, or the long-running return of data for a response.

For the frontend architecture, create this web application using a JavaScript framework (such as JQuery). Use API Gateway to accept a REST API for the data and then open a WebSocket connection on the client to facilitate the return of results:poll to push solution architecture

  1. The client sends a REST request to API Gateway. This invokes a Lambda function that starts the Step Functions state machine execution. The function also returns a task token for the open connection activity worker. The function then returns the execution ARN and task token to the client.
  2. Using the data returned by the REST request, the client connects to the WebSocket API and sends the task token to the WebSocket connection.
  3. The WebSocket notifies the Step Functions state machine that the client is connected. The Lambda function completes the OpenConnection task through validating the task token and sending a success message.
  4. After RunAthenaQuery and OpenConnection are successful, the state machine updates the connected client over the WebSocket API that their long-running job is complete. It uses the REST API call post_to_connection.
  5. The client receives the update over their WebSocket connection using the IssueCallback Lambda function, with the callback URL from the API Gateway WebSocket API.

In this example application, the data response is an S3 presigned URL composed of results from Athena. You can then configure your frontend to use that S3 link to download to the client.

Why not just open the WebSocket API for the request?

While this approach can work, we advise against it for this use case. Break the required interfaces for this use case into three processes:

  1. Submit a request.
  2. Get the status of the request (to detect failure).
  3. Deliver the results.

For the submit request interface, RESTful APIs have strong controls to ensure that user requests for data are validated and given guardrails for the intended query purpose. This helps prevent rogue requests and unforeseen impacts to the analytics environment, especially when exposed to a large number of users.

With this example solution, you’re restricting the data requests to specific countries in the frontend JavaScript. Using the RESTful POST method for API requests enables you to validate data as query string parameters, such as the following:

https://<apidomain>.amazonaws.com/Demo/CreateStateMachineAndToken?Country=France

API Gateway models and mapping templates can also be used to validate or transform request payloads at the API layer before they hit the backend. Use models to ensure that the structure or contents of the client payload are the same as you expect. Use mapping templates to transform the payload sent by clients to another format, to be processed by the backend.

This REST validation framework can also be used to detect header information on WebSocket browser compatibility (external site). While using WebSocket has many advantages, not all browsers support it, especially older browsers. Therefore, a REST API request layer can pass this browser metadata and determine whether a WebSocket API can be opened.

Because a REST interface is already created for submitting the request, you can easily add another GET method if the client must query the status of the Step Functions state machine. That might be the case if a health check in the request is taking longer than expected. You can also add another GET method as an alternative access method for REST-only compatible clients.

If low-latency request and retrieval are an important characteristic of your API and there aren’t any browser-compatibility risks, use a WebSocket API with JSON model selection expressions to protect your backend with a schema.

In the spirit of picking the best tool for the job, use a REST API for the request layer and a WebSocket API to listen for the result.

This solution, although secure, is an example for how a non-polling solution can work on AWS. At scale, it may require refactoring due to the cross-talk at high concurrency that may result in client resubmissions.

To discover, deploy, and extend this solution into your own AWS environment, follow the PollToPush instructions in the AWS Serverless Application Repository.

Conclusion

When application consumers poll for long-running tasks, it can be a wasteful, detrimental, and costly use of resources. This post outlined multiple ways to refactor the polling method. Use API Gateway to host a RESTful interface, Step Functions to orchestrate your workflow, Lambda to perform backend processing, and an API Gateway WebSocket API to push results to your clients.

ICYMI: Serverless Q4 2018

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/icymi-serverless-q4-2018/

This post is courtesy of Eric Johnson, Senior Developer Advocate – AWS Serverless

Welcome to the fourth edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share all of the most recent product launches, feature enhancements, blog posts, webinars, Twitch live streams, and other interesting things that you might have missed!

This edition of ICYMI includes all announcements from AWS re:Invent 2018!

If you didn’t see them, check our Q1 ICYMIQ2 ICYMI, and Q3 ICYMI posts for what happened then.

So, what might you have missed this past quarter? Here’s the recap.

New features

AWS Lambda introduced the Lambda runtime API and Lambda layers, which enable developers to bring their own runtime and share common code across Lambda functions. With the release of the runtime API, we can now support runtimes from AWS partners such as the PHP runtime from Stackery and the Erlang and Elixir runtimes from Alert Logic. Using layers, partners such as Datadog and Twistlock have also simplified the process of using their Lambda code libraries.

To meet the demand of larger Lambda payloads, Lambda doubled the payload size of asynchronous calls to 256 KB.

In early October, Lambda also lengthened the runtime limit by enabling Lambda functions that can run up to 15 minutes.

Lambda also rolled out native support for Ruby 2.5 and Python 3.7.

You can now process Amazon Kinesis streams up to 68% faster with AWS Lambda support for Kinesis Data Streams enhanced fan-out and HTTP/2 for faster streaming.

Lambda also released a new Application view in the console. It’s a high-level view of all of the resources in your application. It also gives you a quick view of deployment status with the ability to view service metrics and custom dashboards.

Application Load Balancers added support for targeting Lambda functions. ALBs can provide a simple HTTP/S front end to Lambda functions. ALB features such as host- and path-based routing are supported to allow flexibility in triggering Lambda functions.

Amazon API Gateway added support for AWS WAF. You can use AWS WAF for your Amazon API Gateway APIs to protect from attacks such as SQL injection and cross-site scripting (XSS).

API Gateway has also improved parameter support by adding support for multi-value parameters. You can now pass multiple values for the same key in the header and query string when calling the API. Returning multiple headers with the same name in the API response is also supported. For example, you can send multiple Set-Cookie headers.

In October, API Gateway relaunched the Serverless Developer Portal. It provides a catalog of published APIs and associated documentation that enable self-service discovery and onboarding. You can customize it for branding through either custom domain names or logo/styling updates. In November, we made it easier to launch the developer portal from the Serverless Application Repository.

In the continuous effort to decrease customer costs, API Gateway introduced tiered pricing. The tiered pricing model allows the cost of API Gateway at scale with an API Requests price as low as $1.51 per million requests at the highest tier.

Last but definitely not least, API Gateway released support for WebSocket APIs in mid-December as a final holiday gift. With this new feature, developers can build bidirectional communication applications without having to provision and manage any servers. This has been a long-awaited and highly anticipated announcement for the serverless community.

AWS Step Functions added eight new service integrations. With this release, the steps of your workflow can exist on Amazon ECS, AWS Fargate, Amazon DynamoDB, Amazon SNS, Amazon SQS, AWS Batch, AWS Glue, and Amazon SageMaker. This is in addition to the services that Step Functions already supports: AWS Lambda and Amazon EC2.

Step Functions expanded by announcing availability in the EU (Paris) and South America (São Paulo) Regions.

The AWS Serverless Application Repository increased its functionality by supporting more resources in the repository. The Serverless Application Repository now supports Application Auto Scaling, Amazon Athena, AWS AppSync, AWS Certificate Manager, Amazon CloudFront, AWS CodeBuild, AWS CodePipeline, AWS Glue, AWS dentity and Access Management, Amazon SNS, Amazon SQS, AWS Systems Manager, and AWS Step Functions.

The Serverless Application Repository also released support for nested applications. Nested applications enable you to build highly sophisticated serverless architectures by reusing services that are independently authored and maintained but easily composed using AWS SAM and the Serverless Application Repository.

AWS SAM made authorization simpler by introducing SAM support for authorizers. Enabling authorization for your APIs is as simple as defining an Amazon Cognito user pool or an API Gateway Lambda authorizer as a property of your API in your SAM template.

AWS SAM CLI introduced two new commands. First, you can now build locally with the sam build command. This functionality allows you to compile deployment artifacts for Lambda functions written in Python. Second, the sam publish command allows you to publish your SAM application to the Serverless Application Repository.

Our SAM tooling team also released the AWS Toolkit for PyCharm, which provides an integrated experience for developing serverless applications in Python.

The AWS Toolkits for Visual Studio Code (Developer Preview) and IntelliJ (Developer Preview) are still in active development and will include similar features when they become generally available.

AWS SAM and the AWS SAM CLI implemented support for Lambda layers. Using a SAM template, you can manage your layers, and using the AWS SAM CLI, you can develop and debug Lambda functions that are dependent on layers.

Amazon DynamoDB added support for transactions, allowing developer to enforce all-or-nothing operations. In addition to transaction support, Amazon DynamoDB Accelerator also added support for DynamoDB transactions.

Amazon DynamoDB also announced Amazon DynamoDB on-demand, a flexible new billing option for DynamoDB capable of serving thousands of requests per second without capacity planning. DynamoDB on-demand offers simple pay-per-request pricing for read and write requests so that you only pay for what you use, making it easy to balance costs and performance.

AWS Amplify released the Amplify Console, which is a continuous deployment and hosting service for modern web applications with serverless backends. Modern web applications include single-page app frameworks such as React, Angular, and Vue and static-site generators such as Jekyll, Hugo, and Gatsby.

Amazon SQS announced support for Amazon VPC Endpoints using PrivateLink.

Serverless blogs

October

November

December

Tech talks

We hold several Serverless tech talks throughout the year, so look out for them in the Serverless section of the AWS Online Tech Talks page. Here are the three tech talks that we delivered in Q4:

Twitch

We’ve been so busy livestreaming on Twitch that you’re most certainly missing out if you aren’t following along!

For information about upcoming broadcasts and recent livestreams, keep an eye on AWS on Twitch for more Serverless videos and on the Join us on Twitch AWS page.

New Home for SAM Docs

This quarter, we moved all SAM docs to https://docs.aws.amazon.com/serverless-application-model. Everything you need to know about SAM is there. If you don’t find what you’re looking for, let us know!

In other news

The schedule is out for 2019 AWS Global Summits in cities around the world. AWS Global Summits are free events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Summits are held in major cities around the world. They attract technologists from all industries and skill levels who want to discover how AWS can help them innovate quickly and deliver flexible, reliable solutions at scale. Get notified when to register and learn more at the AWS Global Summits website.

Still looking for more?

The Serverless landing page has lots of information. The resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials. Check it out!

Building Simpler Genomics Workflows on AWS Step Functions

Post Syndicated from Christie Gifrin original https://aws.amazon.com/blogs/compute/building-simpler-genomics-workflows-on-aws-step-functions/

This post is courtesy of Ryan Ulaszek, AWS Genomics Partner Solutions Architect and Aaron Friedman, AWS Healthcare and Life Sciences Partner Solutions Architect

In 2017, we published a four part blog series on how to build a genomics workflow on AWS. In part 1, we introduced a general architecture highlighting three common layers: job, batch and workflow.  In part 2, we described building the job layer with Docker and Amazon Elastic Container Registry (Amazon ECR).  In part 3, we tackled the batch layer and built a batch engine using AWS Batch.  In part 4, we built out the workflow layer using AWS Step Functions and AWS Lambda.

Since then, we’ve worked with many AWS customers and APN partners to implement this solution in genomics as well as in other workloads-of-interest. Today, we wanted to highlight a new feature in Step Functions that simplifies how customers and partners can build high-throughput genomics workflows on AWS.

Step Functions now supports native integration with AWS Batch, which simplifies how you can create an AWS Batch state that submits an asynchronous job and waits for that job to finish.

Before, you needed to build a state machine building block that submitted a job to AWS Batch, and then polled and checked its execution. Now, you can just submit the job to AWS Batch using the new AWS Batch task type.  Step Functions waits to proceed until the job is completed. This reduces the complexity of your state machine and makes it easier to build a genomics workflow with asynchronous AWS Batch steps.

The new integrations include support for the following API actions:

  • AWS Batch SubmitJob
  • Amazon SNS Publish
  • Amazon SQS SendMessage
  • Amazon ECS RunTask
  • AWS Fargate RunTask
  • Amazon DynamoDB
    • PutItem
    • GetItem
    • UpdateItem
    • DeleteItem
  • Amazon SageMaker
    • CreateTrainingJob
    • CreateTransformJob
  • AWS Glue
    • StartJobRun

You can also pass parameters to the service API.  To use the new integrations, the role that you assume when running a state machine needs to have the appropriate permissions.  For more information, see the AWS Step Functions Developer Guide.

Using a job status poller

In our 2017 post series, we created a job poller “pattern” with two separate Lambda functions. When the job finishes, the state machine proceeds to the next step and operates according to the necessary business logic.  This is a useful pattern to manage asynchronous jobs when a direct integration is unavailable.

The steps in this building block state machine are as follows:

  1. A job is submitted through a Lambda function.
  2. The state machine queries the AWS Batch API for the job status in another Lambda function.
  3. The job status is checked to see if the job has completed.  If the job status equals SUCCESS, the final job status is logged. If the job status equals FAILED, the execution of the state machine ends. In all other cases, wait 30 seconds and go back to Step 2.

Both of the Submit Job and Get Job Lambda functions are available as example Lambda functions in the console.  The job status poller is available in the Step Functions console as a sample project.

Here is the JSON representing this state machine.

{
  "Comment": "A simple example that submits a job to AWS Batch",
  "StartAt": "SubmitJob",
  "States": {
    "SubmitJob": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:<account-id>::function:batchSubmitJob",
      "Next": "GetJobStatus"
    },
    "GetJobStatus": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:<account-id>:function:batchGetJobStatus",
      "Next": "CheckJobStatus",
      "InputPath": "$",
      "ResultPath": "$.status"
    },
    "CheckJobStatus": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.status",
          "StringEquals": "FAILED",
          "End": true
        },
        {
          "Variable": "$.status",
          "StringEquals": "SUCCEEDED",
          "Next": "GetFinalJobStatus"
        }
      ],
      "Default": "Wait30Seconds"
    },
    "Wait30Seconds": {
      "Type": "Wait",
      "Seconds": 30,
      "Next": "GetJobStatus"
    },
    "GetFinalJobStatus": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:<account-id>:function:batchGetJobStatus",
      "End": true
    }
  }
}

With Step Functions Service Integrations

With Step Functions service integrations, it is now simpler to submit and wait for an AWS Batch job, or any other supported service.

The following code block is the JSON representing the new state machine for an asynchronous batch job. If you are familiar with the AWS Batch SubmitJob API action, you may notice that the parameters are consistent with what you would see in that API call. You can also use the optional AWS Batch parameters in addition to JobDefinition, JobName, and JobQueue.

{
 "StartAt": "RunBatchJob",
 "States": {
     "RunIsaacJob":{
     "Type":"Task",
     "Resource":"arn:aws:states:::batch:submitJob.sync",
     "Parameters":{
        "JobDefinition":"Isaac",
        "JobName.$":"$.isaac.JobName",
        "JobQueue":"HighPriority",
        "Parameters.$": "$.isaac"
     },
     "TimeoutSeconds": 900,
     "HeartbeatSeconds": 60,
     "Next":"Parallel",
     "InputPath":"$",
     "ResultPath":"$.status",
     "Retry" : [
        {
          "ErrorEquals": [ "States.Timeout" ],
          "IntervalSeconds": 3,
          "MaxAttempts": 2,
          "BackoffRate": 1.5
        }
     ]
  }
}

Here is an example of the workflow input JSON.  Pass all of the container parameters that were being constructed in the submit job Lambda function.

{
  "isaac": {
    "WorkingDir": "/scratch",
    "JobName": "isaac-1",
    "FastQ1S3Path": "s3://aws-batch-genomics-resources/fastq/SRR1919605_1.fastq.gz",
    "BAMS3FolderPath": "s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz",
    "FastQ2S3Path": "s3://bccn-genome-data/fastq/NIST7035_R2_trimmed.fastq.gz",
    "ReferenceS3Path": "s3://aws-batch-genomics-resources/reference/isaac/"
  }
}

When you deploy the job definition, add the command attribute that was previously being constructed in the Lambda function launching the AWS Batch job.

IsaacJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      JobDefinitionName: "Isaac"
      Type: container
      RetryStrategy:
        Attempts: 1
      Parameters:
        BAMS3FolderPath: !Sub "s3://${JobResultsBucket}/NA12878_states_1/bam"
        FastQ1S3Path: "s3://aws-batch-genomics-resources/fastq/SRR1919605_1.fastq.gz"
        FastQ2S3Path: "s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz"
        ReferenceS3Path: "s3://aws-batch-genomics-resources/reference/isaac/"
        WorkingDir: "/scratch"
      ContainerProperties:
        Image: "rulaszek/isaac"
        Vcpus: 32
        Memory: 80000
        JobRoleArn:
          Fn::ImportValue: !Sub "${RoleStackName}:ECSTaskRole"
        Command:
          - "--bam_s3_folder_path"
          - "Ref::BAMS3FolderPath"
          - "--fastq1_s3_path"
          - "Ref::FastQ1S3Path"
          - "--fastq2_s3_path"
          - "Ref::FastQ2S3Path"
          - "--reference_s3_path"
          - "Ref::ReferenceS3Path"
          - "--working_dir"
          - "Ref::WorkingDir"
        MountPoints:
          - ContainerPath: "/scratch"
            ReadOnly: false
            SourceVolume: docker_scratch
        Volumes:
          - Name: docker_scratch
            Host:
              SourcePath: "/docker_scratch"

The key-value parameters passed into the workflow are mapped using Parameters.$ to the values in the job definition using the keys.  Value substitutions do take place. The Docker run looks like the following:

docker run <isaac_container_uri> --bam_s3_folder_path s3://batch-genomics-pipeline-jobresultsbucket-1kzdu216m2b0k/NA12878_states_3/bam
                                 --fastq1_s3_path s3://aws-batch-genomics-resources/fastq/SRR1919605_1.fastq.gz
                                 --fastq2_s3_path s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz 
                                 --reference_s3_path s3://aws-batch-genomics-resources/reference/isaac/ 
                                 --working_dir /scratch

Genomics workflow: Before and after

Overall, connectors dramatically simplify your genomics workflow.  The following workflow is a simple genomics secondary analysis pipeline, which we highlighted in our original post series.

The first step aligns the sample against a reference genome.  When alignment is complete, variant calling and QA metrics are calculated in two parallel steps.  When variant calling is complete, variant annotation is performed.  Before, our genomics workflow looked like this:

Now it looks like this:

Here is the new workflow JSON:

{
   "Comment":"A simple genomics secondary-analysis workflow",
   "StartAt":"RunIsaacJob",
   "States":{
      "RunIsaacJob":{
         "Type":"Task",
         "Resource":"arn:aws:states:::batch:submitJob.sync",
         "Parameters":{
            "JobDefinition":"Isaac",
            "JobName.$":"$.isaac.JobName",
            "JobQueue":"HighPriority",
            "Parameters.$": "$.isaac"
         },
         "TimeoutSeconds": 900,
         "HeartbeatSeconds": 60,
         "Next":"Parallel",
         "InputPath":"$",
         "ResultPath":"$.status",
         "Retry" : [
            {
              "ErrorEquals": [ "States.Timeout" ],
              "IntervalSeconds": 3,
              "MaxAttempts": 2,
              "BackoffRate": 1.5
            }
         ]
      },
      "Parallel":{
         "Type":"Parallel",
         "Next":"FinalState",
         "Branches":[
            {
               "StartAt":"RunStrelkaJob",
               "States":{
                  "RunStrelkaJob":{
                     "Type":"Task",
                     "Resource":"arn:aws:states:::batch:submitJob.sync",
                     "Parameters":{
                        "JobDefinition":"Strelka",
                        "JobName.$":"$.strelka.JobName",
                        "JobQueue":"HighPriority",
                        "Parameters.$": "$.strelka"
                     },
                     "TimeoutSeconds": 900,
                     "HeartbeatSeconds": 60,
                     "Next":"RunSnpEffJob",
                     "InputPath":"$",
                     "ResultPath":"$.status",
                     "Retry" : [
                        {
                          "ErrorEquals": [ "States.Timeout" ],
                          "IntervalSeconds": 3,
                          "MaxAttempts": 2,
                          "BackoffRate": 1.5
                        }
                     ]
                  },
                  "RunSnpEffJob":{
                     "Type":"Task",
                     "Resource":"arn:aws:states:::batch:submitJob.sync",
                     "Parameters":{
                        "JobDefinition":"SNPEff",
                        "JobName.$":"$.snpeff.JobName",
                        "JobQueue":"HighPriority",
                        "Parameters.$": "$.snpeff"
                     },
                     "TimeoutSeconds": 900,
                     "HeartbeatSeconds": 60,
                     "Retry" : [
                        {
                          "ErrorEquals": [ "States.Timeout" ],
                          "IntervalSeconds": 3,
                          "MaxAttempts": 2,
                          "BackoffRate": 1.5
                        }
                     ],
                     "End":true
                  }
               }
            },
            {
               "StartAt":"RunSamtoolsStatsJob",
               "States":{
                  "RunSamtoolsStatsJob":{
                     "Type":"Task",
                     "Resource":"arn:aws:states:::batch:submitJob.sync",
                     "Parameters":{
                        "JobDefinition":"SamtoolsStats",
                        "JobName.$":"$.samtools.JobName",
                        "JobQueue":"HighPriority",
                        "Parameters.$": "$.samtools"
                     },
                     "TimeoutSeconds": 900,
                     "HeartbeatSeconds": 60,
                     "End":true,
                     "Retry" : [
                        {
                          "ErrorEquals": [ "States.Timeout" ],
                          "IntervalSeconds": 3,
                          "MaxAttempts": 2,
                          "BackoffRate": 1.5
                        }
                     ]
                  }
               }
            }
         ]
      },
      "FinalState":{
         "Type":"Pass",
         "End":true
      }
   }
}

Here is the new Amazon CloudFormation template for deploying the AWS Batch job definitions for each tool:

AWSTemplateFormatVersion: 2010-09-09

Description: Batch job definitions for batch genomics

Parameters:
  RoleStackName:
    Description: "Stack that deploys roles for genomic workflow"
    Type: String
  VPCStackName:
    Description: "Stack that deploys vps for genomic workflow"
    Type: String
  JobResultsBucket:
    Description: "Bucket that holds workflow job results"
    Type: String

Resources:
  IsaacJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      JobDefinitionName: "Isaac"
      Type: container
      RetryStrategy:
        Attempts: 1
      Parameters:
        BAMS3FolderPath: !Sub "s3://${JobResultsBucket}/NA12878_states_1/bam"
        FastQ1S3Path: "s3://aws-batch-genomics-resources/fastq/SRR1919605_1.fastq.gz"
        FastQ2S3Path: "s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz"
        ReferenceS3Path: "s3://aws-batch-genomics-resources/reference/isaac/"
        WorkingDir: "/scratch"
      ContainerProperties:
        Image: "rulaszek/isaac"
        Vcpus: 32
        Memory: 80000
        JobRoleArn:
          Fn::ImportValue: !Sub "${RoleStackName}:ECSTaskRole"
        Command:
          - "--bam_s3_folder_path"
          - "Ref::BAMS3FolderPath"
          - "--fastq1_s3_path"
          - "Ref::FastQ1S3Path"
          - "--fastq2_s3_path"
          - "Ref::FastQ2S3Path"
          - "--reference_s3_path"
          - "Ref::ReferenceS3Path"
          - "--working_dir"
          - "Ref::WorkingDir"
        MountPoints:
          - ContainerPath: "/scratch"
            ReadOnly: false
            SourceVolume: docker_scratch
        Volumes:
          - Name: docker_scratch
            Host:
              SourcePath: "/docker_scratch"

  StrelkaJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      JobDefinitionName: "Strelka"
      Type: container
      RetryStrategy:
        Attempts: 1
      Parameters:
        BAMS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/bam/sorted.bam"
        BAIS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/bam/sorted.bam.bai"
        ReferenceS3Path: "s3://aws-batch-genomics-resources/reference/hg38.fa"
        ReferenceIndexS3Path: "s3://aws-batch-genomics-resources/reference/hg38.fa.fai"
        VCFS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/vcf"
        WorkingDir: "/scratch"
      ContainerProperties:
        Image: "rulaszek/strelka"
        Vcpus: 32
        Memory: 32000
        JobRoleArn:
          Fn::ImportValue: !Sub "${RoleStackName}:ECSTaskRole"
        Command:
          - "--bam_s3_path"
          - "Ref::BAMS3Path"
          - "--bai_s3_path"
          - "Ref::BAIS3Path"
          - "--reference_s3_path"
          - "Ref::ReferenceS3Path"
          - "--reference_index_s3_path"
          - "Ref::ReferenceIndexS3Path"
          - "--vcf_s3_path"
          - "Ref::VCFS3Path"
          - "--working_dir"
          - "Ref::WorkingDir"
        MountPoints:
          - ContainerPath: "/scratch"
            ReadOnly: false
            SourceVolume: docker_scratch
        Volumes:
          - Name: docker_scratch
            Host:
              SourcePath: "/docker_scratch"

  SnpEffJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      JobDefinitionName: "SNPEff"
      Type: container
      RetryStrategy:
        Attempts: 1
      Parameters:
        VCFS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/vcf/variants/genome.vcf.gz"
        AnnotatedVCFS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/vcf/genome.anno.vcf"
        CommandArgs: " -t hg38 "
        WorkingDir: "/scratch"
      ContainerProperties:
        Image: "rulaszek/snpeff"
        Vcpus: 4
        Memory: 10000
        JobRoleArn:
          Fn::ImportValue: !Sub "${RoleStackName}:ECSTaskRole"
        Command:
          - "--annotated_vcf_s3_path"
          - "Ref::AnnotatedVCFS3Path"
          - "--vcf_s3_path"
          - "Ref::VCFS3Path"
          - "--cmd_args"
          - "Ref::CommandArgs"
          - "--working_dir"
          - "Ref::WorkingDir"
        MountPoints:
          - ContainerPath: "/scratch"
            ReadOnly: false
            SourceVolume: docker_scratch
        Volumes:
          - Name: docker_scratch
            Host:
              SourcePath: "/docker_scratch"

  SamtoolsStatsJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      JobDefinitionName: "SamtoolsStats"
      Type: container
      RetryStrategy:
        Attempts: 1
      Parameters:
        ReferenceS3Path: "s3://aws-batch-genomics-resources/reference/hg38.fa"
        BAMS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/bam/sorted.bam"
        BAMStatsS3Path: !Sub "s3://${JobResultsBucket}/NA12878_states_1/bam/sorted.bam.stats"
        WorkingDir: "/scratch"
      ContainerProperties:
        Image: "rulaszek/samtools-stats"
        Vcpus: 4
        Memory: 10000
        JobRoleArn:
          Fn::ImportValue: !Sub "${RoleStackName}:ECSTaskRole"
        Command:
          - "--bam_s3_path"
          - "Ref::BAMS3Path"
          - "--bam_stats_s3_path"
          - "Ref::BAMStatsS3Path"
          - "--reference_s3_path"
          - "Ref::ReferenceS3Path"
          - "--working_dir"
          - "Ref::WorkingDir"
        MountPoints:
          - ContainerPath: "/scratch"
            ReadOnly: false
            SourceVolume: docker_scratch
        Volumes:
          - Name: docker_scratch
            Host:
              SourcePath: "/docker_scratch"

Here is the new CloudFormation script that deploys the new workflow:

AWSTemplateFormatVersion: 2010-09-09

Description: State Machine for batch benomics

Parameters:
  RoleStackName:
    Description: "Stack that deploys roles for genomic workflow"
    Type: String
  VPCStackName:
    Description: "Stack that deploys vps for genomic workflow"
    Type: String

Resources:
  # S3
  GenomicWorkflow:
    Type: AWS::StepFunctions::StateMachine
    Properties:
      RoleArn:
        Fn::ImportValue: !Sub "${RoleStackName}:StatesExecutionRole"
      DefinitionString: !Sub |-
        {
           "Comment":"A simple example that submits a job to AWS Batch",
           "StartAt":"RunIsaacJob",
           "States":{
              "RunIsaacJob":{
                 "Type":"Task",
                 "Resource":"arn:aws:states:::batch:submitJob.sync",
                 "Parameters":{
                    "JobDefinition":"Isaac",
                    "JobName.$":"$.isaac.JobName",
                    "JobQueue":"HighPriority",
                    "Parameters.$": "$.isaac"
                 },
                 "TimeoutSeconds": 900,
                 "HeartbeatSeconds": 60,
                 "Next":"Parallel",
                 "InputPath":"$",
                 "ResultPath":"$.status",
                 "Retry" : [
                    {
                      "ErrorEquals": [ "States.Timeout" ],
                      "IntervalSeconds": 3,
                      "MaxAttempts": 2,
                      "BackoffRate": 1.5
                    }
                 ]
              },
              "Parallel":{
                 "Type":"Parallel",
                 "Next":"FinalState",
                 "Branches":[
                    {
                       "StartAt":"RunStrelkaJob",
                       "States":{
                          "RunStrelkaJob":{
                             "Type":"Task",
                             "Resource":"arn:aws:states:::batch:submitJob.sync",
                             "Parameters":{
                                "JobDefinition":"Strelka",
                                "JobName.$":"$.strelka.JobName",
                                "JobQueue":"HighPriority",
                                "Parameters.$": "$.strelka"
                             },
                             "TimeoutSeconds": 900,
                             "HeartbeatSeconds": 60,
                             "Next":"RunSnpEffJob",
                             "InputPath":"$",
                             "ResultPath":"$.status",
                             "Retry" : [
                                {
                                  "ErrorEquals": [ "States.Timeout" ],
                                  "IntervalSeconds": 3,
                                  "MaxAttempts": 2,
                                  "BackoffRate": 1.5
                                }
                             ]
                          },
                          "RunSnpEffJob":{
                             "Type":"Task",
                             "Resource":"arn:aws:states:::batch:submitJob.sync",
                             "Parameters":{
                                "JobDefinition":"SNPEff",
                                "JobName.$":"$.snpeff.JobName",
                                "JobQueue":"HighPriority",
                                "Parameters.$": "$.snpeff"
                             },
                             "TimeoutSeconds": 900,
                             "HeartbeatSeconds": 60,
                             "Retry" : [
                                {
                                  "ErrorEquals": [ "States.Timeout" ],
                                  "IntervalSeconds": 3,
                                  "MaxAttempts": 2,
                                  "BackoffRate": 1.5
                                }
                             ],
                             "End":true
                          }
                       }
                    },
                    {
                       "StartAt":"RunSamtoolsStatsJob",
                       "States":{
                          "RunSamtoolsStatsJob":{
                             "Type":"Task",
                             "Resource":"arn:aws:states:::batch:submitJob.sync",
                             "Parameters":{
                                "JobDefinition":"SamtoolsStats",
                                "JobName.$":"$.samtools.JobName",
                                "JobQueue":"HighPriority",
                                "Parameters.$": "$.samtools"
                             },
                             "TimeoutSeconds": 900,
                             "HeartbeatSeconds": 60,
                             "End":true,
                             "Retry" : [
                                {
                                  "ErrorEquals": [ "States.Timeout" ],
                                  "IntervalSeconds": 3,
                                  "MaxAttempts": 2,
                                  "BackoffRate": 1.5
                                }
                             ]
                          }
                       }
                    }
                 ]
              },
              "FinalState":{
                 "Type":"Pass",
                 "End":true
              }
           }
        }

Outputs:
  GenomicsWorkflowArn:
    Description: GenomicWorkflow ARN
    Value: !Ref GenomicWorkflow
  StackName:
    Description: StackName
    Value: !Sub ${AWS::StackName}

Conclusion

AWS Step Functions service integrations are a great way to simplify creating complex workflows with asynchronous steps. While we highlighted the use case with AWS Batch today, there are many other ways that healthcare and life sciences customers can use this new feature, such as with message processing.

For more information about how AWS can enable your genomics workloads, be sure to check out the AWS Genomics page.

We’ve updated the open-source project to take advantage of the new AWS Batch integration in Step Functions.  You can find the changes aws-batch-genomics/tree/v2.0.0 folder.

Original posts in this four-part series:

Happy coding!

New – Compute, Database, Messaging, Analytics, and Machine Learning Integration for AWS Step Functions

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-compute-database-messaging-analytics-and-machine-learning-integration-for-aws-step-functions/

AWS Step Functions is a fully managed workflow service for application developers. You can think & work at a high level, connecting and coordinating activities in a reliable and repeatable way, while keeping your business logic separate from your workflow logic. After you design and test your workflows (which we call state machines), you can deploy them at scale, with tens or even hundreds of thousands running independently and concurrently. Step Functions tracks the status of each workflow, takes care of retrying activities on transient failures, and also simplifies monitoring and logging. To learn more, step through the Create a Serverless Workflow with AWS Step Functions and AWS Lambda tutorial.

Since our launch at AWS re:Invent 2016, our customers have made great use of Step Functions (my post, Things go Better with Step Functions describes a real-world use case). Our customers love the fact that they can easily call AWS Lambda functions to implement their business logic, and have asked us for even more options.

More Integration, More Power
Today we are giving you the power to use eight more AWS services from your Step Function state machines. Here are the new actions:

DynamoDB – Get an existing item from an Amazon DynamoDB table; put a new item into a DynamoDB table.

AWS Batch – Submit a AWS Batch job and wait for it to complete.

Amazon ECS – Run an Amazon ECS or AWS Fargate task using a task definition.

Amazon SNS – Publish a message to an Amazon Simple Notification Service (SNS) topic.

Amazon SQS – Send a message to an Amazon Simple Queue Service (SQS) queue.

AWS Glue – Start a AWS Glue job run.

Amazon SageMaker – Create an Amazon SageMaker training job; create a SageMaker transform job (learn more by reading New Features for Amazon SageMaker: Workflows, Algorithms, and Accreditation).

You can use these actions individually or in combination with each other. To help you get started, we’ve built some cool samples that will show you how to manage a batch job, manage a container task, copy data from DynamoDB, retrieve the status of a Batch job, and more. For example, here’s a visual representation of the sample that copies data from DynamoDB to SQS:

The sample (available to you as an AWS CloudFormation template) creates all of the necessary moving parts including a Lambda function that will populate (seed) the table with some test data. After I create the stack I can locate the state machine in the Step Functions Console and execute it:

I can inspect each step in the console; the first one (Seed the DynamoDB Table) calls a Lambda function that creates some table entries and returns a list of keys (message ids):

The third step (Send Message to SQS) starts with the following input:

And delivers this output, including the SQS MessageId:

As you can see, the state machine took care of all of the heavy lifting — calling the Lambda function, iterating over the list of message IDs, and calling DynamoDB and SQS for each one. I can run many copies at the same time:

I’m sure you can take this example as a starting point and build something awesome with it; be sure to check out the other samples and templates for some ideas!

If you are already building and running your own state machines, you should know about Magic ARNs and Parameters:

Magic ARNs – Each of these new operations is represented by a special “magic” (that’s the technical term Tim used) ARN. There’s one for sending to SQS, another one for running a batch job, and so forth.

Parameters – You can use the Parameters field in a Task state to control the parameters that are passed to the service APIs that implement the new functions. Your state machine definitions can include static JSON or references (in JsonPath form) to specific elements in the state input.

Here’s how the Magic ARNs and Parameters are used to define a state:

   "Read Next Message from DynamoDB": {
      "Type": "Task",
      "Resource": "arn:aws:states:::dynamodb:getItem",
      "Parameters": {
        "TableName": "StepDemoStack-DDBTable-1DKVAVTZ1QTSH",
        "Key": {
          "MessageId": {"S.$": "$.List[0]"}
        }
      },
      "ResultPath": "$.DynamoDB",
      "Next": "Send Message to SQS"
    },

Available Now
The new integrations are available now and you can start using them today in all AWS Regions where Step Functions are available. You pay the usual charge for each state transition and for the AWS services that you consume.

Jeff;

ICYMI: Serverless Q3 2018

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/icymi-serverless-q3-2018/

Welcome to the third edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share all of the most recent product launches, feature enhancements, blog posts, webinars, Twitch live streams, and other interesting things that you might have missed!

If you didn’t see them, catch our Q1 ICYMI and Q2 ICYMI posts for what happened then.

So, what might you have missed this past quarter? Here’s the recap.

AWS Amplify CLI

In August, AWS Amplify launched the AWS Amplify Command Line Interface (CLI) toolchain for developers.

The AWS Amplify CLI enables developers to build, test, and deploy full web and mobile applications based on AWS Amplify directly from their CLI. It has built-in helpers for configuring AWS services such as Amazon Cognito for Auth , Amazon S3 and Amazon DynamoDB for storage, and Amazon API Gateway for APIs. With these helpers, developers can configure AWS services to interact with applications built in popular web frameworks such as React.

Get started with the AWS Amplify CLI toolchain.

New features

Rejoice Microsoft application developers: AWS Lambda now supports .NET Core 2.1 and PowerShell Core!

AWS SAM had a few major enhancements to help in both testing and debugging functions. The team launched support to locally emulate an endpoint for Lambda so that you can run automated tests against your functions. This differs from the existing functionality that emulated a proxy similar to API Gateway in front of your function. Combined with the new improved support for ‘sam local generate-event’ to generate over 50 different payloads, you can now test Lambda function code that would be invoked by almost all of the various services that interface with Lambda today. On the operational front, AWS SAM can now fetch, tail, and filter logs generated by your functions running live on AWS. Finally, with integration with Delve, a debugger for the Go programming language, you can more easily debug your applications locally.

If you’re part of an organization that uses AWS Service Catalog, you can now launch applications based on AWS SAM, too.

The AWS Serverless Application Repository launched new search improvements to make it even faster to find serverless applications that you can deploy.

In July, AWS AppSync added HTTP resolvers so that now you can query your REST APIs via GraphQL! API Inception! AWS AppSync also added new built-in scalar types to help with data validation at the GraphQL layer instead of having to do this in code that you write yourself. For building your GraphQL-based applications on AWS AppSync, an enhanced no-code GraphQL API builder enables you to model your data, and the service generates your GraphQL schema, Amazon DynamoDB tables, and resolvers for your backend. The team also published a Quick Start for using Amazon Aurora as a data source via a Lambda function. Finally, the service is now available in the Asia Pacific (Seoul) Region.

Amazon API Gateway announced support for AWS X-Ray!

With X-Ray integrated in API Gateway, you can trace and profile application workflows starting at the API layer and going through the backend. You can control the sample rates at a granular level.

API Gateway also announced improvements to usage plans that allow for method level throttling, request/response parameter and status overrides, and higher limits for the number of APIs per account for regional, private, and edge APIs. Finally, the team added support for the OpenAPI 3.0 API specification, the next generation of OpenAPI 2, formerly known as Swagger.

AWS Step Functions is now available in the Asia Pacific (Mumbai) Region. You can also build workflows visually with Step Functions and trigger them directly with AWS IoT Rules.

AWS [email protected] now makes the HTTP Request Body for POST and PUT requests available.

AWS CloudFormation announced Macros, a feature that enables customers to extend the functionality of AWS CloudFormation templates by calling out to transformations that Lambda powers. Macros are the same technology that enables SAM to exist.

Serverless posts

July:

August:

September:

Tech Talks

We hold several Serverless tech talks throughout the year, so look out for them in the Serverless section of the AWS Online Tech Talks page. Here are the three tech talks that we delivered in Q3:

Twitch

We’ve been busy streaming deeply technical content to you the past few months! Check out awesome sessions like this one by AWS’s Heitor Lessa and Jason Barto diving deep into Continuous Learning for ML and the entire “Build on Serverless” playlist.

For information about upcoming broadcasts and recent live streams, keep an eye on AWS on Twitch for more Serverless videos and on the Join us on Twitch AWS page.

For AWS Partners

In September, we announced the AWS Serverless Navigate program for AWS APN Partners. Via this program, APN Partners can gain a deeper understanding of the AWS Serverless Platform, including many of the services mentioned in this post. The program’s phases help partners learn best practices such as the Well-Architected Framework, business and technical concepts, and growing their business’s ability to better support AWS customers in their serverless projects.

Check out more at AWS Serverless Navigate.

In other news

AWS re:Invent 2018 is coming in just a few weeks! For November 26–30 in Las Vegas, Nevada, join tens of thousands of AWS customers to learn, share ideas, and see exciting keynote announcements. The agenda for Serverless talks contains over 100 sessions where you can hear about serverless applications and technologies from fellow AWS customers, AWS product teams, solutions architects, evangelists, and more.

Register for AWS re:Invent now!

Want to get a sneak peek into what you can expect at re:Invent this year? Check out the awesome re:Invent Guides put out by AWS Community Heroes. AWS Community Hero Eric Hammond (@esh on Twitter) published one for advanced serverless attendees that you will want to read before the big event.

What did we do at AWS re:Invent 2017? Check out our recap: Serverless @ re:Invent 2017.

Still looking for more?

The Serverless landing page has lots of information. The resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials. Check it out!

Implementing Serverless Video Subtitles

Post Syndicated from Christie Gifrin original https://aws.amazon.com/blogs/compute/implementing-serverless-video-subtitles/

This post is courtesy of Maxime Thomas, DevOps Partner Solutions Architect – AWS

This story begins when I joined AWS at the beginning of the year. I had a hard time during my ramp-up period trying to handle the amount of information coming from all directions. Technical training, meetings, new colleagues, in a worldwide company—the volume of information was overwhelming. However, my first priority was to get my AWS Certified Solutions Architect — Professional certification. This gave me plenty of opportunities to learn and focus on all of the new domains I had never heard about.

This intensive self-paced training quickly gave me a way to get experience. I was opening the AWS Management Console, diving deep into the service documentation, and comparing to my own experience and understanding of production constraints. I wasn’t disappointed by the scope of the platform and its various capabilities.

However, as a native French speaker, I struggled a bit because all of the training videos were in English. Okay, it’s not a problem when you speak another language for 20 minutes a day, but 6 hours every day was exhausting. (It did help me to learn the language faster.) I looked at all of those training videos, and I thought: It would be so much easier if they had French subtitles!

But they didn’t. I continued my deep dive into the serverless world, which led me to another consideration: It would be cool to have a service that could generate subtitles from a video in any language.

Wait–the AWS platform has everything we need to do that!

Video: Playing a video after subtitle generation

I mean, what is the process of translation when you watch a video? It’s basically the following:

  • Listen
  • Extract the information
  • Translate

Proof of concept

I decided to focus on this subject to understand how I could build that kind of system. My pitch was this: The system can receive a video input, extract the audio track, transcribe it, and generate different subtitle files for your video. Since AWS re:Invent 2017, AWS has announced several services that helped me with my proof of concept:

Finally, the way to define subtitles has been specified by the World Wide Web Consortium under the WebVTT format, providing a simple way to produce subtitles for online videos.

I proved the concept in barely 20 minutes with a video file, an Amazon S3 bucket, some AWS IAM roles, and access to the beta versions of the different services. It was going to work, so I decided to transform it into a demo project.

Solution

The fun part of this project was doing it in a serverless way using AWS Lambda and AWS Step Functions. I could have developed it in other ways, but I eliminated them quickly: a custom code base on Amazon EC2 would take too long to code and was excessive computation for what I needed; a container with the code base on Amazon Elastic Container Service would be better, but still was overkill from a compute perspective.

So, Lambda was the solution of choice for compute. Step Functions would take care of coordinating the workflow of the application and the different Lambda functions, so I didn’t need to build that logic into the functions themselves. I split the solution into two parts:

  • The backend processes an MP4 file and outputs the same file plus WebVTT files for each language
  • The frontend provides a web interface to submit the video and render the result in a fancy way

The following image shows the solution’s architecture.

Backend

The solution consists of a Step Functions state machine that executes the following sequence triggered by an Amazon S3 event notification:

  1. Transcode the file with Elastic Transcoder using its API.
  2. Wait two minutes, which is enough time for transcoding.
  3. Submit the file to Amazon Transcribe and enter the following loop:
    1. Wait for 30 seconds.
    2. Check the API to know if transcription is over. If it is, go to step 4; otherwise, go back to step 3.1.
  4. Process the transcript to become a VTT file, which goes to Amazon Translate several times to get a version of the file in another language.
  5. Clean and wrap up.

The following image shows this sequence as a Step Functions state machine.

The power of Step Functions appears in the integration of such a sequence. You can set up different Lambda functions at each stage of the sequence, put them in parallel if you need to, and handle errors with a retry and fallback. Everything is declarative in the JSON that defines the state machine. The input object that the state machine evaluates between each transition is the one that you provide at the first call. You can enrich it as the state machine executes and gathers more information at later steps.

For instance, if you pass a JSON object as input, it goes through all the way through, and each step can add information that wasn’t there at the beginning of the workflow. This is useful when your decision tree is creating elements and you need to refer to it in other steps.

I also set up an Amazon DynamoDB table to store the state of each file for further processing on the front end.

Frontend

The front end’s setup is easy: an Amazon S3 bucket with the static website feature on and a combination of HTML, AWS SDK for JavaScript in the Browser, and a JavaScript framework to handle calls to the AWS Platform. The sequence has the following steps:

  1. Load HTML, CSS, and JavaScript from a bucket in Amazon S3.
  2. Specific JavaScript for this project does the following:
    • Sets up the AWS SDK
    • Connects to Amazon Cognito against a predefined identity pool set up for anonymous users
    • Loads a custom IAM role that gives access to an Amazon S3 bucket
  3. The user uploads an MP4 file to the bucket, and the backend process starts.
  4. A JavaScript loop checks the DynamoDB table where the state of the process is stored and do the following:
    • Add a description of the video process and show the state of the process.
    • Update the progress bar in the description block to inform the user what the process is doing
    • Update the video links when the process is over.
  5. When the process completes, the user can choose the list item to get an HTML5 video player with the VTT files loaded.

Considerations

Keep the following points in mind:

  • This isn’t a production solution. Don’t use it as is.
  • The solution is designed for videos where a person speaks clearly. I tried with non- native English-speaking people, and results are poor at the moment.
  • The solution is adapted for videos without background noise or music. I checked with different types of videos (movie scenes, music videos, and ads), and results are poor.
  • Processing time depends on the length of the original video.
  • The frontend check is basic. Improve it by implementing WebSockets to avoid polling from the browser, which it doesn’t scale.

What’s next?

Feel free to try out the code yourself and customize it for your own needs! This project is open source. To download the project files, see Serverless Subtitles on the AWSLabs GitHub website. Feel free to contribute (Pull Requests only).

ICYMI: Serverless Q2 2018

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/icymi-serverless-q2-2018/

The better-late-than-never edition!

Welcome to the second edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share all of the most recent product launches, feature enhancements, blog posts, webinars, Twitch live streams, and other interesting things that you might have missed!

The second quarter of 2018 flew by so fast that we didn’t get a chance to get out this post! We’re playing catch up, and making sure that the Q3 post launches a bit sooner.

Missed our Q1 ICYMI? Catch up on everything you missed.

So, what might you have missed this past quarter? Here’s the recap….

AWS AppSync

In April, AWS AppSync went generally available (GA)!

AWS AppSync provides capabilities to build real-time, collaborative mobile and web applications. It uses GraphQL, an open standard query language that makes it easy to request data from the cloud. When AWS AppSync went GA, several features also launched. These included better in-console testing with mock data, Amazon CloudWatch support, AWS CloudFormation support, and console log access.

AWS Amplify then also launched support for AWS AppSync to make it even easier for developers to build JavaScript-based applications that can integrate with several AWS services via its simplified GraphQL interface. Click here for the documentation.

AppSync expanded to more Regions and added OIDC support in May.

New features

AWS Lambda made Node.js v8.10 available. 8.10 brings some significant improvements in supporting async/await calls that simplify the traditional callback style common in Node.js applications. Developers can also see performance improvements and lower memory consumption.

In June, the long-awaited support for Amazon SQS as a trigger for Lambda launched! With this launch, customers can easily create Lambda functions that directly consume from SQS queues without needing to manage scheduling for the invocations to poll a queue. Today, SQS is one of the most popular AWS services. It’s used by hundreds of thousands of customers at massive scales as one of the fundamental building blocks of many applications.

AWS Lambda gained support for AWS Config. With AWS Config, you can track changes to the Lambda function, runtime environments, tags, handler name, code size, memory allocation, timeout settings, and concurrency settings. You can also record changes to Lambda IAM execution roles, subnets, and security group associations. Even more fun, you can use AWS Lambda functions in AWS Config Rules to check if your Lambda functions conform to certain standards as decided by you. Inception!

Amazon API Gateway announced the availability of private API endpoints! With private API endpoints, you can now create APIs that are completely inside your own virtual private clouds (VPCs). You can use awesome API Gateway features such as Lambda custom authorizers and Amazon Cognito integration. Back your APIs with Lambda, containers running in Amazon ECS, ECS supporting AWS Fargate, and Amazon EKS, as well as on Amazon EC2.

Amazon API Gateway also launched two really useful features; support for Resource Polices for APIs and Cross-Account AWS Lambda Authorizers and Integrations. Both features offer capabilities to help developers secure their APIs whether they are public or private.

AWS SAM went open source and the AWS SAM Local tool has now been relaunched as AWS SAM CLI! As part of the relaunch, AWS SAM CLI has gained numerous capabilities, such as helping you start a brand new serverless project and better template validation. With version 0.4.0, released in June, we added Python 3.6 support. You can now perform new project creation, local development and testing, and then packaging and deployment of serverless applications for all actively supported Lambda languages.

AWS Step Functions expanded into more Regions, increased default limits, became HIPPA eligible, and is also now available in AWS GovCloud (US).

AWS [email protected] added support for Node.js v8.10.

Serverless posts

April:

May:

June:

Webinars

Here are the three webinars we delivered in Q2. We hold several Serverless webinars throughout the year, so look out for them in the Serverless section of the AWS Online Tech Talks page:

Twitch

We’ve been so busy livestreaming on Twitch that you are most certainly missing out if you aren’t following along!

Here are links to all of the Serverless Twitch sessions that we’ve done.

Keep an eye on AWS on Twitch for more Serverless videos and on the Join us on Twitch AWS page for information about upcoming broadcasts and recent live streams.

Worthwhile reading

Serverless: Changing the Face of Business Economics, A Venture Capital and Startup Perspective
In partnership with three prominent venture capitalists—Greylock Partners, Madrona Venture Group, and Accel—AWS released a whitepaper on the business benefits to serverless. Check it out to hear about opportunities for companies in the space and how several have seen significant benefits from a serverless approach.

Serverless Streaming Architectures and Best Practices
Streaming workloads are some of the biggest workloads for AWS Lambda. Customers of all shapes and sizes are using streaming workloads for near real-time processing of data from services such as Amazon Kinesis Streams. In this whitepaper, we explore three stream-processing patterns using a serverless approach. For each pattern, we describe how it applies to a real-world use case, the best practices and considerations for implementation, and cost estimates. Each pattern also includes a template that enables you to quickly and easily deploy these patterns in your AWS accounts.

In other news

AWS re:Invent 2018 is coming! From November 26—30 in Las Vegas, Nevada, join tens of thousands of AWS customers to learn, share ideas, and see exciting keynote announcements. The agenda for Serverless talks is just starting to show up now and there are always lots of opportunities to hear about serverless applications and technologies from fellow AWS customers, AWS product teams, solutions architects, evangelists, and more.

Register for AWS re:Invent now!

What did we do at AWS re:Invent 2017? Check out our recap here: Serverless @ re:Invent 2017

Attend a Serverless event!

“ServerlessDays are a family of events around the world focused on fostering a community around serverless technologies.” —https://serverlessdays.io/

The events are run by local volunteers as vendor-agnostic events with a focus on community, accessibility, and local representation. Dozens of cities around the world have folks interested in these events, with more popping up regularly.

Find a ServerlessDays event happening near you. Come ready to learn and connect with other developers, architects, hobbyists, and practitioners. AWS has members from our team at every event to connect with and share ideas and content. Maybe, just maybe, we’ll even hand out cool swag!

AWS Serverless Apps for Social Good Hackathon

Our AWS Serverless Apps for Social Good hackathon invites you to publish serverless applications for popular use cases. Your app can use Alexa skills, machine learning, media processing, monitoring, data transformation, notification services, location services, IoT, and more.

We’re looking for apps that can be used as standalone assets or as inputs that can be combined with other applications to add to the open-source serverless ecosystem. This supports the work being done by developers and nonprofit organizations around the world.

Winners will be awarded cash prizes and the opportunity to direct donations to the nonprofit partner of their choice.

Still looking for more?

The AWS Serverless landing page has lots of information. The resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials. Check it out!

Powering HIPAA-compliant workloads using AWS Serverless technologies

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/powering-hipaa-compliant-workloads-using-aws-serverless-technologies/

This post courtesy of Mayank Thakkar, AWS Senior Solutions Architect

Serverless computing refers to an architecture discipline that allows you to build and run applications or services without thinking about servers. You can focus on your applications, without worrying about provisioning, scaling, or managing any servers. You can use serverless architectures for nearly any type of application or backend service. AWS handles the heavy lifting around scaling, high availability, and running those workloads.

The AWS HIPAA program enables covered entities—and those business associates subject to the U.S. Health Insurance Portability and Accountability Act of 1996 (HIPAA)—to use the secure AWS environment to process, maintain, and store protected health information (PHI). Based on customer feedback, AWS is trying to add more services to the HIPAA program, including serverless technologies.

AWS recently announced that AWS Step Functions has achieved HIPAA-eligibility status and has been added to the AWS Business Associate Addendum (BAA), adding to a growing list of HIPAA-eligible services. The BAA is an AWS contract that is required under HIPAA rules to ensure that AWS appropriately safeguards PHI. The BAA also serves to clarify and limit, as appropriate, the permissible uses and disclosures of PHI by AWS, based on the relationship between AWS and customers and the activities or services being performed by AWS.

Along with HIPAA eligibility for most of the rest of the serverless platform at AWS, Step Functions inclusion is a major win for organizations looking to process PHI using serverless technologies, opening up numerous new use cases and patterns. You can still use non-eligible services to orchestrate the storage, transmission, and processing of the metadata around PHI, but not the PHI itself.

In this post, I examine some common serverless use cases that I see in the healthcare and life sciences industry and show how AWS Serverless can be used to build powerful, cost-efficient, HIPAA-eligible architectures.

Provider directory web application

Running HIPAA-compliant web applications (like provider directories) on AWS is a common use case in the healthcare industry. Healthcare providers are often looking for ways to build and run applications and services without thinking about servers. They are also looking for ways to provide the most cost-effective and scalable delivery of secure health-related information to members, providers, and partners worldwide.

Unpredictable access patterns and spiky workloads often force organizations to provision for peak in these cases, and they end up paying for idle capacity. AWS Auto Scaling solves this challenge to a great extent but you still have to manage and maintain the underlying servers from a patching, high availability, and scaling perspective. AWS Lambda (along with other serverless technologies from AWS) removes this constraint.

The above architecture shows a serverless way to host a customer-facing website, with Amazon S3 being used for hosting static files (.js, .css, images, and so on). If your website is based on client-side technologies, you can eliminate the need to run a web server farm. In addition, you can use S3 features like server-side encryption and bucket access policies to lock down access to the content.

Using Amazon CloudFront, a global content delivery network, with S3 origins can bring your content closer to the end user and cut down S3 access costs, by caching the content at the edge. In addition, using AWS [email protected] gives you an ability to bring and execute your own code to customize the content that CloudFront delivers. That significantly reduces latency and improves the end user experience while maintaining the same Lambda development model. Some common examples include checking cookies, inspecting headers or authorization tokens, rewriting URLs, and making calls to external resources to confirm user credentials and generate HTTP responses.

You can power the APIs needed for your client application by using Amazon API Gateway, which takes care of creating, publishing, maintaining, monitoring, and securing APIs at any scale. API Gateway also provides robust ways to provide traffic management, authorization and access control, monitoring, API version management, and the other tasks involved in accepting and processing up to hundreds of thousands of concurrent API calls. This allows you to focus on your business logic. Direct, secure, and authenticated integration with Lambda functions allows this serverless architecture to scale up and down seamlessly with incoming traffic.

The CloudFront integration with AWS WAF provides a reliable way to protect your application against common web exploits that could affect application availability, compromise security, or consume excessive resources.

API Gateway can integrate directly with Lambda, which by default can access the public resources. Lambda functions can be configured to access your Amazon VPC resources as well. If you have extended your data center to AWS using AWS Direct Connect or a VPN connection, Lambda can access your on-premises resources, with the traffic flowing over your VPN connection (or Direct Connect) instead of the public internet.

All the services mentioned above (except Amazon EC2) are fully managed by AWS in terms of high availability, scaling, provisioning, and maintenance, giving you a cost-effective way to host your web applications. It’s pay-as-you-go vs. pay-as-you-provision. Spikes in demand, typically encountered during the enrollment season, are handled gracefully, with these services scaling automatically to meet demand and then scale down. You get to keep your costs in control.

All AWS services referenced in the above architecture are HIPAA-eligible, thus enabling you to store, process, and transmit PHI, as long as it complies with the BAA.

Medical device telemetry (ingesting data @ scale)

The ever-increasing presence of IoT devices in the healthcare industry has created the challenges of ingesting this data at scale and making it available for processing as soon as it is produced. Processing this data in real time (or near-real time) is key to delivering urgent care to patients.

The infinite scalability (theoretical) along with low startup times offered by Lambda makes it a great candidate for these kinds of use cases. Balancing ballooning healthcare costs and timely delivery of care is a never-ending challenge. With subsecond billing and no charge for non-execution, Lambda becomes the best choice for AWS customers.

These end-user medical devices emit a lot of telemetry data, which requires constant analysis and real-time tracking and updating. For example, devices like infusion pumps, personal use dialysis machines, and so on require tracking and alerting of device consumables and calibration status. They also require updates for these settings. Consider the following architecture:

Typically, these devices are connected to an edge node or collector, which provides sufficient computing resources to authenticate itself to AWS and start streaming data to Amazon Kinesis Streams. The collector uses the Kinesis Producer Library to simplify high throughput to a Kinesis data stream. You can also use the server-side encryption feature, supported by Kinesis Streams, to achieve encryption-at-rest. Kinesis provides a scalable, highly available way to achieve loose coupling between data-producing (medical devices) and data-consuming (Lambda) layers.

After the data is transported via Kinesis, Lambda can then be used to process this data in real time, storing derived insights in Amazon DynamoDB, which can then power a near-real time health dashboard. Caregivers can access this real-time data to provide timely care and manage device settings.

End-user medical devices, via the edge node, can also connect to and poll an API hosted on API Gateway to check for calibration settings, firmware updates, and so on. The modifications can be easily updated by admins, providing a scalable way to manage these devices.

For historical analysis and pattern prediction, the staged data (stored in S3), can be processed in batches. Use AWS Batch, Amazon EMR, or any custom logic running on a fleet of Amazon EC2 instances to gain actionable insights. Lambda can also be used to process data in a MapReduce fashion, as detailed in the Ad Hoc Big Data Processing Made Simple with Serverless MapReduce post.

You can also build high-throughput batch workflows or orchestrate Apache Spark applications using Step Functions, as detailed in the Orchestrate Apache Spark applications using AWS Step Functions and Apache Livy post. These insights can then be used to calibrate the medical devices to achieve effective outcomes.

Use Lambda to load data into Amazon Redshift, a cost-effective, petabyte-scale data warehouse offering. One of my colleagues, Ian Meyers, pointed this out in his Zero-Administration Amazon Redshift Database Loader post.

Mobile diagnostics

Another use case that I see is using mobile devices to provide diagnostic care in out-patient settings. These environments typically lack the robust IT infrastructure that clinics and hospitals can provide, and often are subjected to intermittent internet connectivity as well. Various biosensors (otoscopes, thermometers, heart rate monitors, and so on) can easily talk to smartphones, which can then act as aggregators and analyzers before forwarding the data to a central processing system. After the data is in the system, caregivers and practitioners can then view and act on the data.

In the above diagram, an application running on a mobile device (iOS or Android) talks to various biosensors and collects diagnostic data. Using AWS mobile SDKs along with Amazon Cognito, these smart devices can authenticate themselves to AWS and access the APIs hosted on API Gateway. Amazon Cognito also offers data synchronization across various mobile devices, which helps you to build “offline” features in your mobile application. Amazon Cognito Sync resolves conflicts and intermittent network connectivity, enabling you to focus on delivering great app experiences instead of creating and managing a user data sync solution.

You can also use CloudFront and [email protected], as detailed in the first use case of this post, to cache content at edge locations and provide some light processing closer to your end users.

Lambda acts as a middle tier, processing the CRUD operations on the incoming data and storing it in DynamoDB, which is again exposed to caregivers through another set of Lambda functions and API Gateway. Caregivers can access the information through a browser-based interface, with Lambda processing the middle-tier application logic. They can view the historical data, compare it with fresh data coming in, and make corrections. Caregivers can also react to incoming data and issue alerts, which are delivered securely to the smart device through Amazon SNS.

Also, by using DynamoDB Streams and its integration with Lambda, you can implement Lambda functions that react to data modifications in DynamoDB tables (and hence, incoming device data). This gives you a way to codify common reactions to incoming data, in near-real time.

Lambda ecosystem

As I discussed in the above use cases, Lambda is a powerful, event-driven, stateless, on-demand compute platform offering scalability, agility, security, and reliability, along with a fine-grained cost structure.

For some organizations, migrating from a traditional programing model to a microservices-driven model can be a steep curve. Also, to build and maintain complex applications using Lambda, you need a vast array of tools, all the way from local debugging support to complex application performance monitoring tools. The following list of tools and services can assist you in building world-class applications with minimal effort:

  • AWS X-Ray is a distributed tracing system that allows developers to analyze and debug production for distributed applications, such as those built using a microservices (Lambda) architecture. AWS X-Ray was recently added to the AWS BAA, opening the doors for processing PHI workloads.
  • AWS Step Functions helps build HIPAA-compliant complex workflows using Lambda. It provides a way to coordinate the components of distributed applications and Lambda functions using visual workflows.
  • AWS SAM provides a fast and easy way of deploying serverless applications. You can write simple templates to describe your functions and their event sources (API Gateway, S3, Kinesis, and so on). AWS recently relaunched the AWS SAM CLI, which allows you to create a local testing environment that simulates the AWS runtime environment for Lambda. It allows faster, iterative development of your Lambda functions by eliminating the need to redeploy your application package to the Lambda runtime.

For more details, see the Serverless Application Developer Tooling webpage.

Conclusion

There are numerous other health care and life science use cases that customers are implementing, using Lambda with other AWS services. AWS is committed to easing the effort of implementing health care solutions in the cloud. Making Lambda HIPAA-eligible is just another milestone in the journey. For more examples of use cases, see Serverless. For the latest list of HIPAA-eligible services, see HIPAA Eligible Services Reference.

Amazon Comprehend Launches Asynchronous Batch Operations

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/amazon-comprehend-launches-asynchronous-batch-operations/

My colleague Jeff Barr last wrote about Amazon Comprehend, a service for discovering insights and relationships in text, when it launched at AWS re:Invent in 2017. Today, after iterating on customer feedback, we’re releasing a new asynchronous batch inferencing feature for Comprehend. Asynchronous batch operations work on documents stored in Amazon Simple Storage Service (S3) buckets and can perform all of the normal Comprehend operations like entity recognition, key phrase extraction, sentiment analysis, and language detection. These new asynchronous batch APIs support significantly larger documents than the single document and batch APIs, reducing the need for customers to truncate their documents for the service. Of course, all of the single document and batch synchronous API operations remain available for real-time results. The addition of asynchronous operations allows developers to choose the tools most suited for their applications. Let’s take a deeper look at the new API.

Asynchronous API Operations

The new batch APIs follow the same asynchronous call structure as Amazon Comprehend’s TopicDetection API. To analyze a collection of documents we first call one of the Start* APIs like StartDominantLanguageDetectionJob, StartEntitiesDetectionJob, StartKeyPhrasesDetectionJob, or StartSentimentDetectionJob.

Each of these APIs take an InputDataConfig and an OutputDataConfig that specify the incoming data format and location as well as where in S3 the results should be stored. The InputDataConfig specifies whether the input data should be treated as one document per file or one document per line.

Additionally we can name the job and include a unique request identifier for synchronization purposes. If we don’t supply these the Comprehend service will automatically generate them.

At the time of this writing the asynchronous operations support individual documents of up to 100KB for entity and key phrase detection, 1MB for language detection, and 5KB for sentiment detection. The total size of all files in the batch must be under 5GB and we cannot submit more than 1 million individual files per batch.

Now that we see what the API is doing let’s take a look at the updated console and start a job!

Amazon Comprehend Analysis Console

First I’ll navigate to the AWS Management Console and open Amazon Comprehend. Next I’ll select the new Analysis console.

From here I can create a new analysis job by clicking the create button in the top right of the console. I’ll create an entities detection job and select English as my document language. Then I’ll point the console at some sampe data.

Now I’ll configure my output data location and make sure the service role has access to that S3 bucket. Then I’ll start the job!

Here I can see the operation started in the console and I can wait until it’s complete to view the detailed results.

Here on the job page I can see the status of the job and the output location. If I download the results from the S3 location I can take a look at the detected entities in the sample text.

I’ve truncated the results here but mostly they look something like this:

{
  "Entities": [
    {
      "BeginOffset": 875,
      "EndOffset": 899,
      "Score": 0.9936646223068237,
      "Text": "University of California",
      "Type": "ORGANIZATION"
    },
    {
      "BeginOffset": 903,
      "EndOffset": 911,
      "Score": 0.9519965648651123,
      "Text": "Berkeley",
      "Type": "LOCATION"
    },
    {
      "BeginOffset": 974,
      "EndOffset": 992,
      "Score": 0.9981470108032227,
      "Text": "Christopher Monroe",
      "Type": "PERSON"
    },
    {
      "BeginOffset": 997,
      "EndOffset": 1010,
      "Score": 0.9992995262145996,
      "Text": "Mikhail Lukin",
      "Type": "PERSON"
    },
    {
      "BeginOffset": 1095,
      "EndOffset": 1099,
      "Score": 0.9990954399108887,
      "Text": "2017",
      "Type": "DATE"
    }
  ],
  "File": "Sample.txt",
  "Line": 8
}

Pretty cool! We could go through similar steps for setniment detection of key phrase detection. The fact that we can submit up to 5GB of data in a single batch means customers will spend less time transforming and truncating their documents.

I personally recommend a tool like AWS Step Functions for checking on the status of jobs programatically. It’s very easy to setup and build programatic analysis pipelines.

You could also use AWS Glue to call comprehend as part of your regular ETL operations as we mentioned in this blog post by Roy Hasson.

Additional Information

You can find detailed information on these new APIs in the documentation and learn more about the limits and best practices.

As previously mentioned the synchronous batch APIs are still available and work best for smaller sets of documents and smaller sizes.

As always don’t hesitate to share your feedback here or on twitter.

Randall

Orchestrate Multiple ETL Jobs Using AWS Step Functions and AWS Lambda

Post Syndicated from Moataz Anany original https://aws.amazon.com/blogs/big-data/orchestrate-multiple-etl-jobs-using-aws-step-functions-and-aws-lambda/

Extract, transform, and load (ETL) operations collectively form the backbone of any modern enterprise data lake. It transforms raw data into useful datasets and, ultimately, into actionable insight. An ETL job typically reads data from one or more data sources, applies various transformations to the data, and then writes the results to a target where data is ready for consumption. The sources and targets of an ETL job could be relational databases in Amazon Relational Database Service (Amazon RDS) or on-premises, a data warehouse such as Amazon Redshift, or object storage such as Amazon Simple Storage Service (Amazon S3) buckets. Amazon S3 as a target is especially commonplace in the context of building a data lake in AWS.

AWS offers AWS Glue, which is a service that helps author and deploy ETL jobs. AWS Glue is a fully managed extract, transform, and load service that makes it easy for customers to prepare and load their data for analytics. Other AWS Services also can be used to implement and manage ETL jobs. They include: AWS Database Migration Service (AWS DMS), Amazon EMR (using the Steps API), and even Amazon Athena.

The challenge of orchestrating an ETL workflow

How can we orchestrate an ETL workflow that involves a diverse set of ETL technologies? AWS Glue, AWS DMS, Amazon EMR, and other services support Amazon CloudWatch Events, which we could use to chain ETL jobs together. Amazon S3, the central data lake store, also supports CloudWatch Events. But relying on CloudWatch Events alone means that there’s no single visual representation of the ETL workflow. Also, tracing the overall ETL workflow’s execution status and handling error scenarios can become a challenge.

In this post, I show you how to use AWS Step Functions and AWS Lambda for orchestrating multiple ETL jobs involving a diverse set of technologies in an arbitrarily-complex ETL workflow. AWS Step Functions is a web service that enables you to coordinate the components of distributed applications and microservices using visual workflows. You build applications from individual components. Each component performs a discrete function, or task, allowing you to scale and change applications quickly.

Let’s look at an example ETL workflow.

Example datasets for the ETL workflow

For our example, we’ll use two publicly available Amazon QuickSight datasets.

The first dataset is a sales pipeline dataset, which contains a list of slightly more than 20,000 sales opportunity records for a fictitious business. Each record has fields that specify:

  • A date, potentially when an opportunity was identified.
  • The salesperson’s name.
  • A market segment to which the opportunity belongs.
  • Forecasted monthly revenue.

The second dataset is an online marketing metrics dataset. This dataset contains records of marketing metrics, aggregated by day. The metrics describe user engagement across various channels, such as websites, mobile, and social media, plus other marketing metrics. The two datasets are unrelated, but for the purpose of this example we’ll assume that they are related.

The example ETL workflow requirements

Imagine there’s a business user who needs to answer questions based on both datasets. Perhaps the user wants to explore the correlations between online user engagement metrics on the one hand, and forecasted sales revenue and opportunities generated on the other hand. The user engagement metrics include website visits, mobile users, and desktop users.

The steps in the ETL workflow are:

Process the Sales dataset (PSD). Read the Sales dataset. Group records by day, aggregating the Forecasted Monthly Revenue field. Rename fields to replace white space with underscores. Output the intermediary results to Amazon S3 in compressed Parquet format. Overwrite any previous output.

Process the Marketing dataset (PMD). Read the Marketing dataset. Rename fields to replace white space with underscores. Send the intermediary results to Amazon S3 in compressed Parquet format. Overwrite any previous output.

Join Marketing and Sales datasets (JMSD). Read the output of the processed Sales and Marketing datasets. Perform an inner join of both datasets on the date field. Sort in ascending order by date. Send the final joined dataset to Amazon S3, and overwrite any previous output.

So far, this ETL workflow can be implemented with AWS Glue, with the ETL jobs being chained by using job triggers. But you might have other requirements outside of AWS Glue that are part of your end-to-end data processing workflow, such as the following:

  • Both Sales and Marketing datasets are uploaded to an S3 bucket at random times in an interval of up to a week. The PSD job should start as soon as the Sales dataset file is uploaded. The PMD job should start as soon as the Marketing dataset file is uploaded. Parallel ETL jobs can start and finish anytime, but the final JMSD job can start only after all parallel ETL jobs are complete.
  • In addition to PSD and PMD jobs, the orchestration must support more parallel ETL jobs in the future that contribute to the final dataset aggregated by the JMSD job. The additional ETL jobs could be managed by AWS services, such as AWS Database Migration Service, Amazon EMR, Amazon Athena or other non-AWS services.

The data engineer takes these requirements and builds the following ETL workflow chart.

To fulfill the requirements, we need a generic ETL orchestration solution. A serverless solution is even better.

The ETL orchestration architecture and events

Let’s see how we can orchestrate an ETL workflow to fulfill the requirements using AWS Step Functions and AWS Lambda. The following diagram shows the ETL orchestration architecture and the main flow of events.

The main flow of events starts with an AWS Step Functions state machine. This state machine defines the steps in the orchestrated ETL workflow. A state machine can be triggered through Amazon CloudWatch based on a schedule, through the AWS Command Line Interface (AWS CLI), or using the various AWS SDKs in an AWS Lambda function or some other execution environment.

As the state machine execution progresses, it invokes the ETL jobs. As shown in the diagram, the invocation happens indirectly through intermediary AWS Lambda functions that you author and set up in your account. We’ll call this type of function an ETL Runner.

While the architecture in the diagram shows Amazon Athena, Amazon EMR, and AWS Glue, the accompanying code sample (aws-etl-orchestrator) includes a single ETL Runner, labeled AWS Glue Runner Function in the diagram. You can use this ETL Runner to orchestrate AWS Glue jobs. You can also follow the pattern and implement more ETL Runners to orchestrate other AWS services or non-AWS tools.

ETL Runners are invoked by activity tasks in Step Functions. Because of the way AWS Step Functions’ activity tasks work, ETL Runners need to periodically poll the AWS Step Functions state machine for tasks. The state machine responds by providing a Task object. The Task object contains inputs which enable an ETL Runner to run an ETL job.

As soon as an ETL Runner receives a task, it starts the respective ETL job. An ETL Runner maintains a state of active jobs in an Amazon DynamoDB table. Periodically, the ETL Runner checks the status of active jobs. When an active ETL job completes, the ETL Runners notifies the AWS Step Functions state machine. This allows the ETL workflow in AWS Step Functions to proceed to the next step.

An important question may come up. Why does an ETL Runner run independently from your Step Functions state machine and poll for tasks? Can’t we instead directly invoke an AWS Lambda function from the Step Functions state machine? Then can’t we have that function start and monitor an ETL job until it completes?

The answer is that AWS Lambda functions have a maximum execution duration per request of 300 seconds, or 5 minutes. For more information, see AWS Lambda Limits. ETL jobs typically take more than 5 minutes to complete. If an ETL Runner function is invoked directly, it will likely time out before the ETL job completes. Thus, we follow the long-running worker approach with activity tasks. The worker in this code sample – the ETL Runner – is an AWS Lambda function that gets triggered on a schedule using CloudWatch Events. If you want to avoid managing the polling schedule through CloudWatch Events, you can implement a polling loop in your ETL workflow’s state machine. Check the AWS Big Data blog post Orchestrate Apache Spark applications using AWS Step Functions and Apache Livy for an example.

Finally, let’s discuss how we fulfill the requirement of waiting for Sales and Marketing datasets to arrive in an S3 bucket at random times. We implement these waits as two separate activity tasks: Wait for Sales Data and Wait for Marketing Data. A state machine halts execution when it encounters either of these activity tasks. A CloudWatch Events event handler is then configured on an Amazon S3 bucket, so that when Sales or Marketing dataset files are uploaded to the bucket, Amazon S3 invokes an AWS Lambda function. The Lambda function then signals the waiting state machine to exit the activity task corresponding to the uploaded dataset. The subsequent ETL job is then invoked by the state machine.

Set up your own ETL orchestration

The aws-etl-orchestrator GitHub repository provides source code you can customize to set up the ETL orchestration architecture in your AWS account. The following steps show what you need to do to start orchestrating your ETL jobs using the architecture shown in this post:

  1. Model the ETL orchestration workflow in AWS Step Functions
  2. Build your ETL Runners (or use an existing AWS Glue ETL Runner)
  3. Customize AWS CloudFormation templates and create stacks
  4. Invoke the ETL orchestration state machine
  5. Upload sample Sales and Marketing datasets to Amazon S3

Model the ETL orchestration workflow in AWS Step Functions.  Use AWS Step Functions to model the ETL workflow described in this post as a state machine. A state machine in Step Functions consists of a set of states and the transitions between these states. A state machine is defined in Amazon States Language, which is a JSON-based notation. For a few examples of state machine definitions, see Sample Projects.

The following snapshot from the AWS Step Functions console shows our example ETL workflow modeled as a state machine. This workflow is what we provide you in the code sample.

When you start an execution of this state machine, it will branch to run two ETL jobs in parallel: Process Sales Data (PSD) and Process Marketing Data (PMD). But, according to the requirements, both ETL jobs should not start until their respective datasets are uploaded to Amazon S3. Hence, we implement Wait activity tasks before both PSD and PMD. When a dataset file is uploaded to Amazon S3, this triggers an AWS Lambda function that notifies the state machine to exit the Wait states. When both PMD and PSD jobs are successful, the JMSD job runs to produce the final dataset.

Finally, to have this ETL workflow execute once per week, you will need to configure a state machine execution to start once per week using a CloudWatch Event.

Build your ETL Runners (or use an existing AWS Glue ETL Runner)The code sample includes an AWS Glue ETL Runner. For simplicity, we implemented the ETL workflow using only AWS Glue jobs. However, nothing prevents you from using a different ETL technology to implement PMD or PSD jobs. You’ll need to build an ETL Runner for the technology that follows the AWS Glue ETL Runner example.

Customize AWS CloudFormation templates and create stacks. The sample published in the aws-etl-orchestrator repository includes three separate AWS CloudFormation templates. We organized resources into three templates following AWS CloudFormation best practices. The three resource groups are logically distinct and likely to have separate lifecycles and ownerships. Each template has an associated AWS CloudFormation parameters file (“*-params.json” files). Parameters in those files must be customized. The details about the three AWS CloudFormation templates are as follows:

  1. A template responsible for setting up AWS Glue resources.For our example ETL workflow, the sample template creates three AWS Glue jobs: PSD, PMD, and JMSD. The scripts for these jobs are pulled by AWS CloudFormation from an Amazon S3 bucket that you own.
  2. A template where the AWS Step Functions state machine is defined.The state machine definition in Amazon States Language is embedded in a StateMachine resource within the Step Functions template.
  3. A template that sets up resources required by the ETL Runner for AWS Glue.The AWS Glue ETL Runner is a Python script that is written to be run as an AWS Lambda function.

Invoke the ETL orchestration state machine. Finally, it is time to start a new state machine execution in AWS Step Functions. For our ETL example, the AWS CloudFormation template creates a state machine named MarketingAndSalesETLOrchestrator. You can start an execution from the AWS Step Functions console, or through an AWS CLI command. When you start an execution, the state machine will immediately enter Wait for Data states, waiting for datasets to be uploaded to Amazon S3.

Upload sample Sales and Marketing datasets to Amazon S3

Upload datasets provided to the S3 bucket that you specified in the code sample configuration. This uploaded datasets signal the state machine to continue execution.

The state machine may take a while to complete execution. You can monitor progress in the AWS Step Functions console. If the execution is successful, the output shown in the following diagram appears.

Congratulations! You’ve orchestrated the example ETL workflow to a successful completion.

Handling failed ETL jobs

What if a job in the ETL workflow fails? In such a case, there are error-handling strategies available to the ETL workflow developer, from simply notifying an administrator, to fully undoing the effects of the previous jobs through compensating ETL jobs. Detecting and responding to a failed ETL job can be implemented using the AWS Step Functions’ Catch mechanism. For more information, see Handling Error Conditions Using a State Machine. In the sample state machine, errors are handled by a do-nothing Pass state.

Try it out. Stop any of the example ETL workflow’s jobs while executing through the AWS Glue console or the AWS CLI. You’ll notice the state machine transitioning to the ETL Job Failed Fallback state.

Conclusion

In this post, I showed you how to implement your ETL logic as an orchestrated workflow. I presented a serverless solution for ETL orchestration that allows you to control ETL job execution using AWS Step Functions and AWS Lambda.  You can use the concepts and the code described in this post to build arbitrarily complex ETL state machines.

For more information and to download the source code, see the aws-etl-orchestrator GitHub repository. If you have questions about this post, send them our way in the Comments section below.


Additional Reading

If you found this post useful, be sure to check out Build a Data Lake Foundation with AWS Glue and Amazon S3 and Orchestrate Apache Spark applications using AWS Step Functions and Apache Livy.

 


About the Author

Moataz Anany is a senior solutions architect with AWS. He enjoys partnering with customers to help them leverage AWS and the cloud in creative ways. He dedicates most of his spare time to his wife and little ones. The rest is spent building and breaking things in side projects.