Tag Archives: Amazon Simple Storage Services (S3)

Amazon S3 Update – SigV2 Deprecation Period Extended & Modified

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-s3-update-sigv2-deprecation-period-extended-modified/

Every request that you make to the Amazon S3 API must be signed to ensure that it is authentic. In the early days of AWS we used a signing model that is known as Signature Version 2, or SigV2 for short. Back in 2012, we announced SigV4, a more flexible signing method, and made it the sole signing method for all regions launched after 2013. At that time, we recommended that you use it for all new S3 applications.

Last year we announced that we would be ending support for SigV2 later this month. While many customers have updated their applications (often with nothing more than a simple SDK update), to use SigV4, we have also received many requests for us to extend support.

New Date, New Plan
In response to the feedback on our original plan, we are making an important change. Here’s the summary:

Original Plan – Support for SigV2 ends on June 24, 2019.

Revised Plan – Any new buckets created after June 24, 2020 will not support SigV2 signed requests, although existing buckets will continue to support SigV2 while we work with customers to move off this older request signing method.

Even though you can continue to use SigV2 on existing buckets, and in the subset of AWS regions that support SigV2, I encourage you to migrate to SigV4, gaining some important security and efficiency benefits in the process. The newer signing method uses a separate, specialized signing key that is derived from the long-term AWS access key. The key is specific to the service, region, and date. This provides additional isolation between services and regions, and provides better protection against key reuse. Internally, our SigV4 implementation is able to securely cache the results of authentication checks; this reduces latency and adds to the overall resiliency of your application. To learn more, read Changes in Signature Version 4.

Identifying Use of SigV2
S3 has been around since 2006 and some of the code that you or your predecessors wrote way back then might still be around, dutifully making requests that are signed with SigV2. You can use CloudTrail Data Events or S3 Server Access Logs to find the old-school requests and target the applications for updates:

CloudTrail Data Events – Look for the SignatureVersion element within the additionalDataElement of each CloudTrail event entry (read Using AWS CloudTrail to Identify Amazon S3 Signature Version 2 Requests to learn more).

S3 Server Access Logs – Look for the SignatureVersion element in the logs (read Using Amazon S3 Access Logs to Identify Signature Version 2 Requests to learn more).

Updating to SigV4

“Do we need to change our code?”

The Europe (Frankfurt), US East (Ohio), Canada (Central), Europe (London), Asia Pacific (Seoul), Asia Pacific (Mumbai), Europe (Paris), China (Ningxia), Europe (Stockholm), Asia Pacific (Osaka Local), AWS GovCloud (US-East), and Asia Pacific (Hong Kong) Regions were launched after 2013, and support SigV4 but not SigV2. If you have code that accesses S3 buckets in that region, it is already making exclusive use of SigV4.

If you are using the latest version of the AWS SDKs, you are either ready or just about ready for the SigV4 requirement on new buckets beginning June 24, 2020. If you are using an older SDK, please check out the detailed version list at Moving from Signature Version 2 to Signature Version 4 for more information.

There are a few situations where you will need to make some changes to your code. For example, if you are using pre-signed URLs with the AWS Java, JavaScript (node.js), or Python SDK, you need to set the correct region and signature version in the client configuration. Also, be aware that SigV4 pre-signed URLs are valid for a maximum of 7 days, while SigV2 pre-signed URLs can be created with a maximum expiry time that could be many weeks or years in the future (in almost all cases, using time-limited URLs is a much better practice). Using SigV4 will improve your security profile, but might also mandate a change in the way that you create, store, and use the pre-signed URLs. While using long-lived pre-signed URLs was easy and convenient for developers, using SigV4 with URLs that have a finite expiration is a much better security practice.

If you are using Amazon EMR, you should upgrade your clusters to version 5.22.0 or later so that all requests to S3 are made using SigV4 (see Amazon EMR 5.x Release Versions for more info).

If your S3 objects are fronted by Amazon CloudFront and you are signing your own requests, be sure to update your code to use SigV4. If you are using Origin Access Identities to restrict access to S3, be sure to include the x-amz-content-sha256 header and the proper regional S3 domain endpoint.

We’re Here to Help
The AWS team wants to help make your transition to SigV4 as smooth and painless as possible. If you run in to problems, I strongly encourage you to make use of AWS Support, as described in Getting Started with AWS Support.

You can also Discuss this Post on Reddit!



How to export an Amazon DynamoDB table to Amazon S3 using AWS Step Functions and AWS Glue

Post Syndicated from Joe Feeney original https://aws.amazon.com/blogs/big-data/how-to-export-an-amazon-dynamodb-table-to-amazon-s3-using-aws-step-functions-and-aws-glue/

In typical AWS fashion, not a week had gone by after I published How Goodreads offloads Amazon DynamoDB tables to Amazon S3 and queries them using Amazon Athena on the AWS Big Data blog when the AWS Glue team released the ability for AWS Glue crawlers and AWS Glue ETL jobs to read from DynamoDB tables natively. I was actually pretty excited about this. Less code means fewer bugs. The original architecture had been around for at least 18 months and could be simplified significantly with a little bit of work.

Refactoring the data pipeline

The AWS Data Pipeline architecture outlined in my previous blog post is just under two years old now. We had used data pipelines as a way to back up Amazon DynamoDB data to Amazon S3 in case of a catastrophic developer error. However, with DynamoDB point-in-time recovery we have a better, native mechanism for disaster recovery. Additionally, with data pipelines we still own the operations associated with the clusters themselves, even if they are transient. A common challenge is keeping our clusters up with recent releases of Amazon EMR to help mitigate any outstanding bugs. Another is the inefficiency of needing to spin up an EMR cluster for each DynamoDB table.

I decided to take a step back and list the capabilities I wanted to have in the next iteration:

  • Export tables using AWS Glue instead of EMR.
    • AWS Glue provides a serverless ETL environment where I don’t have to worry about the underlying infrastructure. This minimizes operational tasks like keeping up with the EMR release tags.
  • Use a workflow solution that works across services like AWS Glue and Amazon Athena.
    • In the first iteration, the workflow was spread across various services. Unless you had the entire pipeline in your head, it was difficult to get a bird’s-eye view of how the pipeline was progressing.
  • Ability to select different formats.
    • For data engineering, I prefer Apache Parquet. However, customers might prefer a different format.
  • Add exported data to Athena.
    • I find that the easier it is for the data to be queried, the more likely it’s used.

Architecture overview

At a high level, this is the architecture:

  • We’re using AWS Step Functions as the workflow engine.
    • Each step is either a built-in Step Functions state, a service integration, or a simple Python AWS Lambda For example, GlueStartJobRun is using the synchronous job run service integration, as discussed in the documentation.
    • We get a visual representation of the entire pipeline.
    • It’s quick to onboard new developers.
  • An event in Amazon CloudWatch Events, which is disabled to start, triggers a Step Functions state machine with a JSON payload that contains the following:
    • AWS Glue job name
    • Export destination
    • DynamoDB table name
    • Desired read percentage
    • AWS Glue crawler name
  • AWS Glue exports a DynamoDB table in your preferred format to S3 as snapshots_your_table_name. The data is partitioned by the snapshot_timestamp
  • An AWS Glue crawler adds or updates your data’s schema and partitions in the AWS Glue Data Catalog.
  • Finally, we create an Athena view that only has data from the latest export snapshot.

A simple AWS Glue ETL job

The script that I created accepts AWS Glue ETL job arguments for the table name, read throughput, output, and format. Behind the scenes, AWS Glue scans the DynamoDB table. AWS Glue makes sure that every top-level attribute makes it into the schema, no matter how sparse your attributes are (as discussed in the DynamoDB documentation).

Here’s the script:

import sys
import datetime
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext

ARG_TABLE_NAME = "table_name"
ARG_READ_PERCENT = "read_percentage"
ARG_OUTPUT = "output_prefix"
ARG_FORMAT = "output_format"

PARTITION = "snapshot_timestamp"

args = getResolvedOptions(sys.argv,

table_name = args[ARG_TABLE_NAME]
read = args[ARG_READ_PERCENT]
output_prefix = args[ARG_OUTPUT]
fmt = args[ARG_FORMAT]

print("Table name:", table_name)
print("Read percentage:", read)
print("Output prefix:", output_prefix)
print("Format:", fmt)

date_str = datetime.datetime.utcnow().strftime('%Y-%m-%dT%H:%M')
output = "%s/%s=%s" % (output_prefix, PARTITION, date_str)

sc = SparkContext()
glueContext = GlueContext(sc)

table = glueContext.create_dynamic_frame.from_options(
    "dynamodb.input.tableName": table_name,
    "dynamodb.throughput.read.percent": read

    "path": output

There’s not a lot here. We’re creating a DynamicFrameReader of connection type dynamodb and passing in the table name and desired maximum read throughput consumption. We pass that data frame to a DynamicFrameWriter that writes the table to S3 in the specified format.

Athena views

Most teams at Amazon own applications that have multiple DynamoDB tables, including my own team. Our current application uses five primary tables. Ideally, at the end of an export workflow you can write simple, obvious queries across a consistent view of your tables. However, each exported table is partitioned by the timestamp from when the table was exported. This makes querying across one or more tables very cumbersome, because you have to add a WHERE snapshot_timestamp = clause to every table reference in your query. Additionally, each table might have a different snapshot_timestamp value for any given day!

The final step in this export workflow creates an Athena view that adds that WHERE clause for you. This means that you can interact with your DynamoDB exports as if they were one sane view of your exported DynamoDB tables.

Setting up the infrastructure

The AWS CloudFormation stacks I create are split into two stacks. The common stack contains shared infrastructure, and you need only one of these per AWS Region. The table stacks are designed in such a way that you can create one per table-format combination in any given AWS Region. It contains the CloudWatch event logic and AWS Glue components needed to export and transform DynamoDB tables.

Creating the common stack

The common stack contains the majority of the infrastructure. That includes the Step Functions state machine and Lambda functions to trigger and check the state of asynchronous jobs. It also includes IAM roles that the export stacks use, and the S3 bucket to store the exports.

To create the common stack, do the following:

  1. Choose this Launch Stack
  2. Choose I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  3. Choose Create Stack.

Creating the table export stack

If you don’t have a DynamoDB table to export, follow the original blog post. Start with the Working with the Reviews stack section and continue until you’ve added the two Items to the table. Otherwise, feel free to point this CloudFormation stack at your favorite DynamoDB table that is using provisioned throughput. Tables that use on-demand throughput are not currently supported.

Because so much of this architecture is shareable, there’s not much in the table export stack. This stack defines the CloudWatch event used to trigger the Step Functions state machine with a JSON payload containing all the necessary metadata. Additionally, it contains the AWS Glue ETL job that exports the table and the AWS Glue Crawler that updates metadata in the AWS Glue Data Catalog.

Technically, you can define the AWS Glue ETL job in the common stack because it’s already parameterized. However, the default limit for concurrent runs for an AWS Glue job is three. This is a soft limit, but with this architecture you have headroom to export up to 25 tables before asking for a limit increase.

To create the table export stack, do the following:

  1. Choose this Launch Stack
  2. Choose an output format from the list. All the available formats are supported by Athena natively.
  3. Enter your DynamoDB table name.
  4. Enter the percentage of Read Capacity Units (RCUs) that the job should consume from your table’s currently provisioned throughput. This percentage is expressed as a float between 0.1 and 1.0 inclusive. The default is 0.25 (25 percent).

As an example: Suppose that your table’s RCUs are set to 100 and you use the default 0.25, 25 percent. Then the AWS Glue job consume 25 RCUs while running.

  1. Choose Create.

Kicking off a state machine execution

To demonstrate how this works, we run the DynamoDB export state machine manually by passing it the JSON payload that the CloudWatch event would pass to Step Functions.

Getting the JSON payload from CloudWatch Events

To get the JSON payload, do the following:

  1. Open CloudWatch in the AWS Management Console.
  2. In the left column under Events, choose Rules.
  3. Choose your rule from the list. It is prefixed by AWSBigDataBlog-.
  4. For Actions, choose Edit.
  5. Copy the JSON payload from the Configure input section of Targets.
  6. Choose Cancel to exit edit mode.

Starting a state machine execution

To start an execution of the state machine, take the following steps:

  1. Open Step Functions in the console.
  2. Choose the DynamoDBExportAndAthenaLoad state machine.
  3. Choose Start execution.
  4. Paste the JSON payload into the Input
  5. Choose Start execution.

There are a few ways to follow along with the execution. As steps are entered and exited, entries are added to the Execution event history list. This is a great way to see what state (event in Lambda speak) is passed to each step, in case you need to debug.

You can also expand the Visual workflow. It’s a great high-level view to see how the workflow is progressing.

After the workflow is finished, you see two new tables under the dynamodb_exports database in your AWS Glue Data Catalog. Your DynamoDB snapshots table name is prefixed with snapshots_. The schema is formatted for the AWS Glue Data Catalog (lowercase and hyphens transformed to underscores). You also have a view table with the same table name formatted for AWS Glue Data Catalog but without the snapshots_ prefix.

Querying your data

To showcase how having a separate view table of the most recent snapshot of a table is useful, I use the Reviews table from the previous blog post. The table has two items. I have also run the export workflow twice. As you can see when you preview the table, there are four items total. That’s because each snapshot contains two items.

From the items, the latest snapshot_timestamp is 2019-01-11T23:26. When I run the same preview query against the view table reviews, we see that there are only two items, which is what we expect. The view takes care of specifying the where snapshot_timestamp=… clause so you don’t have to.

Wrapping up

In this post, I showed you how to use AWS Glue’s DynamoDB integration and AWS Step Functions to create a workflow to export your DynamoDB tables to S3 in Parquet. I also show how to create an Athena view for each table’s latest snapshot, giving you a consistent view of your DynamoDB table exports.

About the Author

Joe Feeney is a Software Engineer at Amazon Go, where he does secret stuff and he’s quite chuffed with that. He enjoys embarrassing his family by taking Mario Kart entirely too seriously.




Trigger cross-region replication of pre-existing objects using Amazon S3 inventory, Amazon EMR, and Amazon Athena

Post Syndicated from Michael Sambol original https://aws.amazon.com/blogs/big-data/trigger-cross-region-replication-of-pre-existing-objects-using-amazon-s3-inventory-amazon-emr-and-amazon-athena/

In Amazon Simple Storage Service (Amazon S3), you can use cross-region replication (CRR) to copy objects automatically and asynchronously across buckets in different AWS Regions. CRR is a bucket-level configuration, and it can help you meet compliance requirements and minimize latency by keeping copies of your data in different Regions. CRR replicates all objects in the source bucket, or optionally a subset, controlled by prefix and tags.

Objects that exist before you enable CRR (pre-existing objects) are not replicated. Similarly, objects might fail to replicate (failed objects) if permissions aren’t in place, either on the IAM role used for replication or the bucket policy (if the buckets are in different AWS accounts).

In our work with customers, we have seen situations where large numbers of objects aren’t replicated for the previously mentioned reasons. In this post, we show you how to trigger cross-region replication for pre-existing and failed objects.


At a high level, our strategy is to perform a copy-in-place operation on pre-existing and failed objects. This operation uses the Amazon S3 API to copy the objects over the top of themselves, preserving tags, access control lists (ACLs), metadata, and encryption keys. The operation also resets the Replication_Status flag on the objects. This triggers cross-region replication, which then copies the objects to the destination bucket.

To accomplish this, we use the following:

  • Amazon S3 inventory to identify objects to copy in place. These objects don’t have a replication status, or they have a status of FAILED.
  • Amazon Athena and AWS Glue to expose the S3 inventory files as a table.
  • Amazon EMR to execute an Apache Spark job that queries the AWS Glue table and performs the copy-in-place operation.

Object filtering

To reduce the size of the problem (we’ve seen buckets with billions of objects!) and eliminate S3 List operations, we use Amazon S3 inventory. S3 inventory is enabled at the bucket level, and it provides a report of S3 objects. The inventory files contain the objects’ replication status: PENDING, COMPLETED, FAILED, or REPLICA. Pre-existing objects do not have a replication status in the inventory.

Interactive analysis

To simplify working with the files that are created by S3 inventory, we create a table in the AWS Glue Data Catalog. You can query this table using Amazon Athena and analyze the objects.  You can also use this table in the Spark job running on Amazon EMR to identify the objects to copy in place.

Copy-in-place execution

We use a Spark job running on Amazon EMR to perform concurrent copy-in-place operations of the S3 objects. This step allows the number of simultaneous copy operations to be scaled up. This improves performance on a large number of objects compared to doing the copy operations consecutively with a single-threaded application.

Account setup

For the purpose of this example, we created three S3 buckets. The buckets are specific to our demonstration. If you’re following along, you need to create your own buckets (with different names).

We’re using a source bucket named crr-preexisting-demo-source and a destination bucket named crr-preexisting-demo-destination. The source bucket contains the pre-existing objects and the objects with the replication status of FAILED. We store the S3 inventory files in a third bucket named crr-preexisting-demo-inventory.

The following diagram illustrates the basic setup.

You can use any bucket to store the inventory, but the bucket policy must include the following statement (change Resource and aws:SourceAccount to match yours).

    "Version": "2012-10-17",
    "Id": "S3InventoryPolicy",
    "Statement": [
            "Sid": "S3InventoryStatement",
            "Effect": "Allow",
            "Principal": {
                "Service": "s3.amazonaws.com"
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::crr-preexisting-demo-inventory/*",
            "Condition": {
                "StringEquals": {
                    "s3:x-amz-acl": "bucket-owner-full-control",
                    "aws:SourceAccount": "111111111111"

In our example, we uploaded six objects to crr-preexisting-demo-source. We added three objects (preexisting-*.txt) before CRR was enabled. We also added three objects (failed-*.txt) after permissions were removed from the CRR IAM role, causing CRR to fail.

Enable S3 inventory

You need to enable S3 inventory on the source bucket. You can do this on the Amazon S3 console as follows:

On the Management tab for the source bucket, choose Inventory.

Choose Add new, and complete the settings as shown, choosing the CSV format and selecting the Replication status check box. For detailed instructions for creating an inventory, see How Do I Configure Amazon S3 Inventory? in the Amazon S3 Console User Guide.

After enabling S3 inventory, you need to wait for the inventory files to be delivered. It can take up to 48 hours to deliver the first report. If you’re following the demo, ensure that the inventory report is delivered before proceeding.

Here’s what our example inventory file looks like:

You can also look on the S3 console on the objects’ Overview tab. The pre-existing objects do not have a replication status, but the failed objects show the following:

Register the table in the AWS Glue Data Catalog using Amazon Athena

To be able to query the inventory files using SQL, first you need to create an external table in the AWS Glue Data Catalog. Open the Amazon Athena console at https://console.aws.amazon.com/athena/home.

On the Query Editor tab, run the following SQL statement. This statement registers the external table in the AWS Glue Data Catalog.

crr_preexisting_demo (
    `bucket` string,
    key string,
    replication_status string
PARTITIONED BY (dt string)
    ESCAPED BY '\\'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://crr-preexisting-demo-inventory/crr-preexisting-demo-source/crr-preexisting-demo/hive';

After creating the table, you need to make the AWS Glue Data Catalog aware of any existing data and partitions by adding partition metadata to the table. To do this, you use the Metastore Consistency Check utility to scan for and add partition metadata to the AWS Glue Data Catalog.

MSCK REPAIR TABLE crr_preexisting_demo;

To learn more about why this is required, see the documentation on MSCK REPAIR TABLE and data partitioning in the Amazon Athena User Guide.

Now that the table and partitions are registered in the Data Catalog, you can query the inventory files with Amazon Athena.

SELECT * FROM crr_preexisting_demo where dt='2019-02-24-04-00';

The results of the query are as follows.

The query returns all rows in the S3 inventory for a specific delivery date. You’re now ready to launch an EMR cluster to copy in place the pre-existing and failed objects.

Note: If your goal is to fix FAILED objects, make sure that you correct what caused the failure (IAM permissions or S3 bucket policies) before proceeding to the next step.

Create an EMR cluster to copy objects

To parallelize the copy-in-place operations, run a Spark job on Amazon EMR. To facilitate EMR cluster creation and EMR step submission, we wrote a bash script (available in this GitHub repository).

To run the script, clone the GitHub repo. Then launch the EMR cluster as follows:

$ git clone https://github.com/aws-samples/amazon-s3-crr-preexisting-objects
$ ./launch emr.sh

Note: Running the bash script results in AWS charges. By default, it creates two Amazon EC2 instances, one m4.xlarge and one m4.2xlarge. Auto-termination is enabled so when the cluster is finished with the in-place copies, it terminates.

The script performs the following tasks:

  1. Creates the default EMR roles (EMR_EC2_DefaultRole and EMR_DefaultRole).
  2. Uploads the files used for bootstrap actions and steps to Amazon S3 (we use crr-preexisting-demo-inventory to store these files).
  3. Creates an EMR cluster with Apache Spark installed using the create-cluster

After the cluster is provisioned:

  1. A bootstrap action installs boto3 and awscli.
  2. Two steps execute, copying the Spark application to the master node and then running the application.

The following are highlights from the Spark application. You can find the complete code for this example in the amazon-s3-crr-preexisting-objects repo on GitHub.

Here we select records from the table registered with the AWS Glue Data Catalog, filtering for objects with a replication_status of "FAILED" or “”.

query = """
        SELECT bucket, key
        FROM {}
        WHERE dt = '{}'
        AND (replication_status = '""'
        OR replication_status = '"FAILED"')
        """.format(inventory_table, inventory_date)

print('Query: {}'.format(query))

crr_failed = spark.sql(query)

We call the copy_object function for each key returned by the previous query.

def copy_object(self, bucket, key, copy_acls):
        dest_bucket = self._s3.Bucket(bucket)
        dest_obj = dest_bucket.Object(key)

        src_bucket = self._s3.Bucket(bucket)
        src_obj = src_bucket.Object(key)

        # Get the S3 Object's Storage Class, Metadata, 
        # and Server Side Encryption
        storage_class, metadata, sse_type, last_modified = \

        # Update the Metadata so the copy will work
        metadata['forcedreplication'] = runtime

        # Get and copy the current ACL
        if copy_acls:
            src_acl = src_obj.Acl()
            dest_acl = {
                'Grants': src_acl.grants,
                'Owner': src_acl.owner

        params = {
            'CopySource': {
                'Bucket': bucket,
                'Key': key
            'MetadataDirective': 'REPLACE',
            'TaggingDirective': 'COPY',
            'Metadata': metadata,
            'StorageClass': storage_class

        # Set Server Side Encryption
        if sse_type == 'AES256':
            params['ServerSideEncryption'] = 'AES256'
        elif sse_type == 'aws:kms':
            kms_key = src_obj.ssekms_key_id
            params['ServerSideEncryption'] = 'aws:kms'
            params['SSEKMSKeyId'] = kms_key

        # Copy the S3 Object over the top of itself, 
        # with the Storage Class, updated Metadata, 
        # and Server Side Encryption
        result = dest_obj.copy_from(**params)

        # Put the ACL back on the Object
        if copy_acls:

        return {
            'CopyInPlace': 'TRUE',
            'LastModified': str(result['CopyObjectResult']['LastModified'])

Note: The Spark application adds a forcedreplication key to the objects’ metadata. It does this because Amazon S3 doesn’t allow you to copy in place without changing the object or its metadata.

Verify the success of the EMR job by running a query in Amazon Athena

The Spark application outputs its results to S3. You can create another external table with Amazon Athena and register it with the AWS Glue Data Catalog. You can then query the table with Athena to ensure that the copy-in-place operation was successful.

crr_preexisting_demo_results (
  `bucket` string,
  key string,
  replication_status string,
  last_modified string
LOCATION 's3://crr-preexisting-demo-inventory/results';

SELECT * FROM crr_preexisting_demo_results;

The results appear as follows on the console.

Although this shows that the copy-in-place operation was successful, CRR still needs to replicate the objects. Subsequent inventory files show the objects’ replication status as COMPLETED. You can also verify on the console that preexisting-*.txt and failed-*.txt are COMPLETED.

It is worth noting that because CRR requires versioned buckets, the copy-in-place operation produces another version of the objects. You can use S3 lifecycle policies to manage noncurrent versions.


In this post, we showed how to use Amazon S3 inventory, Amazon Athena, the AWS Glue Data Catalog, and Amazon EMR to perform copy-in-place operations on pre-existing and failed objects at scale.

Note: Amazon S3 batch operations is an alternative for copying objects. The difference is that S3 batch operations will not check each object’s existing properties and set object ACLs, storage class, and encryption on an object-by-object basis. For more information, see Introduction to Amazon S3 Batch Operations in the Amazon S3 Console User Guide.


About the Authors

Michael Sambol is a senior consultant at AWS. He holds an MS in computer science from Georgia Tech. Michael enjoys working out, playing tennis, traveling, and watching Western movies.





Chauncy McCaughey is a senior data architect at AWS. His current side project is using statistical analysis of driving habits and traffic patterns to understand how he always ends up in the slow lane.




Amazon S3 Path Deprecation Plan – The Rest of the Story

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-s3-path-deprecation-plan-the-rest-of-the-story/

Last week we made a fairly quiet (too quiet, in fact) announcement of our plan to slowly and carefully deprecate the path-based access model that is used to specify the address of an object in an S3 bucket. I spent some time talking to the S3 team in order to get a better understanding of the situation in order to write this blog post. Here’s what I learned…

We launched S3 in early 2006. Jeff Bezos’ original spec for S3 was very succinct – he wanted malloc (a key memory allocation function for C programs) for the Internet. From that starting point, S3 has grown to the point where it now stores many trillions of objects and processes millions of requests per second for them. Over the intervening 13 years, we have added many new storage options, features, and security controls to S3.

Old vs. New
S3 currently supports two different addressing models: path-style and virtual-hosted style. Let’s take a quick look at each one. The path-style model looks like either this (the global S3 endpoint):


Or this (one of the regional S3 endpoints):


In this example, jbarr-public and jeffbarr-public are bucket names; /images/ritchie_and_thompson_pdp11.jpeg and /classic_amazon_door_desk.png are object keys.

Even though the objects are owned by distinct AWS accounts and are in different S3 buckets (and possibly in distinct AWS regions), both of them are in the DNS subdomain s3.amazonaws.com. Hold that thought while we look at the equivalent virtual-hosted style references (although you might think of these as “new,” they have been around since at least 2010):


These URLs reference the same objects, but the objects are now in distinct DNS subdomains (jbarr-public.s3.amazonaws.com and jeffbarr-public.s3.amazonaws.com, respectively). The difference is subtle, but very important. When you use a URL to reference an object, DNS resolution is used to map the subdomain name to an IP address. With the path-style model, the subdomain is always s3.amazonaws.com or one of the regional endpoints; with the virtual-hosted style, the subdomain is specific to the bucket. This additional degree of endpoint specificity is the key that opens the door to many important improvements to S3.

Out with the Old
In response to feedback on the original deprecation plan that we announced last week, we are making an important change. Here’s the executive summary:

Original Plan – Support for the path-style model ends on September 30, 2020.

Revised Plan – Support for the path-style model continues for buckets created on or before September 30, 2020. Buckets created after that date must be referenced using the virtual-hosted model.

We are moving to virtual-hosted references for two reasons:

First, anticipating a world with billions of buckets homed in many dozens of regions, routing all incoming requests directly to a small set of endpoints makes less and less sense over time. DNS resolution, scaling, security, and traffic management (including DDoS protection) are more challenging with this centralized model. The virtual-hosted model reduces the area of impact (which we call the “blast radius” internally) when problems arise; this helps us to increase availability and performance.

Second, the team has a lot of powerful features in the works, many of which depend on the use of unique, virtual-hosted style subdomains. Moving to this model will allow you to benefit from these new features as soon as they are announced. For example, we are planning to deprecate some of the oldest security ciphers and versions (details to come later). The deprecation process is easier and smoother (for you and for us) if you are using virtual-hosted references.

In With the New
As just one example of what becomes possible when using virtual-hosted references, we are thinking about providing you with increased control over the security configuration (including ciphers and cipher versions) for each bucket. If you have ideas of your own, feel free to get in touch.

Moving Ahead
Here are some things to know about our plans:

Identifying Path-Style References – You can use S3 Access Logs (look for the Host Header field) and AWS CloudTrail Data Events (look for the host element of the requestParameters entry) to identify the applications that are making path-style requests.

Programmatic Access – If your application accesses S3 using one of the AWS SDKs, you don’t need to do anything, other than ensuring that your SDK is current. The SDKs already use virtual-hosted references to S3, except if the bucket name contains one or more “.” characters.

Bucket Names with Dots – It is important to note that bucket names with “.” characters are perfectly valid for website hosting and other use cases. However, there are some known issues with TLS and with SSL certificates. We are hard at work on a plan to support virtual-host requests to these buckets, and will share the details well ahead of September 30, 2020.

Non-Routable Names – Some characters that are valid in the path component of a URL are not valid as part of a domain name. Also, paths are case-sensitive, but domain and subdomain names are not. We’ve been enforcing more stringent rules for new bucket names since last year. If you have data in a bucket with a non-routable name and you want to switch to virtual-host requests, you can use the new S3 Batch Operations feature to move the data. However, if this is not a viable option, please reach out to AWS Developer Support.

Documentation – We are planning to update the S3 Documentation to encourage all developers to build applications that use virtual-host requests. The Virtual Hosting documentation is a good starting point.

We’re Here to Help
The S3 team has been working with some of our customers to help them to migrate, and they are ready to work with many more.

Our goal is to make this deprecation smooth and uneventful, and we want to help minimize any costs you may incur! Please do not hesitate to reach out to us if you have questions, challenges, or concerns.


PS – Stay tuned for more information on tools and other resources.

New – Amazon S3 Batch Operations

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-amazon-s3-batch-operations/

AWS customers routinely store millions or billions of objects in individual Amazon Simple Storage Service (S3) buckets, taking advantage of S3’s scale, durability, low cost, security, and storage options. These customers store images, videos, log files, backups, and other mission-critical data, and use S3 as a crucial part of their data storage strategy.

Batch Operations
Today, I would like to tell you about Amazon S3 Batch Operations. You can use this new feature to easily process hundreds, millions, or billions of S3 objects in a simple and straightforward fashion. You can copy objects to another bucket, set tags or access control lists (ACLs), initiate a restore from Glacier, or invoke an AWS Lambda function on each one.

This feature builds on S3’s existing support for inventory reports (read my S3 Storage Management Update post to learn more), and can use the reports or CSV files to drive your batch operations. You don’t have to write code, set up any server fleets, or figure out how to partition the work and distribute it to the fleet. Instead, you create a job in minutes with a couple of clicks, turn it loose, and sit back while S3 uses massive, behind-the-scenes parallelism to take care of the work. You can create, monitor, and manage your batch jobs using the S3 Console, the S3 CLI, or the S3 APIs.

A Quick Vocabulary Lesson
Before we get started and create a batch job, let’s review and introduce a couple of important terms:

Bucket – An S3 bucket holds a collection of any number of S3 objects, with optional per-object versioning.

Inventory Report – An S3 inventory report is generated each time a daily or weekly bucket inventory is run. A report can be configured to include all of the objects in a bucket, or to focus on a prefix-delimited subset.

Manifest – A list (either an Inventory Report, or a file in CSV format) that identifies the objects to be processed in the batch job.

Batch Action – The desired action on the objects described by a Manifest. Applying an action to an object constitutes an S3 Batch Task.

IAM Role – An IAM role that provides S3 with permission to read the objects in the inventory report, perform the desired actions, and to write the optional completion report. If you choose Invoke AWS Lambda function as your action, the function’s execution role must grant permission to access the desired AWS services and resources.

Batch Job – References all of the items above. Each job has a status and a priority; higher priority (numerically) jobs take precedence over those with lower priority.

Running a Batch Job
Ok, let’s use the S3 Console to create and run a batch job! In preparation for this blog post I enabled inventory reports for one of my S3 buckets (jbarr-batch-camera) earlier this week, with the reports routed to jbarr-batch-inventory:

I select the desired inventory item, and click Create job from manifest to get started (I can also click Batch operations while browsing my list of buckets). All of the relevant information is already filled in, but I can choose an earlier version of the manifest if I want (this option is only applicable if the manifest is stored in a bucket that has versioning enabled). I click Next to proceed:

I choose my operation (Replace all tags), enter the options that are specific to it (I’ll review the other operations later), and click Next:

I enter a name for my job, set its priority, and request a completion report that encompasses all tasks. Then I choose a bucket for the report and select an IAM Role that grants the necessary permissions (the console also displays a role policy and a trust policy that I can copy and use), and click Next:

Finally, I review my job, and click Create job:

The job enters the Preparing state. S3 Batch Operations checks the manifest and does some other verification, and the job enters the Awaiting your confirmation state (this only happens when I use the console). I select it and click Confirm and run:

I review the confirmation (not shown) to make sure that I understand the action to be performed, and click Run job. The job enters the Ready state, and starts to run shortly thereafter. When it is done it enters the Complete state:

If I was running a job that processed a substantially larger number of objects, I could refresh this page to monitor status. One important thing to know: After the first 1000 objects have been processed, S3 Batch Operations examines and monitors the overall failure rate, and will stop the job if the rate exceeds 50%.

The completion report contains one line for each of my objects, and looks like this:

Other Built-In Batch Operations
I don’t have enough space to give you a full run-through of the other built-in batch operations. Here’s an overview:

The PUT copy operation copies my objects, with control of the storage class, encryption, access control list, tags, and metadata:

I can copy objects to the same bucket to change their encryption status. I can also copy them to another region, or to a bucket owned by another AWS account.

The Replace Access Control List (ACL) operation does exactly that, with control over the permissions that are granted:

And the Restore operation initiates an object-level restore from the Glacier or Glacier Deep Archive storage class:

Invoking AWS Lambda Functions
I have saved the most general option for last. I can invoke a Lambda function for each object, and that Lambda function can programmatically analyze and manipulate each object. The Execution Role for the function must trust S3 Batch Operations:

Also, the Role for the Batch job must allow Lambda functions to be invoked.

With the necessary roles in place, I can create a simple function that calls Amazon Rekognition for each image:

import boto3
def lambda_handler(event, context):
    s3Client = boto3.client('s3')
    rekClient = boto3.client('rekognition')
    # Parse job parameters
    jobId = event['job']['id']
    invocationId = event['invocationId']
    invocationSchemaVersion = event['invocationSchemaVersion']

    # Process the task
    task = event['tasks'][0]
    taskId = task['taskId']
    s3Key = task['s3Key']
    s3VersionId = task['s3VersionId']
    s3BucketArn = task['s3BucketArn']
    s3Bucket = s3BucketArn.split(':')[-1]
    print('BatchProcessObject(' + s3Bucket + "/" + s3Key + ')')
    resp = rekClient.detect_labels(Image={'S3Object':{'Bucket' : s3Bucket, 'Name' : s3Key}}, MaxLabels=10, MinConfidence=85)
    l = [lb['Name'] for lb in resp['Labels']]
    print(s3Key + ' - Detected:' + str(sorted(l)))

    results = [{
        'taskId': taskId,
        'resultCode': 'Succeeded',
        'resultString': 'Succeeded'
    return {
        'invocationSchemaVersion': invocationSchemaVersion,
        'treatMissingKeysAs': 'PermanentFailure',
        'invocationId': invocationId,
        'results': results

With my function in place, I select Invoke AWS lambda function as my operation when I create my job, and choose my BatchProcessObject function:

Then I create and confirm my job as usual. The function will be invoked for each object, taking advantage of Lambda’s ability to scale and allowing this moderately-sized job to run to completion in less than a minute:

I can find the “Detected” messages in the CloudWatch Logs Console:

As you can see from my very simple example, the ability to easily run Lambda functions on large numbers of S3 objects opens the door to all sorts of interesting applications.

Things to Know
I am looking forward to seeing and hearing about the use cases that you discover for S3 Batch Operations! Before I wrap up, here are some final thoughts:

Job Cloning – You can clone an existing job, fine-tune the parameters, and resubmit it as a fresh job. You can use this to re-run a failed job or to make any necessary adjustments.

Programmatic Job Creation – You could attach a Lambda function to the bucket where you generate your inventory reports and create a fresh batch job each time a report arrives. Jobs that are created programmatically do not need to be confirmed, and are immediately ready to execute.

CSV Object Lists – If you need to process a subset of the objects in a bucket and cannot use a common prefix to identify them, you can create a CSV file and use it to drive your job. You could start from an inventory report and filter the objects based on name or by checking them against a database or other reference. For example, perhaps you use Amazon Comprehend to perform sentiment analysis on all of your stored documents. You can process inventory reports to find documents that have not yet been analyzed and add them to a CSV file.

Job Priorities – You can have multiple jobs active at once in each AWS region. Your jobs with a higher priority take precedence, and can cause existing jobs to be paused momentarily. You can select an active job and click Update priority in order to make changes on the fly:

Learn More
Here are some resources to help you learn more about S3 Batch Operations:

Documentation – Read about Creating a Job, Batch Operations, and Managing Batch Operations Jobs.

Tutorial Videos – Check out the S3 Batch Operations Video Tutorials to learn how to Create a Job, Manage and Track a Job, and to Grant Permissions.

Now Available
You can start using S3 Batch Operations in all commercial AWS regions except Asia Pacific (Osaka) today. S3 Batch Operations is also available in both of the AWS GovCloud (US) regions.


New Amazon S3 Storage Class – Glacier Deep Archive

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-amazon-s3-storage-class-glacier-deep-archive/

Many AWS customers collect and store large volumes (often a petabyte or more) of important data but seldom access it. In some cases raw data is collected and immediately processed, then stored for years or decades just in case there’s a need for further processing or analysis. In other cases, the data is retained for compliance or auditing purposes. Here are some of the industries and use cases that fit this description:

Financial – Transaction archives, activity & audit logs, and communication logs.

Health Care / Life Sciences – Electronic medical records, images (X-Ray, MRI, or CT), genome sequences, records of pharmaceutical development.

Media & Entertainment – Media archives and raw production footage.

Physical Security – Raw camera footage.

Online Advertising – Clickstreams and ad delivery logs.

Transportation – Vehicle telemetry, video, RADAR, and LIDAR data.

Science / Research / Education – Research input and results, including data relevant to seismic tests for oil & gas exploration.

Today we are introducing a new and even more cost-effective way to store important, infrequently accessed data in Amazon S3.

Amazon S3 Glacier Deep Archive Storage Class
The new Glacier Deep Archive storage class is designed to provide durable and secure long-term storage for large amounts of data at a price that is competitive with off-premises tape archival services. Data is stored across 3 or more AWS Availability Zones and can be retrieved in 12 hours or less. You no longer need to deal with expensive and finicky tape drives, arrange for off-premises storage, or worry about migrating data to newer generations of media.

Your existing S3-compatible applications, tools, code, scripts, and lifecycle rules can all take advantage of Glacier Deep Archive storage. You can specify the new storage class when you upload objects, alter the storage class of existing objects manually or programmatically, or use lifecycle rules to arrange for migration based on object age. You can also make use of other S3 features such as Storage Class Analysis, Object Tagging, Object Lock, and Cross-Region Replication.

The existing S3 Glacier storage class allows you to access your data in minutes (using expedited retrieval) and is a good fit for data that requires faster access. To learn more about the entire range of options, read Storage Classes in the S3 Developer Guide. If you are already making use of the Glacier storage class and rarely access your data, you can switch to Deep Archive and begin to see cost savings right away.

Using Glacier Deep Archive Storage – Console
I can switch the storage class of an existing S3 object to Glacier Deep Archive using the S3 Console. I locate the file and click Properties:

Then I click Storage class:

Next, I select Glacier Deep Archive and click Save:

I cannot download the object or edit any of its properties or permissions after I make this change:

In the unlikely event that I need to access this 2013-era video, I select it and choose Restore from the Actions menu:

Then I specify the number of days to keep the restored copy available, and choose either bulk or standard retrieval:

Using Glacier Deep Archive Storage – Lifecycle Rules
I can also use S3 lifecycle rules. I select the bucket and click Management, then select Lifecycle:

Then I click Add lifecycle rule and create my rule. I enter a name (ArchiveOldMovies), and can optionally use a path or tag filter to limit the scope of the rule:

Next, I indicate that I want the rule to apply to the Current version of my objects, and specify that I want my objects to transition to Glacier Deep Archive 30 days after they are created:

Using Glacier Deep Archive – CLI / Programmatic Access
I can use the CLI to upload a new object and set the storage class:

$ aws s3 cp new.mov s3://awsroadtrip-videos-raw/ --storage-class DEEP_ARCHIVE

I can also change the storage class of an existing object by copying it over itself:

$ aws s3 cp s3://awsroadtrip-videos-raw/new.mov s3://awsroadtrip-videos-raw/new.mov --storage-class DEEP_ARCHIVE

If I am building a system that manages archiving and restoration, I can opt to receive notifications on an SNS topic, an SQS queue, or a Lambda function when a restore is initiated and/or completed:

Other Access Methods
You can also use Tape Gateway configuration of AWS Storage Gateway to create a Virtual Tape Library (VTL) and configure it to use Glacier Deep Archive for storage of archived virtual tapes. This will allow you to move your existing tape-based backups to the AWS Cloud without making any changes to your existing backup workflows. You can retrieve virtual tapes archived in Glacier Deep Archive to S3 within twelve hours. With Tape Gateway and S3 Glacier Deep Archive, you no longer need on-premises physical tape libraries, and you don’t need to manage hardware refreshes and rewrite data to new physical tapes as technologies evolve. For more information, visit the Test Your Gateway Setup with Backup Software page of Storage Gateway User Guide.

Now Available
The S3 Glacier Deep Archive storage class is available today in all commercial regions and in both AWS GovCloud regions. Pricing varies by region, and the storage cost is up to 75% less than for the existing S3 Glacier storage class; visit the S3 Pricing page for more information.


Improve Apache Spark write performance on Apache Parquet formats with the EMRFS S3-optimized committer

Post Syndicated from Peter Slawski original https://aws.amazon.com/blogs/big-data/improve-apache-spark-write-performance-on-apache-parquet-formats-with-the-emrfs-s3-optimized-committer/

The EMRFS S3-optimized committer is a new output committer available for use with Apache Spark jobs as of Amazon EMR 5.19.0. This committer improves performance when writing Apache Parquet files to Amazon S3 using the EMR File System (EMRFS). In this post, we run a performance benchmark to compare this new optimized committer with existing committer algorithms, namely FileOutputCommitter algorithm versions 1 and 2. We close with a discussion on current limitations for the new committer, providing workarounds where possible.

Comparison with FileOutputCommitter

In Amazon EMR version 5.19.0 and earlier, Spark jobs that write Parquet to Amazon S3 use a Hadoop commit algorithm called FileOutputCommitter by default. There are two versions of this algorithm, version 1 and 2. Both versions rely on writing intermediate task output to temporary locations. They subsequently perform rename operations to make the data visible at task or job completion time.

Algorithm version 1 has two phases of rename: one to commit the individual task output, and the other to commit the overall job output from completed/successful tasks. Algorithm version 2 is more efficient because task commits rename files directly to the final output location. This eliminates the second rename phase, but it makes partial data visible before the job completes, which not all workloads can tolerate.

The renames that are performed are fast, metadata-only operations on the Hadoop Distributed File System (HDFS). However, when output is written to object stores such as Amazon S3, renames are implemented by copying data to the target and then deleting the source. This rename “penalty” is exacerbated with directory renames, which can happen in both phases of FileOutputCommitter v1. Whereas these are single metadata-only operations on HDFS, committers must execute N copy-and-delete operations on S3.

To partially mitigate this, Amazon EMR 5.14.0+ defaults to FileOutputCommitter v2 when writing Parquet data to S3 with EMRFS in Spark. The new EMRFS S3-optimized committer improves on that work to avoid rename operations altogether by using the transactional properties of Amazon S3 multipart uploads. Tasks may then write their data directly to the final output location, but defer completion of each output file until task commit time.

Performance test

We evaluated the write performance of the different committers by executing the following INSERT OVERWRITE Spark SQL query. The SELECT * FROM range(…) clause generated data at execution time. This produced ~15 GB of data across exactly 100 Parquet files in Amazon S3.

SET rows=4e9; -- 4 Billion
SET partitions=100;

INSERT OVERWRITE DIRECTORY ‘s3://${bucket}/perf-test/${trial_id}’
USING PARQUET SELECT * FROM range(0, ${rows}, 1, ${partitions});

Note: The EMR cluster ran in the same AWS Region as the S3 bucket. The trial_id property used a UUID generator to ensure that there was no conflict between test runs.

We executed our test on an EMR cluster created with the emr-5.19.0 release label, with a single m5d.2xlarge instance in the master group, and eight m5d.2xlarge instances in the core group. We used the default Spark configuration properties set by Amazon EMR for this cluster configuration, which include the following:

spark.dynamicAllocation.enabled true
spark.executor.memory 11168M
spark.executor.cores 4

After running 10 trials for each committer, we captured and summarized query execution times in the following chart. Whereas FileOutputCommitter v2 averaged 49 seconds, the EMRFS S3-optimized committer averaged only 31 seconds—a 1.6x speedup.

As mentioned earlier, FileOutputCommitter v2 eliminates some, but not all, rename operations that FileOutputCommitter v1 uses. To illustrate the full performance impact of renames against S3, we reran the test using FileOutputCommitter v1. In this scenario, we observed an average runtime of 450 seconds, which is 14.5x slower than the EMRFS S3-optimized committer.

The last scenario we evaluated is the case when EMRFS consistent view is enabled, which addresses issues that can arise due to the Amazon S3 data consistency model. In this mode, the EMRFS S3-optimized committer time was unaffected by this change and still averaged 30 seconds. On the other hand, FileOutputCommitter v2 averaged 53 seconds, which was slower than when the consistent view feature was turned off, widening the overall performance difference to 1.8x.

Job correctness

The EMRFS S3-optimized committer has the same limitations that FileOutputCommitter v2 has because both improve performance by fully delegating commit responsibilities to the individual tasks. The following is a discussion of the notable consequences of this design choice.

Partial results from incomplete or failed jobs

Because both committers have their tasks write to the final output location, concurrent readers of that output location can view partial results when using either of them. If a job fails, partial results are left behind from any tasks that have committed before the overall job failed. This situation can lead to duplicate output if the job is run again without first cleaning up the output location.

One way to mitigate this issue is to ensure that a job uses a different output location each time it runs, publishing the location to downstream readers only if the job succeeds. The following code block is an example of this strategy for workloads that use Hive tables. Notice how output_location is set to a unique value each time the job is run, and that the table partition is registered only if the rest of the query succeeds. As long as readers exclusively access data via the table abstraction, they cannot see results before the job finishes.

SET attempt_id=<a random UUID>;
SET output_location=s3://bucket/${attempt_id};


ALTER TABLE output ADD PARTITION (dt = ‘2018-11-26’)
LOCATION ‘${output_location}’;

This approach requires treating the locations that partitions point to as immutable. Updates to partition contents require restating all results into a new location in S3, and then updating the partition metadata to point to that new location.

Duplicate results from non-idempotent tasks

Another scenario that can cause both committers to produce incorrect results is when jobs composed of non-idempotent tasks produce outputs into non-deterministic locations for each task attempt.

The following is an example of a query that illustrates the issue. It uses a timestamp-based table partitioning scheme to ensure that it writes to a different location for each task attempt.

SET hive.exec.dynamic.partition=true
SET hive.exec.dynamic.partition.mode=nonstrict;

INSERT INTO data PARTITION (time) SELECT 42, current_timestamp();

You can avoid the issue of duplicate results in this scenario by ensuring that tasks write to a consistent location across task attempts. For example, instead of calling functions that return the current timestamp within tasks, consider providing the current timestamp as an input to the job. Similarly, if a random number generator is used within jobs, consider using a fixed seed or one that is based on the task’s partition number to ensure that task reattempts uses the same value.

Note: Spark’s built-in random functions rand(), randn(), and uuid() are already designed with this in mind.

Enabling the EMRFS S3-optimized committer

Starting with Amazon EMR version 5.20.0, the EMRFS S3-optimized committer is enabled by default. In Amazon EMR version 5.19.0, you can enable the committer by setting the spark.sql.parquet.fs.optimized.committer.optimization-enabled property to true from within Spark or when creating clusters. The committer takes effect when you use Spark’s built-in Parquet support to write Parquet files into Amazon S3 with EMRFS. This includes using the Parquet data source with Spark SQL, DataFrames, or Datasets. However, there are some use cases when the EMRFS S3-optimized committer does not take effect, and some use cases where Spark performs its own renames entirely outside of the committer. For more information about the committer and about these special cases, see Using the EMRFS S3-optimized Committer in the Amazon EMR Release Guide.

Related Work – S3A Committers

The EMRFS S3-optimized committer was inspired by concepts used by committers that support the S3A file system. The key take-away is that these committers use the transactional nature of S3 multipart uploads to eliminate some or all of the rename costs. This is also the core concept used by the EMRFS S3-optimized committer.

For more information about the various committers available within the ecosystem, including those that support the S3A file system, see the official Apache Hadoop documentation.


The EMRFS S3-optimized committer improves write performance compared to FileOutputCommitter. Starting with Amazon EMR version 5.19.0, you can use it with Spark’s built-in Parquet support. For more information, see Using the EMRFS S3-optimized Committer in the Amazon EMR Release Guide.


About the authors

Peter Slawski is a software development engineer with Amazon Web Services.





Jonathan Kelly is a senior software development engineer with Amazon Web Services.





Deploying a personalized API Gateway serverless developer portal

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/deploying-a-personalized-api-gateway-serverless-developer-portal/

This post is courtesy of Drew Dresser, Application Architect – AWS Professional Services

Amazon API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. Customers of these APIs often want a website to learn and discover APIs that are available to them. These customers might include front-end developers, third-party customers, or internal system engineers. To produce such a website, we have created the API Gateway serverless developer portal.

The API Gateway serverless developer portal (developer portal or portal, for short) is an application that you use to make your API Gateway APIs available to your customers by enabling self-service discovery of those APIs. Your customers can use the developer portal to browse API documentation, register for, and immediately receive their own API key that they can use to build applications, test published APIs, and monitor their own API usage.

Over the past few months, the team has been hard at work contributing to the open source project, available on Github. The developer portal was relaunched on October 29, 2018, and the team will continue to push features and take customer feedback from the open source community. Today, we’re happy to highlight some key functionality of the new portal.

Benefits for API publishers

API publishers use the developer portal to expose the APIs that they manage. As an API publisher, you need to set up, maintain, and enable the developer portal. The new portal has the following benefits:

Benefits for API consumers

API Consumers use the developer portal as a traditional application user. An API consumer needs to understand the APIs being published. API consumers might be front-end developers, distributed system engineers, or third-party customers. The new developer portal comes with the following benefits for API consumers:

  • Explore – API consumers can quickly page through lists of APIs. When they find one they’re interested in, they can immediately see documentation on that API.
  • Learn – API consumers might need to drill down deeper into an API to learn its details. They want to learn how to form requests and what they can expect as a response.
  • Test – Through the developer portal, API consumers can get an API key and invoke the APIs directly. This enables developers to develop faster and with more confidence.


The developer portal is a completely serverless application. It leverages Amazon API Gateway, Amazon Cognito User Pools, AWS Lambda, Amazon DynamoDB, and Amazon S3. Serverless architectures enable you to build and run applications without needing to provision, scale, and manage any servers. The developer portal is broken down in to multiple microservices, each with a distinct responsibility, as shown in the following image.

Identity management for the developer portal is performed by Amazon Cognito and a Lambda function in the Login & Registration microservice. An Amazon Cognito User Pool is configured out of the box to enable users to register and login. Additionally, you can deploy the developer portal to use a UI hosted by Amazon Cognito, which you can customize to match your style and branding.

Requests are routed to static content served from Amazon S3 and built using React. The React app communicates to the Lambda backend via API Gateway. The Lambda function is built using the aws-serverless-express library and contains the business logic behind the APIs. The business logic of the web application queries and add data to the API Key Creation and Catalog Update microservices.

To maintain the API catalog, the Catalog Update microservice uses an S3 bucket and a Lambda function. When an API’s Swagger file is added or removed from the bucket, the Lambda function triggers and maintains the API catalog by updating the catalog.json file in the root of the S3 bucket.

To manage the mapping between API keys and customers, the application uses the API Key Creation microservice. The service updates API Gateway with API key creations or deletions and then stores the results in a DynamoDB table that maps customers to API keys.

Deploying the developer portal

You can deploy the developer portal using AWS SAM, the AWS SAM CLI, or the AWS Serverless Application Repository. To deploy with AWS SAM, you can simply clone the repository and then deploy the application using two commands from your CLI. For detailed instructions for getting started with the portal, see Use the Serverless Developer Portal to Catalog Your API Gateway APIs in the Amazon API Gateway Developer Guide.

Alternatively, you can deploy using the AWS Serverless Application Repository as follows:

  1. Navigate to the api-gateway-dev-portal application and choose Deploy in the top right.
  2. On the Review page, for ArtifactsS3BucketName and DevPortalSiteS3BucketName, enter globally unique names. Both buckets are created for you.
  3. To deploy the application with these settings, choose Deploy.
  4. After the stack is complete, get the developer portal URL by choosing View CloudFormation Stack. Under Outputs, choose the URL in Value.

The URL opens in your browser.

You now have your own serverless developer portal application that is deployed and ready to use.

Publishing a new API

With the developer portal application deployed, you can publish your own API to the portal.

To get started:

  1. Create the PetStore API, which is available as a sample API in Amazon API Gateway. The API must be created and deployed and include a stage.
  2. Create a Usage Plan, which is required so that API consumers can create API keys in the developer portal. The API key is used to test actual API calls.
  3. On the API Gateway console, navigate to the Stages section of your API.
  4. Choose Export.
  5. For Export as Swagger + API Gateway Extensions, choose JSON. Save the file with the following format: apiId_stageName.json.
  6. Upload the file to the S3 bucket dedicated for artifacts in the catalog path. In this post, the bucket is named apigw-dev-portal-artifacts. To perform the upload, run the following command.
    aws s3 cp apiId_stageName.json s3://yourBucketName/catalog/apiId_stageName.json

Uploading the file to the artifacts bucket with a catalog/ key prefix automatically makes it appear in the developer portal.

This might be familiar. It’s your PetStore API documentation displayed in the OpenAPI format.

With an API deployed, you’re ready to customize the portal’s look and feel.

Customizing the developer portal

Adding a customer’s own look and feel to the developer portal is easy, and it creates a great user experience. You can customize the domain name, text, logo, and styling. For a more thorough walkthrough of customizable components, see Customization in the GitHub project.

Let’s walk through a few customizations to make your developer portal more familiar to your API consumers.

Customizing the logo and images

To customize logos, images, or content, you need to modify the contents of the your-prefix-portal-static-assets S3 bucket. You can edit files using the CLI or the AWS Management Console.

Start customizing the portal by using the console to upload a new logo in the navigation bar.

  1. Upload the new logo to your bucket with a key named custom-content/nav-logo.png.
    aws s3 cp {myLogo}.png s3://yourPrefix-portal-static-assets/custom-content/nav-logo.png
  2. Modify object permissions so that the file is readable by everyone because it’s a publicly available image. The new navigation bar looks something like this:

Another neat customization that you can make is to a particular API and stage image. Maybe you want your PetStore API to have a dog picture to represent the friendliness of the API. To add an image:

  1. Use the command line to copy the image directly to the S3 bucket location.
    aws s3 cp my-image.png s3://yourPrefix-portal-static-assets/custom-content/api-logos/apiId-stageName.png
  2. Modify object permissions so that the file is readable by everyone.

Customizing the text

Next, make sure that the text of the developer portal welcomes your pet-friendly customer base. The YAML files in the static assets bucket under /custom-content/content-fragments/ determine the portal’s text content.

To edit the text:

  1. On the AWS Management Console, navigate to the website content S3 bucket and then navigate to /custom-content/content-fragments/.
  2. Home.md is the content displayed on the home page, APIs.md controls the tab text on the navigation bar, and GettingStarted.md contains the content of the Getting Started tab. All three files are written in markdown. Download one of them to your local machine so that you can edit the contents. The following image shows Home.md edited to contain custom text:
  3. After editing and saving the file, upload it back to S3, which results in a customized home page. The following image reflects the configuration changes in Home.md from the previous step:

Customizing the domain name

Finally, many customers want to give the portal a domain name that they own and control.

To customize the domain name:

  1. Use AWS Certificate Manager to request and verify a managed certificate for your custom domain name. For more information, see Request a Public Certificate in the AWS Certificate Manager User Guide.
  2. Copy the Amazon Resource Name (ARN) so that you can pass it to the developer portal deployment process. That process is now includes the certificate ARN and a property named UseRoute53Nameservers. If the property is set to true, the template creates a hosted zone and record set in Amazon Route 53 for you. If the property is set to false, the template expects you to use your own name server hosting.
  3. If you deployed using the AWS Serverless Application Repository, navigate to the Application page and deploy the application along with the certificate ARN.

After the developer portal is deployed and your CNAME record has been added, the website is accessible from the custom domain name as well as the new Amazon CloudFront URL.

Customizing the logo, text content, and domain name are great tools to make the developer portal feel like an internally developed application. In this walkthrough, you completely changed the portal’s appearance to enable developers and API consumers to discover and browse APIs.


The developer portal is available to use right away. Support and feature enhancements are tracked in the public GitHub. You can contribute to the project by following the Code of Conduct and Contributing guides. The project is open-sourced under the Amazon Open Source Code of Conduct. We plan to continue to add functionality and listen to customer feedback. We can’t wait to see what customers build with API Gateway and the API Gateway serverless developer portal.

Handling AWS Chargebacks for Enterprise Customers

Post Syndicated from Varad Ram original https://aws.amazon.com/blogs/architecture/handling-aws-chargebacks-for-enterprise-customers/

As AWS product portfolios and feature sets grow, as an enterprise customer, you are likely to migrate your existing workloads and innovate your new products on AWS. To help you keep your cloud charges simple, you can use consolidated billing. This can, however, create complexity for your internal chargebacks, especially if some of your resources and services are not tagged correctly. To help your individual teams and business units normalize and reduce their costs as your AWS implementation grows, you can implement chargebacks transparently and automate billing.

This blog post includes a walkthrough of an end-to-end mechanism that you can use to automate your consolidated billing charges for either your existing AWS accounts, or for newly created accounts.


Prerequisites for implementation:

  • One account that is the payer account, which consolidates billing and links all other accounts (including admin accounts)
  • An understanding of billing, Detailed Billing Report (DBR), Cost and Usage Report (CUR), and blended and unblended costs
  • Activate propagation of necessary cost allocation tags to consolidated billing
  • Access to reservations across the linked accounts
  • Read permission on the source bucket and write permission to the transformed bucket
  • An automated method (such as database access or an API) to verify the cost centers tagged to AWS resources
  • Permissions to get access to the services described in this solution on the account targeted for this automation

Before you begin, it is important to understand the blended costs and unblended costs in consolidated billing. Blended costs are calculated based on the blended rate (the average rates for the reserved and on-demand instances that are used by your member accounts) for each service your accounts used, multiplied by the account usage of those services. Unblended costs are the charges for those services broken out for each linked account.

Based on your organization’s strategy for savings (centralized or not), you could consider either the blended or unblended costs. The consolidated billing files that include the information for the chargeback are the Detailed Billing Report (DBR) and Cost and Usage Report (CUR). Both of these reports provide both the blended and unblended rates as separate columns.

To help you create and maintain your AWS accounts, you can use AWS Account Vending Machine (AVM). You can launch AVM from either the AWS Landing Zone or with a custom solution. AVM keeps all your account information in a DynamoDB table (such as the account number, root mail ID, default cost center, name of the owner, etc.) and maintains reservation-related data (such as invoice ID, instance type, region, amount, cost center, etc.) in another table. To enable your account administrator to add invoice details for all your reservations, you can use a web page hosted on AWS Lambda, Amazon Simple Storage Service (Amazon S3), or a web server.

To begin the process of billing transformation, you must add a trigger on an S3 bucket (which contains raw AWS billing files) that pushes messages (PutObject) into Amazon Simple Queue Service (SQS) and your billing transformation program (written in Python, Nodejs, Java, .net, etc. using AWS SDK) that runs on an Amazon Elastic Compute Cloud (Amazon EC2) instance, containers, or Lambda (if the bill can be processed within 15 minutes with file size restrictions).

The billing transformation program must do the following:

  • Cache the Account details and reservation DynamoDB tables
  • Verify if there are any messages in SQS
  • Ignore if the file is not a DBR or CUR file (process either of them, not both)
  • Download the file, unzip, and read row-by-row; for a DBR file, consider only the “LineItem” RecordType
  • Add two new columns: Bill_CostCenter and Bill_Notes
    • If there is a valid value in the CostCenter tag (verified with internal automation processes), add the same value to the Bill_CostCenter column and any notes to the Bill_Notes column
    • If the CostCenter is invalid, get the default Cost Center from the cached account details and add the information to the Bill_CostCenter and Bill_Notes columns
    • If the row is a reservation invoice, the cost center information comes from the reservation table and is added to the correct column
  • Cache consolidation of cost centers with the blended or unblended cost of each row
  • Write each of these processed line items into a new file
  • Handle exceptions by the normal organization practices (for example, email the owner of the cost center or the finance team)
  • Push the new file into the transformed Amazon S3 bucket
  • Write the consolidated lines into a different file and upload to Transformed Amazon S3 bucket
Figure 1 – Architecture of processing a billing chargeback

Figure 1 – Architecture of processing a billing chargeback


Figure 2 – Validating the Cost Center process

After you have the consolidated billing file aggregated by cost center, you can easily see and handle your internal chargebacks. To further simplify your chargeback model, you can get help from AWS Technical Account Managers and Billing Concierge, if your organization would like AWS to provide custom invoices from the consolidated billing file.

Because the cost centers in your organization can expire over time, it’s important validate them frequently with automation, such as a Lambda program.


If your organization has a more complex chargeback structure, you can extend the logic described above to support deeper and broader chargeback codes, or implement hierarchical chargeback structure.

You can also extend the transformation logic to support several chargeback codes (such as comma separated or with additional tags) if you have multiple teams or project that want to share a resource.


As enterprise organizations grow and consume more cloud services, the cost optimization process grows and evolves with them. Sophisticated chargeback models enable the teams and business units in the organization to be accountable and contribute to take the steps necessary to normalize the usage and costs of AWS services.

About the Author

Varad RamVarad Ram likes to help customers adopt to cloud technologies and he is particularly interested in Artificial Intelligence. He believes Deep Learning will power future technology growth. In his spare time, his daughter and toddler son keep him busy biking and hiking.

Our data lake story: How Woot.com built a serverless data lake on AWS

Post Syndicated from Karthik Kumar Odapally original https://aws.amazon.com/blogs/big-data/our-data-lake-story-how-woot-com-built-a-serverless-data-lake-on-aws/

In this post, we talk about designing a cloud-native data warehouse as a replacement for our legacy data warehouse built on a relational database.

At the beginning of the design process, the simplest solution appeared to be a straightforward lift-and-shift migration from one relational database to another. However, we decided to step back and focus first on what we really needed out of a data warehouse. We started looking at how we could decouple our legacy Oracle database into smaller microservices, using the right tool for the right job. Our process wasn’t just about using the AWS tools. More, it was about having a mind shift to use cloud-native technologies to get us to our final state.

This migration required developing new extract, transform, load (ETL) pipelines to get new data flowing in while also migrating existing data. Because of this migration, we were able to deprecate multiple servers and move to a fully serverless data warehouse orchestrated by AWS Glue.

In this blog post, we are going to show you:

  • Why we chose a serverless data lake for our data warehouse.
  • An architectural diagram of Woot’s systems.
  • An overview of the migration project.
  • Our migration results.

Architectural and design concerns

Here are some of the design points that we considered:

  • Customer experience. We always start with what our customer needs, and then work backwards from there. Our data warehouse is used across the business by people with varying level of technical expertise. We focused on the ability for different types of users to gain insights into their operations and to provide better feedback mechanisms to improve the overall customer experience.
  • Minimal infrastructure maintenance. The “Woot data warehouse team” is really just one person—Chaya! Because of this, it’s important for us to focus on AWS services that enable us to use cloud-native technologies. These remove the undifferentiated heavy lifting of managing infrastructure as demand changes and technologies evolve.
  • Responsiveness to data source changes. Our data warehouse gets data from a range of internal services. In our existing data warehouse, any updates to those services required manual updates to ETL jobs and tables. The response times for these data sources are critical to our key stakeholders. This requires us to take a data-driven approach to selecting a high-performance architecture.
  • Separation from production systems. Access to our production systems is tightly coupled. To allow multiple users, we needed to decouple it from our production systems and minimize the complexities of navigating resources in multiple VPCs.

Based on these requirements, we decided to change the data warehouse both operationally and architecturally. From an operational standpoint, we designed a new shared responsibility model for data ingestion. Architecturally, we chose a serverless model over a traditional relational database. These two decisions ended up driving every design and implementation decision that we made in our migration.

As we moved to a shared responsibility model, several important points came up. First, our new way of data ingestion was a major cultural shift for Woot’s technical organization. In the past, data ingestion had been exclusively the responsibility of the data warehouse team and required customized pipelines to pull data from services. We decided to shift to “push, not pull”: Services should send data to the data warehouse.

This is where shared responsibility came in. For the first time, our development teams had ownership over their services’ data in the data warehouse. However, we didn’t want our developers to have to become mini data engineers. Instead, we had to give them an easy way to push data that fit with the existing skill set of a developer. The data also needed to be accessible by the range of technologies used by our website.

These considerations led us to select the following AWS services for our serverless data warehouse:

The following diagram shows at a high level how we use these services.


These components together met all of our requirements and enabled our shared responsibility model. However, we made few tradeoffs compared to a lift-and-shift migration to another relational database:

  • The biggest tradeoff was upfront effort vs. ongoing maintenance. We effectively had to start from scratch with all of our data pipelines and introduce a new technology into all of our website services, which required a concerted effort across multiple teams. Minimal ongoing maintenance was a core requirement. We were willing to make this tradeoff to take advantage of the managed infrastructure of the serverless components that we use.
  • Another tradeoff was balancing usability for nontechnical users vs. taking advantage of big data technologies. Making customer experience a core requirement helped us navigate the decision-making when considering these tradeoffs. Ultimately, only switching to another relational database would mean that our customers would have the same experience, not a better one.

Building data pipelines with Kinesis Data Firehose and Lambda

Because our site already runs on AWS, using an AWS SDK to send data to Kinesis Data Firehose was an easy sell to developers. Things like the following were considerations:

  • Direct PUT ingestion for Kinesis Data Firehose is natural for developers to implement, works in all languages used across our services, and delivers data to Amazon S3.
  • Using S3 for data storage means that we automatically get high availability, scalability, and durability. And because S3 is a global resource, it enables us to manage the data warehouse in a separate AWS account and avoid the complexity of navigating multiple VPCs.

We also consume data stored in Amazon DynamoDB tables. Kinesis Data Firehose again provided the core of the solution, this time combined with DynamoDB Streams and Lambda. For each DynamoDB table, we enabled DynamoDB Streams and then used the stream to trigger a Lambda function.

The Lambda function cleans the DynamoDB stream output and writes the cleaned JSON to Kinesis Data Firehose using boto3. After doing this, it converges with the other process and outputs the data to S3. For more information, see How to Stream Data from Amazon DynamoDB to Amazon Aurora using AWS Lambda and Amazon Kinesis Firehose on the AWS Database Blog.

Lambda gave us more fine-grained control and enabled us to move files between accounts:

  • We enabled S3 event notifications on the S3 bucket and created an Amazon SNS topic to receive notifications whenever Kinesis Data Firehose put an object in the bucket.
  • The SNS topic triggered a Lambda function, which took the Kinesis output and moved it to the data warehouse account in our chosen partition structure.

S3 event notifications can trigger Lambda functions, but we chose SNS as an intermediary because the S3 bucket and Lambda function were in separate accounts.

Migrating existing data with AWS DMS and AWS Glue

We needed to migrate data from our existing RDS database to S3, which we accomplished with AWS DMS. DMS natively supports S3 as a target, as described in the DMS documentation.

Setting this up was relatively straightforward. We exported data directly from our production VPC to the separate data warehouse account by tweaking the connection attributes in DMS. The string that we used was this:


This code gives ownership to the bucket owner (the destination data warehouse account), compresses the files to save on storage costs, and includes all column names. After the data was in S3, we used an AWS Glue crawler to infer the schemas of all exported tables and then compared against the source data.

With AWS Glue, some of the challenges we overcame were these:

  • Unstructured text data, such as forum and blog posts. DMS exports these to CSV. This approach conflicted with the commas present in the text data. We opted to use AWS Glue to export data from RDS to S3 in Parquet format, which is unaffected by commas because it encodes columns directly.
  • Cross-account exports. We resolved this by including the code

"glueContext._jsc.hadoopConfiguration().set("fs.s3.canned.acl", "BucketOwnerFullControl”)”

at the top of each AWS Glue job to grant bucket owner access to all S3 files produced by AWS Glue.

Overall, AWS DMS was quicker to set up and great for exporting large amounts of data with rule-based transformations. AWS Glue required more upfront effort to set up jobs, but provided better results for cases where we needed more control over the output.

If you’re looking to convert existing raw data (CSV or JSON) into Parquet, you can set up an AWS Glue job to do that. The process is described in the AWS Big Data Blog post Build a data lake foundation with AWS Glue and Amazon S3.

Bringing it all together with AWS Glue, Amazon Athena, and Amazon QuickSight

After data landed in S3, it was time for the real fun to start: actually working with the data! Can you tell I’m a data engineer? For me, a big part of the fun was exploring AWS Glue:

  • AWS Glue handles our ETL job scheduling.
  • AWS Glue crawlers manage the metadata in the AWS Glue Data Catalog.

Crawlers are the “secret sauce” that enables us to be responsive to schema changes. Throughout the pipeline, we chose to make each step as schema-agnostic as possible, which allows any schema changes to flow through until they reach AWS Glue.

However, raw data is not ideal for most of our business users, because it often has duplicates or incorrect data types. Most importantly, the data out of Firehose is in JSON format, but we quickly observed significant query performance gains from using Parquet format. Here, we used one of the performance tips in the Big Data Blog post Top 10 performance tuning tips for Amazon Athena.

With our shared responsibility model, the data warehouse and BI teams are responsible for the final processing of data into curated datasets ready for reporting. Using Lambda and AWS Glue enables these teams to work in Python and SQL (the core languages for Amazon data engineering and BI roles). It also enables them to deploy code with minimal infrastructure setup or maintenance.

Our ETL process is as follows:

  • Scheduled triggers.
  • Series of conditional triggers that control the flow of subsequent jobs that depend on previous jobs.
  • A similar pattern across many jobs of reading in the raw data, deduplicating the data, and then writing to Parquet. We centralized this logic by creating a Python library of functions and uploading it to S3. We then included that library in the AWS Glue job as an additional Python library. For more information on how to do this, see Using Python Libraries with AWS Glue in the AWS Glue documentation.

We also migrated complex jobs used to create reporting tables with business metrics:

  • The AWS Glue use of PySpark simplified the migration of these queries, because you can embed SparkSQL queries directly in the job.
  • Converting to SparkSQL took some trial and error, but ultimately required less work than translating SQL queries into Spark methods. However, for people on our BI team who had previously worked with Pandas or Spark, working with Spark dataframes was a natural transition. As someone who used SQL for several years before learning Python, I appreciate that PySpark lets me quickly switch back and forth between SQL and an object-oriented framework.

Another hidden benefit of using AWS Glue jobs is that the AWS Glue version of Python (like Lambda) already has boto3 installed. Thus, ETL jobs can directly use AWS API operations without additional configuration.

For example, some of our longer-running jobs created read inconsistency if a user happened to query that table while AWS Glue was writing data to S3. We modified the AWS Glue jobs to write to a temporary directory with Spark and then used boto3 to move the files into place. Doing this reduced read inconsistency by up to 90 percent. It was great to have this functionality readily available, which may not have been the case if we managed our own Spark cluster.

Comparing previous state and current state

After we had all the datasets in place, it was time for our customers to come on board and start querying. This is where we really leveled up the customer experience.

Previously, users had to download a SQL client, request a user name and password, set it up, and learn SQL to get data out. Now, users just sign in to the AWS Management Console through automatically provisioned IAM roles and run queries in their browser with Athena. Or if they want to skip SQL altogether, they can use our Amazon QuickSight account with accounts managed through our pre-existing Active Directory server.

Integration with Active Directory was a big win for us. We wanted to enable users to get up and running without having to wait for an account to be created or managing separate credentials. We already use Active Directory across the company for access to multiple resources. Upgrading to Amazon QuickSight Enterprise Edition enabled us to manage access with our existing AD groups and credentials.

Migration results

Our legacy data warehouse was developed over the course of five years. We recreated it as a serverless data lake using AWS Glue in about three months.

In the end, it took more upfront effort than simply migrating to another relational database. We also dealt with more uncertainty because we used many products that were relatively new to us (especially AWS Glue).

However, in the months since the migration was completed, we’ve gotten great feedback from data warehouse users about the new tools. Our users have been amazed by these things:

  • How fast Athena is.
  • How intuitive and beautiful Amazon QuickSight is. They love that no setup is required—it’s easy enough that even our CEO has started using it!
  • That Athena plus the AWS Glue Data Catalog have given us the performance gains of a true big data platform, but for end users it retains the look and feel of a relational database.


From an operational perspective, the investment has already started to pay off. Literally: Our operating costs have fallen by almost 90 percent.

Personally, I was thrilled that recently I was able to take a three-week vacation and didn’t get paged once, thanks to the serverless infrastructure. And for our BI engineers in addition to myself, the S3-centric architecture is enabling us to experiment with new technologies by integrating seamlessly with other services, such as Amazon EMR, Amazon SageMaker, Amazon Redshift Spectrum, and Lambda. It’s been exciting to see how these services have grown in the time since we’ve adopted them (for example, the recent AWS Glue launch of Amazon CloudWatch metrics and Athena’s launch of views).

We are thrilled that we’ve invested in technologies that continue to grow as we do. We are incredibly proud of our team for accomplishing this ambitious migration. We hope our experience can inspire other engineers to dive in to building a data lake of their own.

For additional information, see these similar AWS Big Data blog posts:

About the authors

Chaya Carey is a data engineer at Woot.com. At Woot, she’s responsible for managing the data warehouse and other scalable data solutions. Outside of work, she’s passionate about Seattle’s bar and restaurant scene, books, and video games.




Karthik Odapally is a senior solutions architect at AWS. His passion is to build cost-effective and highly scalable solutions on the cloud. In his spare time, he bakes cookies and cupcakes for family and friends here in the PNW. He loves vintage racing cars.





Optimizing a Lift-and-Shift for Security

Post Syndicated from Jonathan Shapiro-Ward original https://aws.amazon.com/blogs/architecture/optimizing-a-lift-and-shift-for-security/

This is the third and final blog within a three-part series that examines how to optimize lift-and-shift workloads. A lift-and-shift is a common approach for migrating to AWS, whereby you move a workload from on-prem with little or no modification. This third blog examines how lift-and-shift workloads can benefit from an improved security posture with no modification to the application codebase. (Read about optimizing a lift-and-shift for performance and for cost effectiveness.)

Moving to AWS can help to strengthen your security posture by eliminating many of the risks present in on-premise deployments. It is still essential to consider how to best use AWS security controls and mechanisms to ensure the security of your workload. Security can often be a significant concern in lift-and-shift workloads, especially for legacy workloads where modern encryption and security features may not present. By making use of AWS security features you can significantly improve the security posture of a lift-and-shift workload, even if it lacks native support for modern security best practices.

Adding TLS with Application Load Balancers

Legacy applications are often the subject of a lift-and-shift. Such migrations can help reduce risks by moving away from out of date hardware but security risks are often harder to manage. Many legacy applications leverage HTTP or other plaintext protocols that are vulnerable to all manner of attacks. Often, modifying a legacy application’s codebase to implement TLS is untenable, necessitating other options.

One comparatively simple approach is to leverage an Application Load Balancer or a Classic Load Balancer to provide SSL offloading. In this scenario, the load balancer would be exposed to users, while the application servers that only support plaintext protocols will reside within a subnet which is can only be accessed by the load balancer. The load balancer would perform the decryption of all traffic destined for the application instance, forwarding the plaintext traffic to the instances. This allows  you to use encryption on traffic between the client and the load balancer, leaving only internal communication between the load balancer and the application in plaintext. Often this approach is sufficient to meet security requirements, however, in more stringent scenarios it is never acceptable for traffic to be transmitted in plaintext, even if within a secured subnet. In this scenario, a sidecar can be used to eliminate plaintext traffic ever traversing the network.

Improving Security and Configuration Management with Sidecars

One approach to providing encryption to legacy applications is to leverage what’s often termed the “sidecar pattern.” The sidecar pattern entails a second process acting as a proxy to the legacy application. The legacy application only exposes its services via the local loopback adapter and is thus accessible only to the sidecar. In turn the sidecar acts as an encrypted proxy, exposing the legacy application’s API to external consumers via TLS. As unencrypted traffic between the sidecar and the legacy application traverses the loopback adapter, it never traverses the network. This approach can help add encryption (or stronger encryption) to legacy applications when it’s not feasible to modify the original codebase. A common approach to implanting sidecars is through container groups such as pod in EKS or a task in ECS.

Implementing the Sidecar Pattern With Containers

Figure 1: Implementing the Sidecar Pattern With Containers

Another use of the sidecar pattern is to help legacy applications leverage modern cloud services. A common example of this is using a sidecar to manage files pertaining to the legacy application. This could entail a number of options including:

  • Having the sidecar dynamically modify the configuration for a legacy application based upon some external factor, such as the output of Lambda function, SNS event or DynamoDB write.
  • Having the sidecar write application state to a cache or database. Often applications will write state to the local disk. This can be problematic for autoscaling or disaster recovery, whereby having the state easily accessible to other instances is advantages. To facilitate this, the sidecar can write state to Amazon S3, Amazon DynamoDB, Amazon Elasticache or Amazon RDS.

A sidecar requires customer development, but it doesn’t require any modification of the lift-and-shifted application. A sidecar treats the application as a blackbox and interacts with it via its API, configuration file, or other standard mechanism.

Automating Security

A lift-and-shift can achieve a significantly stronger security posture by incorporating elements of DevSecOps. DevSecOps is a philosophy that argues that everyone is responsible for security and advocates for automation all parts of the security process. AWS has a number of services which can help implement a DevSecOps strategy. These services include:

  • Amazon GuardDuty: a continuous monitoring system which analyzes AWS CloudTrail Events, Amazon VPC Flow Log and DNS Logs. GuardDuty can detect threats and trigger an automated response.
  • AWS Shield: a managed DDOS protection services
  • AWS WAF: a managed Web Application Firewall
  • AWS Config: a service for assessing, tracking, and auditing changes to AWS configuration

These services can help detect security problems and implement a response in real time, achieving a significantly strong posture than traditional security strategies. You can build a DevSecOps strategy around a lift-and-shift workload using these services, without having to modify the lift-and-shift application.


There are many opportunities for taking advantage of AWS services and features to improve a lift-and-shift workload. Without any alteration to the application you can strengthen your security posture by utilizing AWS security services and by making small environmental and architectural changes that can help alleviate the challenges of legacy workloads.

About the author

Dr. Jonathan Shapiro-Ward is an AWS Solutions Architect based in Toronto. He helps customers across Canada to transform their businesses and build industry leading cloud solutions. He has a background in distributed systems and big data and holds a PhD from the University of St Andrews.

Optimizing a Lift-and-Shift for Cost Effectiveness and Ease of Management

Post Syndicated from Jonathan Shapiro-Ward original https://aws.amazon.com/blogs/architecture/optimizing-a-lift-and-shift-for-cost/

Lift-and-shift is the process of migrating a workload from on premise to AWS with little or no modification. A lift-and-shift is a common route for enterprises to move to the cloud, and can be a transitionary state to a more cloud native approach. This is the second blog post in a three-part series which investigates how to optimize a lift-and-shift workload. The first post is about performance.

A key concern that many customers have with a lift-and-shift is cost. If you move an application as is  from on-prem to AWS, is there any possibility for meaningful cost savings? By employing AWS services, in lieu of self-managed EC2 instances, and by leveraging cloud capability such as auto scaling, there is potential for significant cost savings. In this blog post, we will discuss a number of AWS services and solutions that you can leverage with minimal or no change to your application codebase in order to significantly reduce management costs and overall Total Cost of Ownership (TCO).


Even if you can’t modify your application, you can change the way you deploy your application. The adopting-an-infrastructure-as-code approach can vastly improve the ease of management of your application, thereby reducing cost. By templating your application through Amazon CloudFormation, Amazon OpsWorks, or Open Source tools you can make deploying and managing your workloads a simple and repeatable process.

As part of the lift-and-shift process, rationalizing the workload into a set of templates enables less time to spent in the future deploying and modifying the workload. It enables the easy creation of dev/test environments, facilitates blue-green testing, opens up options for DR, and gives the option to roll back in the event of error. Automation is the single step which is most conductive to improving ease of management.

Reserved Instances and Spot Instances

A first initial consideration around cost should be the purchasing model for any EC2 instances. Reserved Instances (RIs) represent a 1-year or 3-year commitment to EC2 instances and can enable up to 75% cost reduction (over on demand) for steady state EC2 workloads. They are ideal for 24/7 workloads that must be continually in operation. An application requires no modification to make use of RIs.

An alternative purchasing model is EC2 spot. Spot instances offer unused capacity available at a significant discount – up to 90%. Spot instances receive a two-minute warning when the capacity is required back by EC2 and can be suspended and resumed. Workloads which are architected for batch runs – such as analytics and big data workloads – often require little or no modification to make use of spot instances. Other burstable workloads such as web apps may require some modification around how they are deployed.

A final alternative is on-demand. For workloads that are not running in perpetuity, on-demand is ideal. Workloads can be deployed, used for as long as required, and then terminated. By leveraging some simple automation (such as AWS Lambda and CloudWatch alarms), you can schedule workloads to start and stop at the open and close of business (or at other meaningful intervals). This typically requires no modification to the application itself. For workloads that are not 24/7 steady state, this can provide greater cost effectiveness compared to RIs and more certainty and ease of use when compared to spot.

Amazon FSx for Windows File Server

Amazon FSx for Windows File Server provides a fully managed Windows filesystem that has full compatibility with SMB and DFS and full AD integration. Amazon FSx is an ideal choice for lift-and-shift architectures as it requires no modification to the application codebase in order to enable compatibility. Windows based applications can continue to leverage standard, Windows-native protocols to access storage with Amazon FSx. It enables users to avoid having to deploy and manage their own fileservers – eliminating the need for patching, automating, and managing EC2 instances. Moreover, it’s easy to scale and minimize costs, since Amazon FSx offers a pay-as-you-go pricing model.

Amazon EFS

Amazon Elastic File System (EFS) provides high performance, highly available multi-attach storage via NFS. EFS offers a drop-in replacement for existing NFS deployments. This is ideal for a range of Linux and Unix usecases as well as cross-platform solutions such as Enterprise Java applications. EFS eliminates the need to manage NFS infrastructure and simplifies storage concerns. Moreover, EFS provides high availability out of the box, which helps to reduce single points of failure and avoids the need to manually configure storage replication. Much like Amazon FSx, EFS enables customers to realize cost improvements by moving to a pay-as-you-go pricing model and requires a modification of the application.

Amazon MQ

Amazon MQ is a managed message broker service that provides compatibility with JMS, AMQP, MQTT, OpenWire, and STOMP. These are amongst the most extensively used middleware and messaging protocols and are a key foundation of enterprise applications. Rather than having to manually maintain a message broker, Amazon MQ provides a performant, highly available managed message broker service that is compatible with existing applications.

To use Amazon MQ without any modification, you can adapt applications that leverage a standard messaging protocol. In most cases, all you need to do is update the application’s MQ endpoint in its configuration. Subsequently, the Amazon MQ service handles the heavy lifting of operating a message broker, configuring HA, fault detection, failure recovery, software updates, and so forth. This offers a simple option for reducing management overhead and improving the reliability of a lift-and-shift architecture. What’s more is that applications can migrate to Amazon MQ without the need for any downtime, making this an easy and effective way to improve a lift-and-shift.

You can also use Amazon MQ to integrate legacy applications with modern serverless applications. Lambda functions can subscribe to MQ topics and trigger serverless workflows, enabling compatibility between legacy and new workloads.

Integrating Lift-and-Shift Workloads with Lambda via Amazon MQ

Figure 1: Integrating Lift-and-Shift Workloads with Lambda via Amazon MQ

Amazon Managed Streaming Kafka

Lift-and-shift workloads which include a streaming data component are often built around Apache Kafka. There is a certain amount of complexity involved in operating a Kafka cluster which incurs management and operational expense. Amazon Kinesis is a managed alternative to Apache Kafka, but it is not a drop-in replacement. At re:Invent 2018, we announced the launch of Amazon Managed Streaming Kafka (MSK) in public preview. MSK provides a managed Kafka deployment with pay-as-you-go pricing and an acts as a drop-in replacement in existing Kafka workloads. MSK can help reduce management costs and improve cost efficiency and is ideal for lift-and-shift workloads.

Leveraging S3 for Static Web Hosting

A significant portion of any web application is static content. This includes videos, image, text, and other content that changes seldom, if ever. In many lift-and-shifted applications, web servers are migrated to EC2 instances and host all content – static and dynamic. Hosting static content from an EC2 instance incurs a number of costs including the instance, EBS volumes, and likely, a load balancer. By moving static content to S3, you can significantly reduce the amount of compute required to host your web applications. In many cases, this change is non-disruptive and can be done at the DNS or CDN layer, requiring no change to your application.

Reducing Web Hosting Costs with S3 Static Web Hosting

Figure 2: Reducing Web Hosting Costs with S3 Static Web Hosting


There are numerous opportunities for reducing the cost of a lift-and-shift. Without any modification to the application, lift-and-shift workloads can benefit from cloud-native features. By using AWS services and features, you can significantly reduce the undifferentiated heavy lifting inherent in on-prem workloads and reduce resources and management overheads.

About the author

Dr. Jonathan Shapiro-Ward is an AWS Solutions Architect based in Toronto. He helps customers across Canada to transform their businesses and build industry leading cloud solutions. He has a background in distributed systems and big data and holds a PhD from the University of St Andrews.

AWS Storage Update: Amazon S3 & Amazon S3 Glacier Launch Announcements for Archival Workloads

Post Syndicated from AWS Admin original https://aws.amazon.com/blogs/architecture/amazon-s3-amazon-s3-glacier-launch-announcements-for-archival-workloads/

By Matt Sidley, Senior Product Manager for S3

Customers have built archival workloads for several years using a combination of S3 storage classes, including S3 Standard, S3 Standard-Infrequent Access, and S3 Glacier. For example, many media companies are using the S3 Glacier storage class to store their core media archives. Most of this data is rarely accessed, but when they need data back (for example, because of breaking news), they need it within minutes. These customers have found S3 Glacier to be a great fit because they can retrieve data in 1-5 minutes and save up to 82% on their storage costs. Other customers in the financial services industry use S3 Standard to store recently generated data, and lifecycle older data to S3 Glacier.

We launched Glacier in 2012 as a secure, durable, and low-cost service to archive data. Customers can use Glacier either as an S3 storage class or through its direct API. Using the S3 Glacier storage class is popular because many applications are built to use the S3 API and with a simple lifecycle policy, older data can be easily shifted to S3 Glacier. S3 Glacier continues to be the lowest-cost storage from any major cloud provider that durably stores data across three Availability Zones or more and allows customers to retrieve their data in minutes.

We’re constantly listening to customer feedback and looking for ways to make it easier to build applications in the cloud. Today we’re announcing six new features across Amazon S3 and S3 Glacier.

Amazon S3 Object Lock

S3 Object Lock is a new feature that prevents data from being deleted during a customer-defined retention period. You can use Object Lock with any S3 storage class, including S3 Glacier. There are many use cases for S3 Object Lock, including customers who want additional safeguards for data that must be retained, and for customers migrating from existing write-once-read-many (WORM) systems to AWS. You can also use S3 Lifecycle policies to transition data and S3 Object Lock will maintain WORM protection as your data is tiered.

S3 Object Lock can be configured in one of two modes: Governance or Compliance. When deployed in Governance mode, only AWS accounts with specific IAM permissions are able to remove the lock. If you require stronger immutability to comply with regulations, you can use Compliance mode. In Compliance mode, the lock cannot be removed by any user, including the root account. Take a look here:

S3 Object Lock is helpful in industries where long-term records retention is mandated by regulations or compliance rules. S3 Object Lock has been assessed for SEC Rule 17a-4(f), FINRA Rule 4511, and CFTC Regulation 1.31 by Cohasset Associates. Cohasset Associates is a management consulting firm specializing in records management and information governance. Read more and find a copy of the Cohasset Associates Assessment report in our documentation here.

New S3 Glacier Features

One of the things we hear from customers about using S3 Glacier is that they prefer to use the most common S3 APIs to operate directly on S3 Glacier objects. Today we’re announcing the availability of S3 PUT to Glacier, which enables you to use the standard S3 “PUT” API and select any storage class, including S3 Glacier, to store the data. Data can be stored directly in S3 Glacier, eliminating the need to upload to S3 Standard and immediately transition to S3 Glacier with a zero-day lifecycle policy. You can “PUT” to S3 Glacier like any other S3 storage class:

Many customers also want to keep a low-cost durable copy of their data in a second region for disaster recovery. We’re also announcing the launch of S3 Cross-Region Replication to S3 Glacier. You can now directly replicate data into the S3 Glacier storage class in a different AWS region.

Restoring Data from S3 Glacier

S3 Glacier provides three restore speeds for you to access your data: expedited (to retrieve data in 1-5 minutes), standard (3-5 hours), or bulk (5-12 hours). With S3 Restore Speed Upgrade, you can now issue a second restore request at a faster restore speed and get your data back sooner. This is useful if you originally requested standard or bulk speed, but later determine that you need a faster restore speed.

After a restore from S3 Glacier has been requested, you likely want to know when the restore completes. Now, with S3 Restore Notifications, you’ll receive a notification when the restoration has completed and the data is available. Many applications today are being built using AWS Lambda and event-driven actions, and you can now use the restore notification to automatically trigger the next step in your application as soon as S3 Glacier data is restored. For example, you can use notifications and Lambda functions to package and fulfill digital orders using archives restored from S3 Glacier.

Here, I’ve set up notifications to fire when my restores complete so I can use Lambda to kick off a piece of analysis I need to run:

You might need to restore many objects from S3 Glacier; for example, to pull all of your log files within a given time range. Using the new feature in Preview, you can provide a manifest of those log files to restore and, with one request, initiate a restore on millions or even trillions of objects just as easily as you can on just a few. S3 Batch Operations automatically manages retries, tracks progress, sends notifications, generates completion reports, and delivers events to AWS CloudTrail for all changes made and tasks executed.

To get started with the new features on Amazon S3, visit https://aws.amazon.com/s3/. We’re excited about these improvements and think they’ll make it even easier to build archival applications using Amazon S3 and S3 Glacier. And we’re not yet done. Stay tuned, as we have even more coming!

New – Automatic Cost Optimization for Amazon S3 via Intelligent Tiering

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-automatic-cost-optimization-for-amazon-s3-via-intelligent-tiering/

Amazon Simple Storage Service (S3) has been around for over 12.5 years, stores trillions of objects, and processes millions of requests for them every second. Our customers count on S3 to support their backup & recovery, data archiving, data lake, big data analytics, hybrid cloud storage, cloud-native storage, and disaster recovery needs. Starting from the initial one-size-fits-all Standard storage class, we have added additional classes in order to better serve our customers. Today, you can choose from four such classes, each designed for a particular use case. Here are the current options:

Standard – Designed for frequently accessed data.

Standard-IA – Designed for long-lived, infrequently accessed data.

One Zone-IA – Designed for long-lived, infrequently accessed, non-critical data.

Glacier – Designed for long-lived, infrequent accessed, archived critical data.

You can choose the applicable storage class when you upload your data to S3, and you can also use S3’s Lifecycle Policies to tell S3 to transition objects from Standard to Standard-IA, One Zone-IA, or Glacier based on their creation date. Note that the Reduced Redundancy storage class is still supported, but we recommend the use of One Zone-IA for new applications.

If you want to tier between different S3 storage classes today, Lifecycle Policies automates moving objects based on the creation date of the object in storage. If your data is stored in Standard storage today and you want to find out if some of that storage is suited to the S-IA storage class, you can use Storage Class Analytics in the S3 Console to identify what groups of objects to tier using Lifecycle. However, there are many situations where the access pattern of data is irregular or you simply don’t know because your data set is accessed by many applications across an organization. Or maybe you are spending so much focusing on your app, you don’t have time to use tools like Storage Class Analysis.

New Intelligent Tiering
In order to make it easier for you to take advantage of S3 without having to develop a deep understanding of your access patterns, we are launching a new storage class, S3 Intelligent-Tiering. This storage class incorporates two access tiers: frequent access and infrequent access. Both access tiers offer the same low latency as the Standard storage class. For a small monitoring and automation fee, S3 Intelligent-Tiering monitors access patterns and moves objects that have not been accessed for 30 consecutive days to the infrequent access tier. If the data is accessed later, it is automatically moved back to the frequent access tier. The bottom line: You save money even under changing access patterns, with no performance impact, no operational overhead, and no retrieval fees.

You can specify the use of the Intelligent-Tiering storage class when you upload new objects to S3. You can also use a Lifecycle Policy to effect the transition after a specified time period. There are no retrieval fees and you can use this new storage class in conjunction with all other features of S3 including cross-region replication, encryption, object tagging, and inventory.

If you are highly confident that your data is accessed infrequently, the Standard-IA storage class is still a better choice with respect to cost savings. However, if you don’t your access patterns or if they are subject to change, Intelligent-Tiering is for you!

Intelligent Tiering in Action
I simply choose the new storage class when I uploads objects to S3:

I can see the storage class in the S3 Console, as usual:

And I can create Lifecycle Rules that make use of Intelligent-Tiering:

And that’s just about it. Here are a few things that you need to know:

Object Size – You can use Intelligent-Tiering for objects of any size, but objects smaller than 128 KB will never be transitioned to the infrequent access tier and will be billed at the usual rate for the frequent access tier.

Object Life – This is not a good fit for objects that live for less than 30 days; all objects will be billed for a minimum of 30 days.

Durability & Availability – The Intelligent-Tiering storage class is designed for 99.9% availability and 99.999999999% durability, with an SLA that provides for 99.0% availability.

Pricing – Just like the other storage classes, you pay for monthly storage, requests, and data transfer. Storage for objects in the frequent access tier is billed at the same rate as S3 Standard; storage for objects in the infrequent access tier is billed at the same rate as S3 Standard-Infrequent Access. When you use Intelligent-Tiering, you pay a small monthly per-object fee for monitoring and automation; this means that the storage class becomes even more economical as object sizes grow. As I noted earlier, S3 Intelligent-Tiering will automatically move data back to the frequent access tier based on access patterns but there is no retrieval charge.

Query in Place – Queries made using S3 Select do not alter the storage tier. Amazon Athena and Amazon Redshift Spectrum access the data using the regular GET operation and will trigger a transition.

API and CLI Access – You can use the storage class INTELLIGENT_TIERING from the S3 CLI and S3 APIs.

Available Now
This new storage class is available now and you can start using it today in all AWS Regions.


PS – Remember the trillions of objects and millions of requests that I just told you about? We fed them into an Amazon Machine Learning model and used them to predict future access patterns for each object. The results were then used to inform storage of your S3 objects in the most cost-effective way possible. This is a really interesting benefit that is made possible by the incredible scale of S3 and the diversity of use cases that it supports. There’s nothing else like it, as far as I know!

Amazon S3 Block Public Access – Another Layer of Protection for Your Accounts and Buckets

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-s3-block-public-access-another-layer-of-protection-for-your-accounts-and-buckets/

Newly created Amazon S3 buckets and objects are (and always have been) private and protected by default, with the option to use Access Control Lists (ACLs) and bucket policies to grant access to other AWS accounts or to public (anonymous) requests. The ACLs and policies give you lots of flexibility. You can grant permissions to multiple accounts, restrict access to specific IP addresses, require the use of Multi-Factor Authentication (MFA), allow other accounts to upload new objects to a bucket, and much more.

We want to make sure that you use public buckets and objects as needed, while giving you tools to make sure that you don’t make them publicly accessible due to a simple mistake or misunderstanding. For example, last year we provided you with a Public indicator to let you know at a glance which buckets are publicly accessible:

The bucket view is sorted so that public buckets appear at the top of the page by default.

We also made Trusted Advisor‘s bucket permission check free:

New Amazon S3 Block Public Access
Today we are making it easier for you to protect your buckets and objects with the introduction of Amazon S3 Block Public Access. This is a new level of protection that works at the account level and also on individual buckets, including those that you create in the future. You have the ability to block existing public access (whether it was specified by an ACL or a policy) and to ensure that public access is not granted to newly created items. If an AWS account is used to host a data lake or another business application, blocking public access will serve as an account-level guard against accidental public exposure. Our goal is to make clear that public access is to be used for web hosting!

This feature is designed to be easy to use, and can be accessed from the S3 Console, the CLI, the S3 APIs, and from within CloudFormation templates. Let’s start with the S3 Console and a bucket that is public:

I can exercise control at the account level by clicking Public access settings for this account:

I have two options for managing public ACLs and two for managing public bucket policies. Let’s take a closer look at each one:

Block new public ACLs and uploading public objects – This option disallows the use of new public bucket or object ACLs, and is used to ensure that future PUT requests that include them will fail. It does not affect existing buckets or objects. Use this setting to protect against future attempts to use ACLs to make buckets or objects public. If an application tries to upload an object with a public ACL or if an administrator tries to apply a public access setting to the bucket, this setting will block the public access setting for the bucket or the object.

Remove public access granted through public ACLs – This option tells S3 not to evaluate any public ACL when authorizing a request, ensuring that no bucket or object can be made public by using ACLs. This setting overrides any current or future public access settings for current and future objects in the bucket. If an existing application is currently uploading objects with public ACLs to the bucket, this setting will override the setting on the object.

Block new public bucket policies – This option disallows the use of new public bucket policies, and is used to ensure that future PUT requests that include them will fail. Again, this does not affect existing buckets or objects. This setting ensures that a bucket policy cannot be updated to grant public access.

Block public and cross-account access to buckets that have public policies – If this option is set, access to buckets that are publicly accessible will be limited to the bucket owner and to AWS services. This option can be used to protect buckets that have public policies while you work to remove the policies; it serves to protect information that is logged to a bucket by an AWS service from becoming publicly accessible.

To make changes, I click Edit, check the desired public access settings, and click Save:

I recommend that you use these settings for any account that is used for internal AWS applications!

Then I confirm my intent:

After I do this, I need to test my applications and scripts to ensure that everything still works as expected!

When I make these settings at the account level, they apply to my current buckets, and also to those that I create in the future. However, I can also set these options on individual buckets if I want to take a more fine-grained approach to access control. If I set some options at the account level and others on a bucket, the protections are additive. I select a bucket and click Edit public access settings:

Then I select the desired options:

Since I have already denied all public access at the account level, this is actually redundant, but I want you to know that you have control at the bucket level. One thing to note: I cannot override an account-level setting by changing the options that I set at the bucket level.

I can see the public access status of all of my buckets at a glance:

Programmatic Access
I can also access this feature by making calls to the S3 API. Here are the functions:

GetPublicAccessBlock – Retrieve the public access block options for an account or a bucket.

PutPublicAccessBlock – Set the public access block options for an account or a bucket.

DeletePublicAccessBlock – Remove the public access block options from an account or a bucket.

GetBucketPolicyStatus – See if the bucket access policy is public or not.

I can also set the options for a bucket when I create it via a CloudFormation template:


Things to Know
Here are a couple of things to keep in mind when you are making use of S3 Block Public Access:

New Buckets – Going forward, buckets that you create using the S3 Console will have all four of the settings enabled, as recommended for any application other than web hosting. You will need to disable one or more of the settings in order to make the bucket public.

Automated Reasoning – The determination of whether a given policy or ACL is considered public is made using our Zelkova Automated Reasoning system (you can read How AWS Uses Automated Reasoning to Help You Achieve Security at Scale to learn more).

Organizations – If you are using AWS Organizations, you can use a Service Control Policy (SCP) to restrict the settings that are available to the AWS account within the organization. For example, you can set the desired public access settings for any desired accounts and then use an SCP to ensure that the settings cannot be changed by the account owners.

Charges – There is no charge for the use of this feature; you pay the usual prices for all requests that you make to the S3 API.

Available Now
Amazon S3 Block Public Access is available now in all commercial AWS regions and you can (and should) start using it today!


Migrate to Apache HBase on Amazon S3 on Amazon EMR: Guidelines and Best Practices

Post Syndicated from Francisco Oliveira original https://aws.amazon.com/blogs/big-data/migrate-to-apache-hbase-on-amazon-s3-on-amazon-emr-guidelines-and-best-practices/

This blog post provides guidance and best practices about how to migrate from Apache HBase on HDFS to Apache HBase on Amazon S3 on Amazon EMR.

Apache HBase on Amazon S3 on Amazon EMR

Amazon EMR version 5.2.0 or later, lets you run Apache HBase on Amazon S3. By using Amazon S3 as a data store for Apache HBase, you can separate your cluster’s storage and compute nodes. This saves costs because you’re sizing your cluster for your compute requirements. You’re not paying to store your entire dataset with 3x replication in the on-cluster HDFS.

Many customers have taken advantage of the benefits of running Apache HBase on Amazon S3 for data storage. These benefits include lower costs, data durability, and more efficient scalability. Customers, such as the Financial Industry Regulatory Agency (FINRA), have lowered their costs by 60% by moving to an Apache HBase on Amazon S3 architecture. They have also experienced operational benefits that come with decoupling storage from compute and using Amazon S3 as the storage layer.

Whitepaper on Migrating to Apache HBase on Amazon S3 on Amazon EMR

This whitepaper walks you through the stages of a migration. It also helps you determine when to choose Apache HBase on Amazon S3 on Amazon EMR, plan for platform security, tune Apache HBase and EMRFS to support your application SLA, identify options to migrate and restore your data, and manage your cluster in production.

For more information, see Migrating to Apache HBase on Amazon S3 on Amazon EMR

Additional Reading

If you found this post useful, be sure to check out Setting up Read Replica Clusters with HBase on Amazon S3, and Tips for Migrating to Apache HBase on Amazon S3 from HDFS.


About the Author

Francisco Oliveira is a Senior Big Data Engineer with AWS Professional Services. He focuses on building big data solutions with open source technology and AWS. In his free time, he likes to try new sports, travel and explore national parks.


Connect to Amazon Athena with federated identities using temporary credentials

Post Syndicated from Nitin Wagh original https://aws.amazon.com/blogs/big-data/connect-to-amazon-athena-with-federated-identities-using-temporary-credentials/

Many organizations have standardized on centralized user management, most commonly Microsoft Active Directory or LDAP.  Access to AWS resources is no exception.  Amazon Athena is a serverless query engine for data on Amazon S3 that is popular for quick and cost-effective queries of data in a data lake.  To allow users or applications to access Athena, organizations are required to use an AWS access key and an access secret key from which appropriate policies are enforced. To maintain a consistent authorization model across, organizations must enable authentication and authorization for Athena by using federated users.

This blog post shows the process of enabling federated user access with the AWS Security Token Service (AWS STS). This approach lets you create temporary security credentials and provides them to trusted users for running queries in Athena.

Temporary security credentials in AWS STS

Temporary security credentials ensures that access keys to protected AWS resources are properly rotated. Therefore, potential security leaks can be caught and remedied. AWS STS generates these per-use temporary access keys.

Temporary security credentials work similar to the long-term access key credentials that your IAM users can use. However, temporary security credentials have the following differences:

  • They are intended for short-term use only. You can configure these credentials to last from a few minutes to several hours, with a maximum of 12 hours. After they expire, AWS no longer recognizes them, or allows any kind of access from API requests made with them.
  • They are not stored with the user. They are generated dynamically and provided to the user when requested. When or before they expire, the user can request new credentials, if they still have permissions to do so.

Common scenarios for federated access

The following common scenarios describe when your organization may require federated access to Athena:

  1. Running queries in Athena while using federation. Your group is required to run queries in Athena while federating into AWS using SAML with permissions stored in Active Directory.
  2. Enabling access across accounts to Athena for users in your organization. Users in your organization with access to one AWS account needs to run Athena queries in a different account.
  3. Enabling access to Athena for a data application. A data application deployed on an Amazon EC2 instance needs to run Athena queries via JDBC.

Athena as the query engine for your data lake on S3

Athena is an interactive query service that lets you analyze data directly in Amazon S3 by using standard SQL. You can access Athena by using JDBC and ODBC drivers, AWS SDK, or the Athena console.

Athena enables schema-on-read analytics to gain insights from structured or semi-structured datasets found in the data lake. A data lake is ubiquitous, scalable, and reliable storage that lets you consume all of your structured and unstructured data. Customers increasingly prefer a serverless approach to querying data in their data lake.  Some benefits of using an Amazon S3 for a data lake include:

  • The ability to efficiently consume and store any data, at any scale, at low cost.
  • A single destination for searching and finding the relevant data.
  • Analysis of the data in S3 through a unified set of tools.

Solution overview

The following sections describe how to enable the common scenarios introduced previously in this post. They show how to download, install, and configure SQL Workbench to run queries in Athena. Next, they show how to use AWS STS with a custom JDBC credentials provider to obtain temporary credentials for an authorized user.  These credentials are passed to Athena’s JDBC driver, which enables SQL Workbench to run authorized queries.

As a reminder, the scenarios are:

  1. Running queries in Athena while using federation.
  2. Enabling access across accounts to Athena for users in your organization.
  3. Enabling access to Athena for a data application.


This walkthrough uses a sample table created to query Amazon CloudFront logs.  It demonstrates that proper access has been granted to Athena after each scenario. This walkthrough also assumes that you have a table for testing.  For information about creating a table, see Getting Started with Amazon Athena.

Prerequisites for scenarios 1 and 2

  1. Download and install SQL Workbench.
  2. Download the Athena custom credentials provider .jar file to the same computer where SQL Workbench is installed.

Note: You can also compile the .jar file by using this Athena JDBC source code.

  1. Download the Amazon Athena JDBC driver to the same computer where SQL Workbench is installed.
  2. In SQL Workbench, open File, Connect window, Manage Drivers. Choose the Athena driver and add two libraries, Athena JDBC driver and the custom credentials provider by specifying the location where you downloaded them.

Scenario 1: SAML Federation with Active Directory

In this scenario, we use a SAML command line tool. It performs a SAML handshake with an identity provider, and then retrieves temporary security credentials from AWS STS. We then pass the obtained security credentials through a custom credentials provider to run queries in Athena.

Before You Begin

  1. Make sure you have followed the prerequisites for scenarios 1 and 2.
  2. Set up a SAML integration with ADFS. For information, see Enabling Federation to AWS Using Windows Active Directory, ADFS, and SAML 2.0.
  3. Add the following IAM policies to the ADFS-Production role:

  1. Set up federated AWS CLI For more information, see How to Implement a General Solution for Federated API/CLI Access Using SAML 2.0

Executing queries in Athena with a SAML federated user identity

  1. Run the federated AWS CLI script configured as part of the prerequisites. Log in using a valid user name and password, then choose the ADFS-Production role. As a result, your temporary access_key, secret_access_key and session_token are generated. They are stored in a “credentials” file under the [saml] profile that looks similar to the following:
output = json
region = us-east-1
aws_access_key_id = XXXXXXXXXXXXXXXXX
aws_secret_access_key = XXXXXXXXXXXXXXXXXXXXXX
aws_session_token = XXXXXXXXXXXXXXXXX
  1. To enable running queries in Athena through SQL Workbench, configure an Athena connection as follows:

  1. Choose Extended Properties, and enter the properties as follows:
"AWSCredentialsProviderArguments"="<access_key_id>, <secret_access_key>, <session token>"
"S3OutputLocation"="s3://<bucket where query results are stored>"
"LogPath"= "<local path on laptop, or pc where logs are stored>"
"LogLevel"= "<Log Level from 0 to 6>"

  1. Choose Test to verify that you can successfully connect to Athena.
  2. Run a query in SQL Workbench to verify that credentials are correctly applied.

Scenario 2: Cross-account access

In this scenario, you allow users in one AWS account, referred to as Account A to run Athena queries in a different account, called Account B.

To enable this configuration, use the CustomIAMRoleAssumptionCredentialsProvider custom credentials provider to retrieve the necessary credentials. By doing so, you can run Athena queries by using credentials from Account A in Account B. This is made possible by the cross-account roles, as shown in the following diagram:

Before You Begin

1. If you haven’t already done so, follow these prerequisites for scenarios 1 and 2.

  1. Use AWS CLI to define a role named AthenaCrossAccountRole in Account B, as follows:
aws iam create-role --role-name AthenaCrossAccountRole --assume-role-policy-document file://cross-account-trust.json

Content of cross-account-trust.json file

  "Statement": [
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<ACCOUNT-A-ID>:root"
      "Action": "sts:AssumeRole"
  1. Attach the IAM policies, AmazonAthenaFullAccess and AmazonS3FullAccess, to the AthenaCrossAccountRole IAM role, as follows:
aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AmazonAthenaFullAccess --role-name AthenaCrossAccountRole

aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess --role-name --role-name AthenaCrossAccountRole

Executing Queries in Athena with a SAML Federated User Identity

  1. Set up a new Athena database connection in SQLWorkbench, as shown in the following example:

  1. Choose Extended Properties, and enter the properties as shown in the following example. This example configures the AWS credentials of an IAM user from Account A (AccessKey, SecretKey, and Cross Account Role ARN).
"AwsCredentialsProviderArguments"="<aws_key_id>,<secret_key_id>,<cross Account Role ARN>"
"S3OutputLocation"="s3://<bucket where Athena results are stored>"
"LogPath"= "<local directory path on laptop, or pc where logs are stored>"
“LogLevel” = “Log Level from 1 to 6”

  1. Choose Test to verify that you can successfully connect to Athena.
  2. Run a query in SQL Workbench to verify that credentials are correctly applied.

Scenario 3: Using EC2 Instance Profile Role

This scenario uses the Amazon EC2 instance profile role to retrieve temporary credentials.  First, create the EC2AthenaInstanceProfileRole IAM role via AWS CLI, as shown in the following example:

aws iam create-role --role-name EC2AthenaInstanceProfileRole --assume-role-policy-document file://policy.json

Content of policy.json file

  "Statement": [
        "Action": "sts:AssumeRole",
        "Effect": "Allow",
        "Principal": {
           "Service": "ec2.amazonaws.com"

Attach the IAM policies, AmazonAthenaFullAccess and AmazonS3FullAccess, to the EC2AthenaInstanceProfileRole IAM role, as follows:

aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AmazonAthenaFullAccess --role-name EC2AthenaInstanceProfileRole

aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess --role-name --role-name EC2AthenaInstanceProfileRole

Before You Begin

  1. Launch the Amazon EC2 instance for Windows, then attach the InstanceProfile role created in the previous step:
aws ec2 run-instances --image-id <use Amazon EC2 on Windows 2012 AMI Id for your region> --iam-instance-profile 'Arn=arn:aws:iam::<your_aws_account_id>:instance-profile/EC2AthenaInstanceProfileRole' --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=AthenaTest}]' --count 1 --instance-type t2.micro --key-name <your key> --security-group-ids <your_windows_security_group_id>

2.    Log in to your Windows instance using RDP.

3.    Install Java 8 and download and install SQL Workbench.

4.    Download the Athena JDBC driver.

Executing queries in Athena using an EC2 instance profile role

1.      In SQL Workbench, open File, Connect window, Manage Drivers. Specify a path to the Athena JDBC driver that was previously downloaded, as shown in the following example.

  1. The instance credentials provider, InstanceProfileCredentialsProvider, is included with the Amazon Athena JDBC driver. In SQL Workbench, set up an Athena connection as follows:

  1. Choose Extended Properties, and enter the properties as follows:
"S3OutputLocation"="s3://<bucket where Athena results are stored>"
"LogPath"= "<local directory path on laptop, or pc where logs are stored>"
“LogLevel” = “Log Level from 0 to 6”

  1. Choose Test to verify that you can successfully connect to Athena.
  2. Run a query in SQL Workbench to verify that credentials are correctly applied.


This post walked through three scenarios to enable trusted users to access Athena using temporary security credentials. First, we used SAML federation where user credentials were stored in Active Directory. Second, we used a custom credentials provider library to enable cross-account access. And third, we used an EC2 Instance Profile role to provide temporary credentials for users in our organization to access Athena.

These approaches ensure that access keys protecting AWS resources are not directly hardcoded in applications and can be easily revoked as needed. To achieve this, we used AWS STS to generate temporary per-use credentials.

The scenarios demonstrated in this post used SQL Workbench.  However, they apply for all other uses of the JDBC driver with Amazon Athena. Additionally, when using AWS SDK with Athena, similar approaches also apply.

Happy querying!


Additional Reading

If you found this post useful, be sure to check out Top 10 Performance Tuning Tips for Amazon Athena, and Analyze and visualize your VPC network traffic using Amazon Kinesis and Amazon Athena.


About the Author

Nitin Wagh is a Solutions Architect with Amazon Web Services specializing in Big Data Analytics. He helps AWS customers develop distributed big data solutions using AWS technologies.





How to build a front-line concussion monitoring system using AWS IoT and serverless data lakes – Part 2

Post Syndicated from Saurabh Shrivastava original https://aws.amazon.com/blogs/big-data/how-to-build-a-front-line-concussion-monitoring-system-using-aws-iot-and-serverless-data-lakes-part-2/

In part 1 of this series, we demonstrated how to build a data pipeline in support of a data lake. We used key AWS services such as Amazon Kinesis Data Streams, Kinesis Data Analytics, Kinesis Data Firehose, and AWS Lambda. In part 2, we discuss how to process and visualize the data by creating a serverless data lake that uses key analytics to create actionable data.

Create a serverless data lake and explore data using AWS Glue, Amazon Athena, and Amazon QuickSight

As we discussed in part 1, you can store heart rate data in an Amazon S3 bucket using Kinesis Data Streams. However, storing data in a repository is not enough. You also need to be able to catalog and store the associated metadata related to your repository so that you can extract the meaningful pieces for analytics.

For a serverless data lake, you can use AWS Glue, which is a fully managed data catalog and ETL (extract, transform, and load) service. AWS Glue simplifies and automates the difficult and time-consuming tasks of data discovery, conversion, and job scheduling. As you get your AWS Glue Data Catalog data partitioned and compressed for optimal performance, you can use Amazon Athena for the direct query to S3 data. You can then visualize the data using Amazon QuickSight.

The following diagram depicts the data lake that is created in this demonstration:

Amazon S3 now has the raw data stored from the Kinesis process. The first task is to prepare the Data Catalog and identify what data attributes are available to query and analyze. To do this task, you need to create a database in AWS Glue that will hold the table created by the AWS Glue crawler.

An AWS Glue crawler scans through the raw data available in an S3 bucket and creates a data table with a Data Catalog. You can add a scheduler to the crawler to run periodically and scan new data as required. For specific steps to create a database and crawler in AWS Glue, see the blog post Build a Data Lake Foundation with AWS Glue and Amazon S3.

The following figure shows the summary screen for a crawler configuration in AWS Glue:

After configuring the crawler, choose Finish, and then choose Crawler in the navigation bar. Select the crawler that you created, and choose Run crawler.

The crawler process can take 20–60 seconds to initiate. It depends on the Data Catalog, and it creates a table in your database as defined during the crawler configuration.

You can choose the table name and explore the Data Catalog and table:

In the demonstration table details, our data has three attribute time stamps as value_time, the person’s ID as id, and the heart rate as colvalue. These attributes are identified and listed by the AWS Glue crawler. You can see other information such as the data format (text) and the record count (approx. 15,000 with each record size of 61 bytes).

You can use Athena to query the raw data. To access Athena directly from the AWS Glue console, choose the table, and then choose View data on the Actions menu, as shown following:

As noted, the data is currently in a JSON format and we haven’t partitioned it. This means that Athena continues to scan more data, which increases the query cost. The best practice is to always partition data and to convert the data into a columnar format like Apache Parquet or Apache ORC. This reduces the amount of data scans while running a query. Having fewer data scans means better query performance at a lower cost.

To accomplish this, AWS Glue generates an ETL script for you. You can schedule it to run periodically for your data processing, which removes the necessity for complex code writing. AWS Glue is a managed service that runs on top of a warm Apache Spark cluster that is managed by AWS. You can run your own script in AWS Glue or modify a script provided by AWS Glue that meets your requirements. For examples of how to build a custom script for your solution, see Providing Your Own Custom Scripts in the AWS Glue Developer Guide.

For detailed steps to create a job, see the blog post Build a Data Lake Foundation with AWS Glue and Amazon S3. The following figure shows the final AWS Glue job configuration summary for this demonstration:

In this example configuration, we enabled the job bookmark, which helps AWS Glue maintain state information and prevents the reprocessing of old data. You only want to process new data when rerunning on a scheduled interval.

When you choose Finish, AWS Glue generates a Python script. This script processes your data and stores it in a columnar format in the destination S3 bucket specified in the job configuration.

If you choose Run Job, it takes time to complete depending on the amount of data and data processing units (DPUs) configured. By default, a job is configured with 10 DPUs, which can be increased. A single DPU provides processing capacity that consists of 4 vCPUs of compute and 16 GB of memory.

After the job is complete, inspect your destination S3 bucket, and you will find that your data is now in columnar Parquet format.

Partitioning has emerged as an important technique for organizing datasets so that they can be queried efficiently by a variety of big data systems. Data is organized in a hierarchical directory structure based on the distinct values of one or more columns. For information about efficiently processing partitioned datasets using AWS Glue, see the blog post Work with partitioned data in AWS Glue.

You can create triggers for your job that run the job periodically to process new data as it is transmitted to your S3 bucket. For detailed steps on how to configure a job trigger, see Triggering Jobs in AWS Glue.

The next step is to create a crawler for the Parquet data so that a table can be created. The following image shows the configuration for our Parquet crawler:

Choose Finish, and execute the crawler.

Explore your database, and you will notice that one more table was created in the Parquet format.

You can use this new table for direct queries to reduce costs and to increase the query performance of this demonstration.

Because AWS Glue is integrated with Athena, you will find in the Athena console an AWS Glue catalog already available with the table catalog. Fetch 10 rows from Athena in a new Parquet table like you did for the JSON data table in the previous steps.

As the following image shows, we fetched the first 10 rows of heartbeat data from a Parquet format table. This same Athena query scanned only 4.99 KB of data compared to 205 KB of data that was scanned in a raw format. Also, there was a significant improvement in query performance in terms of run time.

Visualize data in Amazon QuickSight

Amazon QuickSight is a data visualization service that you can use to analyze data that has been combined. For more detailed instructions, see the Amazon QuickSight User Guide.

The first step in Amazon QuickSight is to create a new Amazon Athena data source. Choose the heartbeat database created in AWS Glue, and then choose the table that was created by the AWS Glue crawler.

Choose Import to SPICE for quicker analytics. This option creates a data cache and improves graph loading. All non-database datasets must use SPICE. To learn more about SPICE, see Managing SPICE Capacity.

Choose Visualize, and wait for SPICE to import the data to the cache. You can also schedule a periodic refresh so that new data is loaded to SPICE as the data is pipelined to the S3 bucket.

When the SPICE import is complete, you can create a visual dashboard easily. The following figure shows graphs displaying the occurrence of heart rate records per device.  The first graph is a horizontally stacked bar chart, which shows the percentage of heart rate occurrence per device. In the second graph, you can visualize the heart rate count group to the heart rate device.


Processing streaming data at scale is relevant in every industry. Whether you process data from wearables to tackle human health issues or address predictive maintenance in manufacturing centers, AWS can help you simplify your data ingestion and analysis while keeping your overall IT expenditure manageable.

In this two-part series, you learned how to ingest streaming data from a heart rate sensor and visualize it in such a way to create actionable insights. The current state of the art available in the big data and machine learning space makes it possible to ingest terabytes and petabytes of data and extract useful and actionable information from that process.

Additional Reading

If you found this post useful, be sure to check out Work with partitioned data in AWS Glue, and 10 visualizations to try in Amazon QuickSight with sample data.


About the Authors

Saurabh Shrivastava is a partner solutions architect and big data specialist working with global systems integrators. He works with AWS partners and customers to provide them architectural guidance for building scalable architecture in hybrid and AWS environments.




Abhinav Krishna Vadlapatla is a Solutions Architect with Amazon Web Services. He supports startups and small businesses with their cloud adoption to build scalable and secure solutions using AWS. During his free time, he likes to cook and travel.




John Cupit is a partner solutions architect for AWS’ Global Telecom Alliance Team. His passion is leveraging the cloud to transform the carrier industry. He has a son and daughter who have both graduated from college. His daughter is gainfully employed, while his son is in his first year of law school at Tulane University. As such, he has no spare money and no spare time to work a second job.



David Cowden is partner solutions architect and IoT specialist working with AWS emerging partners. He works with customers to provide them architectural guidance for building scalable architecture in IoT space.




Josh Ragsdale is an enterprise solutions architect at AWS. His focus is on adapting to a cloud operating model at very large scale. He enjoys cycling and spending time with his family outdoors.




Pierre-Yves Aquilanti, Ph.D., is a senior specialized HPC solutions architect at AWS. He spent several years in the oil & gas industry to optimize R&D applications for large scale HPC systems and enable the potential of machine learning for the upstream. He and his family crave to live in Singapore again for the human, cultural experience and eat fresh durians.



Manuel Puron is an enterprise solutions architect at AWS. He has been working in cloud security and IT service management for over 10 years. He is focused on the telecommunications industry. He enjoys video games and traveling to new destinations to discover new cultures.


How to build a front-line concussion monitoring system using AWS IoT and serverless data lakes – Part 1

Post Syndicated from Saurabh Shrivastava original https://aws.amazon.com/blogs/big-data/how-to-build-a-front-line-concussion-monitoring-system-using-aws-iot-and-serverless-data-lakes-part-1/

Sports-related minor traumatic brain injuries (mTBI) continue to incite concern among different groups in the medical, sports, and parenting community. At the recreational level, approximately 1.6–3.8 million related mTBI incidents occur in the United States every year, and in most cases, are not treated at the hospital. (See “The epidemiology and impact of traumatic brain injury: a brief overview” in Additional resources.) The estimated medical and indirect costs of minor traumatic brain injury are reaching $60 billion annually.

Although emergency facilities in North America collect data on admitted traumatic brain injuries (TBI) cases, there isn’t meaningful data on the number of unreported mTBIs among athletes. Recent studies indicate a significant rate of under-reporting of sports-related mTBI due to many factors. These factors include the simple inability of team staff to either recognize the signs and symptoms or to actually witness the impact. (See “A prospective study of physician-observed concussions during junior ice hockey: implications for incidence rates” in Additional resources.)

The majority of players involved in hockey and football are not college or professional athletes. There are over 3 million youth hockey players and approximately 5 million registered participants in football. (See “Head Impact Exposure in Youth Football” in Additional resources.) These recreational athletes don’t have basic access to medical staff trained in concussion recognition and sideline injury assessment. A user-friendly measurement and a smartphone-based assessment tool would facilitate the process between identifying potential head injuries, assessment, and return to play (RTP) criteria.

Recently, the use of instrumented sports helmets, including the Head Impact Telemetry System (HITS), has allowed for detailed recording of impacts to the head in many research trials. This practice has led to recommendations to alter contact in practices and certain helmet design parameters. (See “Head impact severity measures for evaluating mild traumatic brain injury risk exposure” in Additional resources.) However, due to the higher costs of the HITS system and complexity of the equipment, it is not a practical impact alert device for the general recreational population.

A simple, practical, and affordable system for measuring head trauma within the sports environment, subject to the absence of trained medical personnel, is required.

Given the proliferation of smartphones, we felt that this was a practical device to investigate to provide this type of monitoring.  All smartphone devices have an embedded Bluetooth communication system to receive and transmit data at various ranges.  For the purposes of this demonstration, we chose a class 1 Bluetooth device as the hardware communication method. We chose it because of its simplicity, widely accepted standard, and compatibility to interface with existing smartphones and IoT devices.

Remote monitoring typically involves collecting information from devices (for example, wearables) at the edge, integrating that information into a data lake, and generating inferences that can then be served back to the relevant stakeholders. Additionally, in some cases, compute and inference must also be done at the edge to shorten the feedback loop between data collection and response.

This use case can be extended to many other use cases in myriad verticals. In this two-part series, we show you how to build a data pipeline in support of a data lake. We use key AWS services such as Amazon Kinesis Data Streams, Kinesis Data Analytics, Kinesis Data Firehose, and AWS Lambda. In part 2, we focus on generating simple inferences from that data that can support RTP parameters.

Architectural overview

Here is the AWS architecture that we cover in this two-part series:

Note: For the purposes of our demonstration, we chose to use heart rate monitoring sensors rather than helmet sensors because they are significantly easier to acquire. Both types of sensors are very similar in how they transmit data. They are also very similar in terms of how they are integrated into a data lake solution.

The resulting demonstration transfers the heartbeat data using the following components:

  • AWS Greengrass set up with a Raspberry Pi 3 to stream heart rate data into the cloud.
  • Data is ingested via Amazon Kinesis Data Streams, and raw data is stored in an Amazon S3 bucket using Kinesis Data Firehose. Find more details about writing to Kinesis Data Firehose using Kinesis Data Streams.
  • Kinesis Data Analytics averages out the heartbeat-per-minute data during stream data ingestion and passes the average to an AWS Lambda
  • AWS Lambda enriches the heartbeat data by comparing the real-time data with baseline information stored in Amazon DynamoDB.
  • AWS Lambda sends SMS/email alerts via an Amazon SNS topic if the heartbeat rate is greater than 120 BPM, for example.
  • AWS Glue runs an extract, transform, and load (ETL) job. This job transforms the data store in a JSON format to a compressed Apache Parquet columnar format and applies that transformed partition for faster query processing. AWS Glue is a fully managed ETL service for crawling data stored in an Amazon S3 bucket and building a metadata catalog.
  • Amazon Athena is used for ad hoc query analysis on the data that is processed by AWS Glue. This data is also available for machine learning processing using predictive analysis to reduce heart disease risk.
  • Amazon QuickSight is a fully managed visualization tool. It uses Amazon Athena as a data source and depicts visual line and pie charts to show the heart rate data in a visual dashboard.

All data pipelines are serverless and are refreshed periodically to provide up-to-date data.

You can use Kinesis Data Firehose to transform the data in the pipeline to a compressed Parquet format without needing to use AWS Glue. For the purposes of this post, we are using AWS Glue to highlight its capabilities, including a centralized AWS Glue Data Catalog. This Data Catalog can be used by Athena for ad hoc queries and by Apache Spark EMR to run complex machine learning processes. AWS Glue also lets you edit generated ETL scripts and supports “bring your own ETL” to process data for more complex use cases.

Configuring key processes to support the pipeline

The following sections describe how to set up and configure the devices and services used in the demonstration to build a data pipeline in support of a data lake.

Remote sensors and IoT devices

You can use commercially available heart rate monitors to collect electrocardiography (ECG) information such as heart rate. The monitor is strapped around the chest area with the sensor placed over the sternum for better accuracy. The monitor measures the heart rate and sends the data over Bluetooth Low Energy (BLE) to a Raspberry Pi 3. The following figure depicts the device-side architecture for our demonstration.

The Raspberry Pi 3 is host to both the IoT device and the AWS Greengrass core. The IoT device is responsible for connecting to the heart rate monitor over BLE and collecting the heart rate data. The collected data is then sent locally to the AWS Greengrass core, where it can be processed and routed to the cloud through a secure connection. The AWS Greengrass core serves as the “edge” gateway for the heart rate monitor.

Set up AWS Greengrass core software on Raspberry Pi 3

To prepare your Raspberry Pi for running AWS Greengrass software, follow the instructions in Environment Setup for Greengrass in the AWS Greengrass Developer Guide.

After setting up your Raspberry Pi, you are ready to install AWS Greengrass and create your first Greengrass group. Create a Greengrass group by following the steps in Configure AWS Greengrass on AWS IoT. Then install the appropriate certificates to the Raspberry Pi by following the steps to start AWS Greengrass on a core device.

The preceding steps deploy a Greengrass group that consists of three discrete configurable items: a device, a subscription list, and the connectivity information.

The core device is a set of code that is responsible for collecting the heart rate information from the sensor and sending it to the AWS Greengrass core. This device is using the AWS IoT Device SDK for Python including the Greengrass Discovery API.

Use the following AWS CLI command to create a Greengrass group:

aws greengrass create-group --name heartRateGroup

To complete the setup, follow the steps in Create AWS IoT Devices in an AWS Greengrass Group.

After you complete the setup, the heart rate data is routed from the device to the AWS IoT Core service using AWS Greengrass. As such, you need to add a single subscription in the Greengrass group to facilitate this message route:

Here, your device is named Heartrate_Sensor, and the target is the IoT Cloud on the topic iot/heartrate. That means that when your device publishes to the iot/heartrate topic, AWS Greengrass also sends this message to the AWS IoT Core service on the same topic. Then you can use the breadth of AWS services to process the data.

The connectivity information is configured to use the local host because the IoT device resides on the Raspberry Pi 3 along with the AWS Greengrass core software. The IoT device uses the Discovery API, which is responsible for retrieving the connectivity information of the AWS Greengrass core that the IoT device is associated with.

The IoT device then uses the endpoint and port information to open a secure TLS connection to AWS Greengrass core, where the heart rate data is sent. The AWS Greengrass core connectivity information should be depicted as follows:

The power of AWS Greengrass core is that you can deploy AWS Lambda functions and new subscriptions to process the heart rate information locally on the Raspberry Pi 3. For example, you can deploy an AWS Lambda function that can trigger a reaction if the detected heart rate is reaching a set threshold. In this scenario, different individuals might require different thresholds and responses, so you could theoretically deploy unique Lambda functions on a per-individual basis if needed.

Configure AWS Greengrass and AWS IoT Core

To enable further processing and storage of the heart rate data messages published from AWS Greengrass core to AWS IoT Core, create an AWS IoT rule. The AWS IoT rule retrieves messages published to the IoT/heartrate topic and sends them to the Kinesis data stream through an AWS IoT rule action for Kinesis action.  

Simulate heart rate data

You might not have access to an IoT device, but you still want to run a proof of concept (PoC) around heart rate use cases. You can simulate data by creating a shell script and deploying that data simulation script on an Amazon EC2 instance. Refer to the EC2 user guide to get started with Amazon EC2 Linux instances.

On the Amazon EC2 instance, create a shell script kinesis_client_HeartRate.sh, and copy the provided code to start writing some records into the Kinesis data stream. Be sure to create your Kinesis data stream and replace the variable <your_stream_name> in the following script.

while true
  deviceID=$(( ( RANDOM % 10 )  + 1 ))
  heartRate=$(jot -r 1 60 140)
  echo "$deviceID,$heartRate"
  aws kinesis put-record --stream-name <your_stream_name> --data "$deviceID,$heartRate"$'\n' --partition-key $deviceID --region us-east-1

You can also use the Kinesis Data Generator to create data and then stream it to your solution or demonstration. For details on its use, see the blog post Test Your Streaming Data Solution with the New Amazon Kinesis Data Generator.

Ingest data using Kinesis and manage alerts with Lambda, DynamoDB, and Amazon SNS

Now you need to ingest data from the IoT device, which can be processed for real-time notifications when abnormal heart rates are detected.

Streaming data from the heart rate monitoring device is ingested to Kinesis Data Streams. Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data. For this project, the data stream was configured with one open shard and a data retention period of 24 hours. This lets you send 1 MB of data or 1,000 events per second and read 2 MB of data per second. If you need to support more devices, you can scale up and add more shards using the UpdateShardCount API or the Amazon Kinesis scaling utility.

You can configure your data stream by using the following AWS CLI command (and then using the appropriate flag to turn on encryption).

aws kinesis create-stream --stream-name hearrate_stream --shard-count 1

You can use an AWS CloudFormation template to create the entire stack depicted in the following architecture diagram.

When launching an AWS CloudFormation template, be sure to enter your email address or mobile phone number with the appropriate endpoint protocol (“Email” or “SMS”) as parameters:

Alternatively, you can follow the manual steps in the documentation links that are provided in this post.

Streaming data in Kinesis can be processed and analyzed in real time by Kinesis clients. Refer to the Kinesis Data Streams Developer Guide to learn how to create a Kinesis data stream.

To identify abnormal heart rate information, you must use real-time analytics to detect abnormal behavior. You can use Kinesis Data Analytics to perform analytics on streaming data in real time. Kinesis Data Analytics consists of three configurable components: source, real-time analytics, and destination. Refer to the AWS documentation to learn the detailed steps to configure Kinesis Data Analytics.

Kinesis Data Analytics uses Kinesis Data Streams as the source stream for the data. In the source configuration process, if there are scenarios where in-filtering or masking records is required, you can preprocess records using AWS Lambda. The data in this particular case is relatively simple, so you don’t need preprocessing of records on the data.

The Kinesis Data Analytics schema editor lets you edit and transform the schema if required. In the following example, we transformed the second column to Value instead of COL_Value.

The SQL code to perform the real-time analysis of the data has to be copied to the SQL Editor for real-time analytics. The following is the sample code that was used for this demonstration.

                                   VALUEROWTIME TIMESTAMP,
                                   ID INTEGER, 
                                   COLVALUE INTEGER);
              AVG("Value") AS HEARTRATE
         STEP("SOURCE_SQL_STREAM_001".ROWTIME BY INTERVAL '60' SECOND) HAVING AVG("Value") > 120 OR AVG("Value") < 40;”

This code generates DESTINATION_SQL_STREAM. It inserts values into the stream only when the average value of the heart beat that is received from SOURCE_SQL_STREAM_001 is greater than 120 or less than 40 in the 60-second time window.

For more information about the tumbling window concept, see Tumbling Windows (Aggregations Using GROUP BY).

Next, add an AWS Lambda function as one of your destinations, and configure it as follows:

In the destination editor, make sure that the stream name selected is the DESTINATION_SQL_STREAM. You only want to trigger the Lambda function when anomalies in the heart rate are detected. The output format can be JSON or CSV. In this example, our Lambda function expects the data in JSON format, so we chose JSON.

Athlete and athletic trainer registration information is stored in the heartrate Registrations DynamoDB table. Amazon DynamoDB offers fully managed encryption at rest using an AWS Key Management Service (AWS KMS) managed encryption key for DynamoDB. You need to create a table with encryption at rest enabled. Follow the detailed steps in Amazon DynamoDB Encryption at Rest.

Each record in the table should include deviceid, customerid, firstname, lastname, and mobile. The following is an example table record for reference.

  "customerid": {
    "S": "3"
  "deviceid": {
    "S": "7"
  "email": {
    "S": "[email protected]"
  "firstname": {
    "S": "John"
  "lastname": {
    "S": "Smith"
  "mobile": {
    "S": "19999999999"

Refer to the DynamoDB Developer Guide for complete instructions for creating and populating a DynamoDB table.

The Lambda function is created to process the record passed from the Kinesis Data Analytics application.  The node.js Lambda function retrieves the athlete and athletic trainer information from the DynamoDB registrations table. It then alerts the athletic trainer to the event by sending a cellular text message via the Amazon Simple Notification Service (Amazon SNS).

Note: The default AWS account limit for Amazon SNS for mobile messages is $1.00 per month. You can increase this limit through an SNS Limit Increase case as described in AWS Service Limits.

You now create a new Lambda function with a runtime of Node.js 6.10 and choose the Create a custom role option for IAM permissions.  If you are new to deploying Lambda functions, see Create a Simple Lambda Function.

You must configure the new Lambda function with a specific IAM role, providing privileges to Amazon CloudWatch Logs, Amazon DynamoDB, and Amazon SNS as provided in the supplied AWS CloudFormation template.

The provided AWS Lambda function retrieves the HR Monitor Device ID and HR Average from the base64-encoded JSON message that is passed from Kinesis Data Analytics.  After retrieving the HR Monitor Device ID, the function then queries the DynamoDB Athlete registration table to retrieve the athlete and athletic trainer information.

Finally, the AWS Lambda function sends a mobile text notification (which does not contain any sensitive information) to the athletic trainer’s mobile number retrieved from the athlete data by using the Amazon SNS service.

To store the streaming data to an S3 bucket for further analysis and visualization using other tools, you can use Kinesis Data Firehose to connect the pipeline to Amazon S3 storage.  To learn more, see Create a Kinesis Data Firehose Delivery Stream.

Kinesis Data Firehose delivers the streaming data in intervals to the destination S3 bucket. The intervals can be defined using either an S3 buffer size or an S3 buffer interval (or both, whichever exceeds the first metric). The data in the Data Firehose delivery stream can be transformed. It also lets you back up the source record before applying any transformation. The data can be encrypted and compressed to GZip, Zip, or Snappy format to store the data in a columnar format like Apache Parquet and Apache ORC. This improves the query performance and reduces the storage footprint. You should enable error logging for operational and production troubleshooting.


In part 1 of this blog series, we demonstrated how to build a data pipeline in support of a data lake. We used key AWS services such as Kinesis Data Streams, Kinesis Data Analytics, Kinesis Data Firehose, and Lambda. In part 2, we’ll discuss how to deploy a serverless data lake and use key analytics to create actionable insights from the data lake.

Additional resources

Langlois, J.A., Rutland-Brown, W. & Wald, M., “The epidemiology and impact of traumatic brain injury: a brief overview,” Journal of Head Trauma Rehabilitation, Vol. 21, No. 5, 2006, pp. 375-378.

Echlin, S. E., Tator, C. H., Cusimano, M. D., Cantu, R. C., Taunton, J. E., Upshur E. G., Hall, C. R., Johnson, A. M., Forwell, L. A., Skopelja, E. N., “A prospective study of physician-observed concussions during junior ice hockey: implications for incidence rates,” Neurosurg Focus, 29 (5):E4, 2010

Daniel, R. W., Rowson, S., Duma, S. M., “Head Impact Exposure in Youth Football,” Annals of Biomedical Engineering., Vol. 10, 2012, 1007.

Greenwald, R. M., Gwin, J. T., Chu, J. J., Crisco, J. J., “Head impact severity measures for evaluating mild traumatic brain injury risk exposure,” Neurosurgery Vol. 62, 2008, pp. 789–79

Additional Reading

If you found this post useful, be sure to check out Setting Up Just-in-Time Provisioning with AWS IoT Core, and Real-time Clickstream Anomaly Detection with Amazon Kinesis Analytics.


About the Authors

Saurabh Shrivastava is a partner solutions architect and big data specialist working with global systems integrators. He works with AWS partners and customers to provide them architectural guidance for building scalable architecture in hybrid and AWS environments.




Abhinav Krishna Vadlapatla is a Solutions Architect with Amazon Web Services. He supports startups and small businesses with their cloud adoption to build scalable and secure solutions using AWS. During his free time, he likes to cook and travel.




John Cupit is a partner solutions architect for AWS’ Global Telecom Alliance Team.  His passion is leveraging the cloud to transform the carrier industry.  He has a son and daughter who have both graduated from college. His daughter is gainfully employed, while his son is in his first year of law school at Tulane University.  As such, he has no spare money and no spare time to work a second job.



David Cowden is partner solutions architect and IoT specialist working with AWS emerging partners. He works with customers to provide them architectural guidance for building scalable architecture in IoT space.




Josh Ragsdale is an enterprise solutions architect at AWS.  His focus is on adapting to a cloud operating model at very large scale. He enjoys cycling and spending time with his family outdoors.




Pierre-Yves Aquilanti, Ph.D., is a senior specialized HPC solutions architect at AWS. He spent several years in the oil & gas industry to optimize R&D applications for large scale HPC systems and enable the potential of machine learning for the upstream. He and his family crave to live in Singapore again for the human, cultural experience and eat fresh durians.



Manuel Puron is an enterprise solutions architect at AWS. He has been working in cloud security and IT service management for over 10 years. He is focused on the telecommunications industry. He enjoys video games and traveling to new destinations to discover new cultures.


Build a Concurrent Data Orchestration Pipeline Using Amazon EMR and Apache Livy

Post Syndicated from Binal Jhaveri original https://aws.amazon.com/blogs/big-data/build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy/

Many customers use Amazon EMR and Apache Spark to build scalable big data pipelines. For large-scale production pipelines, a common use case is to read complex data originating from a variety of sources. This data must be transformed to make it useful to downstream applications, such as machine learning pipelines, analytics dashboards, and business reports. Such pipelines often require Spark jobs to be run in parallel on Amazon EMR. This post focuses on how to submit multiple Spark jobs in parallel on an EMR cluster using Apache Livy, which is available in EMR version 5.9.0 and later.

Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface. Apache Livy lets you send simple Scala or Python code over REST API calls instead of having to manage and deploy large jar files. This helps because it scales data pipelines easily with multiple spark jobs running in parallel, rather than running them serially using EMR Step API. Customers can continue to take advantage of transient clusters as part of the workflow resulting in cost savings.

For the purpose of this blog post, we use Apache Airflow to orchestrate the data pipeline. Airflow is an open-sourced task scheduler that helps manage ETL tasks. Customers love Apache Airflow because workflows can be scheduled and managed from one central location. With Airflow’s Configuration as Code approach, automating the generation of workflows, ETL tasks, and dependencies is easy. It helps customers shift their focus from building and debugging data pipelines to focusing on the business problems.

High-level Architecture

Following is a detailed technical diagram showing the configuration of the architecture to be deployed.

We use an AWS CloudFormation script to launch the AWS services required to create this workflow. CloudFormation is a powerful service that allows you to describe and provision all the infrastructure and resources required for your cloud environment, in simple JSON or YAML templates. In this case, the template includes the following:

The Airflow server uses a LocalExecutor (tasks are executed as a subprocess), which helps to parallelize tasks locally. For production workloads, you should consider scaling out with the CeleryExecutor on a cluster with multiple worker nodes.

For demonstration purposes, we use the movielens dataset to concurrently convert the csv files to parquet format and save it to Amazon S3. This dataset is a popular open-source dataset, which is used in exploring data science algorithms. Each dataset file is a comma-separated file with a single header row. The following table describes each file in the dataset.

movies.tsvHas the title and list of genres for movies being reviewed.
ratings.csvShows how users rated movies, using a scale from 1-5. The file also contains the time stamp for the movie review.
tags.csvShows a user-generated tag for each movie. A tag is user-generated metadata about a movie. A tag can be a word or a short phrase. The file also contains the time stamp for the tag.
links.csvContains identifiers to link to movies used by IMDB and MovieDB.
genome-scores.csvShows the relevance of each tag for each movie.
genome-tags.csvProvides the tag descriptions for each tag in the genome-scores.csv file.

Building the Pipeline

Step 0: Prerequisites

Make sure that you have a bash-enabled machine with AWS CLI installed.

Step 1: Create an Amazon EC2 key pair

To build this ETL pipeline, connect to an EC2 instance using SSH. This requires access to an Amazon EC2 key pair in the AWS Region you’re launching your CloudFormation stack. If you have an existing Key Pair in your Region, go ahead and use that Key Pair for this exercise. If not, to create a key pair open the AWS Management Console and navigate to the EC2 console. In the EC2 console left navigation pane, choose Key Pairs.

Choose Create Key Pair, type airflow_key_pair (make sure to type it exactly as shown), then choose Create. This downloads a file called airflow_key_pair.pem. Be sure to keep this file in a safe and private place. Without access to this file, you lose the ability to use SSH to connect with your EC2 instance.

Step 2: Execute the CloudFormation Script

Now, we’re ready to run the CloudFormation script!

Note: The CloudFormation script uses a DBSecurityGroup, which is NOT supported in all Regions.

On the next page, choose the key pair that you created in the previous step (airflow_key_pair) along with a S3 bucket name. The S3 bucket should NOT exist as the cloudformation creates a new S3 bucket. Default values for other parameters have been chosen for simplicity.

After filling out these parameters to fit your environment, choose Next. Finally, review all the settings on the next page. Select the box marked I acknowledge that AWS CloudFormation might create IAM resources (this is required since the script creates IAM resources), then choose Create. This creates all the resources required for this pipeline and takes some time to run. To view the stack’s progress, select the stack you created and choose the Events section or panel.

It takes a couple of minutes for the CloudFormation template to complete.

Step 3: Start the Airflow scheduler in the Airflow EC2 instance

To make these changes, we use SSH to connect to the EC2 instance created by the CloudFormation script. Assuming your local machine has an SSH client, this can be accomplished from the command line. Navigate to the directory that contains the airflow_key_pair.pem file you downloaded earlier and insert the following commands, replacing your-public-ip and your-region with the relevant values from your EC2 instance. The public DNS Name of the EC2 instance can be found on the Outputs tab

Type yes when prompted after the SSH command.

chmod 400 airflow_key_pair.pem
ssh -i "airflow_key_pair.pem" [email protected]amazonaws.com

For more information or help with issues, see Connecting to Your Linux Instance Using SSH.

Now we need to run some commands as the root user.

# sudo as the root user
sudo su
# Navigate to the airflow directory which was created by the cloudformation template – Look at the user-data section.
cd ~/airflow
source ~/.bash_profile

Below is an image of how the ‘/root/airflow/’ directory should look like.

Now we need to start the airflow scheduler. The Airflow scheduler monitors all tasks and all directed acyclic graphs (DAGs), and triggers the task instances whose dependencies have been met. In the background, it monitors and stays in sync with a folder for all DAG objects that it may contain. It periodically (every minute or so) inspects active tasks to see whether they can be triggered.

The Airflow scheduler is designed to run as a persistent service in an Airflow production environment. To get it started, run the airflow scheduler. It will use the configuration specified in airflow.cfg.

To start a scheduler, run the below command in your terminal.

airflow scheduler

Your screen should look like the following with scheduler running.

Step 4: View the transform_movielens DAG on the Airflow Webserver

The Airflow webserver should be running on port 8080. To see the Airflow webserver, open any browser and type in the <EC2-public-dns-name>:8080. The public EC2 DNS name is the same one found in Step 3.

You should see a list of DAGs on the Airflow dashboard. The example DAGs are left there in case you want you experiment with them. But we focus on the transform_movielens DAG for the purposes of this blog. Toggle the ON button next to the name of the DAG.

The following example shows how the dashboard should look.

Choose the transform_movielens DAG, then choose Graph View to view the following image.

This image shows the overall data pipeline. In the current setup, there are six transform tasks that convert each .csv file to parquet format from the movielens dataset. Parquet is a popular columnar storage data format used in big data applications. The DAG also takes care of spinning up and terminating the EMR cluster once the workflow is completed.

The DAG code can also be viewed by choosing the Code button.

Step 5: Run the Airflow DAG

To run the DAG, go back to the Airflow dashboard, and choose the Trigger DAG button for the transform_movielens DAG.

When the Airflow DAG is run, the first task calls the run_job_flow boto3 API to create an EMR cluster. The second task waits until the EMR cluster is ready to take on new tasks. As soon as the cluster is ready, the transform tasks are kicked off in parallel using Apache Livy, which runs on port 8998. Concurrency in the current Airflow DAG is set to 3, which runs three tasks in parallel. To run more tasks in parallel (multiple spark sessions) in Airflow without overwhelming the EMR cluster, you can throttle the concurrency.

How does Apache Livy run the Scala code on the EMR cluster in parallel?

Once the EMR cluster is ready, the transform tasks are triggered by the Airflow scheduler. Each transform task triggers Livy to create a new interactive spark session. Each POST request brings up a new Spark context with a Spark interpreter. This remote Spark interpreter is used to receive and run code snippets, and return back the result.

Let’s use one of the transform tasks as an example to understand the steps in detail.

# Converts each of the movielens datafile to parquet
def transform_movies_to_parquet(**kwargs):
    # ti is the Task Instance
    ti = kwargs['ti']
    cluster_id = ti.xcom_pull(task_ids='create_cluster')
    cluster_dns = emr.get_cluster_dns(cluster_id)
    headers = emr.create_spark_session(cluster_dns, 'spark')
    session_url = emr.wait_for_idle_session(cluster_dns, headers)
    statement_response = emr.submit_statement(session_url,   '/root/airflow/dags/transform/movies.scala')
    emr.track_statement_progress(cluster_dns, statement_response.headers)

The first three lines of this code helps to look up the EMR cluster details. This is used to create an interactive spark session on the EMR cluster using Apache Livy.


Apache Livy creates an interactive spark session for each transform task. The code for which is shown below. SparkSession provides a single point of entry to interact with underlying Spark functionality and allows programming Spark with DataFrame and Dataset APIs. The Spark session is created by calling the POST /sessions API.

Note: You can also change different parameters like driverMemory, executor Memory, number of driver and executor cores as part of the API call.

# Creates an interactive scala spark session. 
# Python(kind=pyspark), R(kind=sparkr) and SQL(kind=sql) spark sessions can also be created by changing the value of kind.
def create_spark_session(master_dns, kind='spark'):
    # 8998 is the port on which the Livy server runs
    host = 'http://' + master_dns + ':8998'
    data = {'kind': kind}
    headers = {'Content-Type': 'application/json'}
    response = requests.post(host + '/sessions', data=json.dumps(data), headers=headers)
    return response.headers


Once the session has completed starting up, it transitions to the idle state. The transform task is then submitted to the session. The scala code is submitted as a REST API call to the Livy Server instead of the EMR cluster, to have good fault tolerance and concurrency.

# Submits the scala code as a simple JSON command to the Livy server
def submit_statement(session_url, statement_path):
    statements_url = session_url + '/statements'
    with open(statement_path, 'r') as f:
        code = f.read()
    data = {'code': code}
    response = requests.post(statements_url, data=json.dumps(data), headers={'Content-Type': 'application/json'})
    return response


The progress of the statement can also be easily tracked and the logs are centralized on the Airflow webserver.

# Function to help track the progress of the scala code submitted to Apache Livy
def track_statement_progress(master_dns, response_headers):
    statement_status = ''
    host = 'http://' + master_dns + ':8998'
    session_url = host + response_headers['location'].split('/statements', 1)[0]
    # Poll the status of the submitted scala code
    while statement_status != 'available':
        # If a statement takes longer than a few milliseconds to execute, Livy returns early and provides a statement URL that can be polled until it is complete:
        statement_url = host + response_headers['location']
        statement_response = requests.get(statement_url, headers={'Content-Type': 'application/json'})
        statement_status = statement_response.json()['state']
        logging.info('Statement status: ' + statement_status)

        #logging the logs
        lines = requests.get(session_url + '/log', headers={'Content-Type': 'application/json'}).json()['log']
        for line in lines:

        if 'progress' in statement_response.json():
            logging.info('Progress: ' + str(statement_response.json()['progress']))
    final_statement_status = statement_response.json()['output']['status']
    if final_statement_status == 'error':
        logging.info('Statement exception: ' + statement_response.json()['output']['evalue'])
        for trace in statement_response.json()['output']['traceback']:
        raise ValueError('Final Statement Status: ' + final_statement_status)
    logging.info('Final Statement Status: ' + final_statement_status)

The below is a snapshot of the centralized logs from the Airflow webserver.

Once the job is run successfully, the Spark session is ended and the EMR cluster is terminated.

Analyze the data in Amazon Athena

The output data in S3 can be analyzed in Amazon Athena by creating a crawler on AWS Glue. For information about automatically creating the tables in Athena, see the steps in Build a Data Lake Foundation with AWS Glue and Amazon S3.


In this post, we explored orchestrating a Spark data pipeline on Amazon EMR using Apache Livy and Apache Airflow. We created a simple Airflow DAG to demonstrate how to run spark jobs concurrently. You can modify this to scale your ETL data pipelines and improve latency. Additionally, we saw how Livy helps to hide the complexity to submit spark jobs via REST by using optimal EMR resources.

For more information about the code shown in this post, see AWS Concurrent Data Orchestration Pipeline EMR Livy. Feel free to leave questions and other feedback in the comments.


Additional Reading

If you found this post useful, be sure to check out Orchestrate Apache Spark applications using AWS Step Functions and Apache Livy.


About the Author

Binal Jhaveri is a Big Data Engineer at Amazon Web Services. Her passion is to build big data products in the cloud. During her spare time, she likes to travel, read detective fiction and loves exploring the abundant nature that the Pacific Northwest has to offer.