Tag Archives: Amazon S3 Glacier

Deploying an EMR cluster on AWS Outposts to process data from an on-premises database

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/deploying-an-emr-cluster-on-aws-outposts-to-process-data-from-an-on-premises-database/

seThis post is written by Eder de Mattos, Sr. Cloud Security Consultant, AWS and Fernando Galves, Outpost Solutions Architect, AWS.

In this post, you will learn how to deploy an Amazon EMR cluster on AWS Outposts and use it to process data from an on-premises database. Many organizations have regulatory, contractual, or corporate policy requirements to process and store data in a specific geographical location. These strict requirements become a challenge for organizations to find flexible solutions that balance regulatory compliance with the agility of cloud services. Amazon EMR is the industry-leading cloud big data platform for data processing, interactive analysis, and machine learning (ML) that uses open-source frameworks. With Amazon EMR on Outposts, you can seamlessly use data analytics solutions to process data locally in your on-premises environment without moving data to the cloud. This post focuses on creating and configuring an Amazon EMR cluster on AWS Outposts rack using Amazon Virtual Private Cloud (Amazon VPC) endpoints and keeping the networking traffic in the on-premises environment.

Architecture overview

In this architecture, there is an Amazon EMR cluster created in an AWS Outposts subnet. The cluster retrieves data from an on-premises PostgreSQL database, employs a PySpark Step for data processing, and then stores the result in a new table within the same database. The following diagram shows this architecture.

Architecture overview

Figure 1 Architecture overview

Networking traffic on premises: The communication between the EMR cluster and the on-premises PostgreSQL database is through the Local Gateway. The core Amazon Elastic Compute Cloud (Amazon EC2) instances of the EMR cluster are associated with Customer-owned IP addresses (CoIP), and each instance has two IP addresses: an internal IP and a CoIP IP. The internal IP is used to communicate locally in the subnet, and the CoIP IP is used to communicate with the on-premises network.

Amazon VPC endpoints: Amazon EMR establishes communication with the VPC through an interface VPC endpoint. This communication is private and conducted entirely within the AWS network instead of connecting over the internet. In this architecture, VPC endpoints are created on a subnet in the AWS Region.

The support files used to create the EMR cluster are stored in an Amazon Simple Storage Service (Amazon S3) bucket. The communication between the VPC and Amazon S3 stays within the AWS network. The following files are stored in this S3 bucket:

  • get-postgresql-driver.sh: This is a bootstrap script to download the PostgreSQL driver to allow the Spark step to communicate to the PostgreSQL database through JDBC. You can download it through the GitHub repository for this Amazon EMR on Outposts blog post.
  • postgresql-42.6.0.jar: PostgreSQL binary JAR file for the JDBC driver.
  • spark-step-example.py: Example of a Step application in PySpark to simulate the connection to the PostgreSQL database.

AWS Systems Manager is configured to manage the EC2 instances that belong to the EMR cluster. It uses an interface VPC endpoint to allow the VPC to communicate privately with the Systems Manager.

The database credentials to connect to the PostgreSQL database are stored in AWS Secrets Manager. Amazon EMR integrates with Secrets Manager. This allows the secret to be stored in the Secrets Manager and be used through its ARN in the cluster configuration. During the creation of the EMR cluster, the secret is accessed privately through an interface VPC endpoint and stored in the variable DBCONNECTION in the EMR cluster.

In this solution, we are creating a small EMR cluster with one primary and one core node. For the correct sizing of your cluster, see Estimating Amazon EMR cluster capacity.

There is additional information to improve the security posture for organizations that use AWS Control Tower landing zone and AWS Organizations. The post Architecting for data residency with AWS Outposts rack and landing zone guardrails is a great place to start.

Prerequisites

Before deploying the EMR cluster on Outposts, you must make sure the following resources are created and configured in your AWS account:

  1. Outposts rack are installed, up and running.
  2. Amazon EC2 key pair is created. To create it, you can follow the instructions in Create a key pair using Amazon EC2 in the Amazon EC2 user guide.

Deploying the EMR cluster on Outposts

1.      Deploy the CloudFormation template to create the infrastructure for the EMR cluster

You can use this AWS CloudFormation template to create the infrastructure for the EMR cluster. To create a stack, you can follow the instructions in Creating a stack on the AWS CloudFormation console in the AWS CloudFormation user guide.

2.      Create an EMR cluster

To launch a cluster with Spark installed using the console:

Step 1: Configure Name and Applications

  1. Sign in to the AWS Management Console, and open the Amazon EMR console.
  2. Under EMR on EC2, in the left navigation pane, select Clusters, and then choose Create Cluster.
  3. On the Create cluster page, enter a unique cluster name for the Name
  4. For Amazon EMR release, choose emr-6.13.0.
  5. In the Application bundle field, select Spark 3.4.1 and Zeppelin 0.10.1, and unselect all the other options.
  6. For the Operating system options, select Amazon Linux release.

Create Cluster Figure 2: Create Cluster

Step 2: Choose Cluster configuration method

  1. Under the Cluster configuration, select Uniform instance groups.
  2. For the Primary and the Core, select the EC2 instance type available in the Outposts rack that is supported by the EMR cluster.
  3. Remove the instance group Task 1 of 1.

Remove the instance group Task 1 of 1

Figure 3: Remove the instance group Task 1 of 1

Step 3: Set up Cluster scaling and provisioning, Networking and Cluster termination

  1. In the Cluster scaling and provisioning option, choose Set cluster size manually and type the value 1 for the Core
  2. On the Networking, select the VPC and the Outposts subnet.
  3. For Cluster termination, choose Manually terminate cluster.

Step 4: Configure the Bootstrap actions

A. In the Bootstrap actions, add an action with the following information:

    1. Name: copy-postgresql-driver.sh
    2. Script location: s3://<bucket-name>/copy-postgresql-driver.sh. Modify the <bucket-name> variable to the bucket name you specified as a parameter in Step 1.

Add bootstrap action

Figure 4: Add bootstrap action

Step 5: Configure Cluster logs and Tags

a. Under Cluster logs, choose Publish cluster-specific logs to Amazon S3 and enter s3://<bucket-name>/logs for the field Amazon S3 location. Modify the <bucket-name> variable to the bucket name you specified as a parameter in Step 1.

Amazon S3 location for cluster logs

Figure 5: Amazon S3 location for cluster logs

b. In Tags, add new tag. You must enter for-use-with-amazon-emr-managed-policies for the Key field and true for Value.

Add tags

Figure 6: Add tags

Step 6: Set up Software settings and Security configuration and EC2 key pair

a. In the Software settings, enter the following configuration replacing the Secret ARN created in Step 1:

[
          {
                    "Classification": "spark-defaults",
                    "Properties": {
                              "spark.driver.extraClassPath": "/opt/spark/postgresql/driver/postgresql-42.6.0.jar",
                              "spark.executor.extraClassPath": "/opt/spark/postgresql/driver/postgresql-42.6.0.jar",
                              "[email protected]":
                                         "arn:aws:secretsmanager:<region>:<account-id>:secret:<secret-name>"
                    }
          }
]

This is an example of the Secret ARN replaced:

Example of the Secret ARN replaced

Figure 7: Example of the Secret ARN replaced

b. For the Security configuration and EC2 key pair, choose the SSH key pair.

Step 7: Choose Identity and Access Management (IAM) roles

a. Under Identity and Access Management (IAM) roles:

    1. In the Amazon EMR service role:
      • Choose AmazonEMR-outposts-cluster-role for the Service role.
    2. In EC2 instance profile for Amazon EMR
      • Choose AmazonEMR-outposts-EC2-role.

Choose the service role and instance profile

Figure 8: Choose the service role and instance profile

Step 8: Create cluster

  1. Choose Create cluster to launch the cluster and open the cluster details page.

Now, the EMR cluster is starting. When your cluster is ready to process tasks, its status changes to Waiting. This means the cluster is up, running, and ready to accept work.

Result of the cluster creation

Figure 9: Result of the cluster creation

3.      Add CoIPs to EMR core nodes

You need to allocate an Elastic IP from the CoIP pool and associate it with the EC2 instance of the EMR core nodes. This is necessary to allow the core nodes to access the on-premises environment. To allocate an Elastic IP, follow the instructions in Allocate an Elastic IP address in Amazon EC2 User Guide for Linux Instances. In Step 5, choose the Customer-owned pool of IPV4 addresses.

Once the CoIP IP is allocated, associate it with each EC2 instance of the EMR core node. Follow the instructions in Associate an Elastic IP address with an instance or network interface in Amazon EC2 User Guide for Linux Instances.

Checking the configuration

  1. Make sure the EC2 instance of the core nodes can ping the IP of the PostgreSQL database.

Connect to the Core node EC2 instance using Systems Manager and ping the IP address of the PostgreSQL database.

Connectivity test

Figure 10: Connectivity test

  1. Make sure the Status of the EMR cluster is Waiting.

: Cluster is ready and waiting

Figure 11: Cluster is ready and waiting

Adding a step to the Amazon EMR cluster

You can use the following Spark application to simulate the data processing from the PostgreSQL database.

spark-step-example.py:

import os
from pyspark.sql import SparkSession

if __name__ == "__main__":

    # ---------------------------------------------------------------------
    # Step 1: Get the database connection information from the EMR cluster 
    #         configuration
    dbconnection = os.environ.get('DBCONNECTION')
    #    Remove brackets
    dbconnection_info = (dbconnection[1:-1]).split(",")
    #    Initialize variables
    dbusername = ''
    dbpassword = ''
    dbhost = ''
    dbport = ''
    dbname = ''
    dburl = ''
    #    Parse the database connection information
    for dbconnection_attribute in dbconnection_info:
        (key_data, key_value) = dbconnection_attribute.split(":", 1)

        if key_data == "username":
            dbusername = key_value
        elif key_data == "password":
            dbpassword = key_value
        elif key_data == 'host':
            dbhost = key_value
        elif key_data == 'port':
            dbport = key_value
        elif key_data == 'dbname':
            dbname = key_value

    dburl = "jdbc:postgresql://" + dbhost + ":" + dbport + "/" + dbname

    # ---------------------------------------------------------------------
    # Step 2: Connect to the PostgreSQL database and select data from the 
    #         pg_catalog.pg_tables table
    spark_db = SparkSession.builder.config("spark.driver.extraClassPath",                                          
               "/opt/spark/postgresql/driver/postgresql-42.6.0.jar") \
               .appName("Connecting to PostgreSQL") \
               .getOrCreate()

    #    Connect to the database
    data_db = spark_db.read.format("jdbc") \
        .option("url", dburl) \
        .option("driver", "org.postgresql.Driver") \
        .option("query", "select count(*) from pg_catalog.pg_tables") \
        .option("user", dbusername) \
        .option("password", dbpassword) \
        .load()

    # ---------------------------------------------------------------------
    # Step 3: To do the data processing
    #
    #    TO-DO

    # ---------------------------------------------------------------------
    # Step 4: Save the data into the new table in the PostgreSQL database
    #
    data_db.write \
        .format("jdbc") \
        .option("url", dburl) \
        .option("dbtable", "results_proc") \
        .option("user", dbusername) \
        .option("password", dbpassword) \
        .save()

    # ---------------------------------------------------------------------
    # Step 5: Close the Spark session
    #
    spark_db.stop()
    # ---------------------------------------------------------------------

You must upload the file spark-step-example.py to the bucket created in Step 1 of this post before submitting the Spark application to the EMR cluster. You can get the file at this GitHub repository for a Spark step example.

Submitting the Spark application step using the Console

To submit the Spark application to the EMR cluster, follow the instructions in To submit a Spark step using the console in the Amazon EMR Release Guide. In Step 4 of this Amazon EMR guide, provide the following parameters to add a step:

  1. choose Cluster mode for the Deploy mode
  2. type a name for your step (such as Step 1)
  3. for the Application location, choose s3://<bucket-name>/spark-step-example.py and replace the <bucket-name> variable to the bucket name you specified as a parameter in Step 1
  4. leave the Spark-submit options field blank

Add a step to the EMR cluster

Figure 12: Add a step to the EMR cluster

The Step is created with the Status Pending. When it is done, the Status changes to Completed.

Step executed successfully

Figure 13: Step executed successfully

Cleaning up

When the EMR cluster is no longer needed, you can delete the resources created to avoid incurring future costs by following these steps:

  1. Follow the instructions in Terminate a cluster with the console in the Amazon EMR Documentation Management Guide. Remember to turn off the Termination protection.
  2. Dissociate and release the CoIP IPs allocated to the EC2 instances of the EMR core nodes.
  3. Delete the stack in the AWS CloudFormation using the instructions in Deleting a Stack on the AWS CloudFormation console in the AWS CloudFormation User Guide

Conclusion

Amazon EMR on Outposts allows you to use the managed services offered by AWS to perform big data processing close to your data that needs to remain on-premises. This architecture eliminates the need to transfer on-premises data to the cloud, providing a robust solution for organizations with regulatory, contractual, or corporate policy requirements to store and process data in a specific location. With the EMR cluster accessing the on-premises database directly through local networking, you can expect faster and more efficient data processing without compromising on compliance or agility. To learn more, visit the Amazon EMR on AWS Outposts product overview page.

Welcome to AWS Storage Day 2023

Post Syndicated from Veliswa Boya original https://aws.amazon.com/blogs/aws/welcome-to-aws-storage-day-2023/

Welcome to the fifth annual AWS Storage Day! This virtual event is happening today starting at 9:00 AM Pacific Time (12:00 PM Eastern Time) and is available for you to watch on the AWS On Air Twitch channel. The first AWS Storage Day was hosted in 2019, and this event has grown into an innovation day that we look forward to delivering to you every year. In last year’s Storage Day post, I wrote about the constant innovations in AWS Storage aimed at helping you put your data to work while keeping it secure and protected. This year, Storage Day is focused on storage for AI/ML, data protection and resiliency, and the benefits of moving to the cloud.

AWS Storage Day Key Themes
When it comes to storage for AI/ML, data volumes are increasing at an unprecedented rate, exploding from terabytes to petabytes and even to exabytes. With a modern data architecture on AWS, you can rapidly build scalable data lakes, use a broad and deep collection of purpose-built data services, scale your systems at a low cost without compromising performance, share data across organizational boundaries, and manage compliance, security, and governance, allowing you to make decisions with speed and agility at scale.
To train machine learning models and build Generative AI applications, you must have the right data strategy in place. So, I’m happy to see that, among the list of sessions to look forward to at the live event, the Optimize generative AI and ML with AWS Infrastructure session will discuss how you can transform your data into meaningful insights.

Whether you’re just getting started with the cloud, planning to migrate applications to AWS, or already building applications on AWS, we have resources to help you protect your data and meet your business continuity objectives. Our data protection and resiliency features and solutions can help you meet your business continuity goals and deliver disaster recovery during data loss events, across recovery point and time objectives (RPO and RTO). With the unprecedented data growth happening in the world today, determining where your data is stored, how it’s secured, and who has access to it is a higher priority than ever. Be sure to join the Protect data in AWS amid a rapidly evolving cyber landscape session to learn more.

When moving data to the cloud, you need to understand where you’re moving it for different use cases, the types of data you’re moving, and the network resources available, among other considerations. There are many reasons to move to the cloud, recently, Enterprise Strategy Group (ESG) validated that organizations reduced compute, networking, and storage costs by up to 66 percent by migrating on-premises workloads to AWS Cloud infrastructure. ESG confirmed that migrating on-premises workloads to AWS provides organizations with reduced costs, increased performance, improved operational efficiency, faster time to value, and improved business agility.
We have a number of sessions that discuss how to move to the cloud, based on your use case. I’m most looking forward to the Hybrid cloud storage and edge compute: AWS, where you need it session, which will discuss considerations for workloads that can’t fully move to the cloud.

Tune in to learn from experts about new announcements, leadership insights, and educational content related to the broad portfolio of AWS Storage services and features that address all these themes and more. Today, we have announcements related to Amazon Simple Storage Service (Amazon S3), Amazon FSx for Windows File Server, Amazon Elastic File System (Amazon EFS), Amazon FSx for OpenZFS, and more.

Let’s get into it.

15 Years of Amazon EBS
Not long ago, I was reading Jeff Barr’s post titled 15 Years of AWS Blogging! In this post, Jeff mentioned a few posts he wrote for the earliest AWS services and features. Amazon Elastic Block Store (Amazon EBS) is on this list as a service that simplifies the use of Amazon EC2.

Well, it’s been 15 years since the launch of Amazon EBS was announced, and today we celebrate 15 years of this service. If you were one of the original users who put Amazon EBS to good use and provided us with the very helpful feedback that helped us invent and simplify, iterate and improve, I’m sure you can’t believe how time flies. Today, Amazon EBS handles more than 100 trillion I/O operations daily, and over 390 million EBS volumes are created every day.

If you’re new to Amazon EBS, join us for a fireside chat with Matt Garman, Senior Vice President, Sales, Marketing, and Global Services at AWS, and learn the strategy and customer challenges behind the launch of the service in 2008. You’ll also hear from long-term EBS customer, Stripe, about its growth with EBS since Stripe was launched 12 years ago.

Amazon EBS has continuously improved its scalability and performance to support more customer workloads as the direct storage attachment for Amazon EC2 instances. With the launch of Amazon EC2 M7i instances, powered by custom 4th Generation Intel Xeon Scalable processors, on August 2, you can attach up to 128 Amazon EBS volumes, an increase from 28 on a previous generation M6i instance. The higher number of volume attachments means you can increase storage density per instance and improve resource utilization, reducing total compute cost.

You can host up to 127 containers per instance for larger database applications and scale them more cost effectively before needing to provision more instances and only pay for resources you need. With a higher number of volume attachments, you can fully utilize the memory and vCPU available on these powerful M7i instances as your database storage footprint grows. EBS is also increasing the number of multi-volume snapshots you can create, for up to 128 EBS volumes attached to an instance, enabling you to create crash-consistent backups of all volumes attached to an instance.

Join the 15 years of innovations with Amazon EBS session for a discussion about how the original vision for Amazon EBS has evolved to meet your growing demands for cloud infrastructure.

Mountpoint for Amazon S3
Now generally available, Mountpoint for Amazon S3 is a new open source file client that delivers high throughput access, lowering compute costs for data lakes on Amazon S3. Mountpoint for Amazon S3 is a file client that translates local file system API calls to S3 object API calls. Using Mountpoint for Amazon S3, you can mount an Amazon S3 bucket as a local file system on your compute instance, to access your objects through a file interface with the elastic storage and throughput of Amazon S3. Mountpoint for Amazon S3 supports sequential and random read operations on existing files, and sequential write operations for creating new files.

The Deep dive and demo of Mountpoint for Amazon S3 session demonstrates how to use the file client to access objects in Amazon S3 using file APIs, making it easier to store data at scale and maximize the value of your data with analytics and machine learning workloads. Read this blog post to learn more about Mountpoint for Amazon S3 and how to get started, including a demo.

Put Cold Storage to Work Faster with Amazon S3 Glacier Flexible Retrieval
Amazon S3 Glacier Flexible Retrieval improves data restore time by up to 85 percent, at no additional cost. Faster data restores automatically apply to the Standard retrieval tier when using Amazon S3 Batch Operations. These restores begin to return objects within minutes, so you can process restored data faster. Processing restored data in parallel with ongoing restores helps you accelerate data workflows and quickly respond to business needs. Now, whether you’re transcoding media, restoring operational backups, training machine learning models, or analyzing historical data, you can speed up your data restores from archive.

Coupled with the S3 Glacier improvements to restore throughput by up to 10 times for millions of objects announced in 2022, S3 Glacier data restores of all sizes now benefit from both faster starts and shorter completion times.

Join the Maximize the value of cold data with Amazon S3 Glacier session to learn how Amazon S3 Glacier is helping organizations of all sizes and from all industries transform their data archiving to unlock business value, increase agility, and save on storage costs. Read this blog post to learn more about the Amazon S3 Glacier Flexible Retrieval performance improvements and follow step-by-step guidance on how to get started with faster standard retrievals from S3 Glacier Flexible Retrieval.

Supporting a Broad Spectrum of File Workloads
To serve a broad spectrum of use cases that rely on file systems, we offer a portfolio of file system services, each targeting a different set of needs. Amazon EFS is a serverless file system built to deliver an elastic experience for sharing data across compute resources. Amazon FSx makes it easier and cost-effective for you to launch, run, and scale feature-rich, high-performance file systems in the cloud, enabling you to move to the cloud with no changes to your code, processes, or how you manage your data.

Power ML research and big data analytics with Amazon EFS
Amazon EFS offers serverless and fully scalable file storage, designed for high scalability in both storage capacity and throughput performance. Just last week, we announced enhanced support for faster read and write IOPS, making it easier to power more demanding workloads. We’ve improved the performance capabilities of Amazon EFS by adding support for up to 55,000 read IOPS and up to 25,000 write IOPS per file system. These performance enhancements help you to run more demanding workflows, such as machine learning (ML) research with KubeFlow, financial simulations with IBM Symphony, and big data processing with Domino Data Lab, Hadoop, and Spark.

Join the Build and run analytics and SaaS applications at scale session to hear how recent Amazon EFS performance improvements can help power more workloads.

Multi-AZ file systems on Amazon FSx for OpenZFS
You can now use a multi-AZ deployment option when creating file systems on Amazon FSx for OpenZFS, making it easier to deploy file storage that spans multiple AWS Availability Zones to provide multi-AZ resilience for business-critical workloads. With this launch, you can take advantage of the power, agility, and simplicity of Amazon FSx for OpenZFS for a broader set of workloads, including business-critical workloads like database, line-of-business, and web-serving applications that require highly available shared storage that spans multiple AZs.

The new multi-AZ file systems are designed to deliver high levels of performance to serve a broad variety of workloads, including performance-intensive workloads such as financial services analytics, media and entertainment workflows, semiconductor chip design, and game development and streaming, up to 21 GB per second of throughput and over 1 million IOPS for frequently accessed cached data, and up to 10 GB per second and 350,000 IOPS for data accessed from persistent disk storage.

Join the Migrate NAS to AWS to reduce TCO and gain agility session to learn more about multi-AZs with Amazon FSx for OpenZFS.

New, Higher Throughput Capacity Levels on Amazon FSx for Windows File Server
Performance improvements for Amazon FSx for Windows File Server help you accelerate time-to-results for performance-intensive workloads such as SQL Server databases, media processing, cloud video editing, and virtual desktop infrastructure (VDI).

We’re adding four new, higher throughput capacity levels to increase the maximum I/O available up to 12 GB per second from the previous I/O of 2 GB per second. These throughput improvements come with correspondingly higher levels of disk IOPS, designed to deliver an increase up to 350,000 IOPS.

In addition, by using FSx for Windows File Server, you can provision IOPS higher than the default 3 IOPS per GiB for your SSD file system. This allows you to scale SSD IOPS independently from storage capacity, allowing you to optimize costs for performance-sensitive workloads.

Join the Migrate NAS to AWS to reduce TCO and gain agility session to learn more about the performance improvements for Amazon FSx for Windows File Server.

Logically Air-Gapped Vault for AWS Backup
AWS Backup is a fully managed, policy-based data protection solution that enables customers to centralize and automate backup restores across 19 AWS services (spanning compute, storage, and databases) and third-party applications such as VMware Cloud on AWS and on-premises, as well as SAP HANA on Amazon EC2.

Today, we’re announcing the preview of logically air-gapped vault as a new type of AWS Backup Vault that acts as an additional layer of protection to mitigate against malware events. With logically air-gapped vault, customers can recover their application data through a different trusted account.

Join the Deep dive on data recovery for ransomware events session to learn more about logically air-gapped vault for AWS Backup.

Copy Data to and from Other Clouds with AWS DataSync
AWS DataSync is an online data movement and discovery service that simplifies data migration and helps you quickly, easily, and securely transfer your file or object data to, from, and between AWS storage services. In addition to support of data migration to and from AWS storage services, DataSync supports copying to and from other clouds such as Google Cloud Storage, Azure Files, and Azure Blob Storage. Using DataSync, you can move your object data at scale between Amazon S3 compatible storage on other clouds and AWS storage services such as Amazon S3. We’re now expanding the support of DataSync for copying data to and from other clouds to include DigitalOcean Spaces, Wasabi Cloud Storage, Backblaze B2 Cloud Storage, Cloudflare R2 Storage, and Oracle Cloud Storage.

Join the Identify and accelerate data migrations at scale session to learn more about this expanded support for DataSync.

Join Us Online
Join us today for the AWS Storage Day virtual event on the AWS On Air channel on Twitch. The event will be live starting at 9:00 AM Pacific Time (12:00 PM Eastern Time) on August 9. All sessions will be available on demand approximately two days after Storage Day.

We look forward to seeing you on Twitch!

– Veliswa 

New – Improve Amazon S3 Glacier Flexible Restore Time By Up To 85% Using Standard Retrieval Tier and S3 Batch Operations

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/new-improve-amazon-s3-glacier-flexible-restore-time-by-up-to-85-using-standard-retrieval-tier-and-s3-batch-operations/

Last year, Amazon S3 Glacier celebrated its tenth anniversary. Amazon S3 Glacier is the leader in cloud cold storage, and I wrote about its innovations over the last decade.

The Amazon S3 Glacier storage classes provide you with long-term, secure, and durable storage options to optimally archive your data at the lowest cost. The Amazon S3 Glacier storage classes (Amazon S3 Glacier Instant Retrieval, Amazon S3 Glacier Flexible Retrieval, and Amazon S3 Glacier Deep Archive) are purpose-built for colder data, providing you with retrieval flexibility from milliseconds to days, in addition to the ability to store archive data for as low as $1 per terabyte per month.

Many customers tell us that they are keeping their data for longer periods of time because they recognize its future value potential, and that they are already monetizing subsets of their archival data, or plan to use large sets of their archive data in the future. Modern data archiving is not only about optimizing storage costs for cold data; it’s also about setting up mechanisms so that when you need to put that data to work for your business, you can access it as quickly as your business requirements demand.

In 2022, AWS customers restored over 32 billion objects from Amazon S3 Glacier. Customers need to retrieve archived objects quickly when transcoding media, restoring operational backups, training machine learning (ML) models, or analyzing historical data. While customers using S3 Glacier Instant Retrieval can access their data in just milliseconds, S3 Glacier Flexible Retrieval is lower cost and provides three retrieval options: expedited retrievals in 1–5 minutes, standard retrievals in 3–5 hours, and free bulk retrievals in 5–12 hours. S3 Glacier Deep Archive is our lowest cost storage class and provides data retrieval within 12 hours using the standard retrieval option or 48 hours using the bulk retrieval option.

In November 2022, Amazon S3 Glacier improved restore throughput by up to 10 times at no additional cost when retrieving large volumes of archived data in S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive. With Amazon S3 Batch Operations, you can automatically initiate requests at a faster rate, allowing you to restore billions of objects containing petabytes of data.

To continue the decade-long trend of cold storage innovation, we are announcing today the general availability of faster Standard retrievals from S3 Glacier Flexible Retrieval by up to 85 percent, at no additional cost. Faster data restores automatically apply to the Standard retrieval tier when using S3 Batch Operations.

Using S3 Batch Operations, you can restore archived data at scale by providing a manifest of objects to be retrieved and specifying a retrieval tier. With S3 Batch Operations, restores in the Standard retrieval tier now typically begin to return objects to you within minutes, down from 3–5 hours, so you can easily speed up your data restores from archive.

Additionally, S3 Batch Operations improves overall restore throughput by applying new performance optimizations to your jobs. As a result, you can restore your data faster and process restored objects sooner. Processing restored data in parallel with ongoing restores helps you accelerate data workflows and quickly respond to business needs.

Getting Started with Faster Standard Retrievals from S3 Glacier Flexible Retrieval
To restore archived data with this performance improvement, you can use S3 Batch Operations to perform both large- and small-scale batch operations on S3 objects. S3 Batch Operations can perform a single operation on lists of S3 objects that you specify. You can use S3 Batch Operations through the AWS Management Console, AWS Command Line Interface (AWS CLI), SDKs, or REST API.

To create a batch job, choose Batch Operations on the left navigation pane of the Amazon S3 console and choose Create job. You can select one of the manifest formats, a list of S3 objects that contains object keys that you want to retrieve. If your manifest format is a CSV file, each row in the file must include the bucket name, object key, and, optionally, the object version.

In the next step, choose the operation that you want to perform on all objects listed in the manifest. The Restore operation initiates restore requests for archived objects on a list of S3 objects that you specify. Using a restore operation results in a restore request for every object that is specified in the manifest.

When you restore with the Standard retrieval tier from the S3 Glacier Flexible Retrieval storage class, you automatically get faster retrievals.

You can also create a restore job with S3InitiateRestoreObject job using the AWS CLI:

$aws s3control create-job \
     --region us-east-1 \
     --account-id 123456789012 \
     --operation '{"S3InitiateRestoreObject": { "ExpirationInDays": 1, "GlacierJobTier":"STANDARD"} }' \
     --report '{"Bucket":"arn:aws:s3:::reports-bucket ","Prefix":"batch-op-restore-job", "Format":" S3BatchOperations_CSV_20180820","Enabled":true,"ReportScope":"FailedTasksOnly"}' \
     --manifest '{"Spec":{"Format":"S3BatchOperations_CSV_20180820", "Fields":["Bucket","Key"]},"Location":{"ObjectArn":"arn:aws:s3:::inventory-bucket/inventory_for_restore.csv", "ETag":"<ETag>"}}' \
     --role-arn arn:aws:iam::123456789012:role/s3batch-role

You can then check the status of the job submission of the requests by running the following CLI command:

$ aws s3control describe-job \
     --region us-east-1 \
     --account-id 123456789012 \
     --job-id <JobID> \
     --query 'Job'.'ProgressSummary'

You can view and update the job status, add notifications and logging, track job failures, and generate completion reports. S3 Batch Operations job activity is recorded as events in AWS CloudTrail. For tracking job events, you can create a custom rule in Amazon EventBridge and send these events to the target notification resource of your choice, such as Amazon Simple Notification Service (Amazon SNS).

When you create an S3 Batch Operations job, you can also request a completion report for all tasks or just for failed tasks. The completion report contains additional information for each task, including the object key name and version, status, error codes, and descriptions of any errors.

For more information, see Tracking job status and completion reports in the Amazon S3 User Guide.

Here is the result of a sample retrieval job with 250 objects, each sized 100 MB. As you can see from the Previous restore performance line (blue line at the right), these restores would typically finish in 3–5 hours using Standard retrievals. Now, when you use Standard retrievals with S3 Batch Operations, your job typically starts within minutes, as shown in the Improved restore performance line (orange line at the left), improving data restore time by up to 85 percent.

To learn more, see Restoring archived objects at scale from the Amazon S3 Glacier storage classes on the AWS Storage Blog and Restoring an archived object in the Amazon S3 User Guide.

Now Available
Faster standard retrievals for Amazon S3 Glacier Flexible Retrieval are now available in all AWS Regions, including the AWS GovCloud (US) Regions and China Regions. This performance improvement is available to you at no additional cost. You are charged for S3 Batch Operations and data retrievals. For more information, see the S3 pricing page.

Lastly, we published a new ebook titled “Maximize the value of cold storage with Amazon S3 Glacier“. Read this ebook to learn how Amazon S3 Glacier is helping organizations of all sizes and from all industries transform their data archiving to unlock business value, increase agility, and save on storage costs.

To learn more, visit the S3 Glacier storage classes page and getting started guide, and send feedback to AWS re:Post for S3 Glacier or through your usual AWS Support contacts.

I’m really excited for you to start using this new feature, and I look forward to hearing about even more ways you are reinventing your business with archive data.

Channy

Celebrate Amazon S3’s 17th birthday at AWS Pi Day 2023

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/celebrate-amazon-s3s-17th-birthday-at-aws-pi-day-2023/

AWS Pi Day 2023 is live today starting at 13:00 PDT; join us on the AWS on Air channel on Twitch.

On this day 17 years ago, we launched a very simple object storage service. It allowed developers to create, list, and delete private storage spaces (known as buckets), upload and download files, and manage their access permissions. The service was available only through a REST and SOAP API. It was designed to provide highly durable data storage with 99.999999999 percent data durability (that’s 11 nines!).

Fast forward to 2023, Amazon Simple Storage Service (Amazon S3) holds more than 280 trillion objects and averages over 100 million requests per second. To protect data integrity, Amazon S3 performs over four billion checksum computations per second. Over the years, we added many capabilities, such as a range of storage classes, to store your colder data cost effectively. Every day, you restore on average more than 1 petabyte from the S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive storage classes. Since launch, you have saved $1 billion from using Amazon S3 Intelligent-Tiering. In 2015, we added the possibility of replicating your data across Regions. Every week, Amazon S3 Replication moves more than 100 petabytes of data. Amazon S3 is also at the core of hundreds of thousands of data lakes. It also has become a critical component of a growing ecosystem of serverless applications. Every day, Amazon S3 sends over 125 billion event notifications to serverless applications. Altogether, Amazon S3 is helping people around the world securely store and extract value from their data.

AWS Pi Day 2023 Small

To celebrate Amazon S3‘s birthday AWS is hosting the AWS Pi Day event for the third consecutive year. This live online event starts at 13:00 PDT today (March 14, 2023) on the AWS On Air channel on Twitch and will feature four hours of fresh educational content from AWS experts. We will discuss not only Amazon S3 best practices, we will also dive into the latest innovations across AWS data services, from storage to analytics and AI/ML. Tune in to learn how to get the most out of your data by making it more secure, available, accessible, and connected, and to help you respond to rapid growth and changing demand. You will also learn how to optimize your data costs, automate your cost savings, eliminate operational complexity, and get new insights from your data. Have a look at the full agenda on the registration page.

At AWS, we innovate on your behalf. During the last few weeks, we announced a 99.99 percent SLA for Amazon MemoryDB for Redis, enhanced I/O multiplexing for Amazon ElastiCache for Redis, and encryption by default for new objects on Amazon S3.

But we are not stopping there, and today we take the occasion of this celebration to announce seven new capabilities across our data services.

Mountpoint for Amazon S3 (alpha release): an open-source file client for Amazon S3
Mountpoint for Amazon S3 is an open-source file client for Amazon S3 that you can install on your compute instance. It translates local file storage API calls to REST API calls on objects in Amazon S3. When using Mountpoint for Amazon S3, data lake applications that access objects using file APIs can achieve high single-instance transfer rates, saving on compute costs.

You can get started with Mountpoint for Amazon S3 by mounting an Amazon S3 bucket at a local mount point on your compute instance. Once mounted, applications read objects as files available locally. Mountpoint for Amazon S3 supports sequential and random read operations on existing S3 objects. It is available to download for Linux operating systems as an alpha release and is not yet intended for production workloads. Instead, we want to collect your feedback early and incorporate your input into the design and implementation. To get started, visit the Mountpoint for Amazon S3 GitHub repo, read the technical launch blog, and share your feedback.

Now Generally Available: AWS Data Exchange for Amazon S3
AWS Data Exchange for Amazon S3 enables you to easily find, subscribe to, and use third-party data files for faster time to insight, storage cost optimization, simplified data licensing management, and more. Data Exchange subscribers can directly use files from data providers’ Amazon S3 buckets for their analysis with AWS services without needing to create or manage copies to their account. Data providers can license in-place access to data hosted in their Amazon S3 buckets.

To learn more about how data providers can simplify and scale access management to multiple data subscribers, you can read this blog.

AWS Data Exchange for S3

Amazon S3 Multi-Region Access Points now support replicated datasets that span multiple AWS accounts
We launched Amazon S3 Multi-Region Access Points in September 2021. We added failover control in November 2022. Amazon S3 Multi-Region Access Points now support datasets that are replicated across multiple AWS accounts. Cross-account Multi-Region Access Points simplify object storage access for applications that span both AWS Regions and accounts, avoiding the need for complex request routing logic in your application. They provide a single global endpoint for your multi-Region applications and dynamically route S3 requests based on policies that you define. This helps you to easily implement multi-Region resilience, latency-based routing, and active/passive failover, even when your data is stored in multiple AWS accounts.

You can learn more about S3 Multi-Region Access Points on the Amazon S3 FAQs.

Aliases for S3 Object Lambda Access Points as CloudFront origin
Amazon S3 Object Lambda, launched in March 2021, lets you add your own code to S3 GET, HEAD, and LIST API requests to modify data as it is returned to an application. With today’s launch of aliases for S3 Object Lambda Access Points any application that requires an S3 bucket name can easily present different views of data depending on the requester. You can now use an S3 Object Lambda Access Point alias as an origin for your Amazon CloudFront distribution to modify the data requested. For example, you can dynamically transform an image depending on the device that a user is visiting from, such as a desktop or a smartphone.

If you want to learn more, my colleague Danilo wrote a blog post with more details and code examples.

Simplify private connectivity from on-premises networks
Amazon Virtual Private Cloud (Amazon VPC) interface endpoints for Amazon S3 now offer private DNS options that can help you more easily route Amazon S3 requests to the lowest-cost endpoint in your VPC. With private DNS for Amazon S3, your on-premises applications can use AWS PrivateLink to access Amazon S3 over an interface endpoint, while requests from your in-VPC applications access Amazon S3 using gateway endpoints. Routing requests like this helps you take advantage of the lowest-cost private network path without having to make code or configuration changes to your clients.

S3 private connectivity

You can learn more on the AWS PrivateLink for Amazon S3 documentation.

Local Amazon S3 Replication on Outposts
Amazon S3 on Outposts now supports S3 replication on Outposts. This extends S3’s fully managed approach to replication to S3 on Outposts buckets. It helps you meet your data residency and data redundancy requirements. With local S3 Replication on Outposts, you can create and configure replication rules to automatically replicate your S3 objects to another Outpost or to another bucket on the same Outpost. During replication, your S3 on Outposts objects are always sent over your local gateway, and objects do not travel back to the AWS Region. S3 Replication on Outposts provides an easy and flexible way to automatically replicate data within a specific data perimeter to address your data redundancy and compliance requirements.

Amazon OpenSearch Security Analytics
The new Amazon OpenSearch Service’s security analytics capability enables your Security Operations (SecOps) teams to detect potential threats quickly while having the tools to help with security investigations on historical data—all with lower data storage costs. Like many other advanced capabilities of Amazon OpenSearch Service, there is no additional charge for security analytics.

You can learn more about Amazon OpenSearch security analytics by reading this blog post.

Join Us Online Today
You will learn more about these launches and about AWS data services in general. We have also prepared some live demos. We designed the AWS Pi Day event for system administrators, engineers, developers, and architects. Our sessions will bring you the latest and greatest information on storage, security, backup, archiving, training and certification, and more.

And to dive deeper, get Pi Day started early by attending AWS Innovate: Data and AI/ML Edition to learn about cutting-edge machine learning tools, strategies for building future-proof applications, and making data-driven decisions for your organization. Don’t miss Swami Sivasubramanian‘s keynote, starting at 9:00 PDT.

Join us today on the AWS Pi Day live stream. Kevin Miller, VP and GM of Amazon S3, will kick off the event with a keynote at 13:00 PDT.

See you there!

— seb

Week in Review – February 13, 2023

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/week-in-review-february-13-2023/

AWS announced 32 capabilities since we published the last Week in Review blog post a week ago. I also read a couple of other news and blog posts.

Here is my summary.

The VPC section of the AWS Management Console now allows you to visualize your VPC resources, such as the relationships between a VPC and its subnets, routing tables, and gateways. This visualization was available at VPC creation time only, and now you can go back to it using the Resource Map tab in the console. You can read the details in Channy’s blog post.

CloudTrail Lake now gives you the ability to ingest activity events from non-AWS sources. This lets you immutably store and then process activity events without regard to their origin–AWS, on-premises servers, and so forth. All of this power is available to you with a single API call: PutAuditEvents. We launched AWS CloudTrail Lake about a year ago. It is a managed organization-scale data lake that aggregates, immutably stores, and allows querying of events recorded by CloudTrail. You can use it for auditing, security investigation, and troubleshooting. Again, my colleague Channy wrote a post with the details.

There are three new Amazon CloudWatch metrics for asynchronous AWS Lambda function invocations: AsyncEventsReceived, AsyncEventAge, and AsyncEventsDropped. These metrics provide visibility for asynchronous Lambda function invocations. They help you to identify the root cause of processing issues such as throttling, concurrency limit, function errors, processing latency because of retries, or missing events. You can learn more and have access to a sample application in this blog post.

Amazon Simple Notification Service (Amazon SNS) now supports AWS X-Ray to visualize, analyze, and debug applications. Developers can now trace messages going through Amazon SNS, making it easier to understand or debug microservices or serverless applications.

Amazon EC2 Mac instances now support replacing root volumes for quick instance restoration. Stopping and starting EC2 Mac instances trigger a scrubbing workflow that can take up to one hour to complete. Now you can swap the root volume of the instance with an EBS snapshot or an AMI. It helps to reset your instance to a previous known state in 10–15 minutes only. This significantly speeds up your CI and CD pipelines.

Amazon Polly launches two new Japanese NTTS voices. Neural Text To Speech (NTTS) produces the most natural and human-like text-to-speech voices possible. You can try these voices in the Polly section of the AWS Management Console. With this addition, according to my count, you can now choose among 52 NTTS voices in 28 languages or language variants (French from France or from Quebec, for example).

The AWS SDK for Java now includes the AWS CRT HTTP Client. The HTTP client is the center-piece powering our SDKs. Every single AWS API call triggers a network call to our API endpoints. It is therefore important to use a low-footprint and low-latency HTTP client library in our SDKs. AWS created a common HTTP client for all SDKs using the C programming language. We also offer 11 wrappers for 11 programming languages, from C++ to Swift. When you develop in Java, you now have the option to use this common HTTP client. It provides up to 76 percent cold start time reduction on AWS Lambda functions and up to 14 percent less memory usage compared to the Netty-based HTTP client provided by default. My colleague Zoe has more details in her blog post.

X in Y Jeff started this section a while ago to list the expansion of new services and capabilities to additional Regions. I noticed 10 Regional expansions this week:

Other AWS News
This week, I also noticed these AWS news items:

My colleague Mai-Lan shared some impressive customer stories and metrics related to the use and scale of Amazon S3 Glacier. Check it out to learn how to put your cold data to work.

Space is the final (edge) frontier. I read this blog post published on avionweek.com. It explains how AWS helps to deploy AIML models on observation satellites to analyze image quality before sending them to earth, saving up to 40 percent satellite bandwidth. Interestingly, the main cause for unusable satellite images is…clouds.

Upcoming AWS Events
Check your calendars and sign up for these AWS events:

AWS re:Invent recaps in your area. During the re:Invent week, we had lots of new announcements, and in the next weeks, you can find in your area a recap of all these launches. All the events are posted on this site, so check it regularly to find an event nearby.

AWS re:Invent keynotes, leadership sessions, and breakout sessions are available on demand. I recommend that you check the playlists and find the talks about your favorite topics in one collection.

AWS Summits season will restart in Q2 2023. The dates and locations will be announced here. Paris and Sidney are kicking off the season on April 4th. You can register today to attend these in-person, free events (Paris, Sidney).

Stay Informed
That was my selection for this week! To better keep up with all of this news, do not forget to check out the following resources:

— seb
This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Genomics workflows, Part 4: processing archival data

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/genomics-workflows-part-4-processing-archival-data/

Genomics workflows analyze data at petabyte scale. After processing is complete, data is often archived in cold storage classes. In some cases, like studies on the association of DNA variants against larger datasets, archived data is needed for further processing. This means manually initiating the restoration of each archived object and monitoring the progress. Scientists require a reliable process for on-demand archival data restoration so their workflows do not fail.

In Part 4 of this series, we look into genomics workloads processing data that is archived with Amazon Simple Storage Service (Amazon S3). We design a reliable data restoration process that informs the workflow when data is available so it can proceed. We build on top of the design pattern laid out in Parts 1-3 of this series. We use event-driven and serverless principles to provide the most cost-effective solution.

Use case

Our use case focuses on data in Amazon Simple Storage Service Glacier (Amazon S3 Glacier) storage classes. The S3 Glacier Instant Retrieval storage class provides the lowest-cost storage for long-lived data that is rarely accessed but requires retrieval in milliseconds.

The S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive provide further cost savings, with retrieval times ranging from minutes to hours. We focus on the latter in order to provide the most cost-effective solution.

You must first restore the objects before accessing them. Our genomics workflow will pause until the data restore completes. The requirements for this workflow are:

  • Reliable launch of the restore so our workflow doesn’t fail (due to S3 Glacier service quotas, or because not all objects were restored)
  • Event-driven design to mirror the event-driven nature of genomics workflows and perform the retrieval upon request
  • Cost-effective and easy-to-manage by using serverless services
  • Upfront detection of archived data when formulating the genomics workflow task, avoiding idle computational tasks that incur cost
  • Scalable and elastic to meet the restore needs of large, archived datasets

Solution overview

Genomics workflows take multiple input parameters to prepare the initiation, such as launch ID, data path, workflow endpoint, and workflow steps. We store this data, including workflow configurations, in an S3 bucket. An AWS Fargate task reads from the S3 bucket and prepares the workflow. It detects if the input parameters include S3 Glacier URLs.

We use Amazon Simple Queue Service (Amazon SQS) to decouple S3 Glacier index creation from object restore actions (Figure 1). This increases the reliability of our process.

Solution architecture for S3 Glacier object restore

Figure 1. Solution architecture for S3 Glacier object restore

An AWS Lambda function creates the index of all objects in the specified S3 bucket URLs and submits them as an SQS message.

Another Lambda function polls the SQS queue and submits the request(s) to restore the S3 Glacier objects to S3 Standard storage class.

The function writes the job ID of each S3 Glacier restore request to Amazon DynamoDB. After the restore is complete, Lambda sets the status of the workflow to READY. Only then can any computing jobs start, such as with AWS Batch.

Implementation considerations

We consider the use case of Snakemake with Tibanna, which we detailed in Part 2 of this series. This allows us to dive deeper on launch details.

Snakemake is an open-source utility for whole-genome-sequence mapping in directed acyclic graph format. Snakemake uses Snakefiles to declare workflow steps and commands. Tibanna is an open-source, AWS-native software that runs bioinformatics data pipelines. It supports Snakefile syntax, plus other workflow languages, including Common Workflow Language and Workflow Description Language (WDL).

We recommend using Amazon Genomics CLI if Tibanna is not needed for your use case, or Amazon Omics if your workflow definitions are compliant with the supported WDL and Nextflow specifications.

Formulate the restore request

The Snakemake Fargate launch container detects if the S3 objects under the requested S3 bucket URLs are stored in S3 Glacier. The Fargate launch container generates and puts a JSON binary base call (BCL) configuration file into an S3 bucket and exits successfully. This file includes the launch ID of the workflow, corresponding with the DynamoDB item key, plus the S3 URLs to restore.

Query the S3 URLs

Once the JSON BCL configuration file lands in this S3 bucket, the S3 Event Notification PutObject event invokes a Lambda function. This function parses the configuration file and recursively queries for all S3 object URLs to restore.

Initiate the restore

The main Lambda function then submits messages to the SQS queue that contains the full list of S3 URLs that need to be restored. SQS messages also include the launch ID of the workflow. This is to ensure we can bind specific restoration jobs to specific workflow launches. If all S3 Glacier objects belong to Flexible Retrieval storage class, the Lambda function puts the URLs in a single SQS message, enabling restoration with Bulk Glacier Job Tier. The Lambda function also sets the status of the workflow to WAITING in the corresponding DynamoDB item. The WAITING state is used to notify the end user that the job is waiting on the data-restoration process and will continue once the data restoration is complete.

A secondary Lambda function polls for new messages landing in the SQS queue. This Lambda function submits the restoration request(s)—for example, as a free-of-charge Bulk retrieval—using the RestoreObject API. The function subsequently writes the S3 Glacier Job ID of each request in our DynamoDB table. This allows the main Lambda function to check if all Job IDs associated with a workflow launch ID are complete.

Update status

The status of our workflow launch will remain WAITING as long as the Glacier object restore is incomplete. The AWS CloudTrail logs of completed S3 Glacier Job IDs invoke our main Lambda function (via an Amazon EventBridge rule) to update the status of the restoration job in our DynamoDB table. With each invocation, the function checks if all Job IDs associated with a workflow launch ID are complete.

After all objects have been restored, the function updates the workflow launch with status READY. This launches the workflow with the same launch ID prior to the restore.

Conclusion

In this blog post, we demonstrated how life-science research teams can make use of their archival data for genomic studies. We designed an event-driven S3 Glacier restore process, which retrieves data upon request. We discussed how to reliably launch the restore so our workflow doesn’t fail. Also, we determined upfront if an S3 Glacier restore is needed and used the WAITING state to prevent our workflow from failing.

With this solution, life-science research teams can save money using Amazon S3 Glacier without worrying about their day-to-day work or manually administering S3 Glacier object restores.

Related information

Happy 10th Anniversary, Amazon S3 Glacier – A Decade of Cold Storage in the Cloud

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/happy-10th-anniversary-amazon-s3-glacier-a-decade-of-cold-storage-in-the-cloud/

Ten years ago, on August 20, 2012, AWS announced the general availability of Amazon Glacier, secure, reliable, and extremely low-cost storage designed for data archiving and backup. At the time, I was working as an AWS customer and it felt like an April Fools’ joke, offering long-term, secure, and durable cloud storage that allowed me to archive large amounts of data at a very low cost.

In Jeff’s original blog post for this launch, he noted that:

Glacier provides, at a cost as low as $0.01 (one US penny, one one-hundredth of a dollar) per Gigabyte per month, extremely low-cost archive storage. You can store a little bit, or you can store a lot (terabytes, petabytes, and beyond). There’s no upfront fee, and you pay only for the storage that you use. You don’t have to worry about capacity planning, and you will never run out of storage space.

Ten years later, Amazon S3 Glacier has evolved to be the best place in the world for you to store your archive data. The Amazon S3 Glacier storage classes are purpose-built for data archiving, providing you with the highest performance, most retrieval flexibility, and the lowest cost archive storage in the cloud.

You can now choose from three archive storage classes optimized for different access patterns and storage duration – Amazon S3 Glacier Instant Retrieval, Amazon S3 Glacier Flexible Retrieval (formerly Amazon S3 Glacier), and Amazon S3 Glacier Deep Archive. We’ll dive into each of these storage classes in a bit.

A Decade of Innovation in Amazon S3 Glacier
To understand how we got here, we’ll walk through through the last decade and revisit some of the most significant Amazon S3 Glacier launches that fundamentally changed archive storage forever:

August 2012 – Amazon Glacier: Archival Storage for One Penny per GB per Month
We launched Amazon Glacier to store any amount of data with high durability at a cost that allows you to get rid of your tape libraries and all the operational complexity and overhead that have been part of data archiving for decades. Amazon Glacier was modeled on S3’s durability and dependability but designed and built from the ground up to offer an archival storage to you at an extremely low cost. At that time, Glacier introduced the concept of a “vault” for storing archival data. You could then easily retrieve your archival data by initiating a request and then the data was made available to you for download in 3–5 hours.

November 2012 – Archiving Amazon S3 Data to Glacier
While Glacier was purpose-built from the ground up for archival data, many customers had object data that originated in S3 warmer storage that they would eventually want to move to colder storage. To make that easy for customers, Amazon S3’s Lifecycle Management (aka Lifecycle Rule) integrated S3 and Glacier and made the details visible via the storage class of each object. Lifecycle Management allows you to define time-based rules that can start Transition (changing S3 storage class to Glacier) and Expiration (deletion of objects). In 2014, we combined the flexibility of S3 versioned objects with Glacier, helping you to further reduce your overall storage costs.

November 2016 – Glacier Price Reductions and Additional Retrieval Options for Glacier
As part of AWS’s long-term focus on reducing costs and passing along those savings to customers, we reduced the price of Glacier storage to $0.004 (less than half a cent) in the case of 1 GB for 1 month in the US East (N. Virginia) Region, from $0.007 in 2015 and $0.010 in 2012. With storing data at a very low cost but having flexibility in how quickly they can retrieve the data, we introduced two more options for data retrieval that were based on the amount of data that you stored in Glacier and the rate at which you retrieved it. You could select expedited retrieval (typically taking 1–5 minutes), bulk retrieval (5–12 hours), or the existing standard retrieval method (3–5 hours).

November 2018 – Amazon S3 Glacier Storage Class to Integrate S3 Experiences
Glacier customers appreciated the way they could easily move data from S3 to Glacier via S3 lifecycle management, and wanted us to expand on that capability to use the most common S3 APIs to operate directly on S3 Glacier objects. So, we added S3 PUT API to S3 Glacier, which enables you to use the standard S3 PUT API and select any storage class, including S3 Glacier, to store the data. Data can be stored directly in S3 Glacier, eliminating the need to upload to S3 Standard and immediately transition to S3 Glacier with a zero-day lifecycle policy. So, you could PUT to S3 Glacier like any other S3 storage class.

March 2019 – Amazon S3 Glacier Deep Archive – the Lowest Cost Storage in the Cloud
While the original Glacier service offered an extremely low price for archival storage, we challenged ourselves to see if we could find a way to invent an even lower priced storage offering for very cold data. The Amazon S3 Glacier Deep Archive storage class delivers the lowest cost storage, up to 75 percent lower cost (than S3 Glacier Flexible Retrieval), for long-lived archive data that is accessed less than once per year and is retrieved asynchronously. At just $0.00099 per GB-month (or $1 per TB-month), S3 Glacier Deep Archive offers the lowest cost storage in the cloud at prices significantly lower than storing and maintaining data in on-premises tape or archiving data off-site.

November 2020 – Amazon S3 Intelligent-Tiering adds Archive Access and Deep Archive Access tiers
In November 2018, we launched Amazon S3 Intelligent-Tiering, the only cloud storage class that delivers automatic storage cost savings, up to 95 percent when data access patterns change, without performance impact or operational overhead. In order to offer customers the simplicity and flexibility of S3 Intelligent-Tiering and the low storage cost of archival data, we added the Archive Access tier providing the same performance and pricing as the S3 Glacier storage class as well as the Deep Archive Access tier which offers the same performance and pricing as the S3 Glacier Deep Archive storage class.

November 2021 – Amazon S3 Glacier Flexible Retrieval and S3 Glacier Instant Retrieval
The Amazon S3 Glacier storage class was renamed to Amazon S3 Glacier Flexible Retrieval and now includes free bulk retrievals along with an additional 10 percent price reduction across all Regions, making it optimized for use cases such as backup and disaster recovery.

Additionally, customers asked us for a storage solution that had the low costs of Glacier but allowed for fast access when data was needed very quickly. So, we introduced Amazon S3 Glacier Instant Retrieval, a new archive storage class that delivers the lowest cost storage for long-lived data that is rarely accessed and requires milliseconds retrieval. You can save up to 68 percent on storage costs compared to using the S3 Standard-Infrequent Access (S3 Standard-IA) storage class when your data is accessed once per quarter.

The Amazon S3 Intelligent-Tiering storage class also recently added a new Archive Instant Access tier, providing the same performance and pricing as the S3 Glacier Instant Retrieval storage class which delivers automatic 68% cost savings for customers using S3 Intelligent-Tiering with long-lived data.

Then and Now
Customers across all industries and verticals use the S3 Glacier storage classes for every imaginable archival workload. Accessing and using the S3 Glacier storage classes through the S3 APIs and S3 console provides enhanced functionality for data management and cost optimization.

As we discussed above, you can now choose from three archive storage classes optimized for different access patterns and storage duration:

  • S3 Glacier Instant Retrieval – For archive data that needs immediate access, such as medical images, news media assets, or genomics data, choose the S3 Glacier Instant Retrieval storage class, an archive storage class that delivers the lowest cost storage with milliseconds retrieval.
  • S3 Glacier Flexible Retrieval – For archive data that does not require immediate access but needs to have the flexibility to retrieve large sets of data at no cost, such as backup or disaster recovery use cases, choose the S3 Glacier Flexible Retrieval storage class, with retrieval in minutes or free bulk retrievals in 12 hours.
  • S3 Glacier Deep Archive – For retaining data for 7–10 years or longer to meet customer needs and regulatory compliance requirements, such as financial services, healthcare, media and entertainment, and public sector, choose the S3 Glacier Deep Archive storage class, the lowest cost storage in the cloud with data retrieval within 12–48 hours.

Watch a brief introduction video for an overview of the S3 Glacier storage classes.

All S3 Glacier storage classes are designed for 99.999999999% (11 9s) of durability for objects. Data is redundantly stored across three or more Availability Zones that are physically separated within an AWS Region. Here are some comparisons across the S3 Glacier storage classes at a glance:

Performances S3 Glacier
Instant Retrieval
S3 Glacier
Flexible Retrieval
S3 Glacier
Deep Archive
Availability 99.9% 99.99% 99.99%
Availability SLA 99% 99.9% 99.9%
Minimum capacity charge per object 128 KB 40 KB 40 KB
Minimum storage duration charge 90 days 90 days 180 days
Retrieval charge per GB per GB per GB
Retrieval time milliseconds Expedited (1–5 minutes),
Standard (3–5 hours),
Bulk (5–12 hours) free
Standard (within 12 hours),
Bulk (within 48 hours)

For data with changing access patterns that you want to automatically archive based on the last access of that data, choose the S3 Intelligent-Tiering storage class. Doing so will optimize storage costs by automatically moving data to the most cost-effective access tier when access patterns change. Its Archive Instant Access, Archive Access, and Deep Archive Access tiers have the same performance as S3 Glacier Instant Retrieval, S3 Glacier Flexible Retrieval, and S3 Glacier Deep Archive respectively. To learn more, see the blog post Automatically archive and restore data with Amazon S3 Intelligent-Tiering.

To get started with S3 Glacier, see the blog post Best practices for archiving large datasets with AWS for key considerations and actions when planning your cold data storage patterns. You can also use a hands-on lab tutorial that will help you get started with the S3 Glacier storage classes in just 20 minutes, and start archiving your data in the S3 Glacier storage classes in the S3 console.

Happy Birthday, Amazon S3 Glacier!
During the last AWS Storage Day 2022, Kevin Miller, VP & GM of Amazon S3, mentioned the 10th anniversary of S3 Glacier and its pace of innovation for many customer use cases throughout his interview with theCUBE.

In this expanding world of data growth, you have to have an archiving strategy. Everyone has archival data — every company, every vertical, and every industry. There is an archiving need not only for companies that have been around for a while but also for digital native businesses.

Lots of AWS customers such as Nasdaq, Electronic Arts, and NASCAR have used S3 Glacier storage classes for their backup and archiving workloads. The following are some additional recent customer-authored blogs focusing on AWS archiving best practices from customers in the financial, media, gaming, and software industries.

A big thank you to all of our S3 Glacier customers from around the world! Over 90 percent of S3’s roadmap has come directly from feedback from customers like you. We will never stop listening to you, as your feedback and ideas are essential to how we improve the service. Thank you for trusting us and for constantly raising the bar and pushing us to improve to lower costs, simplify your storage, increase your agility, and allow you to innovate faster.

In accordance with Customer Obsession, one of the Amazon Leadership Principles, your feedback is always welcome! If you want to see new S3 Glacier features and capabilities, please send any feedback to AWS re:Post for S3 Glacier or through your usual AWS Support contacts.

– Channy

Orchestrating Amazon S3 Glacier Deep Archive object retrieval using AWS Step Functions

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/orchestrating-amazon-s3-glacier-deep-archive-object-retrieval-using-aws-step-functions/

This blog was written by Monica Cortes Sack, Solutions Architect, Oskar Neumann, Partner Solutions Architect, and Dhiraj Mahapatro, Principal Specialist SA, Serverless.

AWS Step Functions now support over 220 services and over 10,000 AWS API actions. This enables you to use the AWS SDK integration directly instead of writing an AWS Lambda function as a proxy.

One such service integration is with Amazon S3. Currently, you write scripts using AWS CLI S3 commands to achieve automation around running S3 tasks. For example, S3 integrates with AWS Transfer Family, builds a custom security check, takes action on S3 buckets on S3 object creations, or orchestrates a workflow around S3 Glacier Deep Archive object retrieval. These script executions do not provide an execution history or an easy way to validate the behavior.

Step Functions’ AWS SDK integration with S3 declaratively creates serverless workflows around S3 tasks without relying on those scripts. You can validate the execution history and behavior of a Step Functions workflow.

This blog highlights one of the S3 use cases. It shows how to orchestrate workflows around S3 Glacier Deep Archive object retrieval, cost estimation, and interaction with the requester using Step Functions. The demo application provides additional details on the entire architecture.

S3 Glacier Deep Archive is a storage class in S3 used for data that is rarely accessed. The service provides durable and secure long-term storage, trading immediate access for cost-effectiveness. You must restore archived objects before they are downloadable. It supports two options for object retrieval:

  1. Standard – Access objects within 12 hours of the start of the restoration process.
  2. Bulk – Access objects within 48 hours of the start of the restoration process.

Business use case

Consider a research institute that stores backups on S3 Glacier Deep Archive. The backups are maintained in S3 Glacier Deep Archive for redundancy. The institute has multiple researchers with one central IT team. When a researcher requests an object from S3 Glacier Deep Archive, the central IT team retrieves it and charges the corresponding research group for retrieval and data transfer costs.

Researchers are the end users and do not operate in the AWS Cloud. They run computing clusters on-premises and depend on the central IT team to provide them with the restored archive. A member of the research team requesting an object retrieval provides the following information to the central IT team:

  1. Object key to be restored.
  2. The number of days the researcher needs the object accessible for download.
  3. Researcher’s email address.
  4. Retrieve within 12 or 48 hours SLA. This determines whether “Standard” or “Bulk” retrieval respectively.

The following overall architecture explains the setup on AWS and the interaction between a researcher and the central IT team’s architecture.

Architecture overview

Architecture diagram

Architecture diagram

  1. The researcher uses a front-end application to request object retrieval from S3 Glacier Deep Archive.
  2. Amazon API Gateway synchronously invokes AWS Step Functions Express Workflow.
  3. Step Functions initiates RestoreObject from S3 Glacier Deep Archive.
  4. Step Functions stores the metadata of this retrieval in an Amazon DynamoDB table.
  5. Step Functions uses Amazon SES to email the researcher about archive retrieval initiation.
  6. Upon completion, S3 sends the RestoreComplete event to Amazon EventBridge.
  7. EventBridge rule triggers another Step Functions for post-processing after the restore is complete.
  8. A Lambda function inside the Step Functions calculates the estimated cost (retrieval and data transfer out) and updates existing metadata in the DynamoDB table.
  9. Sync data from DynamoDB table using Amazon Athena Federated Queries to generate reports dashboard in Amazon QuickSight.
  10. Step Functions uses SES to email the researcher with cost details.
  11. Once the researcher receives an email, the researcher uses the front-end application to call the /download API endpoint.
  12. API Gateway invokes a Lambda function that generates a pre-signed S3 URL of the retrieved object and returns it in the response.

Setup

  1. To run the sample application, you must install CDK v2, Node.js, and npm.
  2. To clone the repository, run:
    git clone https://github.com/aws-samples/aws-stepfunctions-examples.git
    cd cdk/app-glacier-deep-archive-retrieval
  3. To deploy the application, run:
    cdk deploy --all

Identifying workflow components

Starting the restore object workflow

The first component is accepting the researcher’s request to start the archive retrieval process. The sample application created from the demo provides a basic front-end app that shows the files from an S3 bucket that has objects stored in S3 Glacier Deep Archive. The researcher retrieves file requests from the front-end application reached by the sample application’s Amazon CloudFront URL.

Glacier Deep Archive Retrieval menu

Glacier Deep Archive Retrieval menu

The front-end app asks the researcher for an email address, the number of days the researcher wants the object to be available for download, and their ETA on retrieval speed. Based on the retrieval speed, the researcher accepts either Standard or Bulk object retrieval. To test this, put objects in the data bucket under the S3 Glacier Deep Archive storage class and use the front-end application to retrieve them.

Item retrieval prompt

Item retrieval prompt

The researcher then chooses the Retrieve file. This action invokes an API endpoint provided by API Gateway. The API Gateway endpoint synchronously invokes a Step Functions Express Workflow. This validates the restore object request, gets the object metadata, and starts to restore the object from S3 Glacier Deep Archive.

The state machine stores the metadata of the restore object AWS SDK call in a DynamoDB table for later use. You can use this metadata to build a dashboard in Amazon QuickSight for reporting and administration purposes. Finally, the state machine uses Amazon SES to email the researcher, notifying them about the restore object initiation process:

Restore object initiated

Restore object initiated

The following state machine shows the workflow:

Workflow diagram

Workflow diagram

The ability to use S3 APIs declaratively using AWS SDK from Step Functions makes it convenient to integrate with S3. This approach avoids writing a Lambda function to wrap the SDK calls. The following portion of the state machine definition shows the usage of S3 HeadObject and RestoreObject APIs:

"Get Object Metadata": {
    "Next": "Initiate Restore Object from Deep Archive",
    "Catch": [{
        "ErrorEquals": ["States.ALL"],
        "Next": "Bad Request"
    }],
    "Type": "Task",
    "ResultPath": "$.result.metadata",
    "Resource": "arn:aws:states:::aws-sdk:s3:headObject",
    "Parameters": {
        "Bucket": "glacierretrievalapp-databucket-abc123",
        "Key.$": "$.fileKey"
    }
}, 
"Initiate Restore Object from Deep Archive": {
    "Next": "Update restore operation metadata",
    "Type": "Task",
    "ResultPath": null,
    "Resource": "arn:aws:states:::aws-sdk:s3:restoreObject",
    "Parameters": {
        "Bucket": "glacierretrievalapp-databucket-abc123",
        "Key.$": "$.fileKey",
        "RestoreRequest": {
            "Days.$": "$.requestedForDays"
        }
    }
}

You can extend the previous workflow and build your own Step Functions workflows to orchestrate other S3 related workflows.

Processing after object restoration completion

S3 RestoreObject is a long-running process for S3 Glacier Deep Archive objects. S3 emits a RestoreCompleted event notification on the object restore completion to EventBridge. You set up an EventBridge rule to trigger another Step Functions workflow as a target for this event. This workflow takes care of the object restoration post-processing.

cfnDataBucket.addPropertyOverride('NotificationConfiguration.EventBridgeConfiguration.EventBridgeEnabled', true);

An EventBridge rule triggers the following Step Functions workflow and passes the event payload as an input to the Step Functions execution:

new aws_events.Rule(this, 'invoke-post-processing-rule', {
  eventPattern: {
    source: ["aws.s3"],
    detailType: [
      "Object Restore Completed"
    ],
    detail: {
      bucket: {
        name: [props.dataBucket.bucketName]
      }
    }
  },
  targets: [new aws_events_targets.SfnStateMachine(this.stateMachine, {
    input: aws_events.RuleTargetInput.fromObject({
      's3Event': aws_events.EventField.fromPath('$')
    })
  })]
});

The Step Functions workflow gets object metadata from the DynamoDB table and then invokes a Lambda function to calculate the estimated cost. The Lambda function calculates the estimated retrieval and the data transfer costs using the contentLength of the retrieved object and the Price List API for the unit cost. The workflow then updates the calculated cost in the DynamoDB table.

The retrieval cost and the data transfer out cost are proportional to the size of the retrieved object. The Step Functions workflow also invokes a Lambda function to create the download API URL for object retrieval. Finally, it emails the researcher with the estimated cost and the download URL as a restoration completion notification.

Workflow studio diagram

Workflow studio diagram

The email notification to the researcher looks like:

Email example

Email example

Downloading the restored object

Once the object restoration is complete, the researcher can download the object from the front-end application.

Front end retrieval menu

Front end retrieval menu

The researcher chooses the Download action, which invokes another API Gateway endpoint. The endpoint integrates with a Lambda function as a backend that creates a pre-signed S3 URL sent as a response to the browser.

Administering object restoration usage

This architecture also provides a view for the central IT team to understand object restoration usage. You achieve this by creating reports and dashboards from the metadata stored in DynamoDB.

The sample application uses Amazon Athena Federated Queries and Amazon Athena DynamoDB Connector to generate a reports dashboard in Amazon QuickSight. You can also use Step Functions AWS SDK integration with Amazon Athena and visualize the workflows in the Athena console.

The following QuickSight visualization shows the count of restored S3 Glacier Deep Archive objects by their contentType:

QuickSite visualization

QuickSight visualization

Considerations

With the preceding approach, you should consider that:

  1. You must start the object retrieval in the same Region as the Region of the archived object.
  2. S3 Glacier Deep Archive only supports standard and bulk retrievals.
  3. You must enable the “Object Restore Completed” event notification on the S3 bucket with the S3 Glacier Deep Archive object.
  4. The researcher’s email must be verified in SES.
  5. Use a Lambda function for the Price List GetProducts API as the service endpoints are available in specific Regions.

Cleanup

To clean up the infrastructure used in this sample application, run:

cdk destroy --all

Conclusion

Step Functions’ AWS SDK integration opens up different opportunities to orchestrate a workflow. Step Functions provides native support for retries and error handling, which offloads the heavy lifting of handling them manually in scripts.

This blog shows one example use case with S3 Glacier Deep Archive. With AWS SDK integration in Step Functions, you can build any workflow orchestration using S3 or S3 control APIs. For example, a workflow to enforce AWS Key Management Service encryption based on an S3 event, or create static website hosting on-demand in a few steps.

With different S3 API calls available in Step Functions’ Workflow Studio, you can declaratively build a Step Functions workflow instead of imperatively calling each S3 API from a shell script or command line. Refer to the demo application for more details.

For more serverless learning resources, visit Serverless Land.

Welcome to AWS Pi Day 2022

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/welcome-to-aws-pi-day-2022/

We launched Amazon Simple Storage Service (Amazon S3) sixteen years ago today!

As I often told my audiences in the early days, I wanted them to think big thoughts and dream big dreams! Looking back, I think it is safe to say that the launch of S3 empowered them to do just that, and initiated a wave of innovation that continues to this day.

Bigger, Busier, and more Cost-Effective
Our customers count on Amazon S3 to provide them with reliable and highly durable object storage that scales to meet their needs, while growing more and more cost-effective over time. We’ve met those needs and many others; here are some new metrics that prove my point:

Object Storage – Amazon S3 now holds more than 200 trillion (2 x 1014) objects. That’s almost 29,000 objects for each resident of planet Earth. Counting at one object per second, it would take 6.342 million years to reach this number! According to Ethan Siegel, there are about 2 trillion galaxies in the visible Universe, so that’s 100 objects per galaxy! Shortly after the 2006 launch of S3, I was happy to announce the then-impressive metric of 800 million stored objects, so the object count has grown by a factor of 250,000 in less than 16 years.

Request Rate – Amazon S3 now averages over 100 million requests per second.

Cost Effective – Over time we have added multiple storage classes to S3 in order to optimize cost and performance for many different workloads. For example, AWS customers are making great use of Amazon S3 Intelligent Tiering (the only cloud storage class that delivers automatic storage cost savings when data access patterns change), and have saved more than $250 million in storage costs as compared to Amazon S3 Standard. When I first wrote about this storage class in 2018, I said:

In order to make it easier for you to take advantage of S3 without having to develop a deep understanding of your access patterns, we are launching a new storage class, S3 Intelligent-Tiering.

With the improved cost optimizations for small and short-lived objects and the archiving capabilities that we launched late last year, you can now use S3 Intelligent-Tiering as the default storage class for just about every workload, especially data lakes, analytics use cases, and new applications.

Customer Innovation
As you can see from the metrics above, our customers use S3 to store and protect vast amounts of data in support of an equally vast number of use cases and applications. Here are just a few of the ways that our customers are innovating:

NASCARAfter spending 15 years collecting video, image, and audio assets representing over 70 years of motor sports history, NASCAR built a media library that encompassed over 8,600 LTO 6 tapes and a few thousand LTO 4 tapes, with a growth rate of between 1.5 PB and 2 PB per year. Over the course of 18 months they migrated all of this content (a total of 15 PB) to AWS, making use of the Amazon S3 Standard, Amazon S3 Glacier Flexible Retrieval, and Amazon S3 Glacier Deep Archive storage classes. To learn more about how they migrated this massive and invaluable archive, read Modernizing NASCAR’s multi-PB media archive at speed with AWS Storage.

Electronic Arts
This game maker’s core telemetry systems handle tens of petabytes of data, tens of thousands of tables, and over 2 billion objects. As their games became more popular and the volume of data grew, they were facing challenges around data growth, cost management, retention, and data usage. In a series of updates, they moved archival data to Amazon S3 Glacier Deep Archive, implemented tag-driven retention management, and implemented Amazon S3 Intelligent-Tiering. They have reduced their costs and made their data assets more accessible; read
Electronic Arts optimizes storage costs and operations using Amazon S3 Intelligent-Tiering and S3 Glacier to learn more.

NRGene / CRISPR-IL
This team came together to build a best-in-class gene-editing prediction platform. CRISPR (
A Crack In Creation is a great introduction) is a very new and very precise way to edit genes and effect changes to an organism’s genetic makeup. The CRISPR-IL consortium is built around an iterative learning process that allows researchers to send results to a predictive engine that helps to shape the next round of experiments. As described in
A gene-editing prediction engine with iterative learning cycles built on AWS, the team identified five key challenges and then used AWS to build GoGenome, a web service that performs predictions and delivers the results to users. GoGenome stores over 20 terabytes of raw sequencing data, and hundreds of millions of feature vectors, making use of Amazon S3 and other
AWS storage services as the foundation of their data lake.

Some other cool recent S3 success stories include Liberty Mutual (How Liberty Mutual built a highly scalable and cost-effective document management solution), Discovery (Discovery Accelerates Innovation, Cuts Linear Playout Infrastructure Costs by 61% on AWS), and Pinterest (How Pinterest worked with AWS to create a new way to manage data access).

Join Us Online Today
In celebration of AWS Pi Day 2022 we have put together an entire day of educational sessions, live demos, and even a launch or two. We will also take a look at some of the newest S3 launches including Amazon S3 Glacier Instant Retrieval, Amazon S3 Batch Replication and AWS Backup Support for Amazon S3.

Designed for system administrators, engineers, developers, and architects, our sessions will bring you the latest and greatest information on security, backup, archiving, certification, and more. Join us at 9:30 AM PT on Twitch for Kevin Miller’s kickoff keynote, and stick around for the entire day to learn a lot more about how you can put Amazon S3 to use in your applications. See you there!

Jeff;

New – Additional Checksum Algorithms for Amazon S3

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-additional-checksum-algorithms-for-amazon-s3/

Amazon Simple Storage Service (Amazon S3) is designed to provide 99.999999999% (11 9s) of durability for your objects and for the metadata associated with your objects. You can rest assured that S3 stores exactly what you PUT, and returns exactly what is stored when you GET. In order to make sure that the object is transmitted back-and-forth properly, S3 uses checksums, basically a kind of digital fingerprint.

S3’s PutObject function already allows you to pass the MD5 checksum of the object, and only accepts the operation if the value that you supply matches the one computed by S3. While this allows S3 to detect data transmission errors, it does mean that you need to compute the checksum before you call PutObject or after you call GetObject. Further, computing checksums for large (multi-GB or even multi-TB) objects can be computationally intensive, and can lead to bottlenecks. In fact, some large S3 users have built special-purpose EC2 fleets solely to compute and validate checksums.

New Checksum Support
Today I am happy to tell you about S3’s new support for four checksum algorithms. It is now very easy for you to calculate and store checksums for data stored in Amazon S3 and to use the checksums to check the integrity of your upload and download requests. You can use this new feature to implement the digital preservation best practices and controls that are specific to your industry. In particular, you can specify the use of any one of four widely used checksum algorithms (SHA-1, SHA-256, CRC-32, and CRC-32C) when you upload each of your objects to S3.

Here are the principal aspects of this new feature:

Object Upload – The newest versions of the AWS SDKs compute the specified checksum as part of the upload, and include it in an HTTP trailer at the conclusion of the upload. You also have the option to supply a precomputed checksum. Either way, S3 will verify the checksum and accept the operation if the value in the request matches the one computed by S3. In combination with the use of HTTP trailers, this feature can greatly accelerate client-side integrity checking.

Multipart Object Upload – The AWS SDKs now take advantage of client-side parallelism and compute checksums for each part of a multipart upload. The checksums for all of the parts are themselves checksummed and this checksum-of-checksums is transmitted to S3 when the upload is finalized.

Checksum Storage & Persistence – The verified checksum, along with the specified algorithm, are stored as part of the object’s metadata. If Server-Side Encryption with KMS Keys is requested for the object, then the checksum is stored in encrypted form. The algorithm and the checksum stick to the object throughout its lifetime, even if it changes storage classes or is superseded by a newer version. They are also transferred as part of S3 Replication.

Checksum Retrieval – The new GetObjectAttributes function returns the checksum for the object and (if applicable) for each part.

Checksums in Action
You can access this feature from the AWS Command Line Interface (CLI), AWS SDKs, or the S3 Console. In the console, I enable the Additional Checksums option when I prepare to upload an object:

Then I choose a Checksum function:

If I have already computed the checksum I can enter it, otherwise the console will compute it.

After the upload is complete I can view the object’s properties to see the checksum:

The checksum function for each object is also listed in the S3 Inventory Report.

From my own code, the SDK can compute the checksum for me:

with open(file_path, 'rb') as file:
    r = s3.put_object(
        Bucket=bucket,
        Key=key,
        Body=file,
        ChecksumAlgorithm='sha1'
    )

Or I can compute the checksum myself and pass it to put_object:

with open(file_path, 'rb') as file:
    r = s3.put_object(
        Bucket=bucket,
        Key=key,
        Body=file,
        ChecksumSHA1='fUM9R+mPkIokxBJK7zU5QfeAHSy='
    )

When I retrieve the object, I specify checksum mode to indicate that I want the returned object validated:

r = s3.get_object(Bucket=bucket, Key=key, ChecksumMode='ENABLED')

The actual validation happens when I read the object from r['Body'], and an exception will be raised if there’s a mismatch.

Watch the Demo
Here’s a demo (first shown at re:Invent 2021) of this new feature in action:

Available Now
The four additional checksums are now available in all commercial AWS Regions and you can start using them today at no extra charge.

Jeff;

New – Offline Tape Migration Using AWS Snowball Edge

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-offline-tape-migration-using-aws-snowball-edge/

Over the years, we have given you a succession of increasingly powerful tools to help you migrate your data to the AWS Cloud. Starting with AWS Import/Export back in 2009, followed by Snowball in 2015, Snowmobile and Snowball Edge in 2016, and Snowcone in 2020, each new device has given you additional features to simplify and expedite the migration process. All of the devices are designed to operate in environments that suffer from network constraints such as limited bandwidth, high connections costs, or high latency.

Offline Tape Migration
Today, we are taking another step forward by making it easier for you to migrate data stored offline on physical tapes. You can get rid of your large and expensive storage facility, send your tape robots out to pasture, and eliminate all of the time & effort involved in moving archived data to new formats and mediums every few years, all while retaining your existing tape-centric backup & recovery utilities and workflows.

This launch brings a tape migration capability to AWS Snowball Edge devices, and allows you to migrate up to 80 TB of data per device, making it suitable for your petabyte-scale migration efforts. Tapes can be stored in the Amazon S3 Glacier Flexible Retrieval or Amazon S3 Glacier Deep Archive storage classes, and then accessed from on-premises and cloud-based backup and recovery utilities.

Back in 2013 I showed you how to Create a Virtual Tape Library Using the AWS Storage Gateway. Today’s launch builds on that capability in two different ways. First, you create a Virtual Tape Library (VTL) on a Snowball Edge and copy your physical tapes to it. Second, after your tapes are in the cloud, you create a VTL on a Storage Gateway and use it to access your virtual tapes.

Getting Started
To get started, I open the Snow Family Console and create a new job. Then I select Import virtual tapes into AWS Storage Gateway and click Next:

Then I go through the remainder of the ordering sequence (enter my shipping address, name my job, choose a KMS key, and set up notification preferences), and place my order. I can track the status of the job in the console:

When my device arrives I tell the somewhat perplexed delivery person about data transfer, carry it down to my basement office, and ask Luna to check it out:

Back in the Snow Family console, I download the manifest file and copy the unlock code:

I connect the Snowball Edge to my “corporate” network:

Then I install AWS OpsHub for Snow Family on my laptop, power on the Snowball Edge, and wait for it to obtain & display an IP address:

I launch OpsHub, sign in, and accept the default name for my device:

I confirm that OpsHub has access to my device, and that the device is unlocked:

I view the list of services running on the device, and note that Tape Gateway is not running:

Before I start Tape Gateway, I create a Virtual Network Interface (VNI):

And then I start the Tape Gateway service on the Snow device:

Now that the service is running on the device, I am ready to create the Storage Gateway. I click Open Storage Gateway console from within OpsHub:

I select Snowball Edge as my host platform:

Then I give my gateway a name (MyTapeGateway), select my backup application (Veeam Backup & Replication in this case), and click Activate Gateway:

Then I configure CloudWatch logging:

And finally, I review the settings and click Finish to activate my new gateway:

The activation process takes a few minutes, just enough time to take Luna for a quick walk. When I return, the console shows that the gateway is activated and running, and I am all set:

Creating Tapes
The next step is to create some virtual tapes. I click Create tapes and enter the requested information, including the pool (Deep Archive or Glacier), and click Create tapes:

The next step is to copy data from my physical tapes to the Snowball Edge. I don’t have a data center and I don’t have any tapes, so I can’t show you how to do this part. The data is stored on the device, and my Internet connection is used only for management traffic between the Snowball Edge device and AWS. To learn more about this part of the process, check out our new animated explainer.

After I have copied the desired tapes to the device, I prepare it for shipment to AWS. I make sure that all of the virtual tapes in the Storage Gateway Console have the status In Transit to VTS (Virtual Tape Shelf), and then I power down the device.

The display on the device updates to show the shipping address, and I wait for the shipping company to pick up the device.

When the device arrives at AWS, the virtual tapes are imported, stored in the S3 storage class associated with the pool that I chose earlier, and can be accessed by retrieving them using an online tape gateway. The gateway can be deployed as a virtual machine or a hardware appliance.

Now Available
You can use AWS Snowball Edge for offline tape migration in the US East (N. Virginia), US East (Ohio), US West (Oregon), US West (N. California), Europe (Ireland), Europe (Frankfurt), Europe (London), Asia Pacific (Sydney) Regions. Start migrating petabytes of your physical tape data to AWS, today!

Jeff;

Amazon S3 Glacier is the Best Place to Archive Your Data – Introducing the S3 Glacier Instant Retrieval Storage Class

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/amazon-s3-glacier-is-the-best-place-to-archive-your-data-introducing-the-s3-glacier-instant-retrieval-storage-class/

Today we are announcing the Amazon S3 Glacier Instant Retrieval storage class. This new archive storage class delivers the lowest cost storage for long-lived data that is rarely accessed and requires millisecond retrieval.

We are also excited to announce that S3 Intelligent-Tiering now automatically optimizes storage costs for rarely accessed data that needs immediate retrieval with the new Archive Instant Access tier, which is ideal for data with unknown or changing access patterns. For existing customers, this will provide an immediate savings of 68 percent for data that hasn’t been accessed for more than 90 days, with no action needed. The Frequent, Infrequent, and now Archive Instant Access tiers are designed for the same milliseconds access time and high-throughput performance.

In addition, we are announcing the new name for the existing Amazon S3 Glacier storage class and several price reductions.

Amazon S3 Glacier Instant Retrieval
The Amazon S3 Glacier storage classes are extremely low-cost and built for data archiving. They are secure and durable, and they are designed to provide the lowest cost for data that does not require immediate access, with retrieval options from minutes to hours.

Many customers need to store rarely accessed data for several years. However the data must be highly available and immediately accessible. Today, these customers use the S3 Standard-Infrequent Access (S3 Standard-IA) storage class. This storage class offers low cost for storage and allows customers to retrieve their data instantly.

S3 Glacier Instant Retrieval is a new storage class that delivers the fastest access to archive storage, with the same low latency and high-throughput performance as the S3 Standard and S3 Standard-IA storage classes. You can save up to 68 percent on storage costs as compared with using the S3 Standard-IA storage class when you use the S3 Glacier Instant Retrieval storage class and pay a low price to retrieve data. For example, in the US East (N. Virginia) Region, S3 Glacier Instant Retrieval storage pricing is $0.004 per GB-month and data retrieval is $0.03 per GB. Learn more about pricing for your Region.

Media archives, medical images, or user-generated content are just a few examples of ideal use cases for S3 Glacier Instant Retrieval. Once created, this content is rarely accessed, but when it is needed it must be available in milliseconds.

To get started using the new storage class from the Amazon S3 console, upload an object as you would normally, and select the S3 Glacier Instant Retrieval storage class.

Upload object with the new storage class

This feature is available programmatically from AWS SDKs, AWS Command Line Interface (CLI), and AWS CloudFormation.

In my opinion, the easiest way to store data in S3 Glacier Instant Retrieval is to use the S3 PUT API using the CLI. When using this API, set the storage class to GLACIER_IR.

aws s3api put-object --bucket <bucket-name> --key <object-key> --body <name-file> --storage-class GLACIER_IR

When the object is uploaded to Amazon S3, verify the storage class in the list of objects or on the object details page.

Storage classes

For data that already exists in Amazon S3, you can use S3 Lifecycle to transition data from the S3 Standard and S3 Standard-IA storage classes into S3 Glacier Instant Retrieval.

New Archive Instant Access Tier in S3 Intelligent-Tiering
S3 Intelligent-Tiering is a storage class that automatically moves objects between access tiers to optimize costs. This is the recommended storage class for data with unpredictable or changing access patterns, such as in data lakes, analytics, or user-generated content.

Until today, there were two low latency access tiers optimized for frequent and infrequent access, and two optional archive access tiers designed for asynchronous access optimized for rare access at a low cost.

Beginning today, the Archive Instant Access tier is added as a new access tier in the S3 Intelligent-Tiering storage class. You will start seeing automatic costs savings for your storage in S3 Intelligent-Tiering for rarely accessed objects.

The Archive Instant Access tier joins the group of low latency access tiers. This new tier is optimized for data that is not accessed for months at a time but, when it is needed, is available within milliseconds.

S3 Intelligent-Tiering automatically stores objects in three access tiers that deliver the same performance as the S3 Standard storage class:

  • Frequent Access tier
  • Infrequent Access tier
  • Archive Instant Access (new)

For a small monitoring and automation charge, S3 Intelligent-Tiering monitors access patterns and moves objects between the different access tiers. Objects that have not been accessed for 30 consecutive days are moved from the Frequent Access tier to the Infrequent Access tier for savings of 40 percent. When an object hasn’t been accessed for 90 consecutive days, S3 Intelligent-Tiering will move the object from the Infrequent Access tier to the Archive Instant Access tier, with a savings of 68 percent. If the data is accessed later, it is automatically moved back to the Frequent Access tier. No tiering charges apply when objects are moved between access tiers within the S3 Intelligent-Tiering storage class.

S3 Intelligent-Tiering access tiers

To get started with this new access tier, select Intelligent-Tiering as the storage class for an object when uploading an object using the S3 console. After 90 days of inactivity (30 days in Frequent Access tier and 60 days in Infrequent Access tier), S3 Intelligent-Tiering will automatically move the object to the Archive Instant Access tier. The introduction of the new Archive Instant Access tier has no impact on performance when you retrieve objects.

New name for the Amazon S3 Glacier storage class – S3 Glacier Flexible Retrieval
The existing Amazon S3 Glacier storage class is now named S3 Glacier Flexible Retrieval. This storage class now has free bulk retrievals in 5 to 12 hours, and the storage price has been reduced by 10 percent in all Regions, effective December 1, 2021. S3 Glacier Flexible Retrieval is now even more cost-effective, and the free bulk retrievals make it ideal for retrieving large data volumes.

These are the Amazon S3 archive storage classes:

  • S3 Glacier Instant Retrieval: The newest storage class is optimized for long-lived data that is rarely accessed (typically once per quarter). However when data is needed, it is available within milliseconds. For example, medical images and news media assets are perfect for this storage class.
  • S3 Glacier Flexible Retrieval: This newly renamed storage class is optimized for archiving data that can be retrieved in minutes or with free bulk retrievals in 5 to 12 hours. This storage class is ideal for backups and disaster recovery use cases, where you have large amounts of long-term, rarely accessed data, and you don’t want to worry about retrieval costs when you need the data.
  • S3 Glacier Deep Archive: This storage class is the lowest-cost storage in the cloud and is optimized for archiving data that can be restored in at least 12 hours. It’s great for storing your compliance archives or for digital media preservation.

Amazon S3 has reduced storage prices!
We are excited to announce that Amazon S3 has reduced storage prices of up to 31 percent in the S3 Standard-IA and S3 One Zone-IA storage classes across 9 AWS Regions: US West (N. California), Asia Pacific (Hong Kong), Asia Pacific (Mumbai), Asia Pacific (Osaka), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), and South America (São Paulo). These price reductions are effective December 1, 2021.

Learn more about price reduction details.

Available Now
The new storage class, S3 Glacier Instant Retrieval, and the new Archive Instant Access tier in S3 Intelligent-Tiering are available today (November 30, 2021) in all AWS Regions.

The price cut for S3 Glacier and free bulk retrievals in all AWS Regions, and the S3 Standard-Infrequent Access/One Zone-Infrequent storage class in nine Regions will be effective on December 1, 2021.

Learn more about the storage classes changes and all the storage classes.

Marcia

Building well-architected serverless applications: Optimizing application costs

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-optimizing-application-costs/

This series of blog posts uses the AWS Well-Architected Tool with the Serverless Lens to help customers build and operate applications using best practices. In each post, I address the serverless-specific questions identified by the Serverless Lens along with the recommended best practices. See the introduction post for a table of contents and explanation of the example application.

COST 1. How do you optimize your serverless application costs?

Design, implement, and optimize your application to maximize value. Asynchronous design patterns and performance practices ensure efficient resource use and directly impact the value per business transaction. By optimizing your serverless application performance and its code patterns, you can directly impact the value it provides, while making more efficient use of resources.

Serverless architectures are easier to manage in terms of correct resource allocation compared to traditional architectures. Due to its pay-per-value pricing model and scale based on demand, a serverless approach effectively reduces the capacity planning effort. As covered in the operational excellence and performance pillars, optimizing your serverless application has a direct impact on the value it produces and its cost. For general serverless optimization guidance, see the AWS re:Invent talks, “Optimizing your Serverless applications” Part 1 and Part 2, and “Serverless architectural patterns and best practices”.

Required practice: Minimize external calls and function code initialization

AWS Lambda functions may call other managed services and third-party APIs. Functions may also use application dependencies that may not be suitable for ephemeral environments. Understanding and controlling what your function accesses while it runs can have a direct impact on value provided per invocation.

Review code initialization

I explain the Lambda initialization process with cold and warm starts in “Optimizing application performance – part 1”. Lambda reports the time it takes to initialize application code in Amazon CloudWatch Logs. As Lambda functions are billed by request and duration, you can use this to track costs and performance. Consider reviewing your application code and its dependencies to improve the overall execution time to maximize value.

You can take advantage of Lambda execution environment reuse to make external calls to resources and use the results for subsequent invocations. Use TTL mechanisms inside your function handler code. This ensures that you can prevent additional external calls that incur additional execution time, while preemptively fetching data that isn’t stale.

Review third-party application deployments and permissions

When using Lambda layers or applications provisioned by AWS Serverless Application Repository, be sure to understand any associated charges that these may incur. When deploying functions packaged as container images, understand the charges for storing images in Amazon Elastic Container Registry (ECR).

Ensure that your Lambda function only has access to what its application code needs. Regularly review that your function has a predicted usage pattern so you can factor in the cost of other services, such as Amazon S3 and Amazon DynamoDB.

Required practice: Optimize logging output and its retention

Considering reviewing your application logging level. Ensure that logging output and log retention are appropriately set to your operational needs to prevent unnecessary logging and data retention. This helps you have the minimum of log retention to investigate operational and performance inquiries when necessary.

Emit and capture only what is necessary to understand and operate your component as intended.

With Lambda, any standard output statements are sent to CloudWatch Logs. Capture and emit business and operational events that are necessary to help you understand your function, its integration, and its interactions. Use a logging framework and environment variables to dynamically set a logging level. When applicable, sample debugging logs for a percentage of invocations.

In the serverless airline example used in this series, the booking service Lambda functions use Lambda Powertools as a logging framework with output structured as JSON.

Lambda Powertools is added to the Lambda functions as a shared Lambda layer in the AWS Serverless Application Model (AWS SAM) template. The layer ARN is stored in Systems Manager Parameter Store.

Parameters:
  SharedLibsLayer:
    Type: AWS::SSM::Parameter::Value<String>
    Description: Project shared libraries Lambda Layer ARN
Resources:
    ConfirmBooking:
        Type: AWS::Serverless::Function
        Properties:
            FunctionName: !Sub ServerlessAirline-ConfirmBooking-${Stage}
            Handler: confirm.lambda_handler
            CodeUri: src/confirm-booking
            Layers:
                - !Ref SharedLibsLayer
            Runtime: python3.7
…

The LOG_LEVEL and other Powertools settings are configured in the Globals section as Lambda environment variable for all functions.

Globals:
    Function:
        Environment:
            Variables:
                POWERTOOLS_SERVICE_NAME: booking
                POWERTOOLS_METRICS_NAMESPACE: ServerlessAirline
                LOG_LEVEL: INFO 

For Amazon API Gateway, there are two types of logging in CloudWatch: execution logging and access logging. Execution logs contain information that you can use to identify and troubleshoot API errors. API Gateway manages the CloudWatch Logs, creating the log groups and log streams. Access logs contain details about who accessed your API and how they accessed it. You can create your own log group or choose an existing log group that could be managed by API Gateway.

Enable access logs, and selectively review the output format and request fields that might be necessary. For more information, see “Setting up CloudWatch logging for a REST API in API Gateway”.

API Gateway logging

API Gateway logging

Enable AWS AppSync logging which uses CloudWatch to monitor and debug requests. You can configure two types of logging: request-level and field-level. For more information, see “Monitoring and Logging”.

AWS AppSync logging

AWS AppSync logging

Define and set a log retention strategy

Define a log retention strategy to satisfy your operational and business needs. Set log expiration for each CloudWatch log group as they are kept indefinitely by default.

For example, in the booking service AWS SAM template, log groups are explicitly created for each Lambda function with a parameter specifying the retention period.

Parameters:
    LogRetentionInDays:
        Type: Number
        Default: 14
        Description: CloudWatch Logs retention period
Resources:
    ConfirmBookingLogGroup:
        Type: AWS::Logs::LogGroup
        Properties:
            LogGroupName: !Sub "/aws/lambda/${ConfirmBooking}"
            RetentionInDays: !Ref LogRetentionInDays

The Serverless Application Repository application, auto-set-log-group-retention can update the retention policy for new and existing CloudWatch log groups to the specified number of days.

For log archival, you can export CloudWatch Logs to S3 and store them in Amazon S3 Glacier for more cost-effective retention. You can use CloudWatch Log subscriptions for custom processing, analysis, or loading to other systems. Lambda extensions allows you to process, filter, and route logs directly from Lambda to a destination of your choice.

Good practice: Optimize function configuration to reduce cost

Benchmark your function using a different set of memory size

For Lambda functions, memory is the capacity unit for controlling the performance and cost of a function. You can configure the amount of memory allocated to a Lambda function, between 128 MB and 10,240 MB. The amount of memory also determines the amount of virtual CPU available to a function. Benchmark your AWS Lambda functions with differing amounts of memory allocated. Adding more memory and proportional CPU may lower the duration and reduce the cost of each invocation.

In “Optimizing application performance – part 2”, I cover using AWS Lambda Power Tuning to automate the memory testing process to balances performance and cost.

Best practice: Use cost-aware usage patterns in code

Reduce the time your function runs by reducing job-polling or task coordination. This avoids overpaying for unnecessary compute time.

Decide whether your application can fit an asynchronous pattern

Avoid scenarios where your Lambda functions wait for external activities to complete. I explain the difference between synchronous and asynchronous processing in “Optimizing application performance – part 1”. You can use asynchronous processing to aggregate queues, streams, or events for more efficient processing time per invocation. This reduces wait times and latency from requesting apps and functions.

Long polling or waiting increases the costs of Lambda functions and also reduces overall account concurrency. This can impact the ability of other functions to run.

Consider using other services such as AWS Step Functions to help reduce code and coordinate asynchronous workloads. You can build workflows using state machines with long-polling, and failure handling. Step Functions also supports direct service integrations, such as DynamoDB, without having to use Lambda functions.

In the serverless airline example used in this series, Step Functions is used to orchestrate the Booking microservice. The ProcessBooking state machine handles all the necessary steps to create bookings, including payment.

Booking service state machine

Booking service state machine

To reduce costs and improves performance with CloudWatch, create custom metrics asynchronously. You can use the Embedded Metrics Format to write logs, rather than the PutMetricsData API call. I cover using the embedded metrics format in “Understanding application health” – part 1 and part 2.

For example, once a booking is made, the logs are visible in the CloudWatch console. You can select a log stream and find the custom metric as part of the structured log entry.

Custom metric structured log entry

Custom metric structured log entry

CloudWatch automatically creates metrics from these structured logs. You can create graphs and alarms based on them. For example, here is a graph based on a BookingSuccessful custom metric.

CloudWatch metrics custom graph

CloudWatch metrics custom graph

Consider asynchronous invocations and review run away functions where applicable

Take advantage of Lambda’s event-based model. Lambda functions can be triggered based on events ingested into Amazon Simple Queue Service (SQS) queues, S3 buckets, and Amazon Kinesis Data Streams. AWS manages the polling infrastructure on your behalf with no additional cost. Avoid code that polls for third-party software as a service (SaaS) providers. Rather use Amazon EventBridge to integrate with SaaS instead when possible.

Carefully consider and review recursion, and establish timeouts to prevent run away functions.

Conclusion

Design, implement, and optimize your application to maximize value. Asynchronous design patterns and performance practices ensure efficient resource use and directly impact the value per business transaction. By optimizing your serverless application performance and its code patterns, you can reduce costs while making more efficient use of resources.

In this post, I cover minimizing external calls and function code initialization. I show how to optimize logging output with the embedded metrics format, and log retention. I recap optimizing function configuration to reduce cost and highlight the benefits of asynchronous event-driven patterns.

This post wraps up the series, building well-architected serverless applications, where I cover the AWS Well-Architected Tool with the Serverless Lens . See the introduction post for links to all the blog posts.

For more serverless learning resources, visit Serverless Land.

 

Working with timestamp with time zone in your Amazon S3-based data lake

Post Syndicated from Thiyagarajan Arumugam original https://aws.amazon.com/blogs/big-data/working-with-timestamp-with-time-zone-in-your-amazon-s3-based-data-lake/

With a data lake built on Amazon Simple Storage Service (Amazon S3), you can use the purpose-built analytics services for a range of use cases, from analyzing petabyte-scale datasets to querying the metadata of a single object. AWS analytics services support open file formats such as Parquet, ORC, JSON, Avro, CSV, and more, so it’s convenient to analyze with the tool that is most appropriate for your use case. For more information, see Amazon S3 as the Data Lake Storage Platform.

The TIMESTAMP and TIMESTAMPTZ (TIMESTAMP with time zone) data types are key data elements associated with many time-based datasets (for example clickstream, historical sales, and forecasting) in your data lake. But when you access the data across different analytical services, such as Amazon EMR-based ETL outputs being read by Amazon Redshift Spectrum, you may not know how the data will behave. Furthermore, lack of proper handling may cause accuracy issues in timestamp with time zone data types. This post delves into handling the TIMESTAMP and TIMESTAMPTZ data types in the context of a data lake by using a centralized data architecture. Because AWS analytical services cover a broad spectrum, we primarily focus on handing timestamps using Apache Hive, Apache Spark, Apache Parquet (using Amazon EMR and Amazon Athena), and Amazon Redshift to cover both the data lake and data warehouse.

Overview of TIMESTAMP and TIMESTAMPTZ data types in your data lake

Let’s start with some common definitions of the TIMESTAMP and TIMESTAMPTZ data types.

  • The TIMESTAMP data type stores values that include the date and time of day. For example, 12/17/1997 17:37:16. Timestamps are presented without time zone information.
  • The TIMESTAMPTZ data type to stores values with the date, time of day, and time zone. For example, 12/17/1997 17:37:16 (PST).

Internally, the timestamp is as an integer, representing seconds in UTC since the epoch (1970-01-01 00:00:00 UTC) and TIMESTAMPTZ values also stored as integers with respect to Coordinated Universal Time (UTC).

When working with the TIMESTAMPTZ data type, reads and writes use the time zone of the client user machine. When no time zone is set up or if left at the default values (such as the JVM/SQL client), it defaults to UTC.

Timestamp behavior when accessed across the analytical services

For this post, we discuss handling the timestamp with time zone data when accessed individually within the services and as well as between the services. The following diagram shows the architecture for this setup.

In this architecture, Parquet objects are stored in a centralized Amazon S3-based data lake, and Amazon EMR, Athena, and Amazon Redshift are used to access this centralized data. Data is also processed by these individual engines and accessed across these services through the Amazon S3 storage.

In this post, we illustrate the behavior of the different data types when data moves across different services from the Amazon S3 Parquet files.

Processing data in Amazon EMR (ETL) and accessing it with Amazon Redshift

In this use case, the Spark or Hive data pipeline generates Parquet files in the data lake and stores it in Amazon S3. Parquet files that are stored in Amazon S3 are loaded to Amazon Redshift using the COPY command. The following diagram illustrates this workflow.

To test this setup, complete the following steps:

  1. Create a Hive table and insert a sample row (for this post, we use an EMR cluster spun up in us-west-2, PST):
    CREATE EXTERNAL TABLE clickstream_dwh.clickstream_hive
    (
      sessionid             BIGINT,
      click_region       STRING,
      click_datetime_utc       TIMESTAMP,
      pageid                INT,
      productid          INT
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
    LOCATION s3://clickstream-dwh-us-west-2/warehouse/clickstream_dwh.db/clickstream_hive/';
    
    
    insert into clickstream_dwh.clickstream_hive values (9074420482 ,'SEATTLE, US' ,'2014-04-06 02:40:13' ,3365,183876);

  1. Verify the Parquet file content using the Parquet tool in Amazon S3:
    $ parq 000000_0 --head 10
        sessionid click_region  click_datetime_utc  pageid  productid
    0  9074420482        SEATTLE, US 2014-04-06 09:40:13    3365     183876

In the preceding output, the Hive client running Amazon EMR interprets the time zone with respect to the end-user client (in PST), and converts it to UTC when writing to the Parquet file.

  1. Read through Hive and Spark (in Pacific time):
    Read through Hive(Pacific):
    
    PST:
    select * from clickstream_dwh.clickstream_hive; 
    
    +-------------+---------------+------------------------+---------+------------+
    |  sessionid  | click_region  |   click_datetime_utc   | pageid  | productid  |
    +-------------+---------------+------------------------+---------+------------+
    | 9074420482  | SEATTLE, US         | 2014-04-06 02:40:13.0  | 3365    | 183876     |

Amazon EMR Hive and Spark convert the underlying UTC stored timestamp values in Parquet to the client user machine’s relative time (PST) when displaying the results.

  1. Copy the Parquet file to an Amazon Redshift table with the TIMESTAMP column data type (in UTC). We use the SQL command line client tool psql to query the results in Amazon Redshift.
    COPY the parquet file to Redshift table with timestamp column data type(UTC):
    
    
    CREATE TABLE clickstream_dwh.clickstream_ts
    (
      sessionid             BIGINT,
      click_region       VARCHAR(100),
      click_datetime_utc       TIMESTAMP,
      pageid                INT,
      productid          INT
    );
    
    dev=# SHOW TIMEZONE;
     TimeZone 
    ----------
     UTC
    (1 row)
    
    dev=# select * from clickstream_dwh.clickstream_ts;
     sessionid  | click_region | click_datetime_utc  | pageid | productid 
    ------------+--------------+---------------------+--------+-----------
     9074420482 | SEATTLE, US        | 2014-04-06 09:40:13 |   3365 |    183876
    (1 row)
    
    
    dev=# SET timezone = 'America/Los_Angeles';
    SET
    
    dev=# SHOW TIMEZONE;
          TimeZone       
    ---------------------
     America/Los_Angeles
    (1 row)
    
    
    dev=# select * from clickstream_dwh.clickstream_ts;
     sessionid  | click_region | click_datetime_utc  | pageid | productid 
    ------------+--------------+---------------------+--------+-----------
     9074420482 | SEATTLE, US        | 2014-04-06 09:40:13 |   3365 |    183876
    (1 row)
    
    Note: SET timezone = 'America/Los_Angeles' , does not affect the TIMESTAMP column.

In the preceding output, the timestamp doesn’t have a time zone. All data is interpreted in UTC or whatever raw format it was when loaded into Amazon Redshift.

  1. Copy the Parquet file to the Amazon Redshift table using TIMESTAMPTZ (UTC & Pacific):
    COPY the parquet file to Redshift table with TIMESTAMPTZ(UTC & Pacific):
    
    
    CREATE TABLE clickstream_dwh.clickstream_tz
    (
      sessionid             BIGINT,
      click_region       VARCHAR(100),
      click_datetime_utc       TIMESTAMPTZ,
      pageid                INT,
      productid          INT
    );
    
    
    COPY clickstream_dwh.clickstream_tz
    FROM 's3://clickstream-dwh-us-west-2/warehouse/clickstream_dwh.db/clickstream_hive/'
    IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftDemoRole'
    parquet;
    
    
    
    dev=# SHOW TIMEZONE;
     TimeZone 
    ----------
     UTC
    (1 row)
    
    dev=# select * from clickstream_dwh.clickstream_tz;
     sessionid  | click_region |   click_datetime_utc    | pageid | productid 
    ------------+--------------+------------------------+--------+-----------
     9074420482 | SEATTLE, US        | 2014-04-06 09:40:13+00 |   3365 |    183876
    (1 row)
    
    dev=# SET timezone = 'America/Los_Angeles';
    SET
    dev=# 
    dev=# 
    dev=# SHOW TIMEZONE;
          TimeZone       
    ---------------------
     America/Los_Angeles
    (1 row)
    
    dev=# select * from clickstream_dwh.clickstream_tz;
     sessionid  | click_region |   click_datetime_utc    | pageid | productid 
    ------------+--------------+------------------------+--------+-----------
     9074420482 | SEATTLE, US        | 2014-04-06 02:40:13-07 |   3365 |    183876
    (1 row)

The output shows that TIMESTAMPTZ can interpret the client time zone and convert the value with respect to the end-user client (PST), though the actual values are stored in UTC.

Processing data from Amazon Redshift and moving it to an Amazon S3 data lake

In the following use case, we copy data from Amazon Redshift to a data lake. Amazon Redshift stores the TIMESTAMP and TIMESTAMPTZ columns data types in a table. The table data is exported to Amazon S3 as Parquet files with the UNLOAD command. The following diagram illustrates this architecture.

To experiment with this setup, complete the following steps:

  1. Unload the Amazon Redshift table data to Amazon S3 (in UTC):
    UNLOAD ('select * from clickstream_dwh.clickstream_tz;')
    TO 's3://clickstream-dwh-us-west-2/warehouse/clickstream_dwh.db/clickstream_rs/'
    IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftDemoRole'
    parquet;

  1. Verify the Parquet file content:
Check the parquet file content:

$ parq 0016_part_00.parquet --head 10
    sessionid click_region   click_datetime_utc  pageid  productid
0  9074420482        SEATTLE, US 2014-04-06 09:40:13    3365     183876
  1. Create a table in Hive and query it (in UTC):
    CREATE EXTERNAL TABLE clickstream_dwh.clickstream_rs
    (
      sessionid             BIGINT,
      click_region       STRING,
      click_datetime_utc       TIMESTAMP,
      pageid                INT,
      productid          INT
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
    LOCATION 's3://clickstream-dwh-us-west-2/warehouse/clickstream_dwh.db/clickstream_rs/';
    
    
    
    > set hive.parquet.timestamp.skip.conversion=false;
    
    
    > select * from clickstream_dwh.clickstream_rs;
    
    
    +-------------+---------------+------------------------+---------+------------+
    |  sessionid  | click_region  |   click_datetime_utc   | pageid  | productid  |
    +-------------+---------------+------------------------+---------+------------+
    | 9074420482  | SEATTLE, US         | 2014-04-06 02:40:13.0  | 3365    | 183876     |
    +-------------+---------------+------------------------+---------+------------+
    1 row selected (0.888 seconds)

In the preceding output, the actual data in the Parquet file is stored in UTC, but Hive can read and display the local time zone using client settings.

When using Hive, set hive.parquet.timestamp.skip.conversion=false. Pre-3.1.2 Hive implementation of Parquet stores timestamps in UTC on-file; this flag allows you to skip the conversion when reading Parquet files created from other tools that may not have done so. Setting it to false treats legacy timestamps as UTC-normalized. For more information, see hive.parquet.timestamp.skip.conversion.

  1. Query using Spark-SQL (in Pacific time):
    spark-sql> select *  from clickstream_dwh.clickstream_rs;
    
    
    9074420482	SEATTLE, US	2014-04-06 02:40:13	3365	183876

In the preceding output, Spark converts the values with respect to the end-user client (PST), though the actual values are stored in UTC.

Accessing data through Athena

To access the data through Athena, you need to create the external table either in the AWS Glue Data Catalog or Hive metastore. In this example, we populate the Data Catalog.

To create a table using the Data Catalog, sign in to the Athena console and run the following DDL:

CREATE EXTERNAL TABLE IF NOT EXISTS default.blog_clickstream(
  `sessionid` bigint,
  `click_region` string,
  `click_datetime_utc` timestamp,
  `pageid` int,
  `productid` int 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://clickstream-dwh-us-west-2/warehouse/clickstream_dwh.db/clickstream_rs/'
TBLPROPERTIES ('has_encrypted_data'='false');

The following screenshot shows the query results.

The results show that Athena converts click_datetime_utc to the user’s local time zone (in this case, PST).

Accessing data through Amazon Redshift Spectrum

To access the data through Amazon Redshift Spectrum, you need to create the following:

  • The external database and table in the Data Catalog or a Hive metastore. We can use the same table we already created in the preceding use case (default.blog_clickstream).
  • An Amazon Redshift external schema for the external database in the Data Catalog.

See the following code:

dev=# CREATE EXTERNAL SCHEMA ext_clickstream_blog
      from data catalog
      database 'default' 
      region 'us-west-2' 
      iam_role 'arn:aws:iam::123456789012:role/RedshiftDemoRole';


dev=#  select * from ext_clickstream_blog.blog_clickstream;

sessionid  | click_region | click_datetime_utc  | pageid | productid
-----------+--------------+---------------------+--------+----------
9074420482 | SEATTLE, US        | 2014-04-06 09:40:13 |   3365 |    183876

The output shows that Amazon Redshift Spectrum can convert click_datetime_utc to the local time zone of the user (PST).

Use cases for handling TIMESTAMP AND TIMESTAMPTZ data types

When implementing the data model for your data lake, the choice between selecting the TIMESTAMP or TIMESTAMPTZ data type depends on how your end-users consume the data. In this section, we discuss two different use cases.

Using TIMESTAMP for a uniform display of one normalized time

When you want a uniform display of a standard time (in a particular time zone), use the TIMESTAMP data type and baseline the values that are stored into a particular time zone. For example, the following table shows collected clickstream data from a global website.

sessionid click_region click_datetime_utc pageid productid
3682484416 Chennai 2014-04-06T 18:44:58 7156 309743
6367587374 London 2014-04-06T 18:44:58 5298 749625
9074420482 Los Angles 2014-04-06T 09:40:13 3365 183876
1746153004 Perth 2014-04-04T 06:13:28 3761 725195
1449344779 Singapore 2014-04-04T 06:13:28 3527 140229
4115543521 New York 2014-04-07T 09:22:28 3712 831655
2748081381 Paris 2014-04-07T 09:22:28 8474 347742
1120684200 New York 2014-04-07T 09:22:28 2731 568755

The clicks for the website come from users across the globe and are normalized for the UTC time zone using the click_datetime_utc column and a TIMESTAMP data type. You can accomplish this step during the data transformation process. Normalizing the data avoids confusion when data is analyzed across the different Regions without needing explicit conversion.

Using TIMESTAMPTZ for a contextual display of data depending on the user’s local time zone

When you need a contextual display of date and time for users accessing the data, choose the TIMESTAMPTZ data type. For example, let’s consider a customer service application that is accessing data from a centralized data warehouse. Individual users of the application are interested in analyzing data with respect which location the issue happened, rather than a normalized time zone. See the following code:

create table public.customer_issue_log
(
customer_id	BIGINT NOT NULL,
customer_location	VARCHAR(50) NOT NULL,
customer_timezone	VARCHAR(10) NOT NULL,
issue_create_time timestamptz NOT NULL,
issue_create_time_utc timestamp  NULL,
issue_id	INTEGER NOT NULL,
issue_severity INTEGER NOT NULL
);

The following table summarizes the output.

customer_id customer_location customer_timezone issue_create_time issue_create_time_utc issue_id issue_severity
9589430063 Chennai, India IST 4/5/2014 11:44:58 PM 4/6/2014 04:44:58 AM 2343 3
2796599493 New York, US EST 4/6/2014 06:44:58 AM 4/6/2014 11:44:58 AM 2780 4
1836626118 Toronto, CA EDT 4/4/2014 05:13:28 AM 4/4/2014 10:13:28 AM 7821 1
6790206978 Sydney, Australia AEDT 4/5/2014 05:40:13 PM 4/5/2014 10:40:13 PM 3135 5

The issue_create_time column stores the date and time values and the time zone. When you query this table, you can view issue_create_time in your local time zone automatically (without any explicit conversion) by configuring set timezone or using a SQL client (such as SQL Workbench) that automatically adjusts this with respect your local computer settings.

In addition, you can also introduce optional redundant columns such as issue_create_time_utc for ease of use when users try to analyze the data across different Regions.

Independent of the approach taken for the implementation, there is no loss of timestamp or time zone values when using the preceding approach, and you can perform data aggregation on both columns without needing explicit conversion because all data is stored in UTC in the underlying storage (Parquet in Amazon Redshift). For example, you can roll up data into weekly or monthly aggregates across the Regions without any explicit conversion. The following example code calculates the weekly number of issues by priority across all locations:

select date_trunc('week', issue_create_time) wk, issue_severity, count(issue_id) from public.customer_issue_log
where
group by date_trunc('week', issue_create_time), issue_severity;

Best practices for handling timestamps and time zones with data types

You should handle dates as either DATE, TIMESTAMP, or TIMESTAMPTZ data types and not convert them to strings. When dates are interpreted from strings, you lose all the features and flexiblity of working with date fields and date calculations, and also lose efficiency of processing. Moreover, casting or converting at runtime can be expensive.

When using TIMESTAMP or TIMESTAMPTZ data types, be aware of the client tools that access them. Client tool behavior largely depends on the local setting of the drivers and JVM. But it’s possible to override the behavior and always check for client tool-specific default behavior.

Use TIMESTAMPTZ only when absolutely necessary in the data model. In most use cases, TIMESTAMP simplifies data handling and avoids ambiguity when users access them.

Summary

In this post, we talked about handling and using TIMESTAMP and TIMESTAMPTZ data types with an Amazon S3-backed data lake. Most importantly, we covered how different AWS services like Amazon Redshift, Amazon EMR, Hive, and many other client tools interpret and interact with these data types. Choosing between using TIMESTAMP or TIMESTAMPTZ depends on the use case and how the end-user wants to visualize the data (a uniform display with one normalized time or a contextual display depending on time zone, respectively). Happy timestamping!


About the Authors

Thiyagarajan Arumugam is a Principal Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.

 

 

Srinivasan Krishnasamy is a ‘Senior Big Data Consultant’ at Amazon Web Services. He joined AWS in 2015 and specializes in building and supporting Big Data solutions that help customers to ingest, process and analyze data at scale.

 

 

Satish Sathiya is a Product Engineer at Amazon Redshift. He is an avid big data enthusiast who collaborates with customers around the globe to achieve success and meet their data warehousing and data lake architecture needs.

Keeping your data lake clean and compliant with Amazon Athena

Post Syndicated from David Roberts original https://aws.amazon.com/blogs/big-data/keeping-your-data-lake-clean-and-compliant-with-amazon-athena/

With the introduction of CTAS support for Amazon Athena (see Use CTAS statements with Amazon Athena to reduce cost and improve performance), you can not only query but also create tables using Athena with the associated data objects stored in Amazon Simple Storage Service (Amazon S3). These tables are often temporary in nature and used to filter or aggregate data that already exists in another Athena table. Although this offers great flexibility to perform exploratory analytics, when tables are dropped, the underlying Amazon S3 data remains indefinitely. Over time, the accumulation of these objects can increase Amazon S3 costs, become administratively challenging to manage, and may inadvertently preserve data that should have been deleted for privacy or compliance reasons. Furthermore, the AWS Glue table entry is purged so there is no convenient way to trace back which Amazon S3 path was mapped to a deleted table.

This post shows how you can automate  deleting Amazon S3 objects associated with a table  after dropping it using Athena. AWS Glue is required to be the metadata store for Athena.

Overview of solution

The solution requires that the AWS Glue table record (database, table, Amazon S3 path) history is preserved outside of AWS Glue, because it’s removed immediately  after a table is dropped. Without this record, you can’t delete the associated Amazon S3 object entries after the fact.

When Athena CTAS statements  are issued, AWS Glue generates Amazon CloudWatch events that specify the database and table names. These events are available from Amazon EventBridge and can be used to trigger an AWS Lambda function (autoCleanS3) to fetch the new or updated Amazon S3 path from AWS Glue and write the database, table, and Amazon S3 path into an AWS Glue history table stored in Amazon DynamoDB (GlueHistoryDDB). When Athena drop table queries are detected, CloudWatch events are generated that trigger autoCleanS3 to look up the Amazon S3 path from GlueHistoryDDB and delete all related objects from Amazon S3.

Not all dropped tables should trigger Amazon S3 object deletion. For example, when you create a table using existing Amazon S3 data (not CTAS), it’s not advisable to automatically delete the associated Amazon S3 tables, because other analysts may have other tables referring to the same source data. For this reason, you must include a user-defined comment (--dropstore ) in the Athena drop table query to cause autoCleanS3 to purge the Amazon S3 objects.

Lastly, after objects are successfully deleted, the corresponding entry in GlueHistoryDDB  is updated for historical and audit purposes. The overall workflow is described in the following diagram.

The workflow contains the following steps:

  1. A user creates a table either via Athena or the AWS Glue console or API.
  2. AWS Glue generates a CloudWatch event, and an EventBridge rule triggers the Lambda function.
  3. The function creates an entry in DynamoDB containing a copy of the AWS Glue record and Amazon S3 path.
  4. The user drops the table from Athena, including the special comment --dropstore.
  5. The Lambda function fetches the dropped table entry from DynamoDB, including the Amazon S3 path.
  6. The function deletes data from the Amazon S3 path, including manifest files, and marks the DynamoDB entry as purged.

Walkthrough overview

To implement this solution, we complete the following steps:

  1. Create the required AWS Identity and Access Management (IAM) policy and role.
  2. Create the AWS Glue history DynamoDB table.
  3. Create the Lambda autoCleanS3 function.
  4. Create the EventBridge rules.
  5. Test the solution.

If you prefer to use a preconfigured CloudFormation template, launch one of the following stacks depending on your Region.

Region Launch Button
us-east-1 (N. Virginia)
us-west-2 (Oregon)
eu-west-1 (Ireland)

Prerequisites

Before implementing this solution, create an AWS Glue database and table with the data residing in  Amazon S3. Be sure your user has the necessary permissions to access Athena and  perform CTAS operations writing  in a sample Amazon S3 location.

For more information about building a data lake, see Build a Data Lake Foundation with AWS Glue and Amazon S3.

Creating an IAM policy and role

You need to first create the required IAM policy for the Lambda function role to use to query AWS Glue and write to DynamoDB.

  1. On the IAM console, choose Policies.
  2. Choose Create policy.
  3. On the JSON tab, enter the following code (update the Region, account ID, and S3 bucket accordingly, and the table name GlueHistoryDDB if you choose to change it):
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Action": [
                    "s3:GetObject",
                    "s3:ListBucket",
                    "s3:ListObjectsV2",
                    "s3:DeleteObjectVersion",
                    "s3:DeleteObject"
                ],
                "Resource": [
                    "arn:aws:s3:::<athena-query-results-s3-bucket>",
                    "arn:aws:s3:::<athena-query-results-s3-bucket>/*"
                ],
                "Effect": "Allow"
            },
            {
                "Action": [
                    "dynamodb:PutItem",
                    "dynamodb:Scan",
                    "dynamodb:Query",
                    "dynamodb:UpdateItem"
                ],
                "Resource": [
                    "arn:aws:dynamodb:<region>:<accountId>:table/GlueHistoryDDB"
                ],
                "Effect": "Allow"
            },
            {
                "Action": [
                    "glue:GetTable"
                ],
                "Resource": [
                    "arn:aws:glue:<region>:<accountId>:catalog",
                    "arn:aws:glue:<region>:<accountId>:database/*",
                    "arn:aws:glue:<region>:<accountId>:table/*"
                ],
                "Effect": "Allow"
            },
            {
                "Action": [
                    "logs:CreateLogGroup"
                ],
                "Resource": [
                    "arn:aws:logs:<region>:<accountId>:*"
                ],
                "Effect": "Allow"
            },
            {
                "Action": [
                    "logs:CreateLogStream",
                    "logs:PutLogEvents",
                    "logs:DescribeLogStreams"
                ],
                "Resource": [
                    "arn:aws:logs:<region>:<accountId>:log-group:/aws/lambda/autoCleanS3:*"
                ],
                "Effect": "Allow"
            }
        ]
    }

  1. Choose Review policy.
  2. For Name, enter autoCleanS3-LambdaPolicy.
  3. For Description, enter Policy used by Lambda role to purge S3 objects when an Amazon Athena table is dropped.
  4. Choose Create policy.

Next, you need to create an IAM role and attach this policy.

  1. On the IAM console, choose Roles.
  2. Choose Create role.
  3. Choose AWS service.
  4. Choose Lambda.
  5. Choose Next: Permissions.

  1. For Filter policies, enter autoCleanS3-LambdaPolicy.
  2. Choose Next: Tags.
  3. Choose Next: Review.
  4. For Role name, enter autoCleanS3-LambdaRole.
  5. For Description, enter Role used by Lambda to purge S3 objects when an Amazon Athena table is dropped.
  6. Choose Create role.

Creating the AWS Glue history DynamoDB table

You use this DynamoDB table to hold the current and historical list of AWS Glue tables and their corresponding Amazon S3 path. Create the table as follows:

  1. On the DynamoDB console, choose Dashboard.
  2. Choose Create table.
  3. For Table name, enter GlueHistoryDDB.
  4. For Partition key, enter database (leave type as String).
  5. Select Add sort key.
  6. Enter table_date (leave type as String).
  7. For Table settings, select Use default settings.
  8. Choose Create.

The following table summarizes the GlueHistoryDDB table attributes that the Lambda function creates.

Column Type Description
database partition key The name of the AWS Glue database.
table_date sort key A composite attribute of AWS Glue table name plus date created. Because the same database and table name can be created again, the date must be used to ensure uniqueness.
created_by attribute The user or Amazon EC2 instance ARN from which the table was created.
owner attribute The owner of the table or account number.
purged attribute A boolean indicating whether the Amazon S3 objects have been deleted (True/False).
s3_path attribute The Amazon S3 path containing objects associated with the table.
table attribute The AWS Glue table name.
update_time attribute The last time the table was updated (the Amazon S3 path changed or objects purged).
view_sql attribute The view DDL if a view was created.

Creating the Lambda function autoCleanS3

A CloudWatch event triggers the Lambda function autoCleanS3 when a new table is created, updated, or dropped. If the --dropstore keyword is included in the Athena query comments, the associated Amazon S3 objects are also removed.

  1. On the Lambda console, choose Create function.
  2. Select Author from scratch.
  3. For Function name¸ enter autoCleanS3.
  4. For Runtime, choose Python 3.8.
  5. Under Permissions, for Execution role, select Use an existing role.
  6. Choose the role you created (service-role/autoCleanS3-LambdaRole).
  7. Choose Create function.
  8. Scroll down to the Function code section.
  9. If using Region us-west-2, on the Actions menu, choose Upload a file to Amazon S3.

  1. Enter the following:
    https://aws-bigdata-blog.s3.amazonaws.com/artifacts/aws-blog-keep-your-data-lake-clean-and-compliant-with-amazon-athena/autoCleanS3.zip

  2. Choose Save.

If using a Region other than us-west-2, download the Lambda .zip file locally. Then choose Upload a .zip file and choose the file from your computer to upload the Lambda function.

  1. In the Environment variables section, choose Edit.
  2. Choose Add environment variable.
  3. Enter the following key-values in the following table (customize as desired):
Key Value Purpose
Athena_SQL_Drop_Phrase --dropstore String to embed in Athena drop table queries to cause associated Amazon S3 objects to be removed
db_list

Comma-separated regex filter

<.*>

Allows you to limit which databases may contain tables that autoCleanS3 is allowed to purge
ddb_history_table GlueHistoryDDB The name of the AWS Glue history DynamoDB table
disable_s3_cleanup False If set to True, it disables the Amazon S3 purge, still recording attempts in the history table
log_level INFO Set to DEBUG to troubleshoot if needed

You must use a standard regex expression, which can be a simple comma-separated list of the AWS Glue databases that you want autoCleanS3 to evaluate.

 The following table shows example patterns for db_list.

Example Regex Pattern Result
.* Default, includes all databases
clickstream_web, orders_web, default Includes only clickstream_web, orders_web, default
.*_web Includes all databases having names ending in _web
.*stream.* Includes all databases containing stream in their name

For a complete list or supported patterns, see https://docs.python.org/3/library/re.html#re.Pattern.match

  1. Choose Save.

Creating EventBridge rules

You need to create EventBridge rules that invoke your Lambda function whenever Athena query events and AWS Glue CreateTable and UpdateTable events are generated.

Creating the Athena event rule

To create the Athena query event rule, complete the following steps:

  1. On the EventBridge console, choose Create rule.
  2. For Name, enter autoCleanS3-AthenaQueryEvent.
  3. For Description, enter Amazon Athena event for any query to trigger autoCleanS3.
  4. For Define pattern, choose Event pattern.
  5. For Event matching pattern, choose Custom pattern.
  6. For Event pattern, enter the following:
    {
    	"detail-type": [
    		"AWS API Call via CloudTrail"
    	],
    	"source": [
    		"aws.athena"
    	],
    	"detail": {
    		"eventName": [
    			"StartQueryExecution"
    		]
    	}
    }

  1. Choose Save.
  2. For Select targets, choose Lambda function.
  3. For Function¸ choose autoClean3.
  4. Choose Create.

Creating the AWS Glue event rule

To create the AWS Glue table event  rule, complete the following steps:

  1. On the EventBridge console, choose Create rule.
  2. For Name, enter autoCleanS3-GlueTableEvent.
  3. For Description, enter AWS Glue event for any table creation or update to trigger autoCleanS3.
  4. For Define pattern, choose Event pattern.
  5. For Event matching pattern, choose Custom pattern.
  6. For Event pattern, enter the following:
    {
    	"detail-type": [
    		"Glue Data Catalog Database State Change"
    	],
    	"source": [
    		"aws.glue"
    	],
    	"detail": {
    		"typeOfChange": [
    			"CreateTable",
    			"UpdateTable"
    		]
    	}
    }

  1. Choose Save.
  2. For Select targets, choose Lambda function.
  3. For Function¸ choose autoClean3.
  4. Choose Create.

You’re finished!

Testing the solution

Make sure you already have a data lake with tables defined in your AWS Glue Data Catalog and permission to access Athena. For this post, we use NYC taxi ride data. For more information, see Build a Data Lake Foundation with AWS Glue and Amazon S3.

  1. Create a new table using Athena CTAS.

Next, verify that the entry appears in the new GlueHistoryDDB table.

  1. On the DynamoDB console, open the GlueHistoryDDB table.
  2. Choose Items.
  3. Confirm the s3_path value for the table.

You can also view  the Amazon S3 table path and objects associated with the table.

  1. On the Amazon S3 console, navigate to the s3_path found in GlueHistoryDDB.
  2. Confirm the table and path containing the data folder and associated manifest and metadata objects.

  1. Drop the table using the keyword --dropstore.

  1. Check the Amazon S3 path to verify both the table folder and associated manifest and metadata files have been removed.

You can also see the purged attribute for the entry in GlueHistoryDDB is now set to True, and update_time has been updated, which you can use if you ever need to look back and understand when a purge event occurred.

Considerations

The Lambda timeout may need to be increased for very large tables, because the object deletion operations may not complete in time.

To prevent accidental data deletion, it’s recommended to carefully limit which databases may participate (Lambda environment variable db_list) and to enable versioning on the Athena bucket path and set up Amazon S3 lifecycle policies to eventually remove older versions. For more information, see Deleting object versions. 

Conclusion

In this post, we demonstrated how to automate the process of  deleting Amazon S3 objects associated with dropped AWS Glue tables. Deleting Amazon S3 objects that are no longer associated with an AWS Glue table reduces ongoing storage expense, management overhead, and unnecessary exposure of potentially private data no longer needed within the organization, allowing you to meet regulatory requirements.

This serverless solution monitors Athena and AWS Glue table creation and drop events via CloudWatch, and triggers Lambda to perform Amazon S3 object deletion. We use DynamoDB to store the audit history of all AWS Glue tables that have been dropped over time. It’s strongly recommended to enable Amazon S3 bucket versioning to prevent accidental data deletion.

To restore the Amazon S3 objects for the deleted table, you first identify the s3_path value for the relevant table entry in GlueHistoryDDB and either copy or remove the delete marker from objects in that path. For more information, see How do I undelete a deleted S3 object?


About the Author

David Roberts is a Senior Solutions Architect at AWS. His passion is building efficient and effective solutions on the cloud, especially involving analytics and data lake governance. Besides spending time with his wife and two daughters, he likes drumming and watching movies, and is an avid video gamer.