Tag Archives: Amazon S3 Glacier Deep Archive

Celebrate Amazon S3’s 17th birthday at AWS Pi Day 2023

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/celebrate-amazon-s3s-17th-birthday-at-aws-pi-day-2023/

AWS Pi Day 2023 is live today starting at 13:00 PDT; join us on the AWS on Air channel on Twitch.

On this day 17 years ago, we launched a very simple object storage service. It allowed developers to create, list, and delete private storage spaces (known as buckets), upload and download files, and manage their access permissions. The service was available only through a REST and SOAP API. It was designed to provide highly durable data storage with 99.999999999 percent data durability (that’s 11 nines!).

Fast forward to 2023, Amazon Simple Storage Service (Amazon S3) holds more than 280 trillion objects and averages over 100 million requests per second. To protect data integrity, Amazon S3 performs over four billion checksum computations per second. Over the years, we added many capabilities, such as a range of storage classes, to store your colder data cost effectively. Every day, you restore on average more than 1 petabyte from the S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive storage classes. Since launch, you have saved $1 billion from using Amazon S3 Intelligent-Tiering. In 2015, we added the possibility of replicating your data across Regions. Every week, Amazon S3 Replication moves more than 100 petabytes of data. Amazon S3 is also at the core of hundreds of thousands of data lakes. It also has become a critical component of a growing ecosystem of serverless applications. Every day, Amazon S3 sends over 125 billion event notifications to serverless applications. Altogether, Amazon S3 is helping people around the world securely store and extract value from their data.

AWS Pi Day 2023 Small

To celebrate Amazon S3‘s birthday AWS is hosting the AWS Pi Day event for the third consecutive year. This live online event starts at 13:00 PDT today (March 14, 2023) on the AWS On Air channel on Twitch and will feature four hours of fresh educational content from AWS experts. We will discuss not only Amazon S3 best practices, we will also dive into the latest innovations across AWS data services, from storage to analytics and AI/ML. Tune in to learn how to get the most out of your data by making it more secure, available, accessible, and connected, and to help you respond to rapid growth and changing demand. You will also learn how to optimize your data costs, automate your cost savings, eliminate operational complexity, and get new insights from your data. Have a look at the full agenda on the registration page.

At AWS, we innovate on your behalf. During the last few weeks, we announced a 99.99 percent SLA for Amazon MemoryDB for Redis, enhanced I/O multiplexing for Amazon ElastiCache for Redis, and encryption by default for new objects on Amazon S3.

But we are not stopping there, and today we take the occasion of this celebration to announce seven new capabilities across our data services.

Mountpoint for Amazon S3 (alpha release): an open-source file client for Amazon S3
Mountpoint for Amazon S3 is an open-source file client for Amazon S3 that you can install on your compute instance. It translates local file storage API calls to REST API calls on objects in Amazon S3. When using Mountpoint for Amazon S3, data lake applications that access objects using file APIs can achieve high single-instance transfer rates, saving on compute costs.

You can get started with Mountpoint for Amazon S3 by mounting an Amazon S3 bucket at a local mount point on your compute instance. Once mounted, applications read objects as files available locally. Mountpoint for Amazon S3 supports sequential and random read operations on existing S3 objects. It is available to download for Linux operating systems as an alpha release and is not yet intended for production workloads. Instead, we want to collect your feedback early and incorporate your input into the design and implementation. To get started, visit the Mountpoint for Amazon S3 GitHub repo, read the technical launch blog, and share your feedback.

Now Generally Available: AWS Data Exchange for Amazon S3
AWS Data Exchange for Amazon S3 enables you to easily find, subscribe to, and use third-party data files for faster time to insight, storage cost optimization, simplified data licensing management, and more. Data Exchange subscribers can directly use files from data providers’ Amazon S3 buckets for their analysis with AWS services without needing to create or manage copies to their account. Data providers can license in-place access to data hosted in their Amazon S3 buckets.

To learn more about how data providers can simplify and scale access management to multiple data subscribers, you can read this blog.

AWS Data Exchange for S3

Amazon S3 Multi-Region Access Points now support replicated datasets that span multiple AWS accounts
We launched Amazon S3 Multi-Region Access Points in September 2021. We added failover control in November 2022. Amazon S3 Multi-Region Access Points now support datasets that are replicated across multiple AWS accounts. Cross-account Multi-Region Access Points simplify object storage access for applications that span both AWS Regions and accounts, avoiding the need for complex request routing logic in your application. They provide a single global endpoint for your multi-Region applications and dynamically route S3 requests based on policies that you define. This helps you to easily implement multi-Region resilience, latency-based routing, and active/passive failover, even when your data is stored in multiple AWS accounts.

You can learn more about S3 Multi-Region Access Points on the Amazon S3 FAQs.

Aliases for S3 Object Lambda Access Points as CloudFront origin
Amazon S3 Object Lambda, launched in March 2021, lets you add your own code to S3 GET, HEAD, and LIST API requests to modify data as it is returned to an application. With today’s launch of aliases for S3 Object Lambda Access Points any application that requires an S3 bucket name can easily present different views of data depending on the requester. You can now use an S3 Object Lambda Access Point alias as an origin for your Amazon CloudFront distribution to modify the data requested. For example, you can dynamically transform an image depending on the device that a user is visiting from, such as a desktop or a smartphone.

If you want to learn more, my colleague Danilo wrote a blog post with more details and code examples.

Simplify private connectivity from on-premises networks
Amazon Virtual Private Cloud (Amazon VPC) interface endpoints for Amazon S3 now offer private DNS options that can help you more easily route Amazon S3 requests to the lowest-cost endpoint in your VPC. With private DNS for Amazon S3, your on-premises applications can use AWS PrivateLink to access Amazon S3 over an interface endpoint, while requests from your in-VPC applications access Amazon S3 using gateway endpoints. Routing requests like this helps you take advantage of the lowest-cost private network path without having to make code or configuration changes to your clients.

S3 private connectivity

You can learn more on the AWS PrivateLink for Amazon S3 documentation.

Local Amazon S3 Replication on Outposts
Amazon S3 on Outposts now supports S3 replication on Outposts. This extends S3’s fully managed approach to replication to S3 on Outposts buckets. It helps you meet your data residency and data redundancy requirements. With local S3 Replication on Outposts, you can create and configure replication rules to automatically replicate your S3 objects to another Outpost or to another bucket on the same Outpost. During replication, your S3 on Outposts objects are always sent over your local gateway, and objects do not travel back to the AWS Region. S3 Replication on Outposts provides an easy and flexible way to automatically replicate data within a specific data perimeter to address your data redundancy and compliance requirements.

Amazon OpenSearch Security Analytics
The new Amazon OpenSearch Service’s security analytics capability enables your Security Operations (SecOps) teams to detect potential threats quickly while having the tools to help with security investigations on historical data—all with lower data storage costs. Like many other advanced capabilities of Amazon OpenSearch Service, there is no additional charge for security analytics.

You can learn more about Amazon OpenSearch security analytics by reading this blog post.

Join Us Online Today
You will learn more about these launches and about AWS data services in general. We have also prepared some live demos. We designed the AWS Pi Day event for system administrators, engineers, developers, and architects. Our sessions will bring you the latest and greatest information on storage, security, backup, archiving, training and certification, and more.

And to dive deeper, get Pi Day started early by attending AWS Innovate: Data and AI/ML Edition to learn about cutting-edge machine learning tools, strategies for building future-proof applications, and making data-driven decisions for your organization. Don’t miss Swami Sivasubramanian‘s keynote, starting at 9:00 PDT.

Join us today on the AWS Pi Day live stream. Kevin Miller, VP and GM of Amazon S3, will kick off the event with a keynote at 13:00 PDT.

See you there!

— seb

Genomics workflows, Part 4: processing archival data

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/genomics-workflows-part-4-processing-archival-data/

Genomics workflows analyze data at petabyte scale. After processing is complete, data is often archived in cold storage classes. In some cases, like studies on the association of DNA variants against larger datasets, archived data is needed for further processing. This means manually initiating the restoration of each archived object and monitoring the progress. Scientists require a reliable process for on-demand archival data restoration so their workflows do not fail.

In Part 4 of this series, we look into genomics workloads processing data that is archived with Amazon Simple Storage Service (Amazon S3). We design a reliable data restoration process that informs the workflow when data is available so it can proceed. We build on top of the design pattern laid out in Parts 1-3 of this series. We use event-driven and serverless principles to provide the most cost-effective solution.

Use case

Our use case focuses on data in Amazon Simple Storage Service Glacier (Amazon S3 Glacier) storage classes. The S3 Glacier Instant Retrieval storage class provides the lowest-cost storage for long-lived data that is rarely accessed but requires retrieval in milliseconds.

The S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive provide further cost savings, with retrieval times ranging from minutes to hours. We focus on the latter in order to provide the most cost-effective solution.

You must first restore the objects before accessing them. Our genomics workflow will pause until the data restore completes. The requirements for this workflow are:

  • Reliable launch of the restore so our workflow doesn’t fail (due to S3 Glacier service quotas, or because not all objects were restored)
  • Event-driven design to mirror the event-driven nature of genomics workflows and perform the retrieval upon request
  • Cost-effective and easy-to-manage by using serverless services
  • Upfront detection of archived data when formulating the genomics workflow task, avoiding idle computational tasks that incur cost
  • Scalable and elastic to meet the restore needs of large, archived datasets

Solution overview

Genomics workflows take multiple input parameters to prepare the initiation, such as launch ID, data path, workflow endpoint, and workflow steps. We store this data, including workflow configurations, in an S3 bucket. An AWS Fargate task reads from the S3 bucket and prepares the workflow. It detects if the input parameters include S3 Glacier URLs.

We use Amazon Simple Queue Service (Amazon SQS) to decouple S3 Glacier index creation from object restore actions (Figure 1). This increases the reliability of our process.

Solution architecture for S3 Glacier object restore

Figure 1. Solution architecture for S3 Glacier object restore

An AWS Lambda function creates the index of all objects in the specified S3 bucket URLs and submits them as an SQS message.

Another Lambda function polls the SQS queue and submits the request(s) to restore the S3 Glacier objects to S3 Standard storage class.

The function writes the job ID of each S3 Glacier restore request to Amazon DynamoDB. After the restore is complete, Lambda sets the status of the workflow to READY. Only then can any computing jobs start, such as with AWS Batch.

Implementation considerations

We consider the use case of Snakemake with Tibanna, which we detailed in Part 2 of this series. This allows us to dive deeper on launch details.

Snakemake is an open-source utility for whole-genome-sequence mapping in directed acyclic graph format. Snakemake uses Snakefiles to declare workflow steps and commands. Tibanna is an open-source, AWS-native software that runs bioinformatics data pipelines. It supports Snakefile syntax, plus other workflow languages, including Common Workflow Language and Workflow Description Language (WDL).

We recommend using Amazon Genomics CLI if Tibanna is not needed for your use case, or Amazon Omics if your workflow definitions are compliant with the supported WDL and Nextflow specifications.

Formulate the restore request

The Snakemake Fargate launch container detects if the S3 objects under the requested S3 bucket URLs are stored in S3 Glacier. The Fargate launch container generates and puts a JSON binary base call (BCL) configuration file into an S3 bucket and exits successfully. This file includes the launch ID of the workflow, corresponding with the DynamoDB item key, plus the S3 URLs to restore.

Query the S3 URLs

Once the JSON BCL configuration file lands in this S3 bucket, the S3 Event Notification PutObject event invokes a Lambda function. This function parses the configuration file and recursively queries for all S3 object URLs to restore.

Initiate the restore

The main Lambda function then submits messages to the SQS queue that contains the full list of S3 URLs that need to be restored. SQS messages also include the launch ID of the workflow. This is to ensure we can bind specific restoration jobs to specific workflow launches. If all S3 Glacier objects belong to Flexible Retrieval storage class, the Lambda function puts the URLs in a single SQS message, enabling restoration with Bulk Glacier Job Tier. The Lambda function also sets the status of the workflow to WAITING in the corresponding DynamoDB item. The WAITING state is used to notify the end user that the job is waiting on the data-restoration process and will continue once the data restoration is complete.

A secondary Lambda function polls for new messages landing in the SQS queue. This Lambda function submits the restoration request(s)—for example, as a free-of-charge Bulk retrieval—using the RestoreObject API. The function subsequently writes the S3 Glacier Job ID of each request in our DynamoDB table. This allows the main Lambda function to check if all Job IDs associated with a workflow launch ID are complete.

Update status

The status of our workflow launch will remain WAITING as long as the Glacier object restore is incomplete. The AWS CloudTrail logs of completed S3 Glacier Job IDs invoke our main Lambda function (via an Amazon EventBridge rule) to update the status of the restoration job in our DynamoDB table. With each invocation, the function checks if all Job IDs associated with a workflow launch ID are complete.

After all objects have been restored, the function updates the workflow launch with status READY. This launches the workflow with the same launch ID prior to the restore.

Conclusion

In this blog post, we demonstrated how life-science research teams can make use of their archival data for genomic studies. We designed an event-driven S3 Glacier restore process, which retrieves data upon request. We discussed how to reliably launch the restore so our workflow doesn’t fail. Also, we determined upfront if an S3 Glacier restore is needed and used the WAITING state to prevent our workflow from failing.

With this solution, life-science research teams can save money using Amazon S3 Glacier without worrying about their day-to-day work or manually administering S3 Glacier object restores.

Related information

Happy 10th Anniversary, Amazon S3 Glacier – A Decade of Cold Storage in the Cloud

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/happy-10th-anniversary-amazon-s3-glacier-a-decade-of-cold-storage-in-the-cloud/

Ten years ago, on August 20, 2012, AWS announced the general availability of Amazon Glacier, secure, reliable, and extremely low-cost storage designed for data archiving and backup. At the time, I was working as an AWS customer and it felt like an April Fools’ joke, offering long-term, secure, and durable cloud storage that allowed me to archive large amounts of data at a very low cost.

In Jeff’s original blog post for this launch, he noted that:

Glacier provides, at a cost as low as $0.01 (one US penny, one one-hundredth of a dollar) per Gigabyte per month, extremely low-cost archive storage. You can store a little bit, or you can store a lot (terabytes, petabytes, and beyond). There’s no upfront fee, and you pay only for the storage that you use. You don’t have to worry about capacity planning, and you will never run out of storage space.

Ten years later, Amazon S3 Glacier has evolved to be the best place in the world for you to store your archive data. The Amazon S3 Glacier storage classes are purpose-built for data archiving, providing you with the highest performance, most retrieval flexibility, and the lowest cost archive storage in the cloud.

You can now choose from three archive storage classes optimized for different access patterns and storage duration – Amazon S3 Glacier Instant Retrieval, Amazon S3 Glacier Flexible Retrieval (formerly Amazon S3 Glacier), and Amazon S3 Glacier Deep Archive. We’ll dive into each of these storage classes in a bit.

A Decade of Innovation in Amazon S3 Glacier
To understand how we got here, we’ll walk through through the last decade and revisit some of the most significant Amazon S3 Glacier launches that fundamentally changed archive storage forever:

August 2012 – Amazon Glacier: Archival Storage for One Penny per GB per Month
We launched Amazon Glacier to store any amount of data with high durability at a cost that allows you to get rid of your tape libraries and all the operational complexity and overhead that have been part of data archiving for decades. Amazon Glacier was modeled on S3’s durability and dependability but designed and built from the ground up to offer an archival storage to you at an extremely low cost. At that time, Glacier introduced the concept of a “vault” for storing archival data. You could then easily retrieve your archival data by initiating a request and then the data was made available to you for download in 3–5 hours.

November 2012 – Archiving Amazon S3 Data to Glacier
While Glacier was purpose-built from the ground up for archival data, many customers had object data that originated in S3 warmer storage that they would eventually want to move to colder storage. To make that easy for customers, Amazon S3’s Lifecycle Management (aka Lifecycle Rule) integrated S3 and Glacier and made the details visible via the storage class of each object. Lifecycle Management allows you to define time-based rules that can start Transition (changing S3 storage class to Glacier) and Expiration (deletion of objects). In 2014, we combined the flexibility of S3 versioned objects with Glacier, helping you to further reduce your overall storage costs.

November 2016 – Glacier Price Reductions and Additional Retrieval Options for Glacier
As part of AWS’s long-term focus on reducing costs and passing along those savings to customers, we reduced the price of Glacier storage to $0.004 (less than half a cent) in the case of 1 GB for 1 month in the US East (N. Virginia) Region, from $0.007 in 2015 and $0.010 in 2012. With storing data at a very low cost but having flexibility in how quickly they can retrieve the data, we introduced two more options for data retrieval that were based on the amount of data that you stored in Glacier and the rate at which you retrieved it. You could select expedited retrieval (typically taking 1–5 minutes), bulk retrieval (5–12 hours), or the existing standard retrieval method (3–5 hours).

November 2018 – Amazon S3 Glacier Storage Class to Integrate S3 Experiences
Glacier customers appreciated the way they could easily move data from S3 to Glacier via S3 lifecycle management, and wanted us to expand on that capability to use the most common S3 APIs to operate directly on S3 Glacier objects. So, we added S3 PUT API to S3 Glacier, which enables you to use the standard S3 PUT API and select any storage class, including S3 Glacier, to store the data. Data can be stored directly in S3 Glacier, eliminating the need to upload to S3 Standard and immediately transition to S3 Glacier with a zero-day lifecycle policy. So, you could PUT to S3 Glacier like any other S3 storage class.

March 2019 – Amazon S3 Glacier Deep Archive – the Lowest Cost Storage in the Cloud
While the original Glacier service offered an extremely low price for archival storage, we challenged ourselves to see if we could find a way to invent an even lower priced storage offering for very cold data. The Amazon S3 Glacier Deep Archive storage class delivers the lowest cost storage, up to 75 percent lower cost (than S3 Glacier Flexible Retrieval), for long-lived archive data that is accessed less than once per year and is retrieved asynchronously. At just $0.00099 per GB-month (or $1 per TB-month), S3 Glacier Deep Archive offers the lowest cost storage in the cloud at prices significantly lower than storing and maintaining data in on-premises tape or archiving data off-site.

November 2020 – Amazon S3 Intelligent-Tiering adds Archive Access and Deep Archive Access tiers
In November 2018, we launched Amazon S3 Intelligent-Tiering, the only cloud storage class that delivers automatic storage cost savings, up to 95 percent when data access patterns change, without performance impact or operational overhead. In order to offer customers the simplicity and flexibility of S3 Intelligent-Tiering and the low storage cost of archival data, we added the Archive Access tier providing the same performance and pricing as the S3 Glacier storage class as well as the Deep Archive Access tier which offers the same performance and pricing as the S3 Glacier Deep Archive storage class.

November 2021 – Amazon S3 Glacier Flexible Retrieval and S3 Glacier Instant Retrieval
The Amazon S3 Glacier storage class was renamed to Amazon S3 Glacier Flexible Retrieval and now includes free bulk retrievals along with an additional 10 percent price reduction across all Regions, making it optimized for use cases such as backup and disaster recovery.

Additionally, customers asked us for a storage solution that had the low costs of Glacier but allowed for fast access when data was needed very quickly. So, we introduced Amazon S3 Glacier Instant Retrieval, a new archive storage class that delivers the lowest cost storage for long-lived data that is rarely accessed and requires milliseconds retrieval. You can save up to 68 percent on storage costs compared to using the S3 Standard-Infrequent Access (S3 Standard-IA) storage class when your data is accessed once per quarter.

The Amazon S3 Intelligent-Tiering storage class also recently added a new Archive Instant Access tier, providing the same performance and pricing as the S3 Glacier Instant Retrieval storage class which delivers automatic 68% cost savings for customers using S3 Intelligent-Tiering with long-lived data.

Then and Now
Customers across all industries and verticals use the S3 Glacier storage classes for every imaginable archival workload. Accessing and using the S3 Glacier storage classes through the S3 APIs and S3 console provides enhanced functionality for data management and cost optimization.

As we discussed above, you can now choose from three archive storage classes optimized for different access patterns and storage duration:

  • S3 Glacier Instant Retrieval – For archive data that needs immediate access, such as medical images, news media assets, or genomics data, choose the S3 Glacier Instant Retrieval storage class, an archive storage class that delivers the lowest cost storage with milliseconds retrieval.
  • S3 Glacier Flexible Retrieval – For archive data that does not require immediate access but needs to have the flexibility to retrieve large sets of data at no cost, such as backup or disaster recovery use cases, choose the S3 Glacier Flexible Retrieval storage class, with retrieval in minutes or free bulk retrievals in 12 hours.
  • S3 Glacier Deep Archive – For retaining data for 7–10 years or longer to meet customer needs and regulatory compliance requirements, such as financial services, healthcare, media and entertainment, and public sector, choose the S3 Glacier Deep Archive storage class, the lowest cost storage in the cloud with data retrieval within 12–48 hours.

Watch a brief introduction video for an overview of the S3 Glacier storage classes.

All S3 Glacier storage classes are designed for 99.999999999% (11 9s) of durability for objects. Data is redundantly stored across three or more Availability Zones that are physically separated within an AWS Region. Here are some comparisons across the S3 Glacier storage classes at a glance:

Performances S3 Glacier
Instant Retrieval
S3 Glacier
Flexible Retrieval
S3 Glacier
Deep Archive
Availability 99.9% 99.99% 99.99%
Availability SLA 99% 99.9% 99.9%
Minimum capacity charge per object 128 KB 40 KB 40 KB
Minimum storage duration charge 90 days 90 days 180 days
Retrieval charge per GB per GB per GB
Retrieval time milliseconds Expedited (1–5 minutes),
Standard (3–5 hours),
Bulk (5–12 hours) free
Standard (within 12 hours),
Bulk (within 48 hours)

For data with changing access patterns that you want to automatically archive based on the last access of that data, choose the S3 Intelligent-Tiering storage class. Doing so will optimize storage costs by automatically moving data to the most cost-effective access tier when access patterns change. Its Archive Instant Access, Archive Access, and Deep Archive Access tiers have the same performance as S3 Glacier Instant Retrieval, S3 Glacier Flexible Retrieval, and S3 Glacier Deep Archive respectively. To learn more, see the blog post Automatically archive and restore data with Amazon S3 Intelligent-Tiering.

To get started with S3 Glacier, see the blog post Best practices for archiving large datasets with AWS for key considerations and actions when planning your cold data storage patterns. You can also use a hands-on lab tutorial that will help you get started with the S3 Glacier storage classes in just 20 minutes, and start archiving your data in the S3 Glacier storage classes in the S3 console.

Happy Birthday, Amazon S3 Glacier!
During the last AWS Storage Day 2022, Kevin Miller, VP & GM of Amazon S3, mentioned the 10th anniversary of S3 Glacier and its pace of innovation for many customer use cases throughout his interview with theCUBE.

In this expanding world of data growth, you have to have an archiving strategy. Everyone has archival data — every company, every vertical, and every industry. There is an archiving need not only for companies that have been around for a while but also for digital native businesses.

Lots of AWS customers such as Nasdaq, Electronic Arts, and NASCAR have used S3 Glacier storage classes for their backup and archiving workloads. The following are some additional recent customer-authored blogs focusing on AWS archiving best practices from customers in the financial, media, gaming, and software industries.

A big thank you to all of our S3 Glacier customers from around the world! Over 90 percent of S3’s roadmap has come directly from feedback from customers like you. We will never stop listening to you, as your feedback and ideas are essential to how we improve the service. Thank you for trusting us and for constantly raising the bar and pushing us to improve to lower costs, simplify your storage, increase your agility, and allow you to innovate faster.

In accordance with Customer Obsession, one of the Amazon Leadership Principles, your feedback is always welcome! If you want to see new S3 Glacier features and capabilities, please send any feedback to AWS re:Post for S3 Glacier or through your usual AWS Support contacts.

– Channy

Orchestrating Amazon S3 Glacier Deep Archive object retrieval using AWS Step Functions

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/orchestrating-amazon-s3-glacier-deep-archive-object-retrieval-using-aws-step-functions/

This blog was written by Monica Cortes Sack, Solutions Architect, Oskar Neumann, Partner Solutions Architect, and Dhiraj Mahapatro, Principal Specialist SA, Serverless.

AWS Step Functions now support over 220 services and over 10,000 AWS API actions. This enables you to use the AWS SDK integration directly instead of writing an AWS Lambda function as a proxy.

One such service integration is with Amazon S3. Currently, you write scripts using AWS CLI S3 commands to achieve automation around running S3 tasks. For example, S3 integrates with AWS Transfer Family, builds a custom security check, takes action on S3 buckets on S3 object creations, or orchestrates a workflow around S3 Glacier Deep Archive object retrieval. These script executions do not provide an execution history or an easy way to validate the behavior.

Step Functions’ AWS SDK integration with S3 declaratively creates serverless workflows around S3 tasks without relying on those scripts. You can validate the execution history and behavior of a Step Functions workflow.

This blog highlights one of the S3 use cases. It shows how to orchestrate workflows around S3 Glacier Deep Archive object retrieval, cost estimation, and interaction with the requester using Step Functions. The demo application provides additional details on the entire architecture.

S3 Glacier Deep Archive is a storage class in S3 used for data that is rarely accessed. The service provides durable and secure long-term storage, trading immediate access for cost-effectiveness. You must restore archived objects before they are downloadable. It supports two options for object retrieval:

  1. Standard – Access objects within 12 hours of the start of the restoration process.
  2. Bulk – Access objects within 48 hours of the start of the restoration process.

Business use case

Consider a research institute that stores backups on S3 Glacier Deep Archive. The backups are maintained in S3 Glacier Deep Archive for redundancy. The institute has multiple researchers with one central IT team. When a researcher requests an object from S3 Glacier Deep Archive, the central IT team retrieves it and charges the corresponding research group for retrieval and data transfer costs.

Researchers are the end users and do not operate in the AWS Cloud. They run computing clusters on-premises and depend on the central IT team to provide them with the restored archive. A member of the research team requesting an object retrieval provides the following information to the central IT team:

  1. Object key to be restored.
  2. The number of days the researcher needs the object accessible for download.
  3. Researcher’s email address.
  4. Retrieve within 12 or 48 hours SLA. This determines whether “Standard” or “Bulk” retrieval respectively.

The following overall architecture explains the setup on AWS and the interaction between a researcher and the central IT team’s architecture.

Architecture overview

Architecture diagram

Architecture diagram

  1. The researcher uses a front-end application to request object retrieval from S3 Glacier Deep Archive.
  2. Amazon API Gateway synchronously invokes AWS Step Functions Express Workflow.
  3. Step Functions initiates RestoreObject from S3 Glacier Deep Archive.
  4. Step Functions stores the metadata of this retrieval in an Amazon DynamoDB table.
  5. Step Functions uses Amazon SES to email the researcher about archive retrieval initiation.
  6. Upon completion, S3 sends the RestoreComplete event to Amazon EventBridge.
  7. EventBridge rule triggers another Step Functions for post-processing after the restore is complete.
  8. A Lambda function inside the Step Functions calculates the estimated cost (retrieval and data transfer out) and updates existing metadata in the DynamoDB table.
  9. Sync data from DynamoDB table using Amazon Athena Federated Queries to generate reports dashboard in Amazon QuickSight.
  10. Step Functions uses SES to email the researcher with cost details.
  11. Once the researcher receives an email, the researcher uses the front-end application to call the /download API endpoint.
  12. API Gateway invokes a Lambda function that generates a pre-signed S3 URL of the retrieved object and returns it in the response.

Setup

  1. To run the sample application, you must install CDK v2, Node.js, and npm.
  2. To clone the repository, run:
    git clone https://github.com/aws-samples/aws-stepfunctions-examples.git
    cd cdk/app-glacier-deep-archive-retrieval
  3. To deploy the application, run:
    cdk deploy --all

Identifying workflow components

Starting the restore object workflow

The first component is accepting the researcher’s request to start the archive retrieval process. The sample application created from the demo provides a basic front-end app that shows the files from an S3 bucket that has objects stored in S3 Glacier Deep Archive. The researcher retrieves file requests from the front-end application reached by the sample application’s Amazon CloudFront URL.

Glacier Deep Archive Retrieval menu

Glacier Deep Archive Retrieval menu

The front-end app asks the researcher for an email address, the number of days the researcher wants the object to be available for download, and their ETA on retrieval speed. Based on the retrieval speed, the researcher accepts either Standard or Bulk object retrieval. To test this, put objects in the data bucket under the S3 Glacier Deep Archive storage class and use the front-end application to retrieve them.

Item retrieval prompt

Item retrieval prompt

The researcher then chooses the Retrieve file. This action invokes an API endpoint provided by API Gateway. The API Gateway endpoint synchronously invokes a Step Functions Express Workflow. This validates the restore object request, gets the object metadata, and starts to restore the object from S3 Glacier Deep Archive.

The state machine stores the metadata of the restore object AWS SDK call in a DynamoDB table for later use. You can use this metadata to build a dashboard in Amazon QuickSight for reporting and administration purposes. Finally, the state machine uses Amazon SES to email the researcher, notifying them about the restore object initiation process:

Restore object initiated

Restore object initiated

The following state machine shows the workflow:

Workflow diagram

Workflow diagram

The ability to use S3 APIs declaratively using AWS SDK from Step Functions makes it convenient to integrate with S3. This approach avoids writing a Lambda function to wrap the SDK calls. The following portion of the state machine definition shows the usage of S3 HeadObject and RestoreObject APIs:

"Get Object Metadata": {
    "Next": "Initiate Restore Object from Deep Archive",
    "Catch": [{
        "ErrorEquals": ["States.ALL"],
        "Next": "Bad Request"
    }],
    "Type": "Task",
    "ResultPath": "$.result.metadata",
    "Resource": "arn:aws:states:::aws-sdk:s3:headObject",
    "Parameters": {
        "Bucket": "glacierretrievalapp-databucket-abc123",
        "Key.$": "$.fileKey"
    }
}, 
"Initiate Restore Object from Deep Archive": {
    "Next": "Update restore operation metadata",
    "Type": "Task",
    "ResultPath": null,
    "Resource": "arn:aws:states:::aws-sdk:s3:restoreObject",
    "Parameters": {
        "Bucket": "glacierretrievalapp-databucket-abc123",
        "Key.$": "$.fileKey",
        "RestoreRequest": {
            "Days.$": "$.requestedForDays"
        }
    }
}

You can extend the previous workflow and build your own Step Functions workflows to orchestrate other S3 related workflows.

Processing after object restoration completion

S3 RestoreObject is a long-running process for S3 Glacier Deep Archive objects. S3 emits a RestoreCompleted event notification on the object restore completion to EventBridge. You set up an EventBridge rule to trigger another Step Functions workflow as a target for this event. This workflow takes care of the object restoration post-processing.

cfnDataBucket.addPropertyOverride('NotificationConfiguration.EventBridgeConfiguration.EventBridgeEnabled', true);

An EventBridge rule triggers the following Step Functions workflow and passes the event payload as an input to the Step Functions execution:

new aws_events.Rule(this, 'invoke-post-processing-rule', {
  eventPattern: {
    source: ["aws.s3"],
    detailType: [
      "Object Restore Completed"
    ],
    detail: {
      bucket: {
        name: [props.dataBucket.bucketName]
      }
    }
  },
  targets: [new aws_events_targets.SfnStateMachine(this.stateMachine, {
    input: aws_events.RuleTargetInput.fromObject({
      's3Event': aws_events.EventField.fromPath('$')
    })
  })]
});

The Step Functions workflow gets object metadata from the DynamoDB table and then invokes a Lambda function to calculate the estimated cost. The Lambda function calculates the estimated retrieval and the data transfer costs using the contentLength of the retrieved object and the Price List API for the unit cost. The workflow then updates the calculated cost in the DynamoDB table.

The retrieval cost and the data transfer out cost are proportional to the size of the retrieved object. The Step Functions workflow also invokes a Lambda function to create the download API URL for object retrieval. Finally, it emails the researcher with the estimated cost and the download URL as a restoration completion notification.

Workflow studio diagram

Workflow studio diagram

The email notification to the researcher looks like:

Email example

Email example

Downloading the restored object

Once the object restoration is complete, the researcher can download the object from the front-end application.

Front end retrieval menu

Front end retrieval menu

The researcher chooses the Download action, which invokes another API Gateway endpoint. The endpoint integrates with a Lambda function as a backend that creates a pre-signed S3 URL sent as a response to the browser.

Administering object restoration usage

This architecture also provides a view for the central IT team to understand object restoration usage. You achieve this by creating reports and dashboards from the metadata stored in DynamoDB.

The sample application uses Amazon Athena Federated Queries and Amazon Athena DynamoDB Connector to generate a reports dashboard in Amazon QuickSight. You can also use Step Functions AWS SDK integration with Amazon Athena and visualize the workflows in the Athena console.

The following QuickSight visualization shows the count of restored S3 Glacier Deep Archive objects by their contentType:

QuickSite visualization

QuickSight visualization

Considerations

With the preceding approach, you should consider that:

  1. You must start the object retrieval in the same Region as the Region of the archived object.
  2. S3 Glacier Deep Archive only supports standard and bulk retrievals.
  3. You must enable the “Object Restore Completed” event notification on the S3 bucket with the S3 Glacier Deep Archive object.
  4. The researcher’s email must be verified in SES.
  5. Use a Lambda function for the Price List GetProducts API as the service endpoints are available in specific Regions.

Cleanup

To clean up the infrastructure used in this sample application, run:

cdk destroy --all

Conclusion

Step Functions’ AWS SDK integration opens up different opportunities to orchestrate a workflow. Step Functions provides native support for retries and error handling, which offloads the heavy lifting of handling them manually in scripts.

This blog shows one example use case with S3 Glacier Deep Archive. With AWS SDK integration in Step Functions, you can build any workflow orchestration using S3 or S3 control APIs. For example, a workflow to enforce AWS Key Management Service encryption based on an S3 event, or create static website hosting on-demand in a few steps.

With different S3 API calls available in Step Functions’ Workflow Studio, you can declaratively build a Step Functions workflow instead of imperatively calling each S3 API from a shell script or command line. Refer to the demo application for more details.

For more serverless learning resources, visit Serverless Land.

New – Additional Checksum Algorithms for Amazon S3

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-additional-checksum-algorithms-for-amazon-s3/

Amazon Simple Storage Service (Amazon S3) is designed to provide 99.999999999% (11 9s) of durability for your objects and for the metadata associated with your objects. You can rest assured that S3 stores exactly what you PUT, and returns exactly what is stored when you GET. In order to make sure that the object is transmitted back-and-forth properly, S3 uses checksums, basically a kind of digital fingerprint.

S3’s PutObject function already allows you to pass the MD5 checksum of the object, and only accepts the operation if the value that you supply matches the one computed by S3. While this allows S3 to detect data transmission errors, it does mean that you need to compute the checksum before you call PutObject or after you call GetObject. Further, computing checksums for large (multi-GB or even multi-TB) objects can be computationally intensive, and can lead to bottlenecks. In fact, some large S3 users have built special-purpose EC2 fleets solely to compute and validate checksums.

New Checksum Support
Today I am happy to tell you about S3’s new support for four checksum algorithms. It is now very easy for you to calculate and store checksums for data stored in Amazon S3 and to use the checksums to check the integrity of your upload and download requests. You can use this new feature to implement the digital preservation best practices and controls that are specific to your industry. In particular, you can specify the use of any one of four widely used checksum algorithms (SHA-1, SHA-256, CRC-32, and CRC-32C) when you upload each of your objects to S3.

Here are the principal aspects of this new feature:

Object Upload – The newest versions of the AWS SDKs compute the specified checksum as part of the upload, and include it in an HTTP trailer at the conclusion of the upload. You also have the option to supply a precomputed checksum. Either way, S3 will verify the checksum and accept the operation if the value in the request matches the one computed by S3. In combination with the use of HTTP trailers, this feature can greatly accelerate client-side integrity checking.

Multipart Object Upload – The AWS SDKs now take advantage of client-side parallelism and compute checksums for each part of a multipart upload. The checksums for all of the parts are themselves checksummed and this checksum-of-checksums is transmitted to S3 when the upload is finalized.

Checksum Storage & Persistence – The verified checksum, along with the specified algorithm, are stored as part of the object’s metadata. If Server-Side Encryption with KMS Keys is requested for the object, then the checksum is stored in encrypted form. The algorithm and the checksum stick to the object throughout its lifetime, even if it changes storage classes or is superseded by a newer version. They are also transferred as part of S3 Replication.

Checksum Retrieval – The new GetObjectAttributes function returns the checksum for the object and (if applicable) for each part.

Checksums in Action
You can access this feature from the AWS Command Line Interface (CLI), AWS SDKs, or the S3 Console. In the console, I enable the Additional Checksums option when I prepare to upload an object:

Then I choose a Checksum function:

If I have already computed the checksum I can enter it, otherwise the console will compute it.

After the upload is complete I can view the object’s properties to see the checksum:

The checksum function for each object is also listed in the S3 Inventory Report.

From my own code, the SDK can compute the checksum for me:

with open(file_path, 'rb') as file:
    r = s3.put_object(
        Bucket=bucket,
        Key=key,
        Body=file,
        ChecksumAlgorithm='sha1'
    )

Or I can compute the checksum myself and pass it to put_object:

with open(file_path, 'rb') as file:
    r = s3.put_object(
        Bucket=bucket,
        Key=key,
        Body=file,
        ChecksumSHA1='fUM9R+mPkIokxBJK7zU5QfeAHSy='
    )

When I retrieve the object, I specify checksum mode to indicate that I want the returned object validated:

r = s3.get_object(Bucket=bucket, Key=key, ChecksumMode='ENABLED')

The actual validation happens when I read the object from r['Body'], and an exception will be raised if there’s a mismatch.

Watch the Demo
Here’s a demo (first shown at re:Invent 2021) of this new feature in action:

Available Now
The four additional checksums are now available in all commercial AWS Regions and you can start using them today at no extra charge.

Jeff;

Amazon S3 Glacier is the Best Place to Archive Your Data – Introducing the S3 Glacier Instant Retrieval Storage Class

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/amazon-s3-glacier-is-the-best-place-to-archive-your-data-introducing-the-s3-glacier-instant-retrieval-storage-class/

Today we are announcing the Amazon S3 Glacier Instant Retrieval storage class. This new archive storage class delivers the lowest cost storage for long-lived data that is rarely accessed and requires millisecond retrieval.

We are also excited to announce that S3 Intelligent-Tiering now automatically optimizes storage costs for rarely accessed data that needs immediate retrieval with the new Archive Instant Access tier, which is ideal for data with unknown or changing access patterns. For existing customers, this will provide an immediate savings of 68 percent for data that hasn’t been accessed for more than 90 days, with no action needed. The Frequent, Infrequent, and now Archive Instant Access tiers are designed for the same milliseconds access time and high-throughput performance.

In addition, we are announcing the new name for the existing Amazon S3 Glacier storage class and several price reductions.

Amazon S3 Glacier Instant Retrieval
The Amazon S3 Glacier storage classes are extremely low-cost and built for data archiving. They are secure and durable, and they are designed to provide the lowest cost for data that does not require immediate access, with retrieval options from minutes to hours.

Many customers need to store rarely accessed data for several years. However the data must be highly available and immediately accessible. Today, these customers use the S3 Standard-Infrequent Access (S3 Standard-IA) storage class. This storage class offers low cost for storage and allows customers to retrieve their data instantly.

S3 Glacier Instant Retrieval is a new storage class that delivers the fastest access to archive storage, with the same low latency and high-throughput performance as the S3 Standard and S3 Standard-IA storage classes. You can save up to 68 percent on storage costs as compared with using the S3 Standard-IA storage class when you use the S3 Glacier Instant Retrieval storage class and pay a low price to retrieve data. For example, in the US East (N. Virginia) Region, S3 Glacier Instant Retrieval storage pricing is $0.004 per GB-month and data retrieval is $0.03 per GB. Learn more about pricing for your Region.

Media archives, medical images, or user-generated content are just a few examples of ideal use cases for S3 Glacier Instant Retrieval. Once created, this content is rarely accessed, but when it is needed it must be available in milliseconds.

To get started using the new storage class from the Amazon S3 console, upload an object as you would normally, and select the S3 Glacier Instant Retrieval storage class.

Upload object with the new storage class

This feature is available programmatically from AWS SDKs, AWS Command Line Interface (CLI), and AWS CloudFormation.

In my opinion, the easiest way to store data in S3 Glacier Instant Retrieval is to use the S3 PUT API using the CLI. When using this API, set the storage class to GLACIER_IR.

aws s3api put-object --bucket <bucket-name> --key <object-key> --body <name-file> --storage-class GLACIER_IR

When the object is uploaded to Amazon S3, verify the storage class in the list of objects or on the object details page.

Storage classes

For data that already exists in Amazon S3, you can use S3 Lifecycle to transition data from the S3 Standard and S3 Standard-IA storage classes into S3 Glacier Instant Retrieval.

New Archive Instant Access Tier in S3 Intelligent-Tiering
S3 Intelligent-Tiering is a storage class that automatically moves objects between access tiers to optimize costs. This is the recommended storage class for data with unpredictable or changing access patterns, such as in data lakes, analytics, or user-generated content.

Until today, there were two low latency access tiers optimized for frequent and infrequent access, and two optional archive access tiers designed for asynchronous access optimized for rare access at a low cost.

Beginning today, the Archive Instant Access tier is added as a new access tier in the S3 Intelligent-Tiering storage class. You will start seeing automatic costs savings for your storage in S3 Intelligent-Tiering for rarely accessed objects.

The Archive Instant Access tier joins the group of low latency access tiers. This new tier is optimized for data that is not accessed for months at a time but, when it is needed, is available within milliseconds.

S3 Intelligent-Tiering automatically stores objects in three access tiers that deliver the same performance as the S3 Standard storage class:

  • Frequent Access tier
  • Infrequent Access tier
  • Archive Instant Access (new)

For a small monitoring and automation charge, S3 Intelligent-Tiering monitors access patterns and moves objects between the different access tiers. Objects that have not been accessed for 30 consecutive days are moved from the Frequent Access tier to the Infrequent Access tier for savings of 40 percent. When an object hasn’t been accessed for 90 consecutive days, S3 Intelligent-Tiering will move the object from the Infrequent Access tier to the Archive Instant Access tier, with a savings of 68 percent. If the data is accessed later, it is automatically moved back to the Frequent Access tier. No tiering charges apply when objects are moved between access tiers within the S3 Intelligent-Tiering storage class.

S3 Intelligent-Tiering access tiers

To get started with this new access tier, select Intelligent-Tiering as the storage class for an object when uploading an object using the S3 console. After 90 days of inactivity (30 days in Frequent Access tier and 60 days in Infrequent Access tier), S3 Intelligent-Tiering will automatically move the object to the Archive Instant Access tier. The introduction of the new Archive Instant Access tier has no impact on performance when you retrieve objects.

New name for the Amazon S3 Glacier storage class – S3 Glacier Flexible Retrieval
The existing Amazon S3 Glacier storage class is now named S3 Glacier Flexible Retrieval. This storage class now has free bulk retrievals in 5 to 12 hours, and the storage price has been reduced by 10 percent in all Regions, effective December 1, 2021. S3 Glacier Flexible Retrieval is now even more cost-effective, and the free bulk retrievals make it ideal for retrieving large data volumes.

These are the Amazon S3 archive storage classes:

  • S3 Glacier Instant Retrieval: The newest storage class is optimized for long-lived data that is rarely accessed (typically once per quarter). However when data is needed, it is available within milliseconds. For example, medical images and news media assets are perfect for this storage class.
  • S3 Glacier Flexible Retrieval: This newly renamed storage class is optimized for archiving data that can be retrieved in minutes or with free bulk retrievals in 5 to 12 hours. This storage class is ideal for backups and disaster recovery use cases, where you have large amounts of long-term, rarely accessed data, and you don’t want to worry about retrieval costs when you need the data.
  • S3 Glacier Deep Archive: This storage class is the lowest-cost storage in the cloud and is optimized for archiving data that can be restored in at least 12 hours. It’s great for storing your compliance archives or for digital media preservation.

Amazon S3 has reduced storage prices!
We are excited to announce that Amazon S3 has reduced storage prices of up to 31 percent in the S3 Standard-IA and S3 One Zone-IA storage classes across 9 AWS Regions: US West (N. California), Asia Pacific (Hong Kong), Asia Pacific (Mumbai), Asia Pacific (Osaka), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), and South America (São Paulo). These price reductions are effective December 1, 2021.

Learn more about price reduction details.

Available Now
The new storage class, S3 Glacier Instant Retrieval, and the new Archive Instant Access tier in S3 Intelligent-Tiering are available today (November 30, 2021) in all AWS Regions.

The price cut for S3 Glacier and free bulk retrievals in all AWS Regions, and the S3 Standard-Infrequent Access/One Zone-Infrequent storage class in nine Regions will be effective on December 1, 2021.

Learn more about the storage classes changes and all the storage classes.

Marcia