Tag Archives: Amazon Simple Storage Service (S3)

AWS Week in Review – March 20, 2023

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-week-in-review-march-20-2023/

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

A new week starts, and Spring is almost here! If you’re curious about AWS news from the previous seven days, I got you covered.

Last Week’s Launches
Here are the launches that got my attention last week:

Picture of an S3 bucket and AWS CEO Adam Selipsky.Amazon S3 – Last week there was AWS Pi Day 2023 celebrating 17 years of innovation since Amazon S3 was introduced on March 14, 2006. For the occasion, the team released many new capabilities:

Amazon Linux 2023 – Our new Linux-based operating system is now generally available. Sébastien’s post is full of tips and info.

Application Auto Scaling – Now can use arithmetic operations and mathematical functions to customize the metrics used with Target Tracking policies. You can use it to scale based on your own application-specific metrics. Read how it works with Amazon ECS services.

AWS Data Exchange for Amazon S3 is now generally available – You can now share and find data files directly from S3 buckets, without the need to create or manage copies of the data.

Amazon Neptune – Now offers a graph summary API to help understand important metadata about property graphs (PG) and resource description framework (RDF) graphs. Neptune added support for Slow Query Logs to help identify queries that need performance tuning.

Amazon OpenSearch Service – The team introduced security analytics that provides new threat monitoring, detection, and alerting features. The service now supports OpenSearch version 2.5 that adds several new features such as support for Point in Time Search and improvements to observability and geospatial functionality.

AWS Lake Formation and Apache Hive on Amazon EMR – Introduced fine-grained access controls that allow data administrators to define and enforce fine-grained table and column level security for customers accessing data via Apache Hive running on Amazon EMR.

Amazon EC2 M1 Mac Instances – You can now update guest environments to a specific or the latest macOS version without having to tear down and recreate the existing macOS environments.

AWS Chatbot – Now Integrates With Microsoft Teams to simplify the way you troubleshoot and operate your AWS resources.

Amazon GuardDuty RDS Protection for Amazon Aurora – Now generally available to help profile and monitor access activity to Aurora databases in your AWS account without impacting database performance

AWS Database Migration Service – Now supports validation to ensure that data is migrated accurately to S3 and can now generate an AWS Glue Data Catalog when migrating to S3.

AWS Backup – You can now back up and restore virtual machines running on VMware vSphere 8 and with multiple vNICs.

Amazon Kendra – There are new connectors to index documents and search for information across these new content: Confluence Server, Confluence Cloud, Microsoft SharePoint OnPrem, Microsoft SharePoint Cloud. This post shows how to use the Amazon Kendra connector for Microsoft Teams.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
A few more blog posts you might have missed:

Example of a geospatial query.Women founders Q&A – We’re talking to six women founders and leaders about how they’re making impacts in their communities, industries, and beyond.

What you missed at that 2023 IMAGINE: Nonprofit conference – Where hundreds of nonprofit leaders, technologists, and innovators gathered to learn and share how AWS can drive a positive impact for people and the planet.

Monitoring load balancers using Amazon CloudWatch anomaly detection alarms – The metrics emitted by load balancers provide crucial and unique insight into service health, service performance, and end-to-end network performance.

Extend geospatial queries in Amazon Athena with user-defined functions (UDFs) and AWS Lambda – Using a solution based on Uber’s Hexagonal Hierarchical Spatial Index (H3) to divide the globe into equally-sized hexagons.

How cities can use transport data to reduce pollution and increase safety – A guest post by Rikesh Shah, outgoing head of open innovation at Transport for London.

For AWS open-source news and updates, here’s the latest newsletter curated by Ricardo to bring you the most recent updates on open-source projects, posts, events, and more.

Upcoming AWS Events
Here are some opportunities to meet:

AWS Public Sector Day 2023 (March 21, London, UK) – An event dedicated to helping public sector organizations use technology to achieve more with less through the current challenging conditions.

Women in Tech at Skills Center Arlington (March 23, VA, USA) – Let’s celebrate the history and legacy of women in tech.

The AWS Summits season is warming up! You can sign up here to know when registration opens in your area.

That’s all from me for this week. Come back next Monday for another Week in Review!


Realtime monitoring of microservices and cloud-native applications with IBM Instana SaaS on AWS

Post Syndicated from Eduardo Monich Fronza original https://aws.amazon.com/blogs/architecture/realtime-monitoring-of-microservices-and-cloud-native-applications-with-ibm-instana-saas-on-aws/

Customers are adopting microservices architecture to build innovative and scalable applications on Amazon Web Services (AWS). These microservices applications are deployed across multiple AWS services, and customers are looking for comprehensive observability solutions that can help them effectively monitor and manage the performance of their applications in real-time.

IBM Instana is a fully automated application performance management (APM) solution, available to customers as a fully managed software as a service (SaaS) solution on AWS. It is specifically designed to help customers address the challenges of monitoring microservices and cloud-native applications in real-time. It uses artificial intelligence and machine learning to provide detailed insights into the health and behavior of applications, allowing developers and IT teams to gain real-time insights into their microservices applications, optimize performance, and quickly identify and troubleshoot issues.

This post explains the capabilities of IBM Instana to automatically collect observability metrics, traces, and events from microservices deployed on AWS cloud, as well as on-premises, to provide full visibility into the performance of individual components and applications as a whole.

IBM Instana solution overview

IBM Instana is designed to be highly scalable and adaptable to changing microservices applications environments. Its architecture (Figure 1) consists of several components that work together to provide comprehensive monitoring for microservices and cloud-native applications.

Instana’s main building blocks are host agents and agent sensors that are deployed in a customer’s AWS account and responsible for collecting, aggregating, and sending detailed monitoring information of applications and AWS services to the Instana SaaS backend.

The Instana SaaS backend services provide several key components, including data collectors, storage services, analytics engines, and user interfaces. It allows customers to process and analyze data in real-time, generate actionable insights, have a comprehensive view of their applications and infrastructure performance, enabling them to quickly identify and resolve issues and improve their overall operations.

IBM Instana architecture on AWS

Figure 1. IBM Instana architecture on AWS

Monitoring data

Instana monitors and observes microservices and cloud-native applications by collecting beacons, traces, and one-second metrics:

  • Beacons are small monitoring payloads that are transmitted by a JavaScript agent to the Instana servers, modeling specific events occurring within the lifecycle of a page view of a website; for example, page loading, resource retrieval, and HTTP requests.
  • Traces are detailed records of the requests and transactions that flow through a microservice architecture. They record the sequence of events that occur when a request is processed, including the services that are involved, the duration of each service, and any errors or exceptions that occur. Instana automatically correlates traces across services to provide a complete view of an entire transaction. This allows for easy identification and diagnosis of performance issues.
  • Metrics are numerical values that represent the performance and resource utilization of a microservice or infrastructure component. Metrics are collected by Instana Agents and sent to the Instana backend at regular intervals. Instana Agents collect hundreds of different metrics, including (but not limited to) CPU usage, memory usage, network traffic, and disk I/O.

This information is captured by Instana agents and sensors, which also collect application configurations and events, plus discover application building blocks, including clusters, containers, and services.

IBM Instana agents and sensors

The Instana host agent is a lightweight software component that collects and aggregates data from various sensors before sending the data to the Instana backend. It can be deployed to AWS services, including Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Kubernetes Service (Amazon EKS), AWS Fargate, AWS Lambda, or Red Hat OpenShift Service on AWS (ROSA). A single host agent, one per host, is used to collect data from monitored systems.

Once Instana agents are running, they automatically detect applications and services, such as containers running on Amazon EKS, and processes like Nginx, NodeJS, Spring Boot, Postgres, Elasticsearch, or Cassandra. For each component detected, different Instana sensors are automatically downloaded, installed, and configured to monitor the environment.

Instana sensors are small programs that are designed to attach and monitor one specific technology and pass their data to the agent. They are automatically managed, updated, loaded, and unloaded by the host agent.

These sensors can monitor several different AWS services like Lambda, Amazon DynamoDB, Amazon Simple Storage Service (Amazon S3), Amazon Aurora, Amazon Simple Queue Service, and Amazon Managed Streaming for Apache Kafka. They collect data—like request and error rates, latency, CPU utilization—via AWS APIs and Amazon CloudWatch.

Instana also provides sensors to collect data from applications running on AWS, like IBM MQ, IBM Db2, or Red Hat OpenShift Container Platform. Review IBM’s full list of supported technologies and AWS services.

Instana also provides tracers, which are used with runtimes like Java, .NET, NodeJS, plus others. They modify code execution to capture logs, traces at request level, and send those back to the Instana agent.

With the use of sensors, the host agent collects configuration data and monitors the applications it has detected. The host agent also handles communications with the Instana SaaS backend services. It collects, aggregates and sends logs, traces and records metrics (such as response times, error rates, and resource utilization) every second to the Instana SaaS backend in real-time, using secure and efficient communication protocols.

IBM Instana SaaS

The Instana SaaS backend is the heart of the Instana APM solution and responsible for processing, storing, and analyzing the monitoring data collected from the Instana agents and sensors installed in the customer’s infrastructure.

It consists of several components and services that work together to provide real-time monitoring and analysis of microservices applications, including:

  • Data collectors: Receive and process data from the Instana agents and sensors, and store it in the Instana backend for further analysis.
  • Analytics engine: Analyzes the data collected by the agents and sensors to provide insights into the performance and health of the microservices applications.
  • User interface: Web-based interface that customers use to view and analyze their monitoring data.
  • Alerting engine: Generates alerts when thresholds or anomalies are detected in the monitoring data.
  • Data storage: Time-series database that stores the monitoring data collected by the agents and sensors. Allows customers to query and analyze the data in real-time.
  • Integrations: Integrates with various third-party tools, such as Slack, PagerDuty, and ServiceNow, providing seamless alerting and incident management.

IBM Instana backend: making sense of the situation in real time

The Instana SaaS platform automatically ingests data from agents and continuously updates a dependency map (Figure 2). This map presents every dependency in context, giving users an easy way to understand the interrelationships between application components and services.

This understanding enables users to identify the upstream and downstream impacts of any issue, ensuring that they stay informed about any potential impacts.

An example of an IBM Instana dependency map

Figure 2. An example of an IBM Instana dependency map

Instana traces every request end-to-end without sampling. The traces are analyzed in real-time, providing metrics that make any performance problems immediately visible. In the event of an incident, Instana can illustrate how a single issue can generate a ripple effect and impact a number of directly and indirectly connected services. Using the relationship information from the Dynamic Graph, Instana’s automatic root-cause analysis can precisely aggregate the individual issues into a single incident.

Applications monitoring with IBM Instana

Figure 3. Applications monitoring with IBM Instana

Developers, IT operations, or site reliability engineers (SREs) can access the Instana backend end-user monitoring interface (Figure 3) or end-user monitoring (EUM) interface (Figure 4) to view monitoring data of their workloads. These can be websites, mobile applications, AWS services, and infrastructure levels. From this UI, these personas can access service dashboards that show key performance indicators (KPIs), like response time and error rate.

End-user monitoring with IBM Instana

Figure 4. End-user monitoring with IBM Instana

The following actions demonstrate how an EUM for a JavaScript application, deployed to Amazon S3 can be completed:

  • Developers inject Instana JavaScript code (Figure 5) into the static website (HTML).
  • When a user visits the website, the JavaScript agent sends beacons to the Instana backend.
  • Dashboards show specific events of the website lifecycle, including page loading, JS errors, and HTTP requests.
  • Teams access Instana UI to check performance matrices. They can configure Smart Alerts with custom alerting policies based on specific metrics and KPIs.
  • Smart Alerts can send alerts via various channels, such as email, Slack, or IBM Watson AIOps Webhook.
  • In case of an incident, teams can use Instana to retrieve various performance metrics for root-cause analysis.
  • Developers can resolve the issues and apply the patch.
IBM Instana EUM JavaScript agent

Figure 5. IBM Instana EUM JavaScript agent

Instana also offers Smart Alerts (Figure 6) to provide a more intuitive process of managing alerts. With Smart Alerts, customers can automatically generate alerting configurations using relevant KPIs and automatic threshold detection for use cases like website slowness or website errors.

IBM Instana Smart Alerts

Figure 6. IBM Instana Smart Alerts


In this post, we discussed how IBM Instana provides a comprehensive monitoring solution with the right tools to help you implement a real-time observability and monitoring solution. It allows you to gain insight into your microservices and cloud-native applications, including visibility into AWS services, containers, on-premises infrastructure, and other technologies. Instana can quickly identify and resolve issues before they impact end-users, ensuring that your applications are performing optimally.

As an IT administrator, developer, or business owner, IBM Instana on AWS give a deeper understanding of your applications and help you make data-driven decisions to improve overall performance.

Additional resources

Celebrate Amazon S3’s 17th birthday at AWS Pi Day 2023

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/celebrate-amazon-s3s-17th-birthday-at-aws-pi-day-2023/

AWS Pi Day 2023 is live today starting at 13:00 PDT; join us on the AWS on Air channel on Twitch.

On this day 17 years ago, we launched a very simple object storage service. It allowed developers to create, list, and delete private storage spaces (known as buckets), upload and download files, and manage their access permissions. The service was available only through a REST and SOAP API. It was designed to provide highly durable data storage with 99.999999999 percent data durability (that’s 11 nines!).

Fast forward to 2023, Amazon Simple Storage Service (Amazon S3) holds more than 280 trillion objects and averages over 100 million requests per second. To protect data integrity, Amazon S3 performs over four billion checksum computations per second. Over the years, we added many capabilities, such as a range of storage classes, to store your colder data cost effectively. Every day, you restore on average more than 1 petabyte from the S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive storage classes. Since launch, you have saved $1 billion from using Amazon S3 Intelligent-Tiering. In 2015, we added the possibility of replicating your data across Regions. Every week, Amazon S3 Replication moves more than 100 petabytes of data. Amazon S3 is also at the core of hundreds of thousands of data lakes. It also has become a critical component of a growing ecosystem of serverless applications. Every day, Amazon S3 sends over 125 billion event notifications to serverless applications. Altogether, Amazon S3 is helping people around the world securely store and extract value from their data.

AWS Pi Day 2023 Small

To celebrate Amazon S3‘s birthday AWS is hosting the AWS Pi Day event for the third consecutive year. This live online event starts at 13:00 PDT today (March 14, 2023) on the AWS On Air channel on Twitch and will feature four hours of fresh educational content from AWS experts. We will discuss not only Amazon S3 best practices, we will also dive into the latest innovations across AWS data services, from storage to analytics and AI/ML. Tune in to learn how to get the most out of your data by making it more secure, available, accessible, and connected, and to help you respond to rapid growth and changing demand. You will also learn how to optimize your data costs, automate your cost savings, eliminate operational complexity, and get new insights from your data. Have a look at the full agenda on the registration page.

At AWS, we innovate on your behalf. During the last few weeks, we announced a 99.99 percent SLA for Amazon MemoryDB for Redis, enhanced I/O multiplexing for Amazon ElastiCache for Redis, and encryption by default for new objects on Amazon S3.

But we are not stopping there, and today we take the occasion of this celebration to announce seven new capabilities across our data services.

Mountpoint for Amazon S3 (alpha release): an open-source file client for Amazon S3
Mountpoint for Amazon S3 is an open-source file client for Amazon S3 that you can install on your compute instance. It translates local file storage API calls to REST API calls on objects in Amazon S3. When using Mountpoint for Amazon S3, data lake applications that access objects using file APIs can achieve high single-instance transfer rates, saving on compute costs.

You can get started with Mountpoint for Amazon S3 by mounting an Amazon S3 bucket at a local mount point on your compute instance. Once mounted, applications read objects as files available locally. Mountpoint for Amazon S3 supports sequential and random read operations on existing S3 objects. It is available to download for Linux operating systems as an alpha release and is not yet intended for production workloads. Instead, we want to collect your feedback early and incorporate your input into the design and implementation. To get started, visit the Mountpoint for Amazon S3 GitHub repo, read the technical launch blog, and share your feedback.

Now Generally Available: AWS Data Exchange for Amazon S3
AWS Data Exchange for Amazon S3 enables you to easily find, subscribe to, and use third-party data files for faster time to insight, storage cost optimization, simplified data licensing management, and more. Data Exchange subscribers can directly use files from data providers’ Amazon S3 buckets for their analysis with AWS services without needing to create or manage copies to their account. Data providers can license in-place access to data hosted in their Amazon S3 buckets.

To learn more about how data providers can simplify and scale access management to multiple data subscribers, you can read this blog.

AWS Data Exchange for S3

Amazon S3 Multi-Region Access Points now support replicated datasets that span multiple AWS accounts
We launched Amazon S3 Multi-Region Access Points in September 2021. We added failover control in November 2022. Amazon S3 Multi-Region Access Points now support datasets that are replicated across multiple AWS accounts. Cross-account Multi-Region Access Points simplify object storage access for applications that span both AWS Regions and accounts, avoiding the need for complex request routing logic in your application. They provide a single global endpoint for your multi-Region applications and dynamically route S3 requests based on policies that you define. This helps you to easily implement multi-Region resilience, latency-based routing, and active/passive failover, even when your data is stored in multiple AWS accounts.

You can learn more about S3 Multi-Region Access Points on the Amazon S3 FAQs.

Aliases for S3 Object Lambda Access Points as CloudFront origin
Amazon S3 Object Lambda, launched in March 2021, lets you add your own code to S3 GET, HEAD, and LIST API requests to modify data as it is returned to an application. With today’s launch of aliases for S3 Object Lambda Access Points any application that requires an S3 bucket name can easily present different views of data depending on the requester. You can now use an S3 Object Lambda Access Point alias as an origin for your Amazon CloudFront distribution to modify the data requested. For example, you can dynamically transform an image depending on the device that a user is visiting from, such as a desktop or a smartphone.

If you want to learn more, my colleague Danilo wrote a blog post with more details and code examples.

Simplify private connectivity from on-premises networks
Amazon Virtual Private Cloud (Amazon VPC) interface endpoints for Amazon S3 now offer private DNS options that can help you more easily route Amazon S3 requests to the lowest-cost endpoint in your VPC. With private DNS for Amazon S3, your on-premises applications can use AWS PrivateLink to access Amazon S3 over an interface endpoint, while requests from your in-VPC applications access Amazon S3 using gateway endpoints. Routing requests like this helps you take advantage of the lowest-cost private network path without having to make code or configuration changes to your clients.

S3 private connectivity

You can learn more on the AWS PrivateLink for Amazon S3 documentation.

Local Amazon S3 Replication on Outposts
Amazon S3 on Outposts now supports S3 replication on Outposts. This extends S3’s fully managed approach to replication to S3 on Outposts buckets. It helps you meet your data residency and data redundancy requirements. With local S3 Replication on Outposts, you can create and configure replication rules to automatically replicate your S3 objects to another Outpost or to another bucket on the same Outpost. During replication, your S3 on Outposts objects are always sent over your local gateway, and objects do not travel back to the AWS Region. S3 Replication on Outposts provides an easy and flexible way to automatically replicate data within a specific data perimeter to address your data redundancy and compliance requirements.

Amazon OpenSearch Security Analytics
The new Amazon OpenSearch Service’s security analytics capability enables your Security Operations (SecOps) teams to detect potential threats quickly while having the tools to help with security investigations on historical data—all with lower data storage costs. Like many other advanced capabilities of Amazon OpenSearch Service, there is no additional charge for security analytics.

You can learn more about Amazon OpenSearch security analytics by reading this blog post.

Join Us Online Today
You will learn more about these launches and about AWS data services in general. We have also prepared some live demos. We designed the AWS Pi Day event for system administrators, engineers, developers, and architects. Our sessions will bring you the latest and greatest information on storage, security, backup, archiving, training and certification, and more.

And to dive deeper, get Pi Day started early by attending AWS Innovate: Data and AI/ML Edition to learn about cutting-edge machine learning tools, strategies for building future-proof applications, and making data-driven decisions for your organization. Don’t miss Swami Sivasubramanian‘s keynote, starting at 9:00 PDT.

Join us today on the AWS Pi Day live stream. Kevin Miller, VP and GM of Amazon S3, will kick off the event with a keynote at 13:00 PDT.

See you there!

— seb

New – Use Amazon S3 Object Lambda with Amazon CloudFront to Tailor Content for End Users

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-use-amazon-s3-object-lambda-with-amazon-cloudfront-to-tailor-content-for-end-users/

With S3 Object Lambda, you can use your own code to process data retrieved from Amazon S3 as it is returned to an application. Over time, we added new capabilities to S3 Object Lambda, like the ability to add your own code to S3 HEAD and LIST API requests, in addition to the support for S3 GET requests that was available at launch.

Today, we are launching aliases for S3 Object Lambda Access Points. Aliases are now automatically generated when S3 Object Lambda Access Points are created and are interchangeable with bucket names anywhere you use a bucket name to access data stored in Amazon S3. Therefore, your applications don’t need to know about S3 Object Lambda and can consider the alias to be a bucket name.

Architecture diagram.

You can now use an S3 Object Lambda Access Point alias as an origin for your Amazon CloudFront distribution to tailor or customize data for end users. You can use this to implement automatic image resizing or to tag or annotate content as it is downloaded. Many images still use older formats like JPEG or PNG, and you can use a transcoding function to deliver images in more efficient formats like WebP, BPG, or HEIC. Digital images contain metadata, and you can implement a function that strips metadata to help satisfy data privacy requirements.

Architecture diagram.

Let’s see how this works in practice. First, I’ll show a simple example using text that you can follow along by just using the AWS Management Console. After that, I’ll implement a more advanced use case processing images.

Using an S3 Object Lambda Access Point as the Origin of a CloudFront Distribution
For simplicity, I am using the same application in the launch post that changes all text in the original file to uppercase. This time, I use the S3 Object Lambda Access Point alias to set up a public distribution with CloudFront.

I follow the same steps as in the launch post to create the S3 Object Lambda Access Point and the Lambda function. Because the Lambda runtimes for Python 3.8 and later do not include the requests module, I update the function code to use urlopen from the Python Standard Library:

import boto3
from urllib.request import urlopen

s3 = boto3.client('s3')

def lambda_handler(event, context):

  object_get_context = event['getObjectContext']
  request_route = object_get_context['outputRoute']
  request_token = object_get_context['outputToken']
  s3_url = object_get_context['inputS3Url']

  # Get object from S3
  response = urlopen(s3_url)
  original_object = response.read().decode('utf-8')

  # Transform object
  transformed_object = original_object.upper()

  # Write object back to S3 Object Lambda


To test that this is working, I open the same file from the bucket and through the S3 Object Lambda Access Point. In the S3 console, I select the bucket and a sample file (called s3.txt) that I uploaded earlier and choose Open.

Console screenshot.

A new browser tab is opened (you might need to disable the pop-up blocker in your browser), and its content is the original file with mixed-case text:

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers...

I choose Object Lambda Access Points from the navigation pane and select the AWS Region I used before from the dropdown. Then, I search for the S3 Object Lambda Access Point that I just created. I select the same file as before and choose Open.

Console screenshot.

In the new tab, the text has been processed by the Lambda function and is now all in uppercase:


Now that the S3 Object Lambda Access Point is correctly configured, I can create the CloudFront distribution. Before I do that, in the list of S3 Object Lambda Access Points in the S3 console, I copy the Object Lambda Access Point alias that has been automatically created:

Console screenshot.

In the CloudFront console, I choose Distributions in the navigation pane and then Create distribution. In the Origin domain, I use the S3 Object Lambda Access Point alias and the Region. The full syntax of the domain is:


Console screenshot.

S3 Object Lambda Access Points cannot be public, and I use CloudFront origin access control (OAC) to authenticate requests to the origin. For Origin access, I select Origin access control settings and choose Create control setting. I write a name for the control setting and select Sign requests and S3 in the Origin type dropdown.

Console screenshot.

Now, my Origin access control settings use the configuration I just created.

Console screenshot.

To reduce the number of requests going through S3 Object Lambda, I enable Origin Shield and choose the closest Origin Shield Region to the Region I am using. Then, I select the CachingOptimized cache policy and create the distribution. As the distribution is being deployed, I update permissions for the resources used by the distribution.

Setting Up Permissions to Use an S3 Object Lambda Access Point as the Origin of a CloudFront Distribution
First, the S3 Object Lambda Access Point needs to give access to the CloudFront distribution. In the S3 console, I select the S3 Object Lambda Access Point and, in the Permissions tab, I update the policy with the following:

    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Principal": {
                "Service": "cloudfront.amazonaws.com"
            "Action": "s3-object-lambda:Get*",
            "Resource": "arn:aws:s3-object-lambda:REGION:ACCOUNT:accesspoint/NAME",
            "Condition": {
                "StringEquals": {
                    "aws:SourceArn": "arn:aws:cloudfront::ACCOUNT:distribution/DISTRIBUTION-ID"

The supporting access point also needs to allow access to CloudFront when called via S3 Object Lambda. I select the access point and update the policy in the Permissions tab:

    "Version": "2012-10-17",
    "Id": "default",
    "Statement": [
            "Sid": "s3objlambda",
            "Effect": "Allow",
            "Principal": {
                "Service": "cloudfront.amazonaws.com"
            "Action": "s3:*",
            "Resource": [
            "Condition": {
                "ForAnyValue:StringEquals": {
                    "aws:CalledVia": "s3-object-lambda.amazonaws.com"

The S3 bucket needs to allow access to the supporting access point. I select the bucket and update the policy in the Permissions tab:

    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Principal": {
                "AWS": "*"
            "Action": "*",
            "Resource": [
            "Condition": {
                "StringEquals": {
                    "s3:DataAccessPointAccount": "ACCOUNT"

Finally, CloudFront needs to be able to invoke the Lambda function. In the Lambda console, I choose the Lambda function used by S3 Object Lambda, and then, in the Configuration tab, I choose Permissions. In the Resource-based policy statements section, I choose Add permissions and select AWS Account. I enter a unique Statement ID. Then, I enter cloudfront.amazonaws.com as Principal and select lambda:InvokeFunction from the Action dropdown and Save. We are working to simplify this step in the future. I’ll update this post when that’s available.

Testing the CloudFront Distribution
When the distribution has been deployed, I test that the setup is working with the same sample file I used before. In the CloudFront console, I select the distribution and copy the Distribution domain name. I can use the browser and enter https://DISTRIBUTION_DOMAIN_NAME/s3.txt in the navigation bar to send a request to CloudFront and get the file processed by S3 Object Lambda. To quickly get all the info, I use curl with the -i option to see the HTTP status and the headers in the response:

curl -i https://DISTRIBUTION_DOMAIN_NAME/s3.txt

HTTP/2 200 
content-type: text/plain
content-length: 427
x-amzn-requestid: a85fe537-3502-4592-b2a9-a09261c8c00c
date: Mon, 06 Mar 2023 10:23:02 GMT
x-cache: Miss from cloudfront
via: 1.1 a2df4ad642d78d6dac65038e06ad10d2.cloudfront.net (CloudFront)
x-amz-cf-pop: DUB56-P1
x-amz-cf-id: KIiljCzYJBUVVxmNkl3EP2PMh96OBVoTyFSMYDupMd4muLGNm2AmgA==


It works! As expected, the content processed by the Lambda function is all uppercase. Because this is the first invocation for the distribution, it has not been returned from the cache (x-cache: Miss from cloudfront). The request went through S3 Object Lambda to process the file using the Lambda function I provided.

Let’s try the same request again:

curl -i https://DISTRIBUTION_DOMAIN_NAME/s3.txt

HTTP/2 200 
content-type: text/plain
content-length: 427
x-amzn-requestid: a85fe537-3502-4592-b2a9-a09261c8c00c
date: Mon, 06 Mar 2023 10:23:02 GMT
x-cache: Hit from cloudfront
via: 1.1 145b7e87a6273078e52d178985ceaa5e.cloudfront.net (CloudFront)
x-amz-cf-pop: DUB56-P1
x-amz-cf-id: HEx9Fodp184mnxLQZuW62U11Fr1bA-W1aIkWjeqpC9yHbd0Rg4eM3A==
age: 3


This time the content is returned from the CloudFront cache (x-cache: Hit from cloudfront), and there was no further processing by S3 Object Lambda. By using S3 Object Lambda as the origin, the CloudFront distribution serves content that has been processed by a Lambda function and can be cached to reduce latency and optimize costs.

Resizing Images Using S3 Object Lambda and CloudFront
As I mentioned at the beginning of this post, one of the use cases that can be implemented using S3 Object Lambda and CloudFront is image transformation. Let’s create a CloudFront distribution that can dynamically resize an image by passing the desired width and height as query parameters (w and h respectively). For example:


For this setup to work, I need to make two changes to the CloudFront distribution. First, I create a new cache policy to include query parameters in the cache key. In the CloudFront console, I choose Policies in the navigation pane. In the Cache tab, I choose Create cache policy. Then, I enter a name for the cache policy.

Console screenshot.

In the Query settings of the Cache key settings, I select the option to Include the following query parameters and add w (for the width) and h (for the height).

Console screenshot.

Then, in the Behaviors tab of the distribution, I select the default behavior and choose Edit.

There, I update the Cache key and origin requests section:

  • In the Cache policy, I use the new cache policy to include the w and h query parameters in the cache key.
  • In the Origin request policy, use the AllViewerExceptHostHeader managed policy to forward query parameters to the origin.

Console screenshot.

Now I can update the Lambda function code. To resize images, this function uses the Pillow module that needs to be packaged with the function when it is uploaded to Lambda. You can deploy the function using a tool like the AWS SAM CLI or the AWS CDK. Compared to the previous example, this function also handles and returns HTTP errors, such as when content is not found in the bucket.

import io
import boto3
from urllib.request import urlopen, HTTPError
from PIL import Image

from urllib.parse import urlparse, parse_qs

s3 = boto3.client('s3')

def lambda_handler(event, context):

    object_get_context = event['getObjectContext']
    request_route = object_get_context['outputRoute']
    request_token = object_get_context['outputToken']
    s3_url = object_get_context['inputS3Url']

    # Get object from S3
        original_image = Image.open(urlopen(s3_url))
    except HTTPError as err:

    # Get width and height from query parameters
    user_request = event['userRequest']
    url = user_request['url']
    parsed_url = urlparse(url)
    query_parameters = parse_qs(parsed_url.query)

        width, height = int(query_parameters['w'][0]), int(query_parameters['h'][0])
    except (KeyError, ValueError):
        width, height = 0, 0

    # Transform object
    if width > 0 and height > 0:
        transformed_image = original_image.resize((width, height), Image.ANTIALIAS)
        transformed_image = original_image

    transformed_bytes = io.BytesIO()
    transformed_image.save(transformed_bytes, format='JPEG')

    # Write object back to S3 Object Lambda


I upload a picture I took of the Trevi Fountain in the source bucket. To start, I generate a small thumbnail (200 by 150 pixels).


Picture of the Trevi Fountain with size 200x150 pixels.

Now, I ask for a slightly larger version (400 by 300 pixels):


Picture of the Trevi Fountain with size 400x300 pixels.

It works as expected. The first invocation with a specific size is processed by the Lambda function. Further requests with the same width and height are served from the CloudFront cache.

Availability and Pricing
Aliases for S3 Object Lambda Access Points are available today in all commercial AWS Regions. There is no additional cost for aliases. With S3 Object Lambda, you pay for the Lambda compute and request charges required to process the data, and for the data S3 Object Lambda returns to your application. You also pay for the S3 requests that are invoked by your Lambda function. For more information, see Amazon S3 Pricing.

Aliases are now automatically generated when an S3 Object Lambda Access Point is created. For existing S3 Object Lambda Access Points, aliases are automatically assigned and ready for use.

It’s now easier to use S3 Object Lambda with existing applications, and aliases open many new possibilities. For example, you can use aliases with CloudFront to create a website that converts content in Markdown to HTML, resizes and watermarks images, or masks personally identifiable information (PII) from text, images, and documents.

Customize content for your end users using S3 Object Lambda with CloudFront.


Implementing an event-driven serverless story generation application with ChatGPT and DALL-E

Post Syndicated from David Boyne original https://aws.amazon.com/blogs/compute/implementing-an-event-driven-serverless-story-generation-application-with-chatgpt-and-dall-e/

This post demonstrates how to integrate AWS serverless services with artificial intelligence (AI) technologies, ChatGPT, and DALL-E. This full stack event-driven application showcases a method of generating unique bedtime stories for children by using predetermined characters and scenes as a prompt for ChatGPT.

Every night at bedtime, the serverless scheduler triggers the application, initiating an event-driven workflow to create and store new unique AI-generated stories with AI-generated images and supporting audio.

These datasets are used to showcase the story on a custom website built with Next.js hosted with AWS App Runner. After the story is created, a notification is sent to the user containing a URL to view and read the story to the children.

Example notification of a story hosted with Next.js and App Runner

Example notification of a story hosted with Next.js and App Runner

By integrating AWS services with AI technologies, you can now create new and innovative ideas that were previously unimaginable.

The application mentioned in this blog post demonstrates examples of point-to-point messaging with Amazon EventBridge pipes, publish/subscribe patterns with Amazon EventBridge and reacting to change data capture events with DynamoDB Streams.

Understanding the architecture

The following image shows the serverless architecture used to generate stories:

Architecture diagram for Serverless bed time story generation with ChatGPT and DALL-E

Architecture diagram for Serverless bed time story generation with ChatGPT and DALL-E

A new children’s story is generated every day at configured time using Amazon EventBridge Scheduler (Step 1). EventBridge Scheduler is a service capable of scaling millions of schedules with over 200 targets and over 6000 API calls. This example application uses EventBridge scheduler to trigger an AWS Lambda function every night at the same time (7:15pm). The Lambda function is triggered to start the generation of the story.

EventBridge scheduler triggers Lambda function every day at 7:15pm (bed time)

EventBridge scheduler triggers Lambda function every day at 7:15pm (bed time)

The “Scenes” and “Characters” Amazon DynamoDB tables contain the characters involved in the story and a scene that is randomly selected during its creation. As a result, ChatGPT receives a unique prompt each time. An example of the prompt may look like this:

Write a title and a rhyming story on 2 main characters called Parker and Jackson. The story needs to be set within the scene haunted woods and be at least 200 words long


After the story is created, it is then saved in the “Stories” DynamoDB table (Step 2).

Scheduler triggering Lambda function to generate the story and store story into DynamoDB

Scheduler triggering Lambda function to generate the story and store story into DynamoDB

Once the story is created this initiates a change data capture event using DynamoDB Streams (Step 3). This event flows through point-to-point messaging with EventBridge pipes and directly into EventBridge. Input transforms are then used to convert the DynamoDB Stream event into a custom EventBridge event, which downstream consumers can comprehend. Adopting this pattern is beneficial as it allows us to separate contracts from the DynamoDB event schema and not having downstream consumers conform to this schema structure, this mapping allows us to remain decoupled from implementation details.

EventBridge Pipes connecting DynamoDB streams directly into EventBridge.

EventBridge Pipes connecting DynamoDB streams directly into EventBridge.

Upon triggering the StoryCreated event in EventBridge, three targets are triggered to carry out several processes (Step 4). Firstly, AI Images are processed, followed by the creation of audio for the story. Finally, the end user is notified of the completed story through Amazon SNS and email subscriptions. This fan-out pattern enables these tasks to be run asynchronously and in parallel, allowing for faster processing times.

EventBridge pub/sub pattern used to start async processing of notifications, audio, and images.

EventBridge pub/sub pattern used to start async processing of notifications, audio, and images.

An SNS topic is triggered by the `StoryCreated` event to send an email to the end user using email subscriptions (Step 6). The email consists of a URL with the id of the story that has been created. Clicking on the URL takes the user to the frontend application that is hosted with App Runner.

Using SNS to notify the user of a new story

Using SNS to notify the user of a new story

Example email sent to the user

Example email sent to the user

Amazon Polly is used to generate the audio files for the story (Step 6). Upon triggering the `StoryCreated` event, a Lambda function is triggered, and the story description is used and given to Amazon Polly. Amazon Polly then creates an audio file of the story, which is stored in Amazon S3. A presigned URL is generated and saved in DynamoDB against the created story. This allows the frontend application and browser to retrieve the audio file when the user views the page. The presigned URL has a validity of two days, after which it can no longer be accessed or listened to.

Lambda function to generate audio using Amazon Polly, store in S3 and update story with presigned URL

Lambda function to generate audio using Amazon Polly, store in S3 and update story with presigned URL

The `StoryCreated` event also triggers another Lambda function, which uses the OpenAI API to generate an AI image using DALL-E based on the generated story (Step 7). Once the image is generated, the image is downloaded and stored in Amazon S3. Similar to the audio file, the system generates a presigned URL for the image and saves it in DynamoDB against the story. The presigned URL is only valid for two days, after which it becomes inaccessible for download or viewing.

Lambda function to generate images, store in S3 and update story with presigned URL.

Lambda function to generate images, store in S3 and update story with presigned URL.

In the event of a failure in audio or image generation, the frontend application still loads the story, but does not display the missing image or audio at that moment. This ensures that the frontend can continue working and provide value. If you wanted more control and only trigger the user’s notification event once all parallel tasks are complete the aggregator messaging pattern can be considered.

Hosting the frontend Next.js application with AWS App Runner

Next.js is used by the frontend application to render server-side rendered (SSR) pages that can access the stories from the DynamoDB table, which are then hosted with AWS App Runner after being containerized.

Next.js application hosted with App Runner, with permissions into DynamoDB table.

Next.js application hosted with App Runner, with permissions into DynamoDB table.

AWS App Runner enables you to deploy containerized web applications and APIs securely, without needing any prior knowledge of containers or infrastructure. With App Runner, developers can concentrate on their application, while the service handles container startup, running, scaling, and load balancing. After deployment, App Runner provides a secure URL for clients to begin making HTTP requests against.

With App Runner, you have two primary options for deploying your container: source code connections or source images. Using source code connections grants App Runner permission to pull the image file directly from your source code, and with Automatic deployment configured, it can redeploy the application when changes are made. Alternatively, source images provide App Runner with the image’s location in an image registry, and this image is deployed by App Runner.

In this example application, CDK deploys the application using the DockerImageAsset construct with the App Runner construct. Once deployed, App Runner builds and uploads the frontend image to Amazon Elastic Container Registry (ECR) and deploys it. Downstream consumers can access the application using the secure URL provided by App Runner. In this example, the URL is used when the SNS notification is sent to the user when the story is ready to be viewed.

Giving the frontend container permission to DynamoDB table

To grant the Next.js application permission to obtain stories from the Stories DynamoDB table, App Runner instance roles are configured. These roles are optional and can provide the necessary permissions for the container to access AWS services required by the compute service.

If you want to learn more about AWS App Runner, you can explore the free workshop.

Design choices and assumptions

The DynamoDB Time to Live (TTL) feature is ideal for the short-lived nature of daily generated stories. DynamoDB handle the deletion of stories after two days by setting the TTL attribute on each story. Once a story is deleted, it becomes inaccessible through the generated story URLs.

Using Amazon S3 presigned URLs is a method to grant temporary access to a file in S3. This application creates presigned URLs for the audio file and generated images that last for 2 days, after which the URLs for the S3 items become invalid.

Input transforms are used between DynamoDB streams and EventBridge events to decouple the schemas and events consumed by downstream targets. Consuming the events as they are is known as the “conformist” pattern, and couples us to implementation details of DynamoDB streams with downstream EventBridge consumers. This allows the application to remain decoupled from implementation details and remain flexible.


The adoption of artificial intelligence (AI) technology has significantly increased in various industries. ChatGPT, a large language model that can understand and generate human-like responses in natural language, and DALL-E, an image generation system that can create realistic images based on textual descriptions, are examples of such technology. These systems have demonstrated the potential for AI to provide innovative solutions and transform the way we interact with technology.

This blog post explores ways in which you can utilize AWS serverless services with ChatGTP and DALL-E to create a story generation application fronted by a Next.js application hosted with App Runner. EventBridge Scheduler is used to trigger the story creation process then react to change data capture events with DynamoDB streams and EventBridge Pipes, and use Amazon EventBridge to fan out compute tasks to process notifications, images, and audio files.

You can find the documentation and the source code for this application in GitHub.

For more serverless learning resources, visit Serverless Land.

Simplify data loading into Type 2 slowly changing dimensions in Amazon Redshift

Post Syndicated from Vaidy Kalpathy original https://aws.amazon.com/blogs/big-data/simplify-data-loading-into-type-2-slowly-changing-dimensions-in-amazon-redshift/

Thousands of customers rely on Amazon Redshift to build data warehouses to accelerate time to insights with fast, simple, and secure analytics at scale and analyze data from terabytes to petabytes by running complex analytical queries. Organizations create data marts, which are subsets of the data warehouse and usually oriented for gaining analytical insights specific to a business unit or team. The star schema is a popular data model for building data marts.

In this post, we show how to simplify data loading into a Type 2 slowly changing dimension in Amazon Redshift.

Star schema and slowly changing dimension overview

A star schema is the simplest type of dimensional model, in which the center of the star can have one fact table and a number of associated dimension tables. A dimension is a structure that captures reference data along with associated hierarchies, while a fact table captures different values and metrics that can be aggregated by dimensions. Dimensions provide answers to exploratory business questions by allowing end-users to slice and dice data in a variety of ways using familiar SQL commands.

Whereas operational source systems contain only the latest version of master data, the star schema enables time travel queries to reproduce dimension attribute values on past dates when the fact transaction or event actually happened. The star schema data model allows analytical users to query historical data tying metrics to corresponding dimensional attribute values over time. Time travel is possible because dimension tables contain the exact version of the associated attributes at different time ranges. Relative to the metrics data that keeps changing on a daily or even hourly basis, the dimension attributes change less frequently. Therefore, dimensions in a star schema that keeps track of changes over time are referred to as slowly changing dimensions (SCDs).

Data loading is one of the key aspects of maintaining a data warehouse. In a star schema data model, the central fact table is dependent on the surrounding dimension tables. This is captured in the form of primary key-foreign key relationships, where the dimension table primary keys are referred by foreign keys in the fact table. In the case of Amazon Redshift, uniqueness, primary key, and foreign key constraints are not enforced. However, declaring them will help the optimizer arrive at optimal query plans, provided that the data loading processes enforce their integrity. As part of data loading, the dimension tables, including SCD tables, get loaded first, followed by the fact tables.

SCD population challenge

Populating an SCD dimension table involves merging data from multiple source tables, which are usually normalized. SCD tables contain a pair of date columns (effective and expiry dates) that represent the record’s validity date range. Changes are inserted as new active records effective from the date of data loading, while simultaneously expiring the current active record on a previous day. During each data load, incoming change records are matched against existing active records, comparing each attribute value to determine whether existing records have changed or were deleted or are new records coming in.

In this post, we demonstrate how to simplify data loading into a dimension table with the following methods:

  • Using Amazon Simple Storage Service (Amazon S3) to host the initial and incremental data files from source system tables
  • Accessing S3 objects using Amazon Redshift Spectrum to carry out data processing to load native tables within Amazon Redshift
  • Creating views with window functions to replicate the source system version of each table within Amazon Redshift
  • Joining source table views to project attributes matching with dimension table schema
  • Applying incremental data to the dimension table, bringing it up to date with source-side changes

Solution overview

In a real-world scenario, records from source system tables are ingested on a periodic basis to an Amazon S3 location before being loaded into star schema tables in Amazon Redshift.

For this demonstration, data from two source tables, customer_master and customer_address, are combined to populate the target dimension table dim_customer, which is the customer dimension table.

The source tables customer_master and customer_address share the same primary key, customer_id, and will be joined on the same to fetch one record per customer_id along with attributes from both tables. row_audit_ts contains the latest timestamp at which the particular source record was inserted or last updated. This column helps identify the change records since the last data extraction.

rec_source_status is an optional column that indicates if the corresponding source record was inserted, updated, or deleted. This is applicable in cases where the source system itself provides the changes and populates rec_source_status appropriately.

The following figure provides the schema of the source and target tables.

Let’s look closer at the schema of the target table, dim_customer. It contains different categories of columns:

  • Keys – It contains two types of keys:
    • customer_sk is the primary key of this table. It is also called the surrogate key and has a unique value that is monotonically increasing.
    • customer_id is the source primary key and provides a reference back to the source system record.
  • SCD2 metadatarec_eff_dt and rec_exp_dt indicate the state of the record. These two columns together define the validity of the record. The value in rec_exp_dt will be set as ‘9999-12-31’ for presently active records.
  • Attributes – Includes first_name, last_name, employer_name, email_id, city, and country.

Data loading into a SCD table involves a first-time bulk data loading, referred to as the initial data load. This is followed by continuous or regular data loading, referred to as an incremental data load, to keep the records up to date with changes in the source tables.

To demonstrate the solution, we walk through the following steps for initial data load (1–7) and incremental data load (8–12):

  1. Land the source data files in an Amazon S3 location, using one subfolder per source table.
  2. Use an AWS Glue crawler to parse the data files and register tables in the AWS Glue Data Catalog.
  3. Create an external schema in Amazon Redshift to point to the AWS Glue database containing these tables.
  4. In Amazon Redshift, create one view per source table to fetch the latest version of the record for each primary key (customer_id) value.
  5. Create the dim_customer table in Amazon Redshift, which contains attributes from all relevant source tables.
  6. Create a view in Amazon Redshift joining the source table views from Step 4 to project the attributes modeled in the dimension table.
  7. Populate the initial data from the view created in Step 6 into the dim_customer table, generating customer_sk.
  8. Land the incremental data files for each source table in their respective Amazon S3 location.
  9. In Amazon Redshift, create a temporary table to accommodate the change-only records.
  10. Join the view from Step 6 and dim_customer and identify change records comparing the combined hash value of attributes. Populate the change records into the temporary table with an I, U, or D indicator.
  11. Update rec_exp_dt in dim_customer for all U and D records from the temporary table.
  12. Insert records into dim_customer, querying all I and U records from the temporary table.


Before you get started, make sure you meet the following prerequisites:

Land data from source tables

Create separate subfolders for each source table in an S3 bucket and place the initial data files within the respective subfolder. In the following image, the initial data files for customer_master and customer_address are made available within two different subfolders. To try out the solution, you can use customer_master_with_ts.csv and customer_address_with_ts.csv as initial data files.

It’s important to include an audit timestamp (row_audit_ts) column that indicates when each record was inserted or last updated. As part of incremental data loading, rows with the same primary key value (customer_id) can arrive more than once. The row_audit_ts column helps identify the latest version of such records for a given customer_id to be used for further processing.

Register source tables in the AWS Glue Data Catalog

We use an AWS Glue crawler to infer metadata from delimited data files like the CSV files used in this post. For instructions on getting started with an AWS Glue crawler, refer to Tutorial: Adding an AWS Glue crawler.

Create an AWS Glue crawler and point it to the Amazon S3 location that contains the source table subfolders, within which the associated data files are placed. When you’re creating the AWS Glue crawler, create a new database named rs-dimension-blog. The following screenshots show the AWS Glue crawler configuration chosen for our data files.

Note that for the Set output and scheduling section, the advanced options are left unchanged.

Running this crawler should create the following tables within the rs-dimension-blog database:

  • customer_address
  • customer_master

Create schemas in Amazon Redshift

First, create an AWS Identity and Access Management (IAM) role named rs-dim-blog-spectrum-role. For instructions, refer to Create an IAM role for Amazon Redshift.

The IAM role has Amazon Redshift as the trusted entity, and the permissions policy includes AmazonS3ReadOnlyAccess and AWSGlueConsoleFullAccess, because we’re using the AWS Glue Data Catalog. Then associate the IAM role with the Amazon Redshift cluster or endpoint.

Instead, you can also set the IAM role as the default for your Amazon Redshift cluster or endpoint. If you do so, in the following create external schema command, pass the iam_role parameter as iam_role default.

Now, open Amazon Redshift Query Editor V2 and create an external schema passing the newly created IAM role and specifying the database as rs-dimension-blog. The database name rs-dimension-blog is the one created in the Data Catalog as part of configuring the crawler in the preceding section. See the following code:

create external schema spectrum_dim_blog 
from data catalog 
database 'rs-dimension-blog' 
iam_role 'arn:aws:iam::<accountid>:role/rs-dim-blog-spectrum-role';

Check if the tables registered in the Data Catalog in the preceding section are visible from within Amazon Redshift:

select * 
from spectrum_dim_blog.customer_master 
limit 10;

select * 
from spectrum_dim_blog.customer_address 
limit 10;

Each of these queries will return 10 rows from the respective Data Catalog tables.

Create another schema in Amazon Redshift to host the table, dim_customer:

create schema rs_dim_blog;

Create views to fetch the latest records from each source table

Create a view for the customer_master table, naming it vw_cust_mstr_latest:

create view rs_dim_blog.vw_cust_mstr_latest as with rows_numbered as (
    row_number() over(
      partition by customer_id 
      order by 
        row_audit_ts desc
    ) as rnum 
  rnum = 1 with no schema binding;

The preceding query uses row_number, which is a window function provided by Amazon Redshift. Using window functions enables you to create analytic business queries more efficiently. Window functions operate on a partition of a result set, and return a value for every row in that window. The row_number window function determines the ordinal number of the current row within a group of rows, counting from 1, based on the ORDER BY expression in the OVER clause. By including the PARTITION BY clause as customer_id, groups are created for each value of customer_id and ordinal numbers are reset for each group.

Create a view for the customer_address table, naming it vw_cust_addr_latest:

create view rs_dim_blog.vw_cust_addr_latest as with rows_numbered as (
    row_number() over(
      partition by customer_id 
      order by 
        row_audit_ts desc
    ) as rnum 
  rnum = 1 with no schema binding;

Both view definitions use the row_number window function of Amazon Redshift, ordering the records by descending order of the row_audit_ts column (the audit timestamp column). The condition rnum=1 fetches the latest record for each customer_id value.

Create the dim_customer table in Amazon Redshift

Create dim_customer as an internal table in Amazon Redshift within the rs_dim_blog schema. The dimension table includes the column customer_sk, that acts as the surrogate key column and enables us to capture a time-sensitive version of each customer record. The validity period for each record is defined by the columns rec_eff_dt and rec_exp_dt, representing record effective date and record expiry date, respectively. See the following code:

create table rs_dim_blog.dim_customer (
  customer_sk bigint, 
  customer_id bigint, 
  first_name varchar(100), 
  last_name varchar(100), 
  employer_name varchar(100), 
  email_id varchar(100), 
  city varchar(100), 
  country varchar(100), 
  rec_eff_dt date, 
  rec_exp_dt date
) diststyle auto;

Create a view to consolidate the latest version of source records

Create the view vw_dim_customer_src, which consolidates the latest records from both source tables using left outer join, keeping them ready to be populated into the Amazon Redshift dimension table. This view fetches data from the latest views defined in the section “Create views to fetch the latest records from each source table”:

create view rs_dim_blog.vw_dim_customer_src as 
  rs_dim_blog.vw_cust_mstr_latest as m 
  left join rs_dim_blog.vw_cust_addr_latest as a on m.customer_id = a.customer_id 
order by 
  m.customer_id with no schema binding;

At this point, this view fetches the initial data for loading into the dim_customer table that we are about to create. In your use-case, use a similar approach to create and join the required source table views to populate your target dimension table.

Populate initial data into dim_customer

Populate the initial data into the dim_customer table by querying the view vw_dim_customer_src. Because this is the initial data load, running row numbers generated by the row_number window function will suffice to populate a unique value in the customer_sk column starting from 1:

insert into rs_dim_blog.dim_customer 
  row_number() over() as customer_sk, 
  cast('2022-07-01' as date) rec_eff_dt, 
  cast('9999-12-31' as date) rec_exp_dt 

In this query, we have specified ’2022-07-01’ as the value in rec_eff_dt for all initial data records. For your use-case, you can modify this date value as appropriate to your situation.

The preceding steps complete the initial data loading into the dim_customer table. In the next steps, we proceed with populating incremental data.

Land ongoing change data files in Amazon S3

After the initial load, the source systems provide data files on an ongoing basis, either containing only new and change records or a full extract containing all records for a particular table.

You can use the sample files customer_master_with_ts_incr.csv and customer_address_with_ts_incr.csv, which contain changed as well as new records. These incremental files need to be placed in the same location in Amazon S3 where the initial data files were placed. Please see section “Land data from source tables”. This will result in the corresponding Redshift Spectrum tables automatically reading the additional rows.

If you used the sample file for customer_master, after adding the incremental files, the following query shows the initial as well as incremental records:

order by 

In case of full extracts, we can identify deletes occurring in the source system tables by comparing the previous and current versions and looking for missing records. In case of change-only extracts where the rec_source_status column is present, its value will help us identify deleted records. In either case, land the ongoing change data files in the respective Amazon S3 locations.

For this example, we have uploaded the incremental data for the customer_master and customer_address source tables with a few customer_id records receiving updates and a few new records being added.

Create a temporary table to capture change records

Create the temporary table temp_dim_customer to store all changes that need to be applied to the target dim_customer table:

create temp table temp_dim_customer (
  customer_sk bigint, 
  customer_id bigint, 
  first_name varchar(100), 
  last_name varchar(100), 
  employer_name varchar(100), 
  email_id varchar(100), 
  city varchar(100), 
  country varchar(100), 
  rec_eff_dt date, 
  rec_exp_dt date, 
  iud_operation character(1)

Populate the temporary table with new and changed records

This is a multi-step process that can be combined into a single complex SQL. Complete the following steps:

  1. Fetch the latest version of all customer attributes by querying the view vw_dim_customer_src:
    coalesce(first_name, '') || coalesce(last_name, '') || coalesce(employer_name, '') || coalesce(email_id, '') || coalesce(city, '') || coalesce(country, ''), 512
  ) as hash_value, 
  current_date rec_eff_dt, 
  cast('9999-12-31' as date) rec_exp_dt 

Amazon Redshift offers hashing functions such as sha2, which converts a variable length string input into a fixed length character output. The output string is a text representation of the hexadecimal value of the checksum with the specified number of bits. In this case, we pass a concatenated set of customer attributes whose change we want to track, specifying the number of bits as 512. We’ll use the output of the hash function to determine if any of the attributes have undergone a change. This dataset will be called newver (new version).

Because we landed the ongoing change data in the same location as the initial data files, the records retrieved from the preceding query (in newver) include all records, even the unchanged ones. But because of the definition of the view vw_dim_customer_src, we get only one record per customerid, which is its latest version based on row_audit_ts.

  1. In a similar manner, retrieve the latest version of all customer records from dim_customer, which are identified by rec_exp_dt=‘9999-12-31’. While doing so, also retrieve the sha2 value of all customer attributes available in dim_customer:
    coalesce(first_name, '') || coalesce(last_name, '') || coalesce(employer_name, '') || coalesce(email_id, '') || coalesce(city, '') || coalesce(country, ''), 512
  ) as hash_value, 
  rec_exp_dt = '9999-12-31';

This dataset will be called oldver (old or existing version).

  1. Identify the current maximum surrogate key value from the dim_customer table:
  max(customer_sk) as maxval 

This value (maxval) will be added to the row_number before being used as the customer_sk value for the change records that need to be inserted.

  1. Perform a full outer join of the old version of records (oldver) and the new version (newver) of records on the customer_id column. Then compare the old and new hash values generated by the sha2 function to determine if the change record is an insert, update, or delete:
case when oldver.customer_id is null then 'I'
when newver.customer_id is null then 'D'
when oldver.hash_value != newver.hash_value then 'U'
else 'N' end as iud_op

We tag the records as follows:

  • If the customer_id is non-existent in the oldver dataset (oldver.customer_id is null), it’s tagged as an insert (‘I').
  • Otherwise, if the customer_id is non-existent in the newver dataset (newver.customer_id is null), it’s tagged as a delete (‘D').
  • Otherwise, if the old hash_value and new hash_value are different, these records represent an update (‘U').
  • Otherwise, it indicates that the record has not undergone any change and therefore can be ignored or marked as not-to-be-processed (‘N').

Make sure to modify the preceding logic if the source extract contains rec_source_status to identify deleted records.

Although sha2 output maps a possibly infinite set of input strings to a finite set of output strings, the chances of collision of hash values for the original row values and changed row values are very unlikely. Instead of individually comparing each column value before and after, we compare the hash values generated by sha2 to conclude if there has been a change in any of the attributes of the customer record. For your use-case, we recommend you choose a hash function that works for your data conditions after adequate testing. Instead, you can compare individual column values if none of the hash functions satisfactorily meet your expectations.

  1. Combining the outputs from the preceding steps, let’s create the INSERT statement that captures only change records to populate the temporary table:
insert into temp_dim_customer (
  customer_sk, customer_id, first_name, 
  last_name, employer_name, email_id, 
  city, country, rec_eff_dt, rec_exp_dt, 
) with newver as (
      coalesce(first_name, '') || coalesce(last_name, '') || coalesce(employer_name, '') || coalesce(email_id, '') || coalesce(city, '') || coalesce(country, ''), 512
    ) as hash_value, 
    current_date rec_eff_dt, 
    cast('9999-12-31' as date) rec_exp_dt 
oldver as (
      coalesce(first_name, '') || coalesce(last_name, '') || coalesce(employer_name, '') || coalesce(email_id, '') || coalesce(city, '') || coalesce(country, ''), 512
    ) as hash_value, 
    rec_exp_dt = '9999-12-31'
maxsk as (
    max(customer_sk) as maxval 
allrecs as (
    coalesce(oldver.customer_id, newver.customer_id) as customer_id, 
    case when oldver.customer_id is null then 'I' when newver.customer_id is null then 'D' when oldver.hash_value != newver.hash_value then 'U' else 'N' end as iud_op, 
    oldver full 
    outer join newver on oldver.customer_id = newver.customer_id
  (maxval + (row_number() over())) as customer_sk, 
  iud_op != 'N';

Expire updated customer records

With the temp_dim_customer table now containing only the change records (either ‘I’, ‘U’, or ‘D’), the same can be applied on the target dim_customer table.

Let’s first fetch all records with values ‘U’ or ‘D’ in the iud_op column. These are records that have either been deleted or updated in the source system. Because dim_customer is a slowly changing dimension, it needs to reflect the validity period of each customer record. In this case, we expire the presently active recorts that have been updated or deleted. We expire these records as of yesterday (by setting rec_exp_dt=current_date-1) matching on the customer_id column:

  rec_exp_dt = current_date - 1 
  customer_id in (
      temp_dim_customer as t 
      iud_operation in ('U', 'D')
  and rec_exp_dt = '9999-12-31';

Insert new and changed records

As the last step, we need to insert the newer version of updated records along with all first-time inserts. These are indicated by ‘U’ and ‘I’, respectively, in the iud_op column in the temp_dim_customer table:

insert into rs_dim_blog.dim_customer (
  customer_sk, customer_id, first_name, 
  last_name, employer_name, email_id, 
  city, country, rec_eff_dt, rec_exp_dt
  iud_operation in ('I', 'U');

Depending on the SQL client setting, you might want to run a commit transaction; command to verify that the preceding changes are persisted successfully in Amazon Redshift.

Check the final output

You can run the following query and see that the dim_customer table now contains both the initial data records plus the incremental data records, capturing multiple versions for those customer_id values that got changed as part of incremental data loading. The output also indicates that each record has been populated with appropriate values in rec_eff_dt and rec_exp_dt corresponding to the record validity period.

order by 

For the sample data files provided in this article, the preceding query returns the following records. If you’re using the sample data files provided in this post, note that the values in customer_sk may not match with what is shown in the following table.

In this post, we only show the important SQL statements; the complete SQL code is available in load_scd2_sample_dim_customer.sql.

Clean up

If you no longer need the resources you created, you can delete them to prevent incurring additional charges.


In this post, you learned how to simplify data loading into Type-2 SCD tables in Amazon Redshift, covering both initial data loading and incremental data loading. The approach deals with multiple source tables populating a target dimension table, capturing the latest version of source records as of each run.

Refer to Amazon Redshift data loading best practices for further materials and additional best practices, and see Updating and inserting new data for instructions to implement updates and inserts.

About the Author

Vaidy Kalpathy is a Senior Data Lab Solution Architect at AWS, where he helps customers modernize their data platform and defines end to end data strategy including data ingestion, transformation, security, visualization. He is passionate about working backwards from business use cases, creating scalable and custom fit architectures to help customers innovate using data analytics services on AWS.

Uploading large objects to Amazon S3 using multipart upload and transfer acceleration

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/uploading-large-objects-to-amazon-s3-using-multipart-upload-and-transfer-acceleration/

This post is written by Tam Baghdassarian, Cloud Application Architect, Rama Krishna Ramaseshu, Sr Cloud App Architect, and Anand Komandooru, Sr Cloud App Architect.

Web and mobile applications must often upload objects to the AWS Cloud. For example, services such as Amazon Photos allow users to upload photos and large video files from their browser or mobile application.

Amazon S3 can be ideal to store large objects due to its 5-TB object size maximum along with its support for reducing upload times via multipart uploads and transfer acceleration.


Developers who must upload large files from their web and mobile applications to the cloud can face a number of challenges. Due to underlying TCP throughput limits, a single HTTP connection cannot use the full bandwidth available, resulting in slower upload times. Furthermore, network latency and quality can result in poor or inconsistent user experience.

Using S3 features such as presigned URLs and multipart upload, developers can securely increase throughput and minimize upload retries due to network errors. Additionally, developers can use transfer acceleration to reduce network latency and provide a consistent user experience to their web and mobile app users across the globe.

This post references a sample application consisting of a web frontend and a serverless backend application. It demonstrates the benefits of using S3’s multipart upload and transfer acceleration features.

Architecture overview

Solution overview:

  1. Web or mobile application (frontend) communicates with AWS Cloud (backend) through Amazon API Gateway to initiate and complete a multipart upload.
  2. AWS Lambda functions invoke S3 API calls on behalf of the web or mobile application.
  3. Web or mobile application uploads large objects to S3 using S3 transfer acceleration and presigned URLs.
  4. File uploads are received and acknowledged by the closest edge location to reduce latency.

Using S3 multipart upload to upload large objects

A multipart upload allows an application to upload a large object as a set of smaller parts uploaded in parallel. Upon completion, S3 combines the smaller pieces into the original larger object.

Breaking a large object upload into smaller pieces has a number of advantages. It can improve throughput by uploading a number of parts in parallel. It can also recover from a network error more quickly by only restarting the upload for the failed parts.

Multipart upload consists of:

  1. Initiate the multipart upload and obtain an upload id via the CreateMultipartUpload API call.
  2. Divide the large object into multiple parts, get a presigned URL for each part, and upload the parts of a large object in parallel via the UploadPart API call.
  3. Complete the upload by calling the CompleteMultipartUpload API call.

When used with presigned URLs, multipart upload allows an application to upload the large object using a secure, time-limited method without sharing private bucket credentials.

This Lambda function can initiate a multipart upload on behalf of a web or mobile application:

const multipartUpload = await s3. createMultipartUpload(multipartParams).promise()
return {
    statusCode: 200,
    body: JSON.stringify({
        fileId: multipartUpload.UploadId,
        fileKey: multipartUpload.Key,
    headers: {
      'Access-Control-Allow-Origin': '*'

The UploadId is required for subsequent calls to upload each part and complete the upload.

Uploading objects securely using S3 presigned URLs

A web or mobile application requires write permission to upload objects to a S3 bucket. This is usually accomplished by granting access to the bucket and storing credentials within the application.

You can use presigned URLs to access S3 buckets securely without the need to share or store credentials in the calling application. In addition, presigned URLs are time-limited (the default is 15 minutes) to apply security best practices.

A web application calls an API resource that uses the S3 API calls to generate a time-limited presigned URL. The web application then uses the URL to upload an object to S3 within the allotted time, without having explicit write access to the S3 bucket. Once the presigned URL expires, it can no longer be used.

When combined with multipart upload, a presigned URL can be generated for each of the upload parts, allowing the web or mobile application to upload large objects.

This example demonstrates generating a set of presigned URLs for index number of parts:

    const multipartParams = {
        Bucket: bucket_name,
        Key: fileKey,
        UploadId: fileId,
    const promises = []

    for (let index = 0; index < parts; index++) {
            s3.getSignedUrlPromise("uploadPart", {
            PartNumber: index + 1,
            Expires: parseInt(url_expiration)
    const signedUrls = await Promise.all(promises)

Prior to calling getSignedUrlPromise, the client must obtain an UploadId via CreateMultipartUpload. Read Generating a presigned URL to share an object for more information.

Reducing latency by using transfer acceleration

By using S3 transfer acceleration, the application can take advantage of the globally distributed edge locations in Amazon CloudFront. When combined with multipart uploads, each part can be uploaded automatically to the edge location closest to the user, reducing the upload time.

Transfer acceleration must be enabled on the S3 bucket. It can be accessed using the endpoint bucketname.s3-acceleration.amazonaws.com or bucketname.s3-accelerate.dualstack.amazonaws.com to connect to the enabled bucket over IPv6.

Use the speed comparison tool to test the benefits of the transfer acceleration from your location.

You can use transfer acceleration with multipart uploads and presigned URLs to allow a web or mobile application to upload large objects securely and efficiently.

Transfer acceleration needs must be enabled on the S3 bucket. This example creates an S3 bucket with transfer acceleration using CDK and TypeScript:

const s3Bucket = new s3.Bucket(this, "document-upload-bucket", {
      bucketName: “BUCKET-NAME”,
      encryption: BucketEncryption.S3_MANAGED,
      enforceSSL: true,
      transferAcceleration: true,      
      removalPolicy: cdk.RemovalPolicy.DESTROY

After activating transfer acceleration on the S3 bucket, the backend application can generate transfer acceleration-enabled presigned URLs. by initializing the S3 SDK:

s3 = new AWS.S3({useAccelerateEndpoint: true});

The web or mobile application then use the presigned URLs to upload file parts.

See S3 transfer acceleration for more information.

Deploying the test solution

To set up and run the tests outlined in this blog, you need:

  • An AWS account.
  • Install and configure AWS CLI.
  • Install and bootstrap AWS CDK.
  • Deploy the backend and frontend solution at the following git repository.
  • A sufficiently large test upload file of at least 100 MB.

To deploy the backend:

  1. Clone the repository to your local machine.
  2. From the backendv2 folder, install all dependencies by running:
    npm install
  3. Use CDK to deploy the backend to AWS:
    cdk deploy --context env="randnumber" --context whitelistip="xx.xx.xxx.xxx"

You can use an additional context variable called “urlExpiry” to set a specific expiration time on the S3 presigned URL. The default value is set at 300 seconds. A new S3 bucket with the name “document-upload-bucket-randnumber” is created for storing the uploaded objects, and the whitelistip value allows API Gateway access from this IP address only.

Note the API Gateway endpoint URL for later.

To deploy the frontend:

  1. From the frontend folder, install the dependencies:
    npm install
  2. To launch the frontend application from the browser, run:
    npm run start

Testing the application

Testing the application

To test the application:

  1. Launch the user interface from the frontend folder:
    npm run
  2. Enter the API Gateway address in the API URL textbox.Select the maximum size of each part of the upload (the minimum is 5 MB) and the number of parallel uploads. Use your available bandwidth, TCP window size, and retry time requirements to determine the optimal part size. Web browsers have a limit on the number of concurrent connections to the same server. Specifying a larger number of concurrent connections results in blocking on the web browser side.
  3. Decide if transfer acceleration should be used to further reduce latency.
  4. Choose a test upload file.
  5. Use the Monitor section to observe the total time to upload the test file.

Experiment with different values for part size, number of parallel uploads, use of transfer acceleration and the size of the test file to see the effects on total upload time. You can also use the developer tools for your browser to gain more insights.

Test results

The following tests have the following characteristics:

  • The S3 bucket is located in the US East Region.
  • The client’s average upload speed is 79 megabits per second.
  • The Firefox browser uploaded a file of 485 MB.

Test 1 – Single part upload without transfer acceleration

To create a baseline, the test file is uploaded without transfer acceleration and using only a single part. This simulates a large file upload without the benefits of multipart upload. The baseline result is 72 seconds.

Single part upload without transfer acceleration

Test 2 – Single upload with transfer acceleration

The next test measured upload time using transfer acceleration while still maintaining a single upload part with no multipart upload benefits. The result is 43 seconds (40% faster).

Single upload with transfer acceleration

Test 3 – Multipart upload without transfer acceleration

This test uses multipart upload by splitting the test file into 5-MB parts with a maximum of six parallel uploads. Transfer acceleration is disabled. The result is 45 seconds (38% faster).

Multipart upload without transfer acceleration

Test 4 – Multipart upload with transfer acceleration

For this test, the test file is uploaded by splitting the file into 5-MB parts with a maximum of six parallel uploads. Transfer acceleration is enabled for each upload. The result is 28 seconds (61% faster).

Multipart upload with transfer acceleration

The following chart summarizes the test results.

Multipart upload Transfer acceleration Upload time
No No 72s
Yes No 43s
No Yes 45s
Yes Yes 28s


This blog shows how web and mobile applications can upload large objects to Amazon S3 in a secured and efficient manner when using presigned URLs and multipart upload.

Developers can also use transfer acceleration to reduce latency and speed up object uploads. When combined with multipart upload, you can see upload time reduced by up to 61%.

Use the reference implementation to start incorporating multipart upload and S3 transfer acceleration in your web and mobile applications.

For more serverless learning resources, visit Serverless Land.

Synchronize your Salesforce and Snowflake data to speed up your time to insight with Amazon AppFlow

Post Syndicated from Ramesh Ranganathan original https://aws.amazon.com/blogs/big-data/synchronize-your-salesforce-and-snowflake-data-to-speed-up-your-time-to-insight-with-amazon-appflow/

This post was co-written with Amit Shah, Principal Consultant at Atos.

Customers across industries seek meaningful insights from the data captured in their Customer Relationship Management (CRM) systems. To achieve this, they combine their CRM data with a wealth of information already available in their data warehouse, enterprise systems, or other software as a service (SaaS) applications. One widely used approach is getting the CRM data into your data warehouse and keeping it up to date through frequent data synchronization.

Integrating third-party SaaS applications is often complicated and requires significant effort and development. Developers need to understand the application APIs, write implementation and test code, and maintain the code for future API changes. Amazon AppFlow, which is a low-code/no-code AWS service, addresses this challenge.

Amazon AppFlow is a fully managed integration service that enables you to securely transfer data between SaaS applications, like Salesforce, SAP, Zendesk, Slack, and ServiceNow, and AWS services like Amazon Simple Storage Service (Amazon S3) and Amazon Redshift in just a few clicks. With Amazon AppFlow, you can run data flows at enterprise scale at the frequency you choose—on a schedule, in response to a business event, or on demand.

In this post, we focus on synchronizing your data from Salesforce to Snowflake (on AWS) without writing code. This post walks you through the steps to set up a data flow to address full and incremental data load using an example use case.

Solution overview

Our use case involves the synchronization of the Account object from Salesforce into Snowflake. In this architecture, you use Amazon AppFlow to filter and transfer the data to your Snowflake data warehouse.

You can configure Amazon AppFlow to run your data ingestion in three different ways:

  • On-demand – You can manually run the flow through the AWS Management Console, API, or SDK call.
  • Event-driven – Amazon AppFlow can subscribe and listen to change data capture (CDC) events from the source SaaS application.
  • Scheduled – Amazon AppFlow can run schedule-triggered flows based on a pre-defined schedule rule. With scheduled flows, you can choose either full or incremental data transfer:
    • With full transfer, Amazon AppFlow transfers a snapshot of all records at the time of the flow run from the source to the destination.
    • With incremental transfer, Amazon AppFlow transfers only the records that have been added or changed since the last successful flow run. To determine the incremental delta of your data, AppFlow requires you to specify a source timestamp field to instruct how Amazon AppFlow identifies new or updated records.

We use the on-demand trigger for the initial load of data from Salesforce to Snowflake, because it helps you pull all the records, irrespective of their creation. To then synchronize data periodically with Snowflake, after we run the on-demand trigger, we configure a scheduled trigger with incremental transfer. With this approach, Amazon AppFlow pulls the records based on a chosen timestamp field from the Salesforce Account object periodically, based on the time interval specified in the flow.

The Account_Staging table is created in Snowflake to act as a temporary storage that can be used to identify the data change events. Then the permanent table (Account) is updated from the staging table by running a SQL stored procedure that contains the incremental update logic. The following figure depicts the various components of the architecture and the data flow from the source to the target.

The data flow contains the following steps:

  1. First, the flow is run with on-demand and full transfer mode to load the full data into Snowflake.
  2. The Amazon AppFlow Salesforce connector pulls the data from Salesforce and stores it in the Account Data S3 bucket in CSV format.
  3. The Amazon AppFlow Snowflake connector loads the data into the Account_Staging table.
  4. A scheduled task, running at regular intervals in Snowflake, triggers a stored procedure.
  5. The stored procedure starts an atomic transaction that loads the data into the Account table and then deletes the data from the Account_Staging table.
  6. After the initial data is loaded, you update the flow to capture incremental updates from Salesforce. The flow trigger configuration is changed to scheduled, to capture data changes in Salesforce. This enables Snowflake to get all updates, deletes, and inserts in Salesforce at configured intervals.
  7. The flow uses the configured LastModifiedDate field to determine incremental changes.
  8. Steps 3, 4, and 5 are run again to load the incremental updates into the Snowflake Accounts table.


To get started, you need the following prerequisites:

  • A Salesforce user account with sufficient privileges to install connected apps. Amazon AppFlow uses a connected app to communicate with Salesforce APIs. If you don’t have a Salesforce account, you can sign up for a developer account.
  • A Snowflake account with sufficient permissions to create and configure the integration, external stage, table, stored procedures, and tasks.
  • An AWS account with access to AWS Identity and Access Management (IAM), Amazon AppFlow, and Amazon S3.

Set up Snowflake configuration and Amazon S3 data

Complete the following steps to configure Snowflake and set up your data in Amazon S3:

  1. Create two S3 buckets in your AWS account: one for holding the data coming from Salesforce, and another for holding error records.

A best practice when creating your S3 bucket is to make sure you block public access to the bucket to ensure your data is not accessible by unauthorized users.

  1. Create an IAM policy named snowflake-access that allows listing the bucket contents and reading S3 objects inside the bucket.

Follow the instructions for steps 1 and 2 in Configuring a Snowflake Storage Integration to Access Amazon S3 to create an IAM policy and role. Replace the placeholders with your S3 bucket names.

  1. Log in to your Snowflake account and create a new warehouse called SALESFORCE and database called SALESTEST.
  2. Specify the format in which data will be available in Amazon S3 for Snowflake to load (for this post, CSV):
CREATE or REPLACE file format my_csv_format
type = csv
field_delimiter = ','
Y skip_header = 1
null_if = ('NULL', 'null')
empty_field_as_null = true
compression = gzip;
  1. Amazon AppFlow uses the Snowflake COPY command to move data using an S3 bucket. To configure this integration, follow steps 3–6 in Configuring a Snowflake Storage Integration to Access Amazon S3.

These steps create a storage integration with your S3 bucket, update IAM roles with Snowflake account and user details, and creates an external stage.

This completes the setup in Snowflake. In the next section, you create the required objects in Snowflake.

Create schemas and procedures in Snowflake

In your Snowflake account, complete the following steps to create the tables, stored procedures, and tasks for implementing the use case:

  1. In your Snowflake account, open a worksheet and run the following DDL scripts to create the Account and Account_staging tables:
primary key (ACCOUNT_NUMBER)

primary key (ACCOUNT_NUMBER)
  1. Create a stored procedure in Snowflake to load data from staging to the Account table:
CREATE or REPLACE procedure sp_account_load( )
returns varchar not null
language sql
Begin transaction;
when matched AND ACCOUNT_STAGING.DELETED=TRUE then delete
when matched then UPDATE SET
when NOT matched then
) ;


This stored procedure determines whether the data contains new records that need to be inserted or existing records that need to be updated or deleted. After a successful run, the stored procedure clears any data from your staging table.

  1. Create a task in Snowflake to trigger the stored procedure. Make sure that the time interval for this task is more than the time interval configured in Amazon AppFlow for pulling the incremental changes from Salesforce. The time interval should be sufficient for data to be processed.
SCHEDULE = 'USING CRON 5 * * * * America/Los_Angeles'
call sp_account_load();
  1. Provide the required permissions to run the task and resume the task:
show tasks;
  • As soon as task is created it will be suspended state so needs to resume it manually first time
  • If the role which is assigned to us doesn’t have proper access to resume/execute task needs to grant execute task privilege to that role

This completes the Snowflake part of configuration and setup.

Create a Salesforce connection

First, let’s create a Salesforce connection that can be used by AppFlow to authenticate and pull records from your Salesforce instance. On the AWS console, make sure you are in the same Region where your Snowflake instance is running.

  1. On the Amazon AppFlow console, choose Connections in the navigation pane.
  2. From the list of connectors, select Salesforce.
  3. Choose Create connection.
  4. For Connection name, enter a name of your choice (for example, Salesforce-blog).
  5. Leave the rest of the fields as default and choose Continue.
  6. You’re redirected to a sign-in page, where you need to log in to your Salesforce instance.
  7. After you allow Amazon AppFlow access to your Salesforce account, your connection is successfully created.

 Create a Snowflake connection

Complete the following steps to create your Snowflake connection:

  1. On the Connections menu, choose Snowflake.
  2. Choose Create connection.
  3. Provide information for the Warehouse, Stage name, and Bucket details fields.
  4. Enter your credential details.

  1. For Region, choose the same Region where Snowflake is running.
  2. For Connection name, name your connection Snowflake-blog.
  3. Leave the rest of the fields as default and choose Connect.

Create a flow in Amazon AppFlow

Now you create a flow in Amazon AppFlow to load the data from Salesforce to Snowflake. Complete the following steps:

  1. On the Amazon AppFlow console, choose Flows in the navigation pane.
  2. Choose Create flow.
  3. On the Specify flow details page, enter a name for the flow (for example, AccountData-SalesforceToSnowflake).
  4. Optionally, provide a description for the flow and tags.
  5. Choose Next.

  1. On the Configure flow page, for Source name¸ choose Salesforce.
  2. Choose the Salesforce connection we created in the previous step (Salesforce-blog).
  3. For Choose Salesforce object, choose Account.
  4. For Destination name, choose Snowflake.
  5. Choose the newly created Snowflake connection.
  6. For Choose Snowflake object, choose the staging table you created earlier (SALESTEST.PUBLIC. ACCOUNT_STAGING).

  1. In the Error handling section, provide your error S3 bucket.
  2. For Choose how to trigger the flow¸ select Run on demand.
  3. Choose Next.

  1. Select Manually map fields to map the fields between your source and destination.
  2. Choose the fields Account Number, Account Name, Account Type, Annual Revenue, Active, Deleted, and Last Modified Date.

  1. Map each source field to its corresponding destination field.
  2. Under Additional settings, leave the Import deleted records unchecked (default setting).

  1. In the Validations section, add validations for the data you’re pulling from Salesforce.

Because the schema for the Account_Staging table in Snowflake database has a NOT NULL constraint for the fields Account_Number and Active, records containing a null value for these fields should be ignored.

  1. Choose Add Validation to configure validations for these fields.
  2. Choose Next.

  1. Leave everything else as default, proceed to the final page, and choose Create Flow.
  2. After the flow is created, choose Run flow.

When the flow run completes successfully, it will bring all records into your Snowflake staging table.

Verify data in Snowflake

The data will be loaded into the Account_staging table. To verify that data is loaded in Snowflake, complete the following steps:

  1. Validate the number of records by querying the ACCOUNT_STAGING table in Snowflake.
  2. Wait for your Snowflake task to run based on the configured schedule.
  3. Verify that all the data is transferred to the ACCOUNT table and the ACCOUNT_STAGING table is truncated.

Configure an incremental data load from Salesforce

Now let’s configure an incremental data load from Salesforce:

  1. On the Amazon AppFlow console, select your flow, and choose Edit.
  2. Go to the Edit configuration step and change to Run flow on schedule.
  3. Set the flow to run every 5 minutes, and provide a start date of Today, with a start time in the future.
  4. Choose Incremental transfer and choose the LastModifiedDate field.
  5. Choose Next.
  6. In the Additional settings section, select Import deleted records.

This ensures that deleted records from the source are also ingested.

  1. Choose Save and then choose Activate flow.

Now your flow is configured to capture all incremental changes.

Test the solution

Log in to your Salesforce account, and edit any record in the Account object.

Within 5 minutes or less, a scheduled flow will pick up your change and write the changed record into your Snowflake staging table and trigger the synchronization process.

You can see the details of the run, including number of records transferred, on the Run History tab of your flow.

Clean up

Clean up the resources in your AWS account by completing the following steps:

  1. On the Amazon AppFlow console, choose Flows in the navigation pane.
  2. From the list of flows, select the flow AccountData-SalesforceToSnowflakeand delete it.
  3. Enter delete to delete the flow.
  4. Choose Connections in the navigation pane.
  5. Choose Salesforce from the list of connectors, select Salesforce-blog, and delete it.
  6. Enter delete to delete the connector.
  7. On the Connections page, choose Snowflake from the list of connectors, select Snowflake-blog, and delete it.
  8. Enter delete to delete the connector.
  9. On the IAM console, choose Roles in the navigation page, then select the role you created for Snowflake and delete it.
  10. Choose Policies in the navigation pane, select the policy you created for Snowflake, and delete it.
  11. On the Amazon S3 console, search for the data bucket you created, choose Empty to delete the objects, then delete the bucket.
  12. Search for the error bucket you created, choose Empty to delete the objects, then delete the bucket.
  13. Clean up resources in your Snowflake account:
  • Delete the task TASK_ACCOUNT_LOAD:
  • Delete the stored procedure sp_account_load:
DROP procedure sp_account_load();
  • Delete the tables ACCOUNT_STAGING and ACCOUNT:


In this post, we walked you through how to integrate and synchronize your data from Salesforce to Snowflake using Amazon AppFlow. This demonstrates how you can set up your ETL jobs without having to learn new programming languages by using Amazon AppFlow and your familiar SQL language. This is a proof of concept, but you can try to handle edge cases like failure of Snowflake tasks or understand how incremental transfer works by making multiple changes to a Salesforce record within the scheduled time interval.

For more information on Amazon AppFlow, visit Amazon AppFlow.

About the authors

Ramesh Ranganathan is a Senior Partner Solution Architect at AWS. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, application modernization and cloud native development. He is passionate about technology and enjoys experimenting with AWS Serverless services.

Kamen Sharlandjiev is an Analytics Specialist Solutions Architect and Amazon AppFlow expert. He’s on a mission to make life easier for customers who are facing complex data integration challenges. His secret weapon? Fully managed, low-code AWS services that can get the job done with minimal effort and no coding.

Amit Shah is a cloud based modern data architecture expert and currently leading AWS Data Analytics practice in Atos. Based in Pune in India, he has 20+ years of experience in data strategy, architecture, design and development. He is on a mission to help organization become data-driven.

Quick Restoration through Replacing the Root Volumes of Amazon EC2 instances

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/quick-restoration-through-replacing-the-root-volumes-of-amazon-ec2/

This blog post is written by Katja-Maja Krödel, IoT Specialist Solutions Architect, and Benjamin Meyer, Senior Solutions Architect, Game Tech.

Customers use Amazon Elastic Compute Cloud (Amazon EC2) instances to develop, deploy, and test applications. To use those instances most effectively, customers have expressed the need to set back their instance to a previous state within minutes or even seconds. They want to find a quick and automated way to manage setting back their instances at scale.

The feature of replacing Root Volumes of Amazon EC2 instances enables customers to replace the root volumes of running EC2 instances to a specific snapshot or its launch state. Without stopping the instance, this allows customers to fix issues while retaining the instance store data, networking, and AWS Identity and Access Management (IAM) configuration. Customers can resume their operations with their instance store data intact. This works for all virtualized EC2 instances and bare metal EC2 Mac instances today.

In this post, we show you how to design your architecture for automated Root Volume Replacement using this Amazon EC2 feature. We start with the automated snapshot creation, continue with automatically replacing the root volume, and finish with how to keep your environment clean after your replacement job succeeds.

What is Root Volume Replacement?

Amazon EC2 enables customers to replace the root Amazon Elastic Block Store (Amazon EBS) volume for an instance without stopping the instance to which it’s attached. An Amazon EBS root volume is replaced to the launch state, or any snapshot taken from the EBS volume itself. This allows issues to be fixed, such as root volume corruption or guest OS networking errors. Replacing the root volume of an instance includes the following steps:

  • A new EBS volume is created from a previously taken snapshot or the launch state
  • Reboot of the instance
  • While rebooting, the current root volume is detached and the new root volume is attached

The previous EBS root volume isn’t deleted and can be attached to an instance for later investigation of the volume. If replacing to a different state of the EBS than the launch state, then a snapshot of the current root volume is used.

An example use case is a continuous integration/continuous deployment (CI/CD) System that builds on EC2 instances to build artifacts. Within this system, you could alter the installed tools on the host and may cause failing builds on the same machine. To prevent any unclean builds, the introduced architecture is used to clean up the machine by replacing the root volume to a previously known good state. This is especially interesting for EC2 Mac Instances, as their Dedicated Host won’t undergo the scrubbing process, and the instance is more quickly restored than launching a fresh EC2 Mac instance on the same host.


The feature of replacing Root Volumes was introduced in April 2021 and has just been <TBD> extended to work for Bare Metal EC2 Mac Instances. This means that EC2 Mac Instances are included. If you want to reset an EC2 instance to a previously known good state, then you can create Snapshots of your EBS volumes. To reset the root volume to its launch state, no snapshot is needed. For non-root volumes, you can use these Snapshots to create new EBS volumes, and then attach those to your instance as well as detach them. To automate the process of replacing your root volume not only once, but also in a repeatable manner, we’re introducing you to an architecture that can fully-automate this process.

In the case that you use a snapshot to create a new root volume, you must take a new snapshot of that volume to be able to get back to that state later on. You can’t use a snapshot of a different volume to restore to, which is the reason that the architecture includes the automatic snapshot creation of a fresh root volume.

The architecture is built in three steps:

  1. Automation of Snapshot Creation for new EBS volumes
  2. Automation of replacing your Root Volume
  3. Preparation of the environment for the next Root Volume Replacement

The following diagram illustrates the architecture of this solution.

 Architecture of the automated creation of Root Volumes for Amazon EC2 Instances

In the next sections, we go through these concepts to design the automatic Root Volume Replacement Task.

Automation of Snapshot Creation for new EBS volumes

Architecture of the automated creation of Snapshots of new EBS Volumes.

The figure above illustrates the architecture for automatically creating a snapshot of an existing EBS volume. In this architecture, we focus on the automation of creating a snapshot whenever a new EBS root volume is created.

Amazon EventBridge is used to invoke an AWS Lambda function on an emitted createVolume event. For automated reaction to the event, you can add a rule to the EventBridge which will forward the event to an AWS Lambda function whenever a new EBS volume is created. The rule within EventBridge looks like this:

  "source": ["aws.ec2"],
  "detail-type": ["EBS Volume Notification"],
  "detail": {
    "event": ["createVolume"]

An example event is emitted when an EBS root volume is created, which will then invoke the Lambda function to look like this:

   "version": "0",
   "id": "01234567-0123-0123-0123-012345678901",
   "detail-type": "EBS Volume Notification",
   "source": "aws.ec2",
   "account": "012345678901",
   "time": "yyyy-mm-ddThh:mm:ssZ",
   "region": "us-east-1",
   "resources": [
   "detail": {
      "result": "available",
      "cause": "",
      "event": "createVolume",
      "request-id": "01234567-0123-0123-0123-0123456789ab"

The code of the function uses the resource ARN within the received event and requests resource details about the EBS volume from the Amazon EC2 APIs. Since the event doesn’t include information if it’s a root volume, then you must verify this using the Amazon EC2 API.

The following is a summary of the tasks of the Lambda function:

  1. Extract the EBS ARN from the EventBridge Event
  2. Verify that it’s a root volume of an EC2 Instance
  3. Call the Amazon EC2 API create-snapshot to create a snapshot of the root volume and add a tag replace-snapshot=true

Then, the tag is used to clean up the environment and get rid of snapshots that aren’t needed.

As an alternative, you can emit your own event to EventBridge. This can be used to automatically create snapshots to which you can restore your volume. Instead of reacting to the createVolume event, you can use a customized approach for this architecture.

Automation of replacing your Root Volume

Architecture of the automated creation of Snapshots of new EBS Volumes.

The figure above illustrates the procedure of replacing the EBS root volume. It starts with the event, which is created through the AWS Command Line Interface (AWS CLI), console, or usage of the API. This leads to creating a new volume from a snapshot or using the initial launch state. The EC2 instance is rebooted, and during that time the old root volume is detached and a new volume gets attached as the root volume.

To invoke the create-replace-root-volume-task, you can call the Amazon EC2 API with the following AWS CLI command:

aws ec2 create-replace-root-volume-task --instance-id <value> --snapshot <value> --tag-specifications ResourceType=string,Tags=[{Key=replaced-volume,Value=true}]

If you want to restore to launch state, then omit the --snapshot parameter:

aws ec2 create-replace-root-volume-task --instance-id <value> --tag-specifications ResourceType=string,Tags=[{Key=delete-volume,Value=true}]

After running this command, AWS will create a new EBS volume, add the tag to the old EBS replaced-volume=true, restart your instance, and attach the new volume to the instance as the root volume. The tag is used later to detect old root volumes and clean up the environment.

If this is combined with the earlier explained automation, then the automation will immediately take a snapshot from the new EBS volume. A restore operation can only be done to a snapshot of the current EBS root volume. Therefore, if no snapshot is taken from the freshly restored EBS volume, then no restore operation is possible except the restore to launch state.

Preparation of the Environment for the next Root Volume Replacement

After the task is completed, the old root volume isn’t removed. Additionally, snapshots of previous root volumes can’t be used to restore current root volumes. To clean up your environment, you can schedule a Lambda function which does the following steps:

  • Delete detached EBS volumes with the tag delete-volume=true
  • Delete snapshots with the tag replace-snapshot=true, which aren’t associated with an existing EBS volume


In this post, we described an architecture to quickly restore EC2 instances through Root Volume Replacement. The feature of replacing Root Volumes of Amazon EC2 instances, now including Bare Metal EC2 Mac instances, enables customers to replace the root volumes of running EC2 instances to a specific snapshot or its launch state. Customers can resume their operations with their instance store data intact. We’ve split the process of doing this in an automated and quick manner into three steps: Create a snapshot, run the replacement task, and reset your environment to be prepared for a following replacement task. If you want to learn more about this feature, then see the Announcement of replacing Root Volumes, as well as the documentation for this feature. <TBD Announcement Bare Metal>

Analyze Amazon S3 storage costs using AWS Cost and Usage Reports, Amazon S3 Inventory, and Amazon Athena

Post Syndicated from Dagar Katyal original https://aws.amazon.com/blogs/big-data/analyze-amazon-s3-storage-costs-using-aws-cost-and-usage-reports-amazon-s3-inventory-and-amazon-athena/

Since its launch in 2006, Amazon Simple Storage Service (Amazon S3) has experienced major growth, supporting multiple use cases such as hosting websites, creating data lakes, serving as object storage for consumer applications, storing logs, and archiving data. As the application portfolio grows, customers tend to store data from multiple application and different business functions in a single S3 bucket, which can grow the storage in S3 buckets to hundreds of TBs. The AWS Billing console provides a way to look at the total storage cost of data stored in Amazon S3, but sometimes IT organizations need to understand the breakdown of costs of a particular S3 bucket by various prefixes or objects corresponding to a particular user or application. There are various reasons to analyze the costs of S3 buckets, such as to identify the spend breakdown, do internal chargebacks, understand the cost breakdown by business unit and application, and many more. As of this writing, there is no easy way to do a cost breakdown of S3 buckets by objects and prefixes.

In this post, we discuss a solution using Amazon Athena to query AWS Cost and Usage Reports and Amazon S3 Inventory reports to analyze the cost by prefixes and objects in an S3 bucket.

Overview of solution

The following figure shows the architecture for this solution. First, we enable the AWS Cost and Usage Reports (AWS CUR) and Amazon S3 Inventory features, which save the output into two separate pre-created S3 buckets. We then use Athena to query these S3 buckets for AWS CUR data and S3 object inventory data to correlate and allocate the cost breakdown at the object or prefix level.

architecture diagram

To implement the solution, we complete the following steps:

  1. Create S3 buckets for AWS CUR, S3 object inventory, and Athena results. Alternatively, you can create these respective buckets when enabling the respective individual features, but for the purpose of this post, we create all of them at the beginning.
  2. Enable the Cost and Usage Reports.
  3. Enable Amazon S3 Inventory configuration.
  4. Create AWS Glue Data Catalog tables for the CUR and S3 object inventory to query using Athena.
  5. Run queries in Athena.


For this walkthrough, you should have the following prerequisites:

Create S3 buckets

Amazon S3 is an object storage service offering industry-leading scalability, data availability, security, and performance. Customers of all sizes and industries can store and protect any amount of data for virtually any use case, such as data lakes, cloud-native applications, and mobile apps. With cost-effective storage classes and easy-to-use management features, you can optimize costs, organize data, and configure fine-tuned access controls to meet specific business, organizational, and compliance requirements.

For this post, we use the S3 bucket s3-object-cost-allocation as the primary bucket for cost allocation. This S3 bucket is conveniently modeled to contain several prefixes and objects of different sizes for which cost allocation needs to be done based on the overall cost of the bucket. In a real-world scenario, you should use a bucket that has data for multiple teams and for which you need to allocate costs by prefix or object. Going forward, we refer to this bucket as the primary object bucket.

The following screenshot shows our S3 bucket and folders.

example Folders created

Now let’s create the three additional operational S3 buckets to store the datasets generated to calculate costs for the objects. You can create the following buckets or any existing buckets as needed:

  • cur-cost-usage-reports-<account_number> – This bucket is used to save the Cost and Usage Reports for the account.
  • S3-inventory-configurations-<account_number> – This bucket is used to save the inventory configurations of our primary object bucket.
  • athena-query-bucket-<account_number> – This bucket is used to save the query results from Athena.

Complete the following steps to create your S3 buckets:

  • On the Amazon S3 console, choose Buckets in the navigation pane.
  • Choose Create bucket.
  • For Bucket name, enter the name of your bucket (cur-cost-usage-reports-<account_number>).
  • For AWS Region, choose your preferred Region.
  • Leave all other settings at default (or according to your organization’s standards).
  • Choose Create bucket.
    create s3 bucket
  • Repeat these steps to create s3-inventory-configurations-<account_number> and athena-query-bucket-<account_number>.

Enable the Cost and Usage Reports

The AWS Cost and Usage Reports (AWS CUR) contains the most comprehensive set of cost and usage data available. You can use Cost and Usage Reports to publish your AWS billing reports to an S3 bucket that you own. You can receive reports that break down your costs by the hour, day, or month; by product or product resource; or by tags that you define yourself.

Complete the following steps to enable Cost and Usage Reports for your account:

  • On the AWS Billing console, in the navigation pane, choose Cost & Usage Reports.
  • Choose Create report.
  • For Report name, enter a name for your report, such as account-cur-s3.
  • For Additional report details, select Include resource IDs to include the IDs of each individual resource in the report.Including resource IDs will create individual line items for each of your resources. This can increase the size of your Cost and Usage Reports files significantly, which can affect the S3 storage costs for your CUR, based on your AWS usage. We need this feature enabled for this post.
  • For Data refresh settings, select whether you want the Cost and Usage Reports to refresh if AWS applies refunds, credits, or support fees to your account after finalizing your bill.When a report refreshes, a new report is uploaded to Amazon S3.
  • Choose Next.
  • For S3 bucket, choose Configure.
  • For Configure S3 Bucket, select an existing bucket created in the previous section (cur-cost-usage-reports-<account_number>) and choose Next.
  • Review the bucket policy, select I have confirmed that this policy is correct, and choose Save. This default bucket policy provides Cost and Usage Reports access to write data to Amazon S3.
  • For Report path prefix, enter cur-data/account-cur-daily.
  • For Time granularity, choose Daily.
  • For Report versioning, choose Overwrite existing report.
  • For Enable report data integration for, select Amazon Athena.
  • Choose Next.
  • After you have reviewed the settings for your report, choose Review and Complete.
    create cost and usage report

The Cost and Usage reports will be delivered to the S3 buckets within 24 hours.

The following sample CUR in CSV format shows different columns of the Cost and Usage Report, including bill_invoice_id, bill_invoicing_entity, bill_payer_account_id, and line_item_product_code, to name a few.

sample cost and usage report

Enable Amazon S3 Inventory configuration

Amazon S3 Inventory is one of the tools Amazon S3 provides to help manage your storage. You can use it to audit and report on the replication and encryption status of your objects for business, compliance, and regulatory needs. Amazon S3 Inventory provides comma-separated values (CSV), Apache Optimized Row Columnar (ORC), or Apache Parquet output files that list your objects and their corresponding metadata on a daily or weekly basis for an S3 bucket or a shared prefix (objects that have names that begin with a common string).

Complete the following steps to enable Amazon S3 Inventory on the primary object bucket:

  • On the Amazon S3 console, choose Buckets in the navigation pane.
  • Choose the bucket for which you want to configure Amazon S3 Inventory.
    This will be the existing bucket in your account that has data that needs to be analyzed. This could be your data lake or application S3 bucket. We created the bucket s3-object-cost-allocation with some sample data and folder structure.
  • Choose Management.
  • Under Inventory configurations, choose Create inventory configuration.
  • For Inventory configuration name, enter s3-object-cost-allocation.
  • For Inventory scope, leave Prefix blank.
    This is to ensure that all objects are covered for the report.
  • For Object Versions, select Current version only.
  • For Report details, choose This account.
  • For Destination, choose the destination bucket we created (s3-inventory-configurations-<account_number>).
  • For Frequency, choose Daily.
  • For Output format, choose as Apache Parquet.
  • For Status, choose Enable.
  • Keep server-side encryption disabled. To use server-side encryption, choose Enable and specify the encryption key.
  • For Additional fields, select the following to add to the inventory report:
    • Size – The object size in bytes.
    • Last modified date – The object creation date or the last modified date, whichever is the latest.
    • Multipart upload – Specifies that the object was uploaded as a multipart upload. For more information, see Uploading and copying objects using multipart upload.
    • Replication status – The replication status of the object. For more information, see Using the S3 console.
    • Encryption status – The server-side encryption used to encrypt the object. For more information, see Protecting data using server-side encryption.
    • Bucket key status – Indicates whether a bucket-level key generated by AWS KMS applies to the object.
    • Storage class – The storage class used for storing the object.
    • Intelligent-Tiering: Access tier – Indicates the access tier of the object if it was stored in Intelligent-Tie
      create s3 inventory
  • Choose Create.
    s3 inventory configuration

It may take up to 48 hours to deliver the first report.

Create AWS Glue Data Catalog tables for CUR and Amazon S3 Inventory reports

Wait for up to 48 hours for the previous step to generate the reports. In this section, we use Athena to create and define AWS Glue Data Catalog tables for the data that has been created using Cost and Usage Reports and Amazon S3 Inventory reports.

Athena is a serverless, interactive analytics service built on open-source frameworks, supporting open-table and file formats. Athena provides a simplified, flexible way to analyze petabytes of data where it lives.

Complete the following steps to create the tables using Athena:

  • Navigate to the Athena console.
  • If you’re using Athena for the first time, you need to set up a query result location in Amazon S3. If you preconfigured this in Athena , you can skip this step.
    • Choose View settings.
      athena setup query bucket
    • Choose Manage.
    • In the section Query result location and encryption, choose Browse S3 and choose the bucket that we created (athena-query-bucket-<account_number>).
    • Choose Save.
      Athena Config
    • Navigate back to the Athena query editor.
  • Run the following query in Athena to create a table for Cost and Usage Reports. Verify and update the section for <<LOCATION>> at the end of the query and point it to the correct S3 bucket and location. Note that the new table name should be account_cur.
    CREATE EXTERNAL TABLE `account_cur`(
    `identity_line_item_id` string,
    `identity_time_interval` string,
    `bill_invoice_id` string,
    `bill_billing_entity` string,
    `bill_bill_type` string,
    `bill_payer_account_id` string,
    `bill_billing_period_start_date` timestamp,
    `bill_billing_period_end_date` timestamp,
    `line_item_usage_account_id` string,
    `line_item_line_item_type` string,
    `line_item_usage_start_date` timestamp,
    `line_item_usage_end_date` timestamp,
    `line_item_product_code` string,
    `line_item_usage_type` string,
    `line_item_operation` string,
    `line_item_availability_zone` string,
    `line_item_resource_id` string,
    `line_item_usage_amount` double,
    `line_item_normalization_factor` double,
    `line_item_normalized_usage_amount` double,
    `line_item_currency_code` string,
    `line_item_unblended_rate` string,
    `line_item_unblended_cost` double,
    `line_item_blended_rate` string,
    `line_item_blended_cost` double,
    `line_item_line_item_description` string,
    `line_item_tax_type` string,
    `line_item_legal_entity` string,
    `product_product_name` string,
    `product_availability` string,
    `product_description` string,
    `product_durability` string,
    `product_event_type` string,
    `product_fee_code` string,
    `product_fee_description` string,
    `product_free_query_types` string,
    `product_from_location` string,
    `product_from_location_type` string,
    `product_from_region_code` string,
    `product_group` string,
    `product_group_description` string,
    `product_location` string,
    `product_location_type` string,
    `product_message_delivery_frequency` string,
    `product_message_delivery_order` string,
    `product_operation` string,
    `product_platopricingtype` string,
    `product_product_family` string,
    `product_queue_type` string,
    `product_region` string,
    `product_region_code` string,
    `product_servicecode` string,
    `product_servicename` string,
    `product_sku` string,
    `product_storage_class` string,
    `product_storage_media` string,
    `product_to_location` string,
    `product_to_location_type` string,
    `product_to_region_code` string,
    `product_transfer_type` string,
    `product_usagetype` string,
    `product_version` string,
    `product_volume_type` string,
    `pricing_rate_code` string,
    `pricing_rate_id` string,
    `pricing_currency` string,
    `pricing_public_on_demand_cost` double,
    `pricing_public_on_demand_rate` string,
    `pricing_term` string,
    `pricing_unit` string,
    `reservation_amortized_upfront_cost_for_usage` double,
    `reservation_amortized_upfront_fee_for_billing_period` double,
    `reservation_effective_cost` double,
    `reservation_end_time` string,
    `reservation_modification_status` string,
    `reservation_normalized_units_per_reservation` string,
    `reservation_number_of_reservations` string,
    `reservation_recurring_fee_for_usage` double,
    `reservation_start_time` string,
    `reservation_subscription_id` string,
    `reservation_total_reserved_normalized_units` string,
    `reservation_total_reserved_units` string,
    `reservation_units_per_reservation` string,
    `reservation_unused_amortized_upfront_fee_for_billing_period` double,
    `reservation_unused_normalized_unit_quantity` double,
    `reservation_unused_quantity` double,
    `reservation_unused_recurring_fee` double,
    `reservation_upfront_value` double,
    `savings_plan_total_commitment_to_date` double,
    `savings_plan_savings_plan_a_r_n` string,
    `savings_plan_savings_plan_rate` double,
    `savings_plan_used_commitment` double,
    `savings_plan_savings_plan_effective_cost` double,
    `savings_plan_amortized_upfront_commitment_for_billing_period` double,
    `savings_plan_recurring_commitment_for_billing_period` double,
    `resource_tags_user_bucket_name` string,
    `resource_tags_user_cost_tracking` string)
    `year` string,
    `month` string)
  • Run the following query in Athena to create the table for Amazon S3 Inventory. Verify and update the section for <<LOCATION>> at the end of the query and point it to the correct S3 bucket and location.
    • To get the exact value of the location, navigate to the bucket where inventory configurations are stored and navigate to the folder path Hive . Use the S3 URI to replace <<LOCATION>> in the query.query path location
      CREATE EXTERNAL TABLE s3_object_inventory(
               bucket string,
               key string,
               version_id string,
               is_latest boolean,
               is_delete_marker boolean,
               size bigint,
               last_modified_date bigint,
               storage_class string,
               is_multipart_uploaded boolean,
               replication_status string,
               encryption_status string,
               intelligent_tiering_access_tier string,
               bucket_key_status string
              dt string
      ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
        STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
        OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
        LOCATION '<<LOCATION>>';
  • We need to refresh the partitions and add new inventory lists to the table. Use the following commands to add data to the CUR table and Amazon S3 Inventory table:
    MSCK REPAIR TABLE `account_cur`;
    MSCK REPAIR TABLE s3_object_inventory;

Run queries in Athena to allocate the cost of objects in an S3 bucket

Now we can query the data we have available to get a cost allocation breakdown at the prefix level.

We need to provide some information in the following queries:

  • Update <<YYYY-MM-DD>> with the date for which you want to analyze the data
  • Update <<prefix>> with the prefix values for your bucket that needs to be analyzed
  • Update <<bucket_name>> with the name of the bucket that needs to be analyzed

We use the following part of the query to calculate the size of storage being used by the target prefix that we want to calculate the cost for:

select date_parse(dt,'%Y-%m-%d-%H-%i') dt, cast (sum(size) as double) targetPrefixBytes
from s3_object_inventory
where date_parse(dt,'%Y-%m-%d-%H-%i') = cast('<<YYYY-MM-DD>>' as timestamp)
and key like '<<prefix>>/%'
group by dt

Next, we calculate the total size of the bucket on that particular date:

select date_parse(dt,'%Y-%m-%d-%H-%i') dt, cast (sum(size) as double) totalBytes
from s3_object_inventory
where date_parse(dt,'%Y-%m-%d-%H-%i') = cast('<<YYYY-MM-DD>>' as timestamp)
group by dt

We query the CUR table to get the cost of a particular bucket on a particular date:

select line_item_usage_start_date as dt, sum(line_item_blended_cost) as line_item_blended_cost
from "account_cur"
where line_item_product_code = 'AmazonS3'
and product_servicecode = 'AmazonS3'
and line_item_operation = 'StandardStorage'
and line_item_resource_id = '<<bucket_name>>'
and line_item_usage_start_date = cast('<<YYYY-MM-DD>>' as timestamp)
group by line_item_usage_start_date

Putting all of this together, we can calculate the cost of a particular prefix (folder or a file) on a specific date. The complete query is as follows:

cost as (select line_item_usage_start_date as dt, sum(line_item_blended_cost) as line_item_blended_cost
from "account_cur"
where line_item_product_code = 'AmazonS3'
and product_servicecode = 'AmazonS3'
and line_item_operation = 'StandardStorage'
and line_item_resource_id = '<<bucket_name>>'
and line_item_usage_start_date = cast('<<YYYY-MM-DD>>' as timestamp)
group by line_item_usage_start_date),
total as (select date_parse(dt,'%Y-%m-%d-%H-%i') dt, cast (sum(size) as double) totalBytes
from s3_object_inventory
where date_parse(dt,'%Y-%m-%d-%H-%i') = cast('<<YYYY-MM-DD>>' as timestamp)
group by dt),
target as (select date_parse(dt,'%Y-%m-%d-%H-%i') dt, cast (sum(size) as double) targetPrefixBytes
from s3_object_inventory
where date_parse(dt,'%Y-%m-%d-%H-%i') = cast('<<YYYY-MM-DD>>' as timestamp)
and key like '<<prefix>>/%'
group by dt)
select target.dt,
(target.targetPrefixBytes/ total.totalBytes * 100) percentUsed,
cost.line_item_blended_cost totalCost,
cost.line_item_blended_cost*(target.targetPrefixBytes/ total.totalBytes) as prefixCost
from target, total, cost
where target.dt = total.dt
and target.dt = cost.dt

The following screenshot shows the results table for the sample data we used in this post. We get the following information:

  • dt – Date
  • percentUsed – The percentage of prefix space compared to overall bucket space
  • totalCost – The total cost of the bucket
  • prefixCost – The cost of the space used by the prefix

final result percetage

Clean up

To stop incurring costs, be sure to disable Amazon S3 Inventory and Cost and Usage Reports when you’re done.

Delete the S3 buckets created for the Amazon S3 Inventory reports and Cost and Usage Reports to avoid storage charges.

Other methods for Amazon S3 storage analysis

Amazon S3 Storage Lens can provide a single view of object storage usage and activity across your entire Amazon S3 storage. With S3 Storage Lens, you can understand, analyze, and optimize storage with over 29 usage and activity metrics and interactive dashboards to aggregate data for your entire organization, specific accounts, Regions, buckets, or prefixes. All of this data is accessible on the Amazon S3 console or as raw data in an S3 bucket.

S3 Storage Lens doesn’t provide cost analysis based on an object or prefix in a single bucket. If you want visibility of storage usage and trends across the entire storage footprint along with recommendations on cost efficiency and data protection best practices, S3 Storage Lens is the right option. But if you want a cost analysis of specific S3 buckets and looking for ways to get cost allocation of S3 objects at the object or prefix level, the solution in this post would be the best fit.


In this post, we detailed how to create a cost breakdown model at the object or prefix level for S3 buckets that contains data for multiple business units and applications. We used Athena to query the reports and datasets produced by the AWS CUR and Amazon S3 Inventory features that, when correlated, give us the cost allocation at the object and prefix level. This solution gives you an easy way to calculate costs for independent objects and prefixes, which can be used for internal chargebacks or just to know the per-object or per-prefix spending in a shared S3 bucket.

About the Authors

Dagar Katyal
is a Senior Solutions Architect at AWS, based in Chicago, Illinois. He works with customers and provides guidance for key strategic initiatives important for their business. Dagar has an MBA and has spent years over 15 years working with customers on projects on analytics strategy, roadmap, and using data as a key differentiator. When not working with customers, Dagar spends time with his family and doing home improvement projects.

Saiteja Pudi
is a Solutions Architect at AWS, based in Dallas, Tx. He has been with AWS for more than 3 years now, helping customers derive the true potential of AWS by being their trusted advisor. He comes from an application development background, interested in Data Science and Machine Learning.

Streaming the AWS Wickr desktop client with Amazon AppStream 2.0

Post Syndicated from Charles H. original https://aws.amazon.com/blogs/architecture/streaming-the-aws-wickr-desktop-client-with-amazon-appstream-2-0/

Amazon Web Services (AWS) customers using AWS Wickr who want to find a way to access their AWS Wickr Windows desktop client though a web browser, can use Amazon AppStream 2.0 to stream the application through to their users.

Using this architecture, you can provide lightweight access to the AWS Wickr desktop client for users that cannot install it onto their local device. By using AppStream 2.0, you can focus on managing your AWS Wickr network while AppStream 2.0 manages the AWS resources required to host and run the application, scaling automatically and providing on-demand access to your users.

If you want to ensure that AWS Wickr user data persists between streaming sessions, you can make use of AppStream 2.0 user persistence to securely save user data (including AWS Wickr client data) to Amazon Simple Storage Service (Amazon S3).

In this post, we discuss how to build an AppStream 2.0 image for AWS Wickr on Windows, enable persistence for users, and deploy a stack.

Solution overview

These steps will illustrate deploying an AppStream 2.0 Windows Image Builder to a single availability zone. Then we deploy an AppStream 2.0 fleet with internet access and user data persistence to three availability zones for high availability.

Review the Regions and Availability Zones documentation and the AWS Regional Services List to choose the best Region for your deployment, as well as the networking and bandwidth requirements for User Connections to Amazon AppStream 2.0.

AWS Wickr with Amazon AppStream 2.0 Architecture

Figure 1. AWS Wickr with Amazon AppStream 2.0 Architecture


The costs associated with using AWS services when deploying AppStream 2.0 and AWS Wickr in your AWS account can be estimated on the pricing pages for the services used.


This walkthrough takes you from installing the AWS Wickr for Windows desktop client onto an AppStream 2.0 Image Builder, to configuring the client for persistence, and finally to deploying a fleet and stack for your users to consume:

  1. Install AWS Wickr for Windows client onto your AppStream 2.0 Windows Image Builder.
  2. Configure the client using the Image Assistant to set up user data and settings persistence.
  3. Create an on-demand instance streaming fleet.
  4. Create a user stack and enable user data and settings persistence.
  5. Create a user pool.
  6. Test your streaming application and prove data persistence.


You should have the following prerequisites:

  • Familiarity with AppStream 2.0 and an existing AWS Wickr account with associated credentials. Currently, it is not possible for an AWS Wickr user to register their account over AppStream 2.0.
  • An AWS account with access to AppStream 2.0. Instructions for setting up an AWS account can be found at Setting Up for Amazon AppStream 2.0.
  • A running AppStream 2.0 image builder, based on the WinServer2019-10-05-2022 base image with the AWS Wickr for Windows client installer downloaded to it.
  • An up-to-date Google Chrome browser (if you want to use your device’s webcam).

Install AWS Wickr

AppStream 2.0 uses Amazon Elastic Compute Cloud (Amazon EC2) instances to stream applications. You launch instances from base images, called image builders. In this step you will install the AWS Wickr client onto your image builder before moving on to configuration.

  1. Connect to your image builder as an Administrator.
  2. Run the AWS Wickr installer and carry out the following steps:
    1. On the Welcome to the AWS Wickr Setup Wizard prompt, choose Next.
    2. For Installation Type, choose Everybody (all users).
    3. Choose Next to accept the default installation folder.
    4. Choose Install.
  3. Once the AWS Wickr client has finished installing, uncheck Launch Wickr on the final screen, and then choose Finish.

Run the Image Assistant

To create your own custom image, connect to an image builder instance, install and configure your applications for streaming with guidance from the Image Assistant, and then create your image by creating a snapshot of the image builder instance:

  1. Run the Image Assistant shortcut found on the desktop.
  2. Under 1. ADD APPS, choose Add App.
  3. Choose Program Files, Amazon Web Services, Wickr, and AWS Wickr.
  4. Scroll to the bottom, choose the Wickr application, and then choose Open.
  5. In the pop-up that appears, customize the Name and Display Name (if needed), and then in the Launch Parameters box, enter -datalocation “C:\Users\%username%” (Figure 2).
    Updating Launch Parameters

    Figure 2. Updating Launch Parameters

    Note: If you want to mask the name of the application within the URL used for streaming, replace the Display Name field text with something else.

  6. Choose Save.
  7. Wickr will now appear as an app on the Image Assistant screen. Choose Next.
  8. Under the 2. CONFIGURE APPS stage, follow the instructions from 1-5, choose Save settings, and choose Next (Figure 3).

    Configuring the Application with Image Assistant

    Figure 3. Configuring the Application with Image Assistant

  9. Under the 3. TEST stage, carry out steps 1-3.
  10. Under the 4. OPTIMIZE stage, select Launch.
  11. As instructed, once the app has launched successfully, select Continue and wait for the app to be optimized.
  12. Under the 5. CONFIGURE IMAGE stage, give the image that will be created a Name, Display name, and Description.
  13. Choose the Always use latest agent version checkbox. This ensures that new image builders or fleet instances that are launched from your image always use the latest AppStream 2.0 agent version, and then select Next (Figure 4).

    Finalizing the Application Image

    Figure 4. Finalizing the Application Image

  14. Under the 6. REVIEW stage, select Disconnect and Create Image. Your session will be terminated while your image is being created.

Create an AppStream 2.0 fleet

With AppStream 2.0, you create fleet instances and stacks as part of the process of streaming applications. A fleet consists of streaming instances that run the image that you specify.

  1. Return to the AppStream 2.0 management console.
  2. Choose Fleets from the menu on the left.
  3. Choose your Fleet type (this demonstration uses On-Demand) and choose Next.
  4. Give the Fleet a Name, Display name, and a Description.
    Note: If you wish to mask the name of the fleet within the URL used for streaming, replace the Name and Display name section with something else.
  5. Choose your Fleet instance type (this walkthrough uses a general purpose stream.standard.large instance type).
  6. Adjust the Fleet capacity as required, leave the other sections as they’re displayed by default, and choose Next.
  7. On the Choose an Image screen, choose the image you created earlier.
  8. On the Configure network screen, choose Enable default internet access and choose your VPC, along with three subnets that you will deploy into. Finally, choose the Security group that you will use to restrict access (the default is to allow access from all IPs). Choose Next.
  9. Review your settings and then choose Create fleet.

Create an AppStream 2.0 stack

A stack consists of an associated fleet, user access policies, and storage configurations. In this step, you will create a stack, enable application settings persistence, and associate the stack with the fleet you provisioned previously.

  1. From the menu in the AppStream 2.0 management console, choose Stacks.
  2. Choose Create Stack.
  3. Give the stack a Name, and optionally a Display name and Description.
  4. Leave all other options as they’re displayed by default and choose Next.
  5. In the Enable storage window, ensure that Enable home folders is selected, leave all the settings as they are, and choose Next.
  6. In the Edit user settings section, modify the copy and paste functionality to your requirements, and ensure that Enable application settings persistence is selected. Choose Next.
  7. Review your configuration and choose Create stack.
  8. You will now be presented with an overview of the stack you have created. Choose the Action dropdown list and choose Associate fleet.
  9. From the dropdown list, choose the fleet that you previously provisioned and choose Associate.

Create an AppStream 2.0 user pool

Users can access application stacks through a persistent URL and login credentials by using their email address and a password that they choose. In this step, you will create a user and assign it to a stack so you can access your AppStream 2.0 streaming session.

  1. Choose User pool.
  2. Choose Create user.
  3. Enter an email address, first name, and last name, and then choose Create User.
  4. In the User pool window, choose the User and then choose Action.
  5. Choose Assign stack, choose the newly created stack, and choose Send email notification to user.
  6. Choose Assign stack.
  7. You will receive two emails. Follow the instructions on the one titled Start accessing your apps using AppStream 2.0 to access your app.

Launch your AppStream 2.0 streaming session

In this step, you will use the user created earlier to log in to an AppStream 2.0 streaming session. You will then prove AWS Wickr user data persistence by exiting and logging back into your session.

Note: You will only be able to launch your session once your fleet has been provisioned. This can take around 15-20 minutes.

  1. Choose the login page link from the email you received in the previous section and log in.
  2. You will be presented with the AWS Wickr client icon. Choose it to start your session.
  3. Log in to the AWS Wickr client with your credentials.
  4. As you have application persistence enabled, you can close the tab and the session will pick up from where you left it when you log back in (Figure 5).
Accessing the Streaming Application

Figure 5. Accessing the Streaming Application


To avoid incurring future charges, delete the stack, user, fleet, custom image, and image builder that you have created.


In this post, we demonstrated how customers can take advantage of AppStream 2.0 as a managed service to enable the provisioning of AWS Wickr clients for users, with persistence between sessions, through a web browser.

How BookMyShow saved 80% in costs by migrating to an AWS modern data architecture

Post Syndicated from Mahesh Vandi Chalil original https://aws.amazon.com/blogs/big-data/how-bookmyshow-saved-80-in-costs-by-migrating-to-an-aws-modern-data-architecture/

This is a guest post co-authored by Mahesh Vandi Chalil, Chief Technology Officer of BookMyShow.

BookMyShow (BMS), a leading entertainment company in India, provides an online ticketing platform for movies, plays, concerts, and sporting events. Selling up to 200 million tickets on an annual run rate basis (pre-COVID) to customers in India, Sri Lanka, Singapore, Indonesia, and the Middle East, BookMyShow also offers an online media streaming service and end-to-end management for virtual and on-ground entertainment experiences across all genres.

The pandemic gave BMS the opportunity to migrate and modernize our 15-year-old analytics solution to a modern data architecture on AWS. This architecture is modern, secure, governed, and cost-optimized architecture, with the ability to scale to petabytes. BMS migrated and modernized from on-premises and other cloud platforms to AWS in just four months. This project was run in parallel with our application migration project and achieved 90% cost savings in storage and 80% cost savings in analytics spend.

The BMS analytics platform caters to business needs for sales and marketing, finance, and business partners (e.g., cinemas and event owners), and provides application functionality for audience, personalization, pricing, and data science teams. The prior analytics solution had multiple copies of data, for a total of over 40 TB, with approximately 80 TB of data in other cloud storage. Data was stored on‑premises and in the cloud in various data stores. Growing organically, the teams had the freedom to choose their technology stack for individual projects, which led to the proliferation of various tools, technology, and practices. Individual teams for personalization, audience, data engineering, data science, and analytics used a variety of products for ingestion, data processing, and visualization.

This post discusses BMS’s migration and modernization journey, and how BMS, AWS, and AWS Partner Minfy Technologies team worked together to successfully complete the migration in four months and saving costs. The migration tenets using the AWS modern data architecture made the project a huge success.

Challenges in the prior analytics platform

  • Varied Technology: Multiple teams used various products, languages, and versions of software.
  • Larger Migration Project: Because the analytics modernization was a parallel project with application migration, planning was crucial in order to consider the changes in core applications and project timelines.
  • Resources: Experienced resource churn from the application migration project, and had very little documentation of current systems.
  • Data : Had multiple copies of data and no single source of truth; each data store provided a view for the business unit.
  • Ingestion Pipelines: Complex data pipelines moved data across various data stores at varied frequencies. We had multiple approaches in place to ingest data to Cloudera, via over 100 Kafka consumers from transaction systems and MQTT(Message Queue Telemetry Transport messaging protocol) for clickstreams, stored procedures, and Spark jobs. We had approximately 100 jobs for data ingestion across Spark, Alteryx, Beam, NiFi, and more.
  • Hadoop Clusters: Large dedicated hardware on which the Hadoop clusters were configured incurring fixed costs. On-premises Cloudera setup catered to most of the data engineering, audience, and personalization batch processing workloads. Teams had their implementation of HBase and Hive for our audience and personalization applications.
  • Data warehouse: The data engineering team used TiDB as their on-premises data warehouse. However, each consumer team had their own perspective of data needed for analysis. As this siloed architecture evolved, it resulted in expensive storage and operational costs to maintain these separate environments.
  • Analytics Database: The analytics team used data sourced from other transactional systems and denormalized data. The team had their own extract, transform, and load (ETL) pipeline, using Alteryx with a visualization tool.

Migration tenets followed which led to project success:

  • Prioritize by business functionality.
  • Apply best practices when building a modern data architecture from Day 1.
  • Move only required data, canonicalize the data, and store it in the most optimal format in the target. Remove data redundancy as much possible. Mark scope for optimization for the future when changes are intrusive.
  • Build the data architecture while keeping data formats, volumes, governance, and security in mind.
  • Simplify ELT and processing jobs by categorizing the jobs as rehosted, rewritten, and retired. Finalize canonical data format, transformation, enrichment, compression, and storage format as Parquet.
  • Rehost machine learning (ML) jobs that were critical for business.
  • Work backward to achieve our goals, and clear roadblocks and alter decisions to move forward.
  • Use serverless options as a first option and pay per use. Assess the cost and effort for rearchitecting to select the right approach. Execute a proof of concept to validate this for each component and service.

Strategies applied to succeed in this migration:

  • Team – We created a unified team with people from data engineering, analytics, and data science as part of the analytics migration project. Site reliability engineering (SRE) and application teams were involved when critical decisions were needed regarding data or timeline for alignment. The analytics, data engineering, and data science teams spent considerable time planning, understanding the code, and iteratively looking at the existing data sources, data pipelines, and processing jobs. AWS team with partner team from Minfy Technologies helped BMS arrive at a migration plan after a proof of concept for each of the components in data ingestion, data processing, data warehouse, ML, and analytics dashboards.
  • Workshops – The AWS team conducted a series of workshops and immersion days, and coached the BMS team on the technology and best practices to deploy the analytics services. The AWS team helped BMS explore the configuration and benefits of the migration approach for each scenario (data migration, data pipeline, data processing, visualization, and machine learning) via proof-of-concepts (POCs). The team captured the changes required in the existing code for migration. BMS team also got acquainted with the following AWS services:
  • Proof of concept – The BMS team, with help from the partner and AWS team, implemented multiple proofs of concept to validate the migration approach:
    • Performed batch processing of Spark jobs in Amazon EMR, in which we checked the runtime, required code changes, and cost.
    • Ran clickstream analysis jobs in Amazon EMR, testing the end-to-end pipeline. Team conducted proofs of concept on AWS IoT Core for MQTT protocol and streaming to Amazon S3.
    • Migrated ML models to Amazon SageMaker and orchestrated with Amazon MWAA.
    • Created sample QuickSight reports and dashboards, in which features and time to build were assessed.
    • Configured for key scenarios for Amazon Redshift, in which time for loading data, query performance, and cost were assessed.
  • Effort vs. cost analysis – Team performed the following assessments:
    • Compared the ingestion pipelines, the difference in data structure in each store, the basis of the current business need for the data source, the activity for preprocessing the data before migration, data migration to Amazon S3, and change data capture (CDC) from the migrated applications in AWS.
    • Assessed the effort to migrate approximately 200 jobs, determined which jobs were redundant or need improvement from a functional perspective, and completed a migration list for the target state. The modernization of the MQTT workflow code to serverless was time-consuming, decided to rehost on Amazon Elastic Compute Cloud (Amazon EC2) and modernization to Amazon Kinesis in to the next phase.
    • Reviewed over 400 reports and dashboards, prioritized development in phases, and reassessed business user needs.

AWS cloud services chosen for proposed architecture:

  • Data lake – We used Amazon S3 as the data lake to store the single truth of information for all raw and processed data, thereby reducing the copies of data storage and storage costs.
  • Ingestion – Because we had multiple sources of truth in the current architecture, we arrived at a common structure before migration to Amazon S3, and existing pipelines were modified to do preprocessing. These one-time preprocessing jobs were run in Cloudera, because the source data was on-premises, and on Amazon EMR for data in the cloud. We designed new data pipelines for ingestion from transactional systems on the AWS cloud using AWS Glue ETL.
  • Processing – Processing jobs were segregated based on runtime into two categories: batch and near-real time. Batch processes were further divided into transient Amazon EMR clusters with varying runtimes and Hadoop application requirements like HBase. Near-real-time jobs were provisioned in an Amazon EMR permanent cluster for clickstream analytics, and a data pipeline from transactional systems. We adopted a serverless approach using AWS Glue ETL for new data pipelines from transactional systems on the AWS cloud.
  • Data warehouse – We chose Amazon Redshift as our data warehouse, and planned on how the data would be distributed based on query patterns.
  • Visualization – We built the reports in Amazon QuickSight in phases and prioritized them based on business demand. We discussed with business users their current needs and identified the immediate reports required. We defined the phases of report and dashboard creation and built the reports in Amazon QuickSight. We plan to use embedded reports for external users in the future.
  • Machine learning – Custom ML models were deployed on Amazon SageMaker. Existing Airflow DAGs were migrated to Amazon MWAA.
  • Governance, security, and compliance – Governance with Amazon Lake Formation was adopted from Day 1. We configured the AWS Glue Data Catalog to reference data used as sources and targets. We had to comply to Payment Card Industry (PCI) guidelines because payment information was in the data lake, so we ensured the necessary security policies.

Solution overview

BMS modern data architecture

The following diagram illustrates our modern data architecture.

The architecture includes the following components:

  1. Source systems – These include the following:
    • Data from transactional systems stored in MariaDB (booking and transactions).
    • User interaction clickstream data via Kafka consumers to DataOps MariaDB.
    • Members and seat allocation information from MongoDB.
    • SQL Server for specific offers and payment information.
  2. Data pipeline – Spark jobs on an Amazon EMR permanent cluster process the clickstream data from Kafka clusters.
  3. Data lake – Data from source systems was stored in their respective Amazon S3 buckets, with prefixes for optimized data querying. For Amazon S3, we followed a hierarchy to store raw, summarized, and team or service-related data in different parent folders as per the source and type of data. Lifecycle polices were added to logs and temp folders of different services as per teams’ requirements.
  4. Data processing – Transient Amazon EMR clusters are used for processing data into a curated format for the audience, personalization, and analytics teams. Small file merger jobs merge the clickstream data to a larger file size, which saved costs for one-time queries.
  5. Governance – AWS Lake Formation enables the usage of AWS Glue crawlers to capture the schema of data stored in the data lake and version changes in the schema. The Data Catalog and security policy in AWS Lake Formation enable access to data for roles and users in Amazon Redshift, Amazon Athena, Amazon QuickSight, and data science jobs. AWS Glue ETL jobs load the processed data to Amazon Redshift at scheduled intervals.
  6. Queries – The analytics team used Amazon Athena to perform one-time queries raised from business teams on the data lake. Because report development is in phases, Amazon Athena was used for exporting data.
  7. Data warehouse – Amazon Redshift was used as the data warehouse, where the reports for the sales teams, management, and third parties (i.e., theaters and events) are processed and stored for quick retrieval. Views to analyze the total sales, movie sale trends, member behavior, and payment modes are configured here. We use materialized views for denormalized tables, different schemas for metadata, and transactional and behavior data.
  8. Reports – We used Amazon QuickSight reports for various business, marketing, and product use cases.
  9. Machine learning – Some of the models deployed on Amazon SageMaker are as follows:
    • Content popularity – Decides the recommended content for users.
    • Live event popularity – Calculates the popularity of live entertainment events in different regions.
    • Trending searches – Identifies trending searches across regions.


Migration execution steps

We standardized tools, services, and processes for data engineering, analytics, and data science:

  • Data lake
    • Identified the source data to be migrated from Archival DB, BigQuery, TiDB, and the analytics database.
    • Built a canonical data model that catered to multiple business teams and reduced the copies of data, and therefore storage and operational costs. Modified existing jobs to facilitate migration to a canonical format.
    • Identified the source systems, capacity required, anticipated growth, owners, and access requirements.
    • Ran the bulk data migration to Amazon S3 from various sources.
  • Ingestion
    • Transaction systems – Retained the existing Kafka queues and consumers.
    • Clickstream data – Successfully conducted a proof of concept to use AWS IoT Core for MQTT protocol. But because we needed to make changes in the application to publish to AWS IoT Core, we decided to implement it as part of mobile application modernization at a later time. We decided to rehost the MQTT server on Amazon EC2.
  • Processing
  • Listed the data pipelines relevant to business and migrated them with minimal modification.
  • Categorized workloads into critical jobs, redundant jobs, or jobs that can be optimized:
    • Spark jobs were migrated to Amazon EMR.
    • HBase jobs were migrated to Amazon EMR with HBase.
    • Metadata stored in Hive-based jobs were modified to use the AWS Glue Data Catalog.
    • NiFi jobs were simplified and rewritten in Spark run in Amazon EMR.
  • Amazon EMR clusters were configured one persistent cluster for streaming the clickstream and personalization workloads. We used multiple transient clusters for running all other Spark ETL or processing jobs. We used Spot Instances for task nodes to save costs. We optimized data storage with specific jobs to merge small files and compressed file format conversions.
  • AWS Glue crawlers identified new data in Amazon S3. AWS Glue ETL jobs transformed and uploaded processed data to the Amazon Redshift data warehouse.
  • Datawarehouse
    • Defined the data warehouse schema by categorizing the critical reports required by the business, keeping in mind the workload and reports required in future.
    • Defined the staging area for incremental data loaded into Amazon Redshift, materialized views, and tuning the queries based on usage. The transaction and primary metadata are stored in Amazon Redshift to cater to all data analysis and reporting requirements. We created materialized views and denormalized tables in Amazon Redshift to use as data sources for Amazon QuickSight dashboards and segmentation jobs, respectively.
    • Optimally used the Amazon Redshift cluster by loading last two years data in Amazon Redshift, and used Amazon Redshift Spectrum to query historical data through external tables. This helped balance the usage and cost of the Amazon Redshift cluster.
  • Visualization
    • Amazon QuickSight dashboards were created for the sales and marketing team in Phase 1:
      • Sales summary report – An executive summary dashboard to get an overview of sales across the country by region, city, movie, theatre, genre, and more.
      • Live entertainment – A dedicated report for live entertainment vertical events.
      • Coupons – A report for coupons purchased and redeemed.
      • BookASmile – A dashboard to analyze the data for BookASmile, a charity initiative.
  • Machine learning
    • Listed the ML workloads to be migrated based on current business needs.
    • Priority ML processing jobs were deployed on Amazon EMR. Models were modified to use Amazon S3 as source and target, and new APIs were exposed to use the functionality. ML models were deployed on Amazon SageMaker for movies, live event clickstream analysis, and personalization.
    • Existing artifacts in Airflow orchestration were migrated to Amazon MWAA.
  • Security
    • AWS Lake Formation was the foundation of the data lake, with the AWS Glue Data Catalog as the foundation for the central catalog for the data stored in Amazon S3. This provided access to the data by various functionalities, including the audience, personalization, analytics, and data science teams.
    • Personally identifiable information (PII) and payment data was stored in the data lake and data warehouse, so we had to comply to PCI guidelines. Encryption of data at rest and in transit was considered and configured in each service level (Amazon S3, AWS Glue Data Catalog, Amazon EMR, AWS Glue, Amazon Redshift, and QuickSight). Clear roles, responsibilities, and access permissions for different user groups and privileges were listed and configured in AWS Identity and Access Management (IAM) and individual services.
    • Existing single sign-on (SSO) integration with Microsoft Active Directory was used for Amazon QuickSight user access.
  • Automation
    • We used AWS CloudFormation for the creation and modification of all the core and analytics services.
    • AWS Step Functions was used to orchestrate Spark jobs on Amazon EMR.
    • Scheduled jobs were configured in AWS Glue for uploading data in Amazon Redshift based on business needs.
    • Monitoring of the analytics services was done using Amazon CloudWatch metrics, and right-sizing of instances and configuration was achieved. Spark job performance on Amazon EMR was analyzed using the native Spark logs and Spark user interface (UI).
    • Lifecycle policies were applied to the data lake to optimize the data storage costs over time.

Benefits of a modern data architecture

A modern data architecture offered us the following benefits:

  • Scalability – We moved from a fixed infrastructure to the minimal infrastructure required, with configuration to scale on demand. Services like Amazon EMR and Amazon Redshift enable us to do this with just a few clicks.
  • Agility – We use purpose-built managed services instead of reinventing the wheel. Automation and monitoring were key considerations, which enable us to make changes quickly.
  • Serverless – Adoption of serverless services like Amazon S3, AWS Glue, Amazon Athena, AWS Step Functions, and AWS Lambda support us when our business has sudden spikes with new movies or events launched.
  • Cost savings – Our storage size was reduced by 90%. Our overall spend on analytics and ML was reduced by 80%.


In this post, we showed you how a modern data architecture on AWS helped BMS to easily share data across organizational boundaries. This allowed BMS to make decisions with speed and agility at scale; ensure compliance via unified data access, security, and governance; and to scale systems at a low cost without compromising performance. Working with the AWS and Minfy Technologies teams helped BMS choose the correct technology services and complete the migration in four months. BMS achieved the scalability and cost-optimization goals with this updated architecture, which has set the stage for innovation using graph databases and enhanced our ML projects to improve customer experience.

About the Authors

Mahesh Vandi Chalil is Chief Technology Officer at BookMyShow, India’s leading entertainment destination. Mahesh has over two decades of global experience, passionate about building scalable products that delight customers while keeping innovation as the top goal motivating his team to constantly aspire for these. Mahesh invests his energies in creating and nurturing the next generation of technology leaders and entrepreneurs, both within the organization and outside of it. A proud husband and father of two daughters and plays cricket during his leisure time.

Priya Jathar is a Solutions Architect working in Digital Native Business segment at AWS. She has more two decades of IT experience, with expertise in Application Development, Database, and Analytics. She is a builder who enjoys innovating with new technologies to achieve business goals. Currently helping customers Migrate, Modernise, and Innovate in Cloud. In her free time she likes to paint, and hone her gardening and cooking skills.

Vatsal Shah is a Senior Solutions Architect at AWS based out of Mumbai, India. He has more than nine years of industry experience, including leadership roles in product engineering, SRE, and cloud architecture. He currently focuses on enabling large startups to streamline their cloud operations and help them scale on the cloud. He also specializes in AI and Machine Learning use cases.

Happy New Year! AWS Week in Review – January 9, 2023

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/happy-new-year-aws-week-in-review-january-9-2023/

Happy New Year! As we kick off 2023, I wanted to take a moment to remind you of some 2023 predictions by AWS leaders for you to help prepare for the new year.

You can also read the nine best things Amazon announced and AWS for Automotive at the Consumer Electronics Show (CES) 2023 in the last week to see the latest offerings from Amazon and AWS that are helping innovate at speed and create new customer experiences at the forefront of technology.

Last Year-End Launches
We skipped two weeks since the last week in review on December 19, 2022. I want to pick some important launches from them.

Last Week’s Launches
As usual, let’s take a look at some launches from the last week that I want to remind you of:

  • Amazon S3 Encrypts New Objects by Default – Amazon S3 encrypts all new objects by default. Now, S3 automatically applies server-side encryption (SSE-S3) for each new object, unless you specify a different encryption option. There is no additional cost for default object-level encryption.
  • Amazon Aurora MySQL Version 3 Backtrack Support – Backtrack allows you to move your MySQL 8.0 compatible Aurora database to a prior point in time without needing to restore from a backup, and it completes within seconds, even for large databases.
  • Amazon EMR Serverless Custom Images – Amazon EMR Serverless now allows you to customize images for Apache Spark and Hive. This means that you can package application dependencies or custom code in the image, simplifying running Spark and Hive workloads.
  • The Graph Explorer, Open-Source Low-Code Visual Exploration Tool – Amazon Neptune announced the graph-explorer, a React-based web application that enables users to visualize both property graph and Resource Description Framework (RDF) data and explore connections between data without having to write graph queries. To learn more about open source updates at AWS, see Ricardo’s OSS newsletter.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Here are some other news items that you may find interesting in the new year:

  • AWS Collective on Stack Overflow – Please join the AWS Collective on Stack Overflow, which provides builders a curated space to engage and learn from this large developer’s community.
  • AWS Fundamentals Book – This upcoming AWS online book is intended to focus on AWS usage in the real world, and goes deeper with amazing per-service infographics.
  • AWS Security Events Workshops – AWS Customer Incident Response Team (CIRT) release five real-world workshops that simulate security events, such as server-side request forgery, ransomware, and cryptominer-based security events, to help you learn the tools and procedures that AWS CIRT uses.

Upcoming AWS Events
Check your calendars and sign up for these AWS events in the new year:

  • AWS Builders Online Series on January 18 – This online conference is designed for you to learn core AWS concepts, and step-by-step architectural best practices, including demonstrations to help you get started and accelerate your success on AWS.
  • AWS Community Day Singapore on January 28 – Come and join AWS User Group Singapore’s first AWS Community Day, a community-led conference for AWS users. See Events for Developers to learn about developer events hosted by AWS and the AWS Community.
  • AWS Cloud Practitioner Essentials Day in January and February – This online workshop provides a detailed overview of cloud concepts, AWS services, security, architecture, pricing, and support. This course also helps you prepare for the AWS Certified Cloud Practitioner examination.

You can browse all upcoming in-person, and virtual events.

That’s all for this week. Check back next Monday for another Week in Review!

— Channy

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Amazon S3 Encrypts New Objects By Default

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/amazon-s3-encrypts-new-objects-by-default/

At AWS, security is job zero. Starting today, Amazon Simple Storage Service (Amazon S3) encrypts all new objects by default. Now, S3 automatically applies server-side encryption (SSE-S3) for each new object, unless you specify a different encryption option. SSE-S3 was first launched in 2011. As Jeff wrote at the time: “Amazon S3 server-side encryption handles all encryption, decryption, and key management in a totally transparent fashion. When you PUT an object, we generate a unique key, encrypt your data with the key, and then encrypt the key with a [root] key.”

This change puts another security best practice into effect automatically—with no impact on performance and no action required on your side. S3 buckets that do not use default encryption will now automatically apply SSE-S3 as the default setting. Existing buckets currently using S3 default encryption will not change.

As always, you can choose to encrypt your objects using one of the three encryption options we provide: S3 default encryption (SSE-S3, the new default), customer-provided encryption keys (SSE-C), or AWS Key Management Service keys (SSE-KMS). To have an additional layer of encryption, you might also encrypt objects on the client side, using client libraries such as the Amazon S3 encryption client.

While it was simple to enable, the opt-in nature of SSE-S3 meant that you had to be certain that it was always configured on new buckets and verify that it remained configured properly over time. For organizations that require all their objects to remain encrypted at rest with SSE-S3, this update helps meet their encryption compliance requirements without any additional tools or client configuration changes.

With today’s announcement, we have now made it “zero click” for you to apply this base level of encryption on every S3 bucket.

Verify Your Objects Are Encrypted
The change is visible today in AWS CloudTrail data event logs. You will see the changes in the S3 section of the AWS Management Console, Amazon S3 Inventory, Amazon S3 Storage Lens, and as an additional header in the AWS CLI and in the AWS SDKs over the next few weeks. We will update this blog post and documentation when the encryption status is available in these tools in all AWS Regions.

To verify the change is effective on your buckets today, you can configure CloudTrail to log data events. By default, trails do not log data events, and there is an extra cost to enable it. Data events show the resource operations performed on or within a resource, such as when a user uploads a file to an S3 bucket. You can log data events for Amazon S3 buckets, AWS Lambda functions, Amazon DynamoDB tables, or a combination of those.

Once enabled, search for PutObject API for file uploads or InitiateMultipartUpload for multipart uploads. When Amazon S3 automatically encrypts an object using the default encryption settings, the log includes the following field as the name-value pair: "SSEApplied":"Default_SSE_S3". Here is an example of a CloudTrail log (with data event logging enabled) when I uploaded a file to one of my buckets using the AWS CLI command aws s3 cp backup.sh s3://private-sst.

Cloudtrail log for S3 with default encryption enabled

Amazon S3 Encryption Options
As I wrote earlier, SSE-S3 is now the new base level of encryption when no other encryption-type is specified. SSE-S3 uses Advanced Encryption Standard (AES) encryption with 256-bit keys managed by AWS.

You can choose to encrypt your objects using SSE-C or SSE-KMS rather than with SSE-S3, either as “one click” default encryption settings on the bucket, or for individual objects in PUT requests.

SSE-C lets Amazon S3 perform the encryption and decryption of your objects while you retain control of the keys used to encrypt objects. With SSE-C, you don’t need to implement or use a client-side library to perform the encryption and decryption of objects you store in Amazon S3, but you do need to manage the keys that you send to Amazon S3 to encrypt and decrypt objects.

With SSE-KMS, AWS Key Management Service (AWS KMS) manages your encryption keys. Using AWS KMS to manage your keys provides several additional benefits. With AWS KMS, there are separate permissions for the use of the KMS key, providing an additional layer of control as well as protection against unauthorized access to your objects stored in Amazon S3. AWS KMS provides an audit trail so you can see who used your key to access which object and when, as well as view failed attempts to access data from users without permission to decrypt the data.

When using an encryption client library, such as the Amazon S3 encryption client, you retain control of the keys and complete the encryption and decryption of objects client-side using an encryption library of your choice. You encrypt the objects before they are sent to Amazon S3 for storage. The Java, .Net, Ruby, PHP, Go, and C++ AWS SDKs support client-side encryption.

You can follow the instructions in this blog post if you want to retroactively encrypt existing objects in your buckets.

Available Now
This change is effective now, in all AWS Regions, including on AWS GovCloud (US) and AWS China Regions. There is no additional cost for default object-level encryption.

— seb

Deploying Oracle RAC in AWS Outposts via FlashGrid Cluster

Post Syndicated from Andreas Bogner original https://aws.amazon.com/blogs/architecture/deploying-oracle-rac-in-aws-outposts-via-flashgrid-cluster/

Amazon Web Services (AWS) customers are deploying AWS Outposts as a fully managed solution that delivers AWS infrastructure and services to on-premises or edge locations for a truly consistent hybrid experience. Those hybrid cloud workloads can require highly available Oracle databases running on- or close-to premises. One way to meet this requirement is Oracle Real Application Clusters (RAC) on top of two or more AWS Outposts racks using the FlashGrid Cluster solution.

In this post, we follow up on a Marketplace blog that describes how to deploy Oracle RAC via FlashGrid Cluster in the AWS regions. Deploying the solution onto AWS Outposts requires additional steps as the Outpost racks communicate between VPCs and across the on-premises network. In the following we explain how to configure the network for a multi-Outpost setup, how to deploy FlashGrid Cluster with Oracle RAC, and how to connect to the database cluster from the on-premises network.

Solution overview

This architecture uses two logical Outposts (42U racks) that are connected to different Availability Zones (AZs) in the same region for high availability. The networking is configured such that the communication between Amazon Elastic Compute Cloud (Amazon EC2) instances on two distinct Outpost racks uses Outpost’s local gateways and the corporate network. Therefore, data will not leave the premises unless explicitly copied to the Cloud region.

The FlashGrid Cluster solution deploys one node to each Outpost rack and an additional quorum node to the Cloud region. The cluster nodes provide the Oracle ASM disk groups and the Oracle RAC nodes. FlashGrid Cluster for Oracle RAC ensures that storage is replicated between nodes (Figure 1).

Architecture for a deployment across two logical AWS Outposts

Figure 1. Architecture for a deployment across two logical AWS Outposts

We provide a complete, step-by-step guide that deploys an Oracle RAC database across two Outpost racks.

It takes three steps to get your database up and running:

  1. Networking: prepare the virtual private clouds (VPCs), subnets, and route tables
  2. FlashGrid Cluster: use the FlashGrid Launcher to create an Oracle RAC cluster
  3. Database: configure Oracle RAC and connect to the database

The deployment uses a CloudFormation template that is generated based on the workload’s specific parameters.


For this solution, you require:

Networking setup

A single VPC cannot span multiple Outposts. Therefore, each Outpost should have a separately configured VPC.

A security group allowing traffic between the nodes of the cluster must be created in each VPC. Private IP addresses of the other nodes in the cluster should be configured as allowed sources of the traffic. (A security group rule cannot reference a security group in a different VPC.)

The steps suggested for configuring the network, as in Figure 2:

  1. Create a VPC for each Outpost and a VPC between those VPCs. Ensure that the VPCs are associated with the respective local gateways on each Outpost.
  2. Within each VPC, create a subnet for the corresponding Outpost and a subnet in the Region. Add route tables to each Outpost subnet that allow cross-VPC routing (as in Table 1).Table 1. Main route table for VPC ‘Outpost 1’

    Destination Target local lgw-11111111111111111 pcx-11223344556677889
    pl-12345678 vpce-12345678901234567
  3. For each of the cluster nodes, allocate a private IP address within the corresponding subnet.
  4. Within each VPC create a separate security group for each cluster. The security group must allow all inbound and outbound traffic from all nodes of the same cluster using their private IP addresses.
  5. Optional: Create a public subnet in the region in either one of the VPCs. Deploying a bastion host EC2 instance into this subnet will allow you to connect by SSH to the Oracle RAC nodes later.
  6. Open SSH access (TCP 22) and Oracle client access (default ports: TCP 1521, 1522) either by adding the corresponding rules to the same security groups or assigning additional security groups to all cluster node instances after the cluster deployment is complete.
  7. Create an Amazon Simple Storage Service (Amazon S3) gateway endpoint in each VPC and add corresponding routes. It is critical to use Amazon S3 gateway endpoints, otherwise the Oracle RAC nodes cannot download the installation binaries from the S3 bucket.
Setting up subnets on each Outpost via the AWS console

Figure 2. Setting up subnets on each Outpost via the AWS console

FlashGrid Cluster setup

The process of deploying FlashGrid Cluster consists of five main steps:

  1. Subscribing to a FlashGrid Cluster product in AWS Marketplace
  2. Uploading Oracle installation files to an S3 bucket in the Region
  3. Modifying the cluster configuration template with parameters specific to Outposts
  4. Using the FlashGrid Launcher tool to finalize cluster configuration and generate an AWS CloudFormation template
  5. Deploying the CloudFormation template

Subscribing to a FlashGrid Cluster product in AWS Marketplace

  1. Open one of the FlashGrid Cluster product pages in AWS Marketplace corresponding with the preferred operating system.
  2. Select Continue.
  3. Select the Manual Launch tab.
  4. Choose Accept Software Terms.

Creating an S3 bucket for Oracle installation files

During cluster provisioning, Oracle installation files are downloaded from an S3 bucket. The list of files that must be placed in the S3 bucket is displayed on the Oracle Files tab of the FlashGrid Launcher tool (next step). The S3 bucket is not supposed to have any sensitive data in it and can be hosted in any AWS Region. You can use the same bucket for multiple deployments.

To create an S3 bucket with the correct IAM role access:

  1. Create an S3 bucket or folder for uploading the installation files
  2. Within the IAM console, create a policy named GetOracleFilesFromS3, which allows the s3:GetObject action on all uploaded files
  3. Create an EC2 instance role named GetOracleFilesFromS3 and attach the GetOracleFilesFromS3 policy to it
  4. Use the GetOracleFilesFromS3 role when configuring cluster parameters in the FlashGrid Launcher
  5. Once the FlashGrid Launcher has provided a list of required Oracle installation files, place those into the S3 bucket

Modifying the cluster configuration template with parameters specific to Outposts

Download the configuration file template, and open it in a text editor. The template is for a two-node RAC cluster on Outposts in Multi-AZ configuration.

For the database node instances in the configuration file, manually set the following attributes:

  • outpost_arn: ARN of the target Outpost
  • ip: private IP address assigned to the EC2 instance
  • sg: security group ID
  • ins_type: instance type (ensure that this matches the available types for the Outposts)

Creating a cluster configuration file using FlashGrid Launcher tool

FlashGrid Launcher is an online tool that allows creating your desired configuration for the cluster and then generating a CloudFormation template for it.

  1. Open FlashGrid Launcher and upload the customized configuration file created at the previous step.
  2. Follow the step-by-step instructions in the Launcher tool (see Figure 3) until you get to the last Launch step.
    a. In the Oracle Files step, there is a list of installer files. Upload these files into the S3 bucket from the previous section (Creating an S3 bucket for Oracle installation files).
    b. In the Storage step, ignore the IOPS and MBPS parameters, as these parameters will not be used because Outposts uses the GP2 type of volumes.
    c. In the Network step, provide the Security Group ID that will be used for the quorum node located in the Region. Make sure to use a security group from the quorum node’s VPC.
  3. At the Launch step, click Launch FlashGrid. This generates a CloudFormation template and takes you to the AWS CloudFormation console.
The FlashGrid Launcher generates a CloudFormation template based on various input parameters

Figure 3. The FlashGrid Launcher generates a CloudFormation template based on various input parameters

Deploying the CloudFormation template

  1. Once you are in AWS CloudFormation console, select Next.
  2. Select the SSH key; do not change network parameters if you set them correctly in the Launcher. Select Next.
  3. On the Options page, if you added tags in FlashGrid Launcher, then do not add the same tags in CloudFormation console. You are free to add further tags in this step. Select Next.
  4. Select Create, and wait until the status of the stack changes to CREATE_COMPLETE.
  5. Connect by using SSH from the bastion host to the first cluster node as ec2-user with the SSH key you were provided when the stack was deployed.
  6. The welcome message details the current initialization status of the cluster: in progress, failed, or completed.
  7. If initialization is still in progress, wait for it to complete (this includes Oracle software installation and configuration). A broadcast message is delivered when initialization completes or fails. Cluster initialization takes 1 to 2 hours, depending on configuration.

Oracle database configuration

You can create an Oracle RAC database (or multiple databases) using Oracle DBCA tool and following standard Oracle best practices.

To connect to the Oracle RAC database using the SCAN listener, configure the Domain Name System (DNS) records and the connection string on client side:

  • On the DNS server(s) used by clients, add a record resolving to the VPC Private IP address of the node instance for each database node. In a test environment without a DNS server, the entries can be added to /etc/hosts on the clients instead of the DNS server.
  • In our example deployment, this is:
  • It is important that hostnames and domain names in the DNS records exactly match the hostnames as reported by the hostname command on the database servers.
  • Finally, define a connection string with the addresses of all database nodes listed:
               (ADDRESS=(PROTOCOL=tcp) (HOST=rac2.example.com) (PORT=1521))
               (ADDRESS=(PROTOCOL=tcp) (HOST=rac2.example.com) (PORT=1521))
                  (SERVICE_NAME=<service name>)

Visit the FlashGrid Help Center for more information on creating and connecting to a database.


The Oracle RAC instance together with the FlashGrid Cluster solution is deployed as a single CloudFormation stack. Deleting this stack will remove all associated resources, including the EBS volumes. Follow the approach detailed in the Delete Your Stacks But Keep Your Data blog post at the deployment stage to retain snapshots of all volumes automatically. You can also take snapshots of all EBS volumes manually before deleting the stack.


In this blog post, we have explored how to deploy Oracle RAC across two or more Outpost racks using FlashGrid Cluster. Running Oracle RAC on top of Outposts provides a highly available database solution for use cases that require you to run the workload on-premises, as in data-residency or latency-critical scenarios. Using the growing number of features for AWS Outposts rack, and you can more efficiently run hybrid workloads using the same tools and automation both in Cloud Regions and on-premises.

Further reading

Genomics workflows, Part 3: automated workflow manager

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/genomics-workflows-part-3-automated-workflow-manager/

Genomics workflows are high-performance computing workloads. Life-science research teams make use of various genomics workflows. With each invocation, they specify custom sets of data and processing steps, and translate them into commands. Furthermore, team members stay to monitor progress and troubleshoot errors, which can be cumbersome, non-differentiated, administrative work.

In Part 3 of this series, we describe the architecture of a workflow manager that simplifies the administration of bioinformatics data pipelines. The workflow manager dynamically generates the launch commands based on user input and keeps track of the workflow status. This workflow manager can be adapted to many scientific workloads—effectively becoming a bring-your-own-workflow-manager for each project.

Use case

In Part 1, we demonstrated how life-science research teams can use Amazon Web Services to remove the heavy lifting of conducting genomic studies, and our design pattern was built on AWS Step Functions with AWS Batch. We mentioned that we’ve worked with life-science research teams to put failed job logs onto Amazon DynamoDB. Some teams prefer to use command-line interface tools, such as the AWS Command Line Interface; other interfaces, such as PyBDA with Apache Spark, or CWL experimental grammar in combination with the Amazon Simple Storage Service (Amazon S3) API, are also used when access to the AWS Management Console is prohibited. In our use case, scientists used the console to easily update table items, plus initiate retry via DynamoDB streams.

In this blog post, we extend this idea to a new frontend layer in our design pattern. This layer automates command generation and monitors the invocations of a variety of workflows—becoming a workflow manager. Life-science research teams use multiple workflows for different datasets and use cases, each with different syntax and commands. The workflow manager we create removes the administrative burden of formulating workflow-specific commands and tracking their launches.

Solution overview

We allow scientists to upload their requested workflow configuration as objects in Amazon S3. We use S3 Event Notifications on PUT requests to invoke an AWS Lambda function. The function parses the uploaded S3 object and registers the new launch request as a DynamoDB item using the PutItem operation. Each item corresponds with a distinct launch request, stored as key-value pair. Item values store the:

  • S3 data path containing genomic datasets
  • Workflow endpoint
  • Preferred compute service (optional)

Another Lambda function monitors for change data captures in the DynamoDB Stream (Figure 1). With each PutItem operation, the Lambda function prepares a workflow invocation, which includes translating the user input into the syntax and launch commands of the respective workflow.

In the case of Snakemake (discussed in Part 2), the function creates a Snakefile that declares processing steps and commands. The function spins up an AWS Fargate task that builds the computational tasks, distributes them with AWS Batch, and monitors for completion. An AWS Step Functions state machine orchestrates job processing, for example, initiated by Tibanna.

Amazon CloudWatch provides a consolidated overview of performance metrics, like time elapsed, failed jobs, and error types. We store log data, including status updates and errors, in Amazon CloudWatch Logs. A third Lambda function parses those logs and updates the status of each workflow launch request in the corresponding DynamoDB item (Figure 1).

Workflow manager for genomics workflows

Figure 1. Workflow manager for genomics workflows

Implementation considerations

In this section, we describe some of our past implementation considerations.

Register new workflow requests

DynamoDB items are key-value pairs. We use launch IDs as key, and the value includes the workflow type, compute engine, S3 data path, the S3 object path to the user-defined configuration file and workflow status. Our Lambda function parses the configuration file and generates all commands plus ancillary artifacts, such as Snakefiles.

Launch workflows

Launch requests are picked by a Lambda function from the DynamoDB stream. The function has the following required parameters:

  • Launch ID: unique identifier of each workflow launch request
  • Configuration file: the Amazon S3 path to the configuration sheet with launch details (in s3://bucket/object format)
  • Compute service (optional): our workflow manager allows to select a particular service on which to run computational tasks, such as Amazon Elastic Compute Cloud (Amazon EC2) or AWS ParallelCluster with Slurm Workload Manager. The default is the pre-defined compute engine.

These points assume that the configuration sheet is already uploaded into an accessible location in an S3 bucket. This will issue a new Snakemake Fargate launch task. If either of the parameters is not provided or access fails, the workflow manager returns MissingRequiredParametersError.

Log workflow launches

Logs are written to CloudWatch Logs automatically. We write the location of the CloudWatch log group and log stream into the DynamoDB table. To send logs to Amazon CloudWatch, specify the awslogs driver in the Fargate task definition settings in your provisioning template.

Our Lambda function writes Fargate task launch logs from CloudWatch Logs to our DynamoDB table. For example, OutOfMemoryError can occur if the process utilizes more memory than the container is allocated.

AWS Batch job state logs are written to the following log group in CloudWatch Logs: /aws/batch/job. Our Lambda function writes status updates to the DynamoDB table. AWS Batch jobs may encounter errors, such as being stuck in RUNNABLE state.

Manage state transitions

We manage the status of each job in DynamoDB. Whenever a Fargate task changes state, it is picked up by a CloudWatch rule that references the Fargate compute cluster. This CloudWatch rule invokes a notifier Lambda function that updates the workflow status in DynamoDB.


In this blog post, we demonstrated how life-science research teams can simplify genomic analysis across an array of workflows. These workflows usually have their own command syntax and workflow management system, such as Snakemake. The presented workflow manager removes the administrative burden of preparing and formulating workflow launches, increasing reliability.

The pattern is broadly reusable with any scientific workflow and related high-performance computing systems. The workflow manager provides persistence to enable historical analysis and comparison, which enables us to automatically benchmark workflow launches for cost and performance.

Stay tuned for Part 4 of this series, in which we explore how to enable our workflows to process archival data stored in Amazon Simple Storage Service Glacier storage classes.

Related information

Architecting your security model in AWS for legacy application migrations

Post Syndicated from Irfan Saleem original https://aws.amazon.com/blogs/architecture/architecting-your-security-model-in-aws-for-legacy-application-migrations/

Application migrations, especially from legacy/mainframe to the cloud, are done in phases that sometimes span multiple years. Each phase migrates a set of applications, data, and other resources to the cloud. During the transition phases, applications might require access to both on-premises and cloud-based resources to perform their function. While working with our customers, we observed that the most common resources that applications require access to are databases, file storage, and shared services.

This blog post includes architecture guidelines for setting up access to commonly used resources by building a security model in Amazon Web Services (AWS). As you move your legacy applications to the cloud, you can apply Zero Trust concepts and security best practices according to your security needs. With AWS, you can build strong identity and access management with centralized control and set up and manage guardrails and fine-grained access controls for your workforce and applications.

In large organizations, on-premises applications rely on mainframe-based security services, an Identity Provider (IdP) platform, or a combination of both.

  • A mainframe-based control facility enables on-premises applications to:
    • Identify and verify users.
    • Establish an authority (authorize users and backend programs to access protected resources) through privileges defined in the control facility.
    • The backend programs use a unique identifier (or surrogate key) and run under the authority defined by the privileges assigned to the unique identifier.This security mechanism needs to be transformed into a role-based security model in AWS as applications are moved to the cloud. You assign permissions to a role, which is assumed by an application to get access to resources in AWS, similar to an authority defined in the legacy environment.
  • An IdP platform (such as Octa or Ping Identify) provides capabilities such as centralized access management and identity federation using SAML 2.0 or OpenID Connect (OIDC), that builds a system of trust between on-premises IdP and AWS. Once the federation is set up, on-premises applications can access AWS resources using AWS Identity and Access Management (IAM) roles, as explained in the next section.

Setting up a scalable security model in AWS

Figure 1 shows an on-premises environment where enterprise identity management is integrated with the mainframe and provides authentication and authorization to applications running off the mainframe. Generally, mainframe-based security controls (users, resources, and profiles) are replicated to the enterprise identity platform and are kept in sync through a change data capture process.

Access to AWS resources from on-premises

Figure 1. Access to AWS resources from on-premises

To enable your on-premises applications to access AWS resources, the applications need valid AWS credentials for making AWS API requests. Avoid using long-term access keys (such as those associated with IAM users) because they remain valid until you remove them. The following two methods can be used to assume an IAM role and get temporary security credentials to gain access to the AWS resources:

  • SAML based Identity federation – AWS supports identity federation with SAML. It allows federated access to users and applications in your organization by assuming an IAM role created for SAML federation to get temporary credentials. This method is helpful:
    • If your application needs to restrict access to AWS resources based on logged in users. You can define attribute mapping and additional attributes as required.
    • If your application uses a service account to manage AWS resource access, regardless of who is logged in.
  • IAM Roles Anywhere – Your on-premises applications will exchange X.509 certificates so that they can assume a role and get temporary credentials. This method is helpful if your application needs access to an AWS resource based on a service account.

In both of these cases, authenticated requests assume an IAM role, get temporary security credentials, and perform certain actions using AWS command line interface (CLI) and AWS SDKs. The IAM role has attached permissions for AWS resources such as Amazon Simple Storage Service (Amazon S3), Amazon DynamoDB, and Amazon Relational Database Service (Amazon RDS).

The temporary credentials expire when the session expires. By default, the session duration is one hour; you can request longer duration and session refresh.

To understand better, let’s consider the use case in Figure 2, where on-premises applications need access to AWS resources.

Access to resources that are created or already migrated to AWS from on-premises

Figure 2. Access to resources that are created or already migrated to AWS from on-premises

Applications can get temporary security credentials through SAML or IAM Roles Anywhere as explained earlier. The next sections explain setting up access to the resources in Figure 2 using temporary credentials.

1. Amazon S3

On-premises applications can access Amazon S3 using the REST API or the AWS SDK to perform certain actions (such as GetObjects or ListObjects):

2. Amazon RDS and Amazon Aurora

AWS Secrets Manager helps you store credentials for Amazon RDS and Amazon Aurora. You can also set up automatic rotation of your database secrets to meet your security and compliance needs. Applications can retrieve secrets using AWS SDKs and AWS CLI.

Additional configuration values can be stored in AWS Systems Manager Parameter Store, which provides secure, hierarchical storage for configuration data management such as passwords, database strings and license codes as parameter values rather than hard coding them in the code.

To access Amazon RDS and Amazon Aurora:

    • You can launch Amazon RDS DB instances into a virtual private cloud (VPC). A client application can access DB instance through the internet or through the private network only using an established connection from on-premises to the AWS environment.
    • On-premises applications can connect to a relational database using a database driver such as Java Database Connectivity (JDBC). The application can retrieve database connection details (such as database URL, port, or credentials) from AWS Secrets Manager and AWS Systems Manager Parameter Store  through API calls and can use them for the database connection.
    • Database admins can access AWS Management Console through an assumed role and can have access to database credentials from AWS Secrets Manager in order to connect directly with the database. For certain administration tasks (such as cluster setup, backup, recovery, maintenance, and management), they will need access to the Amazon RDS management console.
    • Amazon RDS also provides IAM database authentication option for MariaDB, MySQL, and PostgreSQL. You can authenticate without a password when you connect to a DB instance. Instead, you use an authentication token. For more information, go to IAM database authentication.

3. Amazon DynamoDB

Applications can use temporary credentials to invoke certain actions using AWS SDKs for DynamoDB. You can create a VPC endpoint for DynamoDB to access DynamoDB with no exposure to the public internet, then restrict access further by using VPC endpoint and IAM policies.


This blog helps you architect an application security model in AWS to provide on-premises access to commonly used resources in AWS.

You can apply security best practices and Zero Trust concepts as you move your legacy applications to the cloud. With AWS, you can build identity and access management with centralized and fine-grained access controls for your workforce and applications.

Start building your security model on AWS:

Building a healthcare data pipeline on AWS with IBM Cloud Pak for Data

Post Syndicated from Eduardo Monich Fronza original https://aws.amazon.com/blogs/architecture/building-a-healthcare-data-pipeline-on-aws-with-ibm-cloud-pak-for-data/

Healthcare data is being generated at an increased rate with the proliferation of connected medical devices and clinical systems. Some examples of these data are time-sensitive patient information, including results of laboratory tests, pathology reports, X-rays, digital imaging, and medical devices to monitor a patient’s vital signs, such as blood pressure, heart rate, and temperature.

These different types of data can be difficult to work with, but when combined they can be used to build data pipelines and machine learning (ML) models to address various challenges in the healthcare industry, like the prediction of patient outcome, readmission rate, or disease progression.

In this post, we demonstrate how to bring data from different sources, like Snowflake and connected health devices, to form a healthcare data lake on Amazon Web Services (AWS). We also explore how to use this data with IBM Watson to build, train, and deploy ML models. You can learn how to integrate model endpoints with clinical health applications to generate predictions for patient health conditions.

Solution overview

The main parts of the architecture we discuss are (Figure 1):

  1. Using patient data to improve health outcomes
  2. Healthcare data lake formation to store patient health information
  3. Analyzing clinical data to improve medical research
  4. Gaining operational insights from healthcare provider data
  5. Providing data governance to maintain the data privacy
  6. Building, training, and deploying an ML model
  7. Integration with the healthcare system
Data pipeline for the healthcare industry using IBM CP4D on AWS

Figure 1. Data pipeline for the healthcare industry using IBM CP4D on AWS

IBM Cloud Pak for Data (CP4D) is deployed on Red Hat OpenShift Service on AWS (ROSA). It provides the components IBM DataStage, IBM Watson Knowledge Catalogue, IBM Watson Studio, IBM Watson Machine Learning, plus a wide variety of connections with data sources available in a public cloud or on-premises.

Connected health devices, on the edge, use sensors and wireless connectivity to gather patient health data, such as biometrics, and send it to the AWS Cloud through Amazon Kinesis Data Firehose. AWS Lambda transforms the data that is persisted to Amazon Simple Storage Service (Amazon S3), making that information available to healthcare providers.

Amazon Simple Notification Service (Amazon SNS) is used to send notifications whenever there is an issue with the real-time data ingestion from the connected health devices. In case of failures, messages are sent via Amazon SNS topics for rectifying and reprocessing of failure messages.

DataStage performs ETL operations and move patient historical information from Snowflake into Amazon S3. This data, combined with the data from the connected health devices, form a healthcare data lake, which is used in IBM CP4D to build and train ML models.

The pipeline described in architecture uses Watson Knowledge Catalogue, which provides data governance framework and artifacts to enrich our data assets. It protects sensitive patient information from unauthorized access, like individually identifiable information, medical history, test results, or insurance information.

Data protection rules define how to control access to data, mask sensitive values, or filter rows from data assets. The rules are automatically evaluated and enforced each time a user attempts to access a data asset in any governed catalog of the platform.

After this, the datasets are published to Watson Studio projects, where they are used to train ML models. You can develop models using Jupyter Notebook, IBM AutoAI (low-code), or IBM SPSS modeler (no-code).

For the purpose of this use case, we used logistic regression algorithm for classifying and predicting the probability of an event, such as disease risk management to assist doctors in making critical medical decisions. You can also build ML models using algorithms like Classification, Random Forest, and K-Nearest Neighbor. These are widely used to predict disease risk.

Once the models are trained, they are exposed as endpoints with Watson Machine Learning and integrated with the healthcare application to generate predictions by analyzing patient symptoms.

The healthcare applications are a type of clinical software that offer crucial physiological insights and predict the effects of illnesses and possible treatments. It provides built-in dashboards that display patient information together with the patient’s overall metrics for outcomes and treatments. This can help healthcare practitioners gain insights into patient conditions. It also can help medical institutions prioritize patients with more risk factors and curate clinical and behavioral health plans.

Finally, we are using IBM Security QRadar XDR SIEM to collect, process, and aggregate Amazon Virtual Private Cloud (Amazon VPC) flow logs, AWS CloudTrail logs, and IBM CP4D logs. QRadar XDR uses this information to manage security by providing real-time monitoring, alerts, and responses to threats.

Healthcare data lake

A healthcare data lake can help health organizations turn data into insights. It is centralized, curated, and securely stores data on Amazon S3. It also enables you to break down data silos and combine different types of analytics to gain insights. We are using the DataStage, Kinesis Data Firehose, and Amazon S3 services to build the healthcare data lake.

Data governance

Watson Knowledge Catalogue provides an ML catalogue for data discovery, cataloging, quality, and governance. We define policies in Watson Knowledge Catalogue to enable data privacy and overall access to and utilization of this data. This includes sensitive data and personal information that needs to be handled through data protection, quality, and automation rules. To learn more about IBM data governance, please refer to Running a data quality analysis (Watson Knowledge Catalogue).

Build, train, and deploy the ML model

Watson Studio empowers data scientists, developers, and analysts to build, run, and manage AI models on IBM CP4D.

In this solution, we are building models using Watson Studio by:

  1. Promoting the governed data from Watson Knowledge Catalogue to Watson Studio for insights
  2. Using ETL features, such as built-in search, automatic metadata propagation, and simultaneous highlighting, to process and transform large amounts of data
  3. Training the model, including model technique selection and application, hyperparameter setting and adjustment, validation, ensemble model development and testing; algorithm selection; and model optimization
  4. Evaluating the model based on metric evaluation, confusion matrix calculations, KPIs, model performance metrics, model quality measurements for accuracy and precision
  5. Deploying the model on Watson Machine Learning using online deployments, which create an endpoint to generate a score or prediction in real time
  6. Integrating the endpoint with applications like health applications, as demonstrated in Figure 1


In this blog, we demonstrated how to use patient data to improve health outcomes by creating a healthcare data lake and analyzing clinical data. This can help patients and healthcare practitioners make better, faster decisions and prioritize cases. We also discussed how to build an ML model using IBM Watson and integrate it with healthcare applications for health analysis.

Additional resources

Heads-Up: Amazon S3 Security Changes Are Coming in April of 2023

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/heads-up-amazon-s3-security-changes-are-coming-in-april-of-2023/

Starting in April of 2023 we will be making two changes to Amazon Simple Storage Service (Amazon S3) to put our latest best practices for bucket security into effect automatically. The changes will begin to go into effect in April and will be rolled out to all AWS Regions within weeks.

Once the changes are in effect for a target Region, all newly created buckets in the Region will by default have S3 Block Public Access enabled and access control lists (ACLs) disabled. Both of these options are already console defaults and have long been recommended as best practices. The options will become the default for buckets that are created using the S3 API, S3 CLI, the AWS SDKs, or AWS CloudFormation templates.

As a bit of history, S3 buckets and objects have always been private by default. We added Block Public Access in 2018 and the ability to disable ACLs in 2021 in order to give you more control, and have long been recommending the use of AWS Identity and Access Management (IAM) policies as a modern and more flexible alternative.

In light of this change, we recommend a deliberate and thoughtful approach to the creation of new buckets that rely on public buckets or ACLs, and believe that most applications do not need either one. If your application turns out be one that does, then you will need to make the changes that I outline below (be sure to review your code, scripts, AWS CloudFormation templates, and any other automation).

What’s Changing
Let’s take a closer look at the changes that we are making:

S3 Block Public Access – All four of the bucket-level settings described in this post will be enabled for newly created buckets:

A subsequent attempt to set a bucket policy or an access point policy that grants public access will be rejected with a 403 Access Denied error. If you need public access for a new bucket you can create it as usual and then delete the public access block by calling DeletePublicAccessBlock (you will need s3:PutBucketPublicAccessBlock permission in order to call this function; read Block Public Access to learn more about the functions and the permissions).

ACLs Disabled – The Bucket owner enforced setting will be enabled for newly created buckets, making bucket ACLs and object ACLs ineffective, and ensuring that the bucket owner is the object owner no matter who uploads the object. If you want to enable ACLs for a bucket, you can set the ObjectOwnership parameter to ObjectWriter in your CreateBucket request or you can call DeleteBucketOwnershipControls after you create the bucket. You will need s3:PutBucketOwnershipControls permission in order to use the parameter or to call the function; read Controlling Ownership of Objects and Creating a Bucket to learn more.

Stay Tuned
We will publish an initial What’s New post when we start to deploy this change and another one when the deployment has reached all AWS Regions. You can also run your own tests to detect the change in behavior.


Genomics workflows, Part 2: simplify Snakemake launches

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/genomics-workflows-part-2-simplify-snakemake-launches/

Genomics workflows are high-performance computing workloads. In Part 1 of this series, we demonstrated how life-science research teams can focus on scientific discovery without the associated heavy lifting. We used regenie for large genome-wide association studies. Our design pattern built on AWS Step Functions with AWS Batch and Amazon FSx for Lustre.

In Part 2, we explore genomics workloads with built-in workflow logic. Historically, running bioinformatics data pipelines was a manual and error-prone task. Over the last years, multiple workflow management systems have emerged. An example of these is the Snakemake workflow management system with Tibanna orchestration. We discuss the solution design and how you can fully automate the launch with Amazon Web Services (AWS).

Use case

We focus on the use case of Snakemake, an open-source utility for whole genome sequence mapping in directed acyclic graph (DAG) format. Snakemake uses Snakefiles to declare workflow steps and commands. A Snakefile extends Python syntax to declare workflow steps such as mapping data sets to DAG structure and identifying variants. Consult the Snakemake tutorial for further information on workflow rules.

Snakefiles provide an exception from the general design pattern and an alternative to granular modeling workflow logic in Amazon States Language. In our real-life use case, we used Tibanna to orchestrate Snakemake. Tibanna is an open-source, AWS-native software that runs bioinformatics data pipelines. It supports Snakefile syntax, plus other workflow languages, including Common Workflow Language and Workflow Description Language (WDL).

We recommend using Amazon Genomics CLI, if Tibanna is not needed for your use case, and Amazon Omics, if your workflow definitions are compliant with the supported WDL and Nextflow specifications.

Solution overview

Snakemake is available as Docker image on GitHub. We push the image to Amazon Elastic Container Registry. Tibanna is also available as Docker image on GitHub—it comes with Snakemake. Consult the Tibanna installation guide for more information.

We store Snakefiles on Amazon Simple Storage Service (Amazon S3). We configure S3 Event Notifications on PUT request operations. The event notification triggers an AWS Lambda function. The Lambda function launches an AWS Fargate task, which overrides the task definition command with the appropriate Snakemake start command and arguments.

The launched AWS Fargate task pulls the Snakefiles at launch time for each job and prepares the Snakemake initiation commands. Once the Snakefiles are downloaded on the Fargate task, the Snakemake head initiation command is invoked to begin launching jobs using Tibanna. Tibanna invokes a Step Functions state machine which orchestrates the launch of Snakemake on Amazon Elastic Compute Cloud (Amazon EC2).

Amazon CloudWatch provides a consolidated overview of performance metrics, including elapsed time, failed jobs, and error types. You can keep logs of your failed jobs in CloudWatch Logs (Figure 1). You can set up filters to match specific error types, plus create subscriptions to deliver a real-time stream of your log events to Amazon Kinesis or Lambda for further retry.

Solution architecture for Snakemake with Tibanna on AWS

Figure 1. Solution architecture for Snakemake with Tibanna on AWS

Implementation considerations

Here, we describe some of the implementation considerations.

Creating Snakefiles

The launching point for the initiation depends on a Snakefile. Each Snakefile may contain one or more samples to be launched. The sheet resides in an S3 bucket. This adds flexibility and the ability to purge any sensitive or restrictive information after the job has been processed.

Invoking Tibanna

In order to launch Snakemake DAGs using Tibanna, we will need to set up a new Tibanna Unicorn. A Tibanna Unicorn is an Step Functions state machine and a corresponding Lambda function for provisioning EC2 instances.

The state machine runs the following sequence:

  1. Create EC2 instance
  2. Check EC2 status
  3. Exit

After the Tibanna Unicorn has been created, we can start a Snakemake DAG using the following sample commands inside of the Fargate task.

$ snakemake --tibanna --tibanna-config spot_instance=true --default-remote-prefix=YOUR_S3_BUCKET/BUCKET_PREFIX --retries 3.

The Snakemake command is used with the --tibanna flag to send launch requests to the Step Functions state machine in order to provision EC2 instances and run DAG tasks.

We recommend deploying the solution with AWS Serverless Application Model or the AWS Cloud Development Kit, both of which launch AWS CloudFormation.

Logging and troubleshooting

With this solution, each launch will automatically capture and retain start logs in a centralized location in Amazon CloudWatch Logs for tracing and auditing.

If there are issues during the launch of the Tibanna Step Function state machine, such as Amazon EC2 capacity limits, logs will be available in the S3 bucket that was specified during the Tibanna Unicorn creation process. There will be a file available in the format of <EXECUTION_ID>.log inside of the S3 bucket. This information is easily accessible via the command line interface. Use the following command to display specific log results or error messages.

tibanna log -j <EXECUTION_ID> -T 

Retries and EC2 Spot Instances

We advise to use Amazon EC2 Spot Instances, if possible, for additional cost savings. This option is available in the --tibanna-config arguments with the setting spot_instance=true.

This is optional, and you need to create retry logic in the event a Spot Instance gets reclaimed. You can include --retries=3 in your Tibanna launch command. This would ensure all rules are retried three times. You can also specify the number of retries for individual rules when defining the Snakemake DAG definition; for example:

rule a:
    retries: 3
        "curl https://some.unreliable.server/test.txt > {output}"

If EC2 Spot Instance capacity is hit, you can automatically switch to using EC2 On-Demand Instances instead. Add the behavior_on_capacity_limit argument and set retry_without_spot=true.

Adding services

The presented solution can be adapted to use other compute services supported by Snakemake. These include Amazon Elastic Kubernetes Service and AWS ParallelCluster with Slurm Workload Manager plus Amazon FSx for Lustre volumes attached to the head node and cluster nodes.

To initiate jobs on ParallelCluster, install the AWS Systems Manager agent on the head node. This is the launching point into the cluster and used for submitting jobs to the initiation queue. Systems Manager is a secure way to remotely invoke commands on an EC2 instance without the need for SSH access. You can restrict access to your EC2 instance through IAM policies.


In this blog post, we demonstrated how life-science research teams can simplify the launch of Snakemake using AWS. We used Snakefiles and Tibanna to orchestrate workflow steps. Snakefiles provide an exception from the general design pattern and an alternative to Amazon States Language. File uploads to Amazon S3 served as our launching point for workflow initiations.

Stay tuned for Part 3 of this series, in which we create a job manager that administrates multiple workflows.

Related information