Tag Archives: AWS Step Functions

Serverless ICYMI Q1 2024

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/serverless-icymi-q1-2024/

Welcome to the 25th edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share all the most recent product launches, feature enhancements, blog posts, webinars, live streams, and other interesting things that you might have missed!

In case you missed our last ICYMI, check out what happened last quarter here.

2024 Q1 calendar

2024 Q1 calendar

Adobe Summit

At the Adobe Summit, the AWS Serverless Developer Advocacy team showcased a solution developed for the NFL using AWS serverless technologies and Adobe Photoshop APIs. The system automates image processing tasks, including background removal and dynamic resizing, by integrating AWS Step Functions, AWS Lambda, Amazon EventBridge, and AI/ML capabilities via Amazon Rekognition. This solution reduced image processing time from weeks to minutes and saved the NFL significant costs. Combining cloud-based serverless architectures with advanced machine learning and API technologies can optimize digital workflows for cost-effective and agile digital asset management.

Adobe Summit ServerlessVideo

Adobe Summit ServerlessVideo

ServerlessVideo is a demo application to stream live videos and also perform advanced post-video processing. It uses several AWS services, including Step Functions, Lambda, EventBridge, Amazon ECS, and Amazon Bedrock in a serverless architecture that makes it fast, flexible, and cost-effective. The team used ServerlessVideo to interview attendees about the conference experience and Adobe and partners about how they use Adobe. Learn more about the project and watch videos from Adobe Summit 2024 at video.serverlessland.com.

AWS Lambda

AWS launched support for the latest long-term support release of .NET 8, which includes API enhancements, improved Native Ahead of Time (Native AOT) support, and improved performance.

AWS Lambda .NET 8

AWS Lambda .NET 8

Learn how to compare design approaches for building serverless microservices. This post covers the trade-offs to consider with various application architectures. See how you can apply single responsibility, Lambda-lith, and read and write functions.

The AWS Serverless Java Container has been updated. This makes it easier to modernize a legacy Java application written with frameworks such as Spring, Spring Boot, or JAX-RS/Jersey in Lambda with minimal code changes.

AWS Serverless Java Container

AWS Serverless Java Container

Lambda has improved the responsiveness for configuring Event Source Mappings (ESMs) and Amazon EventBridge Pipes with event sources such as self-managed Apache Kafka, Amazon Managed Streaming for Apache Kafka (MSK), Amazon DocumentDB, and Amazon MQ.

Chaos engineering is a popular practice for building confidence in system resilience. However, many existing tools assume the ability to alter infrastructure configurations, and cannot be easily applied to the serverless application paradigm. You can use the AWS Fault Injection Service (FIS) to automate and manage chaos experiments across different Lambda functions to provide a reusable testing method.

Amazon ECS and AWS Fargate

Amazon Elastic Container Service (Amazon ECS) now provides managed instance draining as a built-in feature of Amazon ECS capacity providers. This allows Amazon ECS to safely and automatically drain tasks from Amazon Elastic Compute Cloud (Amazon EC2) instances that are part of an Amazon EC2 Auto Scaling Group associated with an Amazon ECS capacity provider. This simplification allows you to remove custom lifecycle hooks previously used to drain Amazon EC2 instances. You can now perform infrastructure updates such as rolling out a new version of the ECS agent by seamlessly using Auto Scaling Group instance refresh, with Amazon ECS ensuring workloads are not interrupted.

Credentials Fetcher makes it easier to run containers that depend on Windows authentication when using Amazon EC2. Credentials Fetcher now integrates with Amazon ECS, using either the Amazon EC2 launch type, or AWS Fargate serverless compute launch type.

Amazon ECS Service Connect is a networking capability to simplify service discovery, connectivity, and traffic observability for Amazon ECS. You can now more easily integrate certificate management to encrypt service-to-service communication using Transport Layer Security (TLS). You do not need to modify your application code, add additional network infrastructure, or operate service mesh solutions.

Amazon ECS Service Connect

Amazon ECS Service Connect

Running distributed machine learning (ML) workloads on Amazon ECS allows ML teams to focus on creating, training and deploying models, rather than spending time managing the container orchestration engine. Amazon ECS provides a great environment to run ML projects as it supports workloads that use NVIDIA GPUs and provides optimized images with pre-installed NVIDIA Kernel drivers and Docker runtime.

See how to build preview environments for Amazon ECS applications with AWS Copilot. AWS Copilot is an open source command line interface that makes it easier to build, release, and operate production ready containerized applications.

Learn techniques for automatic scaling of your Amazon Elastic Container Service  (Amazon ECS) container workloads to enhance the end user experience. This post explains how to use AWS Application Auto Scaling which helps you configure automatic scaling of your Amazon ECS service. You can also use Amazon ECS Service Connect and AWS Distro for OpenTelemetry (ADOT) in Application Auto Scaling.

AWS Step Functions

AWS workloads sometimes require access to data stored in on-premises databases and storage locations. Traditional solutions to establish connectivity to the on-premises resources require inbound rules to firewalls, a VPN tunnel, or public endpoints. Discover how to use the MQTT protocol (AWS IoT Core) with AWS Step Functions to dispatch jobs to on-premises workers to access or retrieve data stored on-premises.

You can use Step Functions to orchestrate many business processes. Many industries are required to provide audit trails for decision and transactional systems. Learn how to build a serverless pipeline to create a reliable, performant, traceable, and durable pipeline for audit processing.

Amazon EventBridge

Amazon EventBridge now supports publishing events to AWS AppSync GraphQL APIs as native targets. The new integration allows you to publish events easily to a wider variety of consumers and simplifies updating clients with near real-time data.

Amazon EventBridge publishing events to AWS AppSync

Amazon EventBridge publishing events to AWS AppSync

Discover how to send and receive CloudEvents with EventBridge. CloudEvents is an open-source specification for describing event data in a common way. You can publish CloudEvents directly to EventBridge, filter and route them, and use input transformers and API Destinations to send CloudEvents to downstream AWS services and third-party APIs.

AWS Application Composer

AWS Application Composer lets you create infrastructure as code templates by dragging and dropping cards on a virtual canvas. These represent CloudFormation resources, which you can wire together to create permissions and references. Application Composer has now expanded to the VS Code IDE as part of the AWS Toolkit. This now includes a generative AI partner that helps you write infrastructure as code (IaC) for all 1100+ AWS CloudFormation resources that Application Composer now supports.

AWS AppComposer generate suggestions

AWS AppComposer generate suggestions

Amazon API Gateway

Learn how to consume private Amazon API Gateway APIs using mutual TLS (mTLS). mTLS helps prevent man-in-the-middle attacks and protects against threats such as impersonation attempts, data interception, and tampering.

Serverless at AWS re:Invent

Serverless at AWS reInvent

Serverless at AWS reInvent

Visit the Serverless Land YouTube channel to find a list of serverless and serverless container sessions from reinvent 2023. Hear from experts like Chris Munns and Julian Wood in their popular session, Best practices for serverless developers, or Nathan Peck and Jessica Deen in Deploying multi-tenant SaaS applications on Amazon ECS and AWS Fargate.

Serverless blog posts




Serverless container blog posts




Serverless Office Hours

Serverless Office Hours

Serverless Office Hours




Containers from the Couch

Containers from the Couch

Containers from the Couch




FooBar Serverless

FooBar Serverless

FooBar Serverless




Still looking for more?

The Serverless landing page has more information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials.

You can also follow the Serverless Developer Advocacy team on Twitter to see the latest news, follow conversations, and interact with the team.

And finally, visit the Serverless Land and Containers on AWS websites for all your serverless and serverless container needs.

Building a Serverless Streaming Pipeline to Deliver Reliable Messaging

Post Syndicated from Chris McPeek original https://aws.amazon.com/blogs/compute/building-a-serverless-streaming-pipeline-to-deliver-reliable-messaging/

This post is written by Jeff Harman, Senior Prototyping Architect, Vaibhav Shah, Senior Solutions Architect and Erik Olsen, Senior Technical Account Manager.

Many industries are required to provide audit trails for decision and transactional systems. AI assisted decision making requires monitoring the full inputs to the decision system in near real time to prevent fraud, detect model drift, and discrimination. Modern systems often use a much wider array of inputs for decision making, including images, unstructured text, historical values, and other large data elements. These large data elements pose a challenge to traditional audit systems that deal with relatively small text messages in structured formats. This blog shows the use of serverless technology to create a reliable, performant, traceable, and durable streaming pipeline for audit processing.


Consider the following four requirements to develop an architecture for audit record ingestion:

  1. Audit record size: Store and manage large payloads (256k – 6 MB in size) that may be heterogeneous, including text, binary data, and references to other storage systems.
  2. Audit traceability: The data stored has full traceability of the payload and external processes to monitor the process via subscription-based events.
  3. High Performance: The time required for blocking writes to the system is limited to the time it takes to transmit the audit record over the network.
  4. High data durability: Once the system sends a payload receipt, the payload is at very low risk of loss because of system failures.

The following diagram shows an architecture that meets these requirements and models the flow of the audit record through the system.

The primary source of latency is the time it takes for an audit record to be transmitted across the network. Applications sending audit records make an API call to an Amazon API Gateway endpoint. An AWS Lambda function receives the message and an Amazon ElastiCache for Redis cluster provides a low latency initial storage mechanism for the audit record. Once the data is stored in ElastiCache, the AWS Step Functions workflow then orchestrates the communication and persistence functions.

Subscribers receive four Amazon Simple Notification Service (Amazon SNS) notifications pertaining to arrival and storage of the audit record payload, storage of the audit record metadata, and audit record archive completion. Users can subscribe an Amazon Simple Queue Service (SQS) queue to the SNS topic and use fan out mechanisms to achieve high reliability.

  1. The Ingest Message Lambda function sends an initial receipt notification
  2. The Message Archive Handler Lambda function notifies on storage of the audit record from ElastiCache to Amazon Simple Storage Service (Amazon S3)
  3. The Message Metadata Handler Lambda function notifies on storage of the message metadata into Amazon DynamoDB
  4. The Final State Aggregation Lambda function notifies that the audit record has been archived.

Any failure by the three fundamental processing steps: Ingestion, Data Archive, and Metadata Archive triggers a message in an SQS Dead Letter Queue (DLQ) which contains the original request and an explanation of the failure reason. Any failure in the Ingest Message function invokes the Ingest Message Failure function, which stores the original parameters to the S3 Failed Message Storage bucket for later analysis.

The Step Functions workflow provides orchestration and parallel path execution for the system. The detailed workflow below shows the execution flow and notification actions. The transformer steps convert the internal data structures into the format required for consumers.

Data structures

There are types three events and messages managed by this system:

  1. Incoming message: This is the message the producer sends to an API Gateway endpoint.
  2. Internal message: This event contains the message metadata allowing subsequent systems to understand the originating message producer context.
  3. Notification message: Messages that allow downstream subscribers to act based on the message.

Solution walkthrough

The message producer calls the API Gateway endpoint, which enforces the security requirements defined by the business. In this implementation, API Gateway uses an API key for providing more robust security. API Gateway also creates a security header for consumption by the Ingest Message Lambda function. API Gateway can be configured to enforce message format standards, see Use request validation in API Gateway for more information.

The Ingest Message Lambda function generates a message ID that tracks the message payload throughout its lifecycle. Then it stores the full message in the ElastiCache for Redis cache. The Ingest Message Lambda function generates an internal message with all the elements necessary as described above. Finally, the Lambda function handler code starts the Step Functions workflow with the internal message payload.

If the Ingest Message Lambda function fails for any reason, the Lambda function invokes the Ingestion Failure Handler Lambda function. This Lambda function writes any recoverable incoming message data to an S3 bucket and sends a notification on the Ingest Message dead letter queue.

The Step Functions workflow then runs three processes in parallel.

  • The Step Functions workflow triggers the Message Archive Data Handler Lambda function to persist message data from the ElastiCache cache to an S3 bucket. Once stored, the Lambda function returns the S3 bucket reference and state information. There are two options to remove the internal message from the cache. Remove the message from cache immediately before sending the internal message and updating the ElastiCache cache flag or wait for the ElastiCache lifecycle to remove a stale message from cache. This solution waits for the ElastiCache lifecycle to remove the message.
  • The workflow triggers the Message Metadata Handler Lambda function to write all message metadata and security information to DynamoDB. The Lambda function replies with the DynamoDB reference information.
  • Finally, the Step Functions workflow sends a message to the SNS topic to inform subscribers that the message has arrived and the data persistence processes have started.

After each of the Lambda functions’ processes complete, the Lambda function sends a notification to the SNS notification topic to alert subscribers that each action is complete. When both Message Metadata and Message Archive Lambda functions are done, the Final Aggregation function makes a final update to the metadata in DynamoDB to include S3 reference information and to remove the ElastiCache Redis reference.

Deploying the solution


  1. AWS Serverless Application Model (AWS SAM) is installed (see Getting started with AWS SAM)
  2. AWS User/Credentials with appropriate permissions to run AWS CloudFormation templates in the target AWS account
  3. Python 3.8 – 3.10
  4. The AWS SDK for Python (Boto3) is installed
  5. The requests python library is installed

The source code for this implementation can be found at  https://github.com/aws-samples/blog-serverless-reliable-messaging

Installing the Solution:

  1. Clone the git repository to a local directory
  2. git clone https://github.com/aws-samples/blog-serverless-reliable-messaging.git
  3. Change into the directory that was created by the clone operation, usually blog_serverless_reliable_messaging
  4. Execute the command: sam build
  5. Execute the command: sam deploy –-guided. You are asked to supply the following parameters:
    1. Stack Name: Name given to this deployment (example: serverless-streaming)
    2. AWS Region: Where to deploy (example: us-east-1)
    3. ElasticacheInstanceClass: EC2 cache instance type to use with (example: cache.t3.small)
    4. ElasticReplicaCount: How many replicas should be used with ElastiCache (recommended minimum: 2)
    5. ProjectName: Used for naming resources in account (example: serverless-streaming)
    6. MultiAZ: True/False if multiple Availability Zones should be used (recommend: True)
    7. The default parameters can be selected for the remainder of questions


Once you have deployed the stack, you can test it through the API gateway endpoint with the API key that is referenced in the deployment output. There are two methods for retrieving the API key either via the AWS console (from the link provided in the output – ApiKeyConsole) or via the AWS CLI (from the AWS CLI reference in the output – APIKeyCLI).

You can test directly in the Lambda service console by invoking the ingest message function.

A test message is available at the root of the project test_message.json for direct Lambda function testing of the Ingest function.

  1. In the console navigate to the Lambda service
  2. From the list of available functions, select the “<project name> -IngestMessageFunction-xxxxx” function
  3. Under the “Function overview” select the “Test” tab
  4. Enter an event name of your choosing
  5. Copy and paste the contents of test_message.json into the “Event JSON” box
  6. Click “Save” then after it has saved, click the “Test”
  7. If successful, you should see something similar to the below in the details:
    "isBase64Encoded": false,
    "statusCode": 200,
    "headers": {
    "Access-Control-Allow-Headers": "Content-Type",
    "Access-Control-Allow-Origin": "*",
    "Access-Control-Allow-Methods": "OPTIONS,POST"
    "body": "{\"messageID\": \"XXXXXXXXXXXXXX\"}"
  8. In the S3 bucket “<project name>-s3messagearchive-xxxxxx“, find the payload of the original json with a key based on the date and time of the script execution, e.g.: YEAR/MONTH/DAY/HOUR/MINUTE with a file name of the messageID
  9. In a DynamoDB table named metaDataTable, you should find a record with a messageID equal to the messageID from above that contains all of the metadata related to the payload

A python script is included with the code in the test_client folder

  1. Replace the <Your API key key here> and the <Your API Gateway URL here (IngestMessageApi)> values with the correct ones for your environment in the test_client.py file
  2. Execute the test script with Python 3.8 or higher with the requests package installed
    Example execution (from main directory of git clone):
    python3 -m pip install -r ./test_client/requirements.txt
    python3 ./test_client/test_client.py
  3. Successful output shows the messageID and the header JSON payload:
    "messageID": " XXXXXXXXXXXXXX"
  4. In the S3 bucket “<project name>-s3messagearchive-xxxxxx“, you should be able to find the payload of the original json with a key based on the date and time of the script execution, e.g.: YEAR/MONTH/DAY/HOUR/MINUTE with a file name of the messageID
  5. In a DynamoDB table named metaDataTable, you should find a record with a messageID equal to the messageID from above that contains all of the meta data related to the payload


This blog describes architectural patterns, messaging patterns, and data structures that support a highly reliable messaging system for large messages. The use of serverless services including Lambda functions, Step Functions, ElastiCache, DynamoDB, and S3 meet the requirements of modern audit systems to be scalable and reliable. The architecture shared in this blog post is suitable for a highly regulated environment to store and track messages that are larger than typical logging systems, records sized between 256k and 6MB. The architecture serves as a blueprint that can be extended and adapted to fit further serverless use cases.

For serverless learning resources, visit Serverless Land.

Top Architecture Blog Posts of 2023

Post Syndicated from Andrea Courtright original https://aws.amazon.com/blogs/architecture/top-architecture-blog-posts-of-2023/

2023 was a rollercoaster year in tech, and we at the AWS Architecture Blog feel so fortunate to have shared in the excitement. As we move into 2024 and all of the new technologies we could see, we want to take a moment to highlight the brightest stars from 2023.

As always, thanks to our readers and to the many talented and hardworking Solutions Architects and other contributors to our blog.

I give you our 2023 cream of the crop!

#10: Build a serverless retail solution for endless aisle on AWS

In this post, Sandeep and Shashank help retailers and their customers alike in this guided approach to finding inventory that doesn’t live on shelves.

Building endless aisle architecture for order processing

Figure 1. Building endless aisle architecture for order processing

Check it out!

#9: Optimizing data with automated intelligent document processing solutions

Who else dreads wading through large amounts of data in multiple formats? Just me? I didn’t think so. Using Amazon AI/ML and content-reading services, Deependra, Anirudha, Bhajandeep, and Senaka have created a solution that is scalable and cost-effective to help you extract the data you need and store it in a format that works for you.

AI-based intelligent document processing engine

Figure 2: AI-based intelligent document processing engine

Check it out!

#8: Disaster Recovery Solutions with AWS managed services, Part 3: Multi-Site Active/Passive

Disaster recovery posts are always popular, and this post by Brent and Dhruv is no exception. Their creative approach in part 3 of this series is most helpful for customers who have business-critical workloads with higher availability requirements.

Warm standby with managed services

Figure 3. Warm standby with managed services

Check it out!

#7: Simulating Kubernetes-workload AZ failures with AWS Fault Injection Simulator

Continuing with the theme of “when bad things happen,” we have Siva, Elamaran, and Re’s post about preparing for workload failures. If resiliency is a concern (and it really should be), the secret is test, test, TEST.

Architecture flow for Microservices to simulate a realistic failure scenario

Figure 4. Architecture flow for Microservices to simulate a realistic failure scenario

Check it out!

#6: Let’s Architect! Designing event-driven architectures

Luca, Laura, Vittorio, and Zamira weren’t content with their four top-10 spots last year – they’re back with some things you definitely need to know about event-driven architectures.

Let's Architect

Figure 5. Let’s Architect artwork

Check it out!

#5: Use a reusable ETL framework in your AWS lake house architecture

As your lake house increases in size and complexity, you could find yourself facing maintenance challenges, and Ashutosh and Prantik have a solution: frameworks! The reusable ETL template with AWS Glue templates might just save you a headache or three.

Reusable ETL framework architecture

Figure 6. Reusable ETL framework architecture

Check it out!

#4: Invoking asynchronous external APIs with AWS Step Functions

It’s possible that AWS’ menagerie of services doesn’t have everything you need to run your organization. (Possible, but not likely; we have a lot of amazing services.) If you are using third-party APIs, then Jorge, Hossam, and Shirisha’s architecture can help you maintain a secure, reliable, and cost-effective relationship among all involved.

Invoking Asynchronous External APIs architecture

Figure 7. Invoking Asynchronous External APIs architecture

Check it out!

#3: Announcing updates to the AWS Well-Architected Framework

The Well-Architected Framework continues to help AWS customers evaluate their architectures against its six pillars. They are constantly striving for improvement, and Haleh’s diligence in keeping us up to date has not gone unnoticed. Thank you, Haleh!

Well-Architected logo

Figure 8. Well-Architected logo

Check it out!

#2: Let’s Architect! Designing architectures for multi-tenancy

The practically award-winning Let’s Architect! series strikes again! This time, Luca, Laura, Vittorio, and Zamira were joined by Federica to discuss multi-tenancy and why that concept is so crucial for SaaS providers.

Let's Architect

Figure 9. Let’s Architect

Check it out!

And finally…

#1: Understand resiliency patterns and trade-offs to architect efficiently in the cloud

Haresh, Lewis, and Bonnie revamped this 2022 post into a masterpiece that completely stole our readers’ hearts and is among the top posts we’ve ever made!

Resilience patterns and trade-offs

Figure 10. Resilience patterns and trade-offs

Check it out!

Bonus! Three older special mentions

These three posts were published before 2023, but we think they deserve another round of applause because you, our readers, keep coming back to them.

Thanks again to everyone for their contributions during a wild year. We hope you’re looking forward to the rest of 2024 as much as we are!

Automate AWS Clean Rooms querying and dashboard publishing using AWS Step Functions and Amazon QuickSight – Part 2

Post Syndicated from Venkata Kampana original https://aws.amazon.com/blogs/big-data/automate-aws-clean-rooms-querying-and-dashboard-publishing-using-aws-step-functions-and-amazon-quicksight-part-2/

Public health organizations need access to data insights that they can quickly act upon, especially in times of health emergencies, when data needs to be updated multiple times daily. For example, during the COVID-19 pandemic, access to timely data insights was critically important for public health agencies worldwide as they coordinated emergency response efforts. Up-to-date information and analysis empowered organizations to monitor the rapidly changing situation and direct resources accordingly.

This is the second post in this series; we recommend that you read this first post before diving deep into this solution. In our first post, Enable data collaboration among public health agencies with AWS Clean Rooms – Part 1 , we showed how public health agencies can create AWS Clean Room collaborations, invite other stakeholders to join the collaboration, and run queries on their collective data without either party having to share or copy underlying data with each other. As mentioned in the previous blog, AWS Clean Rooms enables multiple organizations to analyze their data and unlock insights they can act upon, without having to share sensitive, restricted, or proprietary records.

However, public health organizations leaders and decision-making officials don’t directly access data collaboration outputs from their Amazon Simple Storage Service (Amazon S3) buckets. Instead, they rely on up-to-date dashboards that help them visualize data insights to make informed decisions quickly.

To ensure these dashboards showcase the most updated insights, the organization builders and data architects need to catalog and update AWS Clean Rooms collaboration outputs on an ongoing basis, which often involves repetitive and manual processes that, if not done well, could delay your organization’s access to the latest data insights.

Manually handling repetitive daily tasks at scale poses risks like delayed insights, miscataloged outputs, or broken dashboards. At a large volume, it would require around-the-clock staffing, straining budgets. This manual approach could expose decision-makers to inaccurate or outdated information.

Automating repetitive workflows, validation checks, and programmatic dashboard refreshes removes human bottlenecks and help decrease inaccuracies. Automation helps ensure continuous, reliable processes that deliver the most current data insights to leaders without delays, all while streamlining resources.

In this post, we explain an automated workflow using AWS Step Functions and Amazon QuickSight to help organizations access the most current results and analyses, without delays from manual data handling steps. This workflow implementation will empower decision-makers with real-time visibility into the evolving collaborative analysis outputs, ensuring they have up-to-date, relevant insights that they can act upon quickly

Solution overview

The following reference architecture illustrates some of the foundational components of clean rooms query automation and publishing dashboards using AWS services. We automate running queries using Step Functions with Amazon EventBridge schedules, build an AWS Glue Data Catalog on query outputs, and publish dashboards using QuickSight so they automatically refresh with new data. This allows public health teams to monitor the most recent insights without manual updates.

The architecture consists of the following components, as numbered in the preceding figure:

  1. A scheduled event rule on EventBridge triggers a Step Functions workflow.
  2. The Step Functions workflow initiates the run of a query using the StartProtectedQuery AWS Clean Rooms API. The submitted query runs securely within the AWS Clean Rooms environment, ensuring data privacy and compliance. The results of the query are then stored in a designated S3 bucket, with a unique protected query ID serving as the prefix for the stored data. This unique identifier is generated by AWS Clean Rooms for each query run, maintaining clear segregation of results.
  3. When the AWS Clean Rooms query is successfully complete, the Step Functions workflow calls the AWS Glue API to update the location of the table in the AWS Glue Data Catalog with the Amazon S3 location where the query results were uploaded in Step 2.
  4. Amazon Athena uses the catalog from the Data Catalog to query the information using standard SQL.
  5. QuickSight is used to query, build visualizations, and publish dashboards using the data from the query results.


For this walkthrough, you need the following:

Launch the CloudFormation stack

In this post, we provide a CloudFormation template to create the following resources:

  • An EventBridge rule that triggers the Step Functions state machine on a schedule
  • An AWS Glue database and a catalog table
  • An Athena workgroup
  • Three S3 buckets:
    • For AWS Clean Rooms to upload the results of query runs
    • For Athena to upload the results for the queries
    • For storing access logs of other buckets
  • A Step Functions workflow designed to run the AWS Clean Rooms query, upload the results to an S3 bucket, and update the table location with the S3 path in the AWS Glue Data Catalog
  • An AWS Key Management Service (AWS KMS) customer-managed key to encrypt the data in S3 buckets
  • AWS Identity and Access Management (IAM) roles and policies with the necessary permissions

To create the necessary resources, complete the following steps:

  1. Choose Launch Stack:

Launch Button

  1. Enter cleanrooms-query-automation-blog for Stack name.
  2. Enter the membership ID from the AWS Clean Rooms collaboration you created in Part 1 of this series.
  3. Choose Next.

  1. Choose Next again.
  2. On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
  3. Choose Create stack.

After you run the CloudFormation template and create the resources, you can find the following information on the stack Outputs tab on the AWS CloudFormation console:

  • AthenaWorkGroup – The Athena workgroup
  • EventBridgeRule – The EventBridge rule triggering the Step Functions state machine
  • GlueDatabase – The AWS Glue database
  • GlueTable – The AWS Glue table storing metadata for AWS Clean Rooms query results
  • S3Bucket – The S3 bucket where AWS Clean Rooms uploads query results
  • StepFunctionsStateMachine – The Step Functions state machine

Test the solution

The EventBridge rule named cleanrooms_query_execution_Stepfunctions_trigger is scheduled to trigger every 1 hour. When this rule is triggered, it initiates the run of the CleanRoomsBlogStateMachine-XXXXXXX Step Functions state machine. Complete the following steps to test the end-to-end flow of this solution:

  1. On the Step Functions console, navigate to the state machine you created.
  2. On the state machine details page, locate the latest query run.

The details page lists the completed steps:

  • The state machine submits a query to AWS Clean Rooms using the startProtectedQuery API. The output of the API includes the query run ID and its status.
  • The state machine waits for 30 seconds before checking the status of the query run.
  • After 30 seconds, the state machine checks the query status using the getProtectedQuery API. When the status changes to SUCCESS, it proceeds to the next step to retrieve the AWS Glue table metadata information. The output of this step contains the S3 location to which the query run results are uploaded.
  • The state machine retrieves the metadata of the AWS Glue table named patientimmunization, which was created via the CloudFormation stack.
  • The state machine updates the S3 location (the location to which AWS Clean Rooms uploaded the results) in the metadata of the AWS Glue table.
  • After a successful update of the AWS Glue table metadata, the state machine is complete.
  1. On the Athena console, switch the workgroup to CustomWorkgroup.
  2. Run the following query:
“SELECT * FROM "cleanrooms_patientdb "."patientimmunization" limit 10;"

Visualize the data with QuickSight

Now that you can query your data in Athena, you can use QuickSight to visualize the results. Let’s start by granting QuickSight access to the S3 bucket where your AWS Clean Rooms query results are stored.

Grant QuickSight access to Athena and your S3 bucket

First, grant QuickSight access to the S3 bucket:

  1. Sign in to the QuickSight console.
  2. Choose your user name, then choose Manage QuickSight.
  3. Choose Security and permissions.
  4. For QuickSight access to AWS services, choose Manage.
  5. For Amazon S3, choose Select S3 buckets, and choose the S3 bucket named cleanrooms-query-execution-results -XX-XXXX-XXXXXXXXXXXX (XXXXX represents the AWS Region and account number where the solution is deployed).
  6. Choose Save.

Create your datasets and publish visuals

Before you can analyze and visualize the data in QuickSight, you must create datasets for your Athena tables.

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose New dataset.
  3. Select Athena.
  4. Enter a name for your dataset.
  5. Choose Create data source.
  6. Choose the AWS Glue database cleanrooms_patientdb and select the table PatientImmunization.
  7. Select Directly query your data.
  8. Choose Visualize.

  1. On the Analysis tab, choose the visual type of your choice and add visuals.

Clean up

Complete the following steps to clean up your resources when you no longer need this solution:

  1. Manually delete the S3 buckets and the data stored in the bucket.
  2. Delete the CloudFormation templates.
  3. Delete the QuickSight analysis.
  4. Delete the data source.


In this post, we demonstrated how to automate running AWS Clean Rooms queries using an API call from Step Functions. We also showed how to update the query results information on the existing AWS Glue table, query the information using Athena, and create visuals using QuickSight.

The automated workflow solution delivers real-time insights from AWS Clean Rooms collaborations to decision makers through automated checks for new outputs, processing, and Amazon QuickSight dashboard refreshes. This eliminates manual handling tasks, enabling faster data-driven decisions based on latest analyses. Additionally, automation frees up staff resources to focus on more strategic initiatives rather than repetitive updates.

Contact the public sector team directly to learn more about how to set up this solution, or reach out to your AWS account team to engage on a proof of concept of this solution for your organization.

About AWS Clean Rooms

AWS Clean Rooms helps companies and their partners more easily and securely analyze and collaborate on their collective datasets—without sharing or copying one another’s underlying data. With AWS Clean Rooms, you can create a secure data clean room in minutes, and collaborate with any other company on the AWS Cloud to generate unique insights about advertising campaigns, investment decisions, and research and development.

The AWS Clean Rooms team is continually building new features to help you collaborate. Watch this video to learn more about privacy-enhanced collaboration with AWS Clean Rooms.

Check out more AWS Partners or contact an AWS Representative to know how we can help accelerate your business.

Additional resources

About the Authors

Venkata Kampana is a Senior Solutions Architect in the AWS Health and Human Services team and is based in Sacramento, CA. In that role, he helps public sector customers achieve their mission objectives with well-architected solutions on AWS.

Jim Daniel is the Public Health lead at Amazon Web Services. Previously, he held positions with the United States Department of Health and Human Services for nearly a decade, including Director of Public Health Innovation and Public Health Coordinator. Before his government service, Jim served as the Chief Information Officer for the Massachusetts Department of Public Health.

Invoking on-premises resources interactively using AWS Step Functions and MQTT

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/invoking-on-premises-resources-interactively-using-aws-step-functions-and-mqtt/

This post is written by Alex Paramonov, Sr. Solutions Architect, ISV, and Pieter Prinsloo, Customer Solutions Manager.

Workloads in AWS sometimes require access to data stored in on-premises databases and storage locations. Traditional solutions to establish connectivity to the on-premises resources require inbound rules to firewalls, a VPN tunnel, or public endpoints.

This blog post demonstrates how to use the MQTT protocol (AWS IoT Core) with AWS Step Functions to dispatch jobs to on-premises workers to access or retrieve data stored on-premises. The state machine can communicate with the on-premises workers without opening inbound ports or the need for public endpoints on on-premises resources. Workers can run behind Network Access Translation (NAT) routers while keeping bidirectional connectivity with the AWS Cloud. This provides a more secure and cost-effective way to access data stored on-premises.


By using Step Functions with AWS Lambda and AWS IoT Core, you can access data stored on-premises securely without altering the existing network configuration.

AWS IoT Core lets you connect IoT devices and route messages to AWS services without managing infrastructure. By using a Docker container image running on-premises as a proxy IoT Thing, you can take advantage of AWS IoT Core’s fully managed MQTT message broker for non-IoT use cases.

MQTT subscribers receive information via MQTT topics. An MQTT topic acts as a matching mechanism between publishers and subscribers. Conceptually, an MQTT topic behaves like an ephemeral notification channel. You can create topics at scale with virtually no limit to the number of topics. In SaaS applications, for example, you can create topics per tenant. Learn more about MQTT topic design here.

The following reference architecture shown uses the AWS Serverless Application Model (AWS SAM) for deployment, Step Functions to orchestrate the workflow, AWS Lambda to send and receive on-premises messages, and AWS IoT Core to provide the MQTT message broker, certificate and policy management, and publish/subscribe topics.

Reference architecture

  1. Start the state machine, either “on demand” or on a schedule.
  2. The state: “Lambda: Invoke Dispatch Job to On-Premises” publishes a message to an MQTT message broker in AWS IoT Core.
  3. The message broker sends the message to the topic corresponding to the worker (tenant) in the on-premises container that runs the job.
  4. The on-premises container receives the message and starts work execution. Authentication is done using client certificates and the attached policy limits the worker access to only the tenant’s topic.
  5. The worker in the on-premises container can access local resources like DBs or storage locations.
  6. The on-premises container sends the results and job status back to another MQTT topic.
  7. The AWS IoT Core rule invokes the “TaskToken Done” Lambda function.
  8. The Lambda function submits the results to Step Functions via SendTaskSuccess or SendTaskFailure API.

Deploying and testing the sample

Ensure you can manage AWS resources from your terminal and that:

  • Latest versions of AWS CLI and AWS SAM CLI are installed.
  • You have an AWS account. If not, visit this page.
  • Your user has sufficient permissions to manage AWS resources.
  • Git is installed.
  • Python version 3.11 or greater is installed.
  • Docker is installed.

You can access the GitHub repository here and follow these steps to deploy the sample.

The aws-resources directory contains the required AWS resources including the state machine, Lambda functions, topics, and policies. The directory on-prem-worker contains the Docker container image artifacts. Use it to run the on-premises worker locally.

In this example, the worker container adds two numbers, provided as an input in the following format:

  "a": 15,
  "b": 42

In a real-world scenario, you can substitute this operation with business logic. For example, retrieving data from on-premises databases, generating aggregates, and then submitting the results back to your state machine.

Follow these steps to test the sample end-to-end.

Using AWS IoT Core without IoT devices

There are no IoT devices in the example use case. However, the fully managed MQTT message broker in AWS IoT Core lets you route messages to AWS services without managing infrastructure.

AWS IoT Core authenticates clients using X.509 client certificates. You can attach a policy to a client certificate allowing the client to publish and subscribe only to certain topics. This approach does not require IAM credentials inside the worker container on-premises.

AWS IoT Core’s security, cost efficiency, managed infrastructure, and scalability make it a good fit for many hybrid applications beyond typical IoT use cases.

Dispatching jobs from Step Functions and waiting for a response

When a state machine reaches the state to dispatch the job to an on-premises worker, the execution pauses and waits until the job finishes. Step Functions support three integration patterns: Request-Response, Sync Run a Job, and Wait for a Callback with Task Token. The sample uses the “Wait for a Callback with Task Token“ integration. It allows the state machine to pause and wait for a callback for up to 1 year.

When the on-premises worker completes the job, it publishes a message to the topic in AWS IoT Core. A rule in AWS IoT Core then invokes a Lambda function, which sends the result back to the state machine by calling either SendTaskSuccess or SendTaskFailure API in Step Functions.

You can prevent the state machine from timing out by adding HeartbeatSeconds to the task in the Amazon States Language (ASL). Timeouts happen if the job freezes and the SendTaskFailure API is not called. HeartbeatSeconds send heartbeats from the worker via the SendTaskHeartbeat API call and should be less than the specified TimeoutSeconds.

To create a task in ASL for your state machine, which waits for a callback token, use the following code:

      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
      "Parameters": {
        "FunctionName": "${LambdaNotifierToWorkerArn}",
        "Payload": {
          "Input.$": "$",
          "TaskToken.$": "$$.Task.Token"

The .waitForTaskToken suffix indicates that the task must wait for the callback. The state machine generates a unique callback token, accessible via the $$.Task.Token built-in variable, and passes it as an input to the Lambda function defined in FunctionName.

The Lambda function then sends the token to the on-premises worker via an AWS IoT Core topic.

Lambda is not the only service that supports Wait for Callback integration – see the full list of supported services here.

In addition to dispatching tasks and getting the result back, you can implement progress tracking and shut down mechanisms. To track progress, the worker sends metrics via a separate topic.

Depending on your current implementation, you have several options:

  1. Storing progress data from the worker in Amazon DynamoDB and visualizing it via REST API calls to a Lambda function, which reads from the DynamoDB table. Refer to this tutorial on how to store data in DynamoDB directly from the topic.
  2. For a reactive user experience, create a rule to invoke a Lambda function when new progress data arrives. Open a WebSocket connection to your backend. The Lambda function sends progress data via WebSocket directly to the frontend.

To implement a shutdown mechanism, you can run jobs in separate threads on your worker and subscribe to the topic, to which your state machine publishes the shutdown messages. If a shutdown message arrives, end the job thread on the worker and send back the status including the callback token of the task.

Using AWS IoT Core Rules and Lambda Functions

A message with job results from the worker does not arrive to the Step Functions API directly. Instead, an AWS IoT Core Rule and a dedicated Lambda function forward the status message to Step Functions. This allows for more granular permissions in AWS IoT Core policies, which result in improved security because the worker container can only publish and subscribe to specific topics. No IAM credentials exist on-premises.

The Lambda function’s execution role contains the permissions for SendTaskSuccess, SendTaskHeartbeat, and SendTaskFailure API calls only.

Alternatively, a worker can run API calls in Step Functions workflows directly, which replaces the need for a topic in AWS IoT Core, a rule, and a Lambda function to invoke the Step Functions API. This approach requires IAM credentials inside the worker’s container. You can use AWS Identity and Access Management Roles Anywhere to obtain temporary security credentials. As your worker’s functionality evolves over time, you can add further AWS API calls while adding permissions to the IAM execution role.

Cleaning up

The services used in this solution are eligible for AWS Free Tier. To clean up the resources in the aws-resources/ directory of the repository run:

sam delete

This removes all resources provisioned by the template.yml file.

To remove the client certificate from AWS, navigate to AWS IoT Core Certificates and delete the certificate, which you added during the manual deployment steps.

Lastly, stop the Docker container on-premises and remove it:

docker rm --force mqtt-local-client

Finally, remove the container image:

docker rmi mqtt-client-waitfortoken


Accessing on-premises resources with workers controlled via Step Functions using MQTT and AWS IoT Core is a secure, reactive, and cost effective way to run on-premises jobs. Consider updating your hybrid workloads from using inefficient polling or schedulers to the reactive approach described in this post. This offers an improved user experience with fast dispatching and tracking of jobs outside of cloud.

For more serverless learning resources, visit Serverless Land.

AWS Weekly Roundup — Amazon API Gateway, AWS Step Functions, Amazon ECS, Amazon EKS, Amazon LightSail, Amazon VPC, and more — January 29, 2024

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-amazon-api-gateway-aws-step-functions-amazon-ecs-amazon-eks-amazon-lightsail-amazon-vpc-and-more-january-29-2024/

This past week our service teams continue to innovate on your behalf, and a lot has happened in the Amazon Web Services (AWS) universe. I’ll also share about all the AWS Community events and initiatives that are happening around the world.

Let’s dive in!

Last week’s launches
Here are some launches that got my attention:

AWS Step Functions adds integration for 33 services including Amazon Q – AWS Step Functions is a visual workflow service capable of orchestrating over 11,000+ API actions from over 220 AWS services to help customers build distributed applications at scale. This week, AWS Step Functions expands its AWS SDK integrations with support for 33 additional AWS services, including Amazon Q, AWS B2B Data Interchange, and Amazon CloudFront KeyValueStore.

Amazon Elastic Container Service (Amazon ECS) Service Connect introduces support for automatic traffic encryption with TLS Certificates – Amazon ECS launches support for automatic traffic encryption with Transport Layer Security (TLS) certificates for its networking capability called ECS Service Connect. With this support, ECS Service Connect allows your applications to establish a secure connection by encrypting your network traffic.

Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon EKS Distro support Kubernetes version 1.29Kubernetes version 1.29 introduced several new features and bug fixes. You can create new EKS clusters using v1.29 and upgrade your existing clusters to v1.29 using the Amazon EKS console, the eksctl command line interface, or through an infrastructure-as-code (IaC) tool.

IPv6 instance bundles on Amazon Lightsail – With these new instance bundles, you can get up and running quickly on IPv6-only without the need for a public IPv4 address with the ease of use and simplicity of Amazon Lightsail. If you have existing Lightsail instances with a public IPv4 address, you can migrate your instances to IPv6-only in a few simple steps.

Amazon Virtual Private Cloud (Amazon VPC) supports idempotency for route table and network ACL creationIdempotent creation of route tables and network ACLs is intended for customers that use network orchestration systems or automation scripts that create route tables and network ACLs as part of a workflow. It allows you to safely retry creation without additional side effects.

Amazon Interactive Video Service (Amazon IVS) announces audio-only pricing for Low-Latency Streaming – Amazon IVS is a managed live streaming solution that is designed to make low-latency or real-time video available to viewers around the world. It now offers audio-only pricing for its Low-Latency Streaming capability at 1/10th of the existing HD video rate.

Sellers can resell third-party professional services in AWS Marketplace – AWS Marketplace sellers, including independent software vendors (ISVs), consulting partners, and channel partners, can now resell third-party professional services in AWS Marketplace. Services can include implementation, assessments, managed services, training, or premium support.

Introducing the AWS Small and Medium Business (SMB) Competency – This is the first go-to-market AWS Specialization designed for partners who deliver to small and medium-sized customers. The SMB Competency provides enhanced benefits for AWS Partners to invest and focus on SMB customer business, such as becoming the go-to standard for participation in new pilots and sales initiatives and receiving unique access to scale demand generation engines.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

X in Y – We launched existing services and instance types in additional Regions:

Other AWS news
Here are some additional projects, programs, and news items that you might find interesting:

Get The NewsExport a Software Bill of Materials using Amazon Inspector – Generating an SBOM gives you critical security information that offers you visibility into specifics about your software supply chain, including the packages you use the most frequently and the related vulnerabilities that might affect your whole company. My colleague Varun Sharma in South Africa shows how to export a consolidated SBOM for the resources monitored by Amazon Inspector across your organization in industry standard formats, including CycloneDx and SPDX. It also shares insights and approaches for analyzing SBOM artifacts using Amazon Athena.

AWS open source news and updates – My colleague Ricardo writes this weekly open source newsletter in which he highlights new open source projects, tools, and demos from the AWS Community.

Upcoming AWS events
Check your calendars and sign up for these AWS events:

AWS InnovateAWS Innovate: AI/ML and Data Edition – Register now for the Asia Pacific & Japan AWS Innovate online conference on February 22, 2024, to explore, discover, and learn how to innovate with artificial intelligence (AI) and machine learning (ML). Choose from over 50 sessions in three languages and get hands-on with technical demos aimed at generative AI builders.

AWS Summit Paris 2024AWS Summit Paris  – The AWS Summit Paris is an annual event that is held in Paris, France. It is a great opportunity for cloud computing professionals from all over the world to learn about the latest AWS technologies, network with other professionals, and collaborate on projects. The Summit is free to attend and features keynote presentations, breakout sessions, and hands-on labs. Registrations are open!

AWS Community re:Invent re:CapsAWS Community re:Invent re:Caps – Join a Community re:Cap event organized by volunteers from AWS User Groups and AWS Cloud Clubs around the world to learn about the latest announcements from AWS re:Invent.

You can browse all upcoming in-person and virtual events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

— seb

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Disaster recovery strategies for Amazon MWAA – Part 1

Post Syndicated from Parnab Basak original https://aws.amazon.com/blogs/big-data/disaster-recovery-strategies-for-amazon-mwaa-part-1/

In the dynamic world of cloud computing, ensuring the resilience and availability of critical applications is paramount. Disaster recovery (DR) is the process by which an organization anticipates and addresses technology-related disasters. For organizations implementing critical workload orchestration using Amazon Managed Workflows for Apache Airflow (Amazon MWAA), it is crucial to have a DR plan in place to ensure business continuity.

In this series, we explore the need for Amazon MWAA disaster recovery and prescribe solutions that will sustain Amazon MWAA environments against unintended disruptions. This lets you to define, avoid, and handle disruption risks as part of your business continuity plan. This post focuses on designing the overall DR architecture. A future post in this series will focus on implementing the individual components using AWS services.

The need for Amazon MWAA disaster recovery

Amazon MWAA, a fully managed service for Apache Airflow, brings immense value to organizations by automating workflow orchestration for extract, transform, and load (ETL), DevOps, and machine learning (ML) workloads. Amazon MWAA has a distributed architecture with multiple components such as scheduler, worker, web server, queue, and database. This makes it difficult to implement a comprehensive DR strategy.

An active Amazon MWAA environment continuously parses Airflow Directed Acyclic Graphs (DAGs), reading them from a configured Amazon Simple Storage Service (Amazon S3) bucket. DAG source unavailability due to network unreachability, unintended corruption, or deletes leads to extended downtime and service disruption.

Within Airflow, the metadata database is a core component storing configuration variables, roles, permissions, and DAG run histories. A healthy metadata database is therefore critical for your Airflow environment. As with any core Airflow component, having a backup and disaster recovery plan in place for the metadata database is essential.

Amazon MWAA deploys Airflow components to multiple Availability Zones within your VPC in your preferred AWS Region. This provides fault tolerance and automatic recovery against a single Availability Zone failure. For mission-critical workloads, being resilient to the impairments of a unitary Region through multi-Region deployments is additionally important to ensure high availability and business continuity.

Balancing between costs to maintain redundant infrastructures, complexity, and recovery time is essential for Amazon MWAA environments. Organizations aim for cost-effective solutions that minimize their Recovery Time Objective (RTO) and Recovery Point Objective (RPO) to meet their service level agreements, be economically viable, and meet their customers’ demands.

Detect disasters in the primary environment: Proactive monitoring through metrics and alarms

Prompt detection of disasters in the primary environment is crucial for timely disaster recovery. Monitoring the Amazon CloudWatch SchedulerHeartbeat metric provides insights into Airflow health of an active Amazon MWAA environment. You can add other health check metrics to the evaluation criteria, such as checking the availability of upstream or downstream systems and network reachability. Combined with CloudWatch alarms, you can send notifications when these thresholds over a number of time periods are not met. You can add alarms to dashboards to monitor and receive alerts about your AWS resources and applications across multiple Regions.

AWS publishes our most up-to-the-minute information on service availability on the Service Health Dashboard. You can check at any time to get current status information, or subscribe to an RSS feed to be notified of interruptions to each individual service in your operating Region. The AWS Health Dashboard provides information about AWS Health events that can affect your account.

By combining metric monitoring, available dashboards, and automatic alarming, you can promptly detect unavailability of your primary environment, enabling proactive measures to transition to your DR plan. It is critical to factor in incident detection, notification, escalation, discovery, and declaration into your DR planning and implementation to provide realistic and achievable objectives that provide business value.

In the following sections, we discuss two Amazon MWAA DR strategy solutions and their architecture.

DR strategy solution 1: Backup and restore

The backup and restore strategy involves generating Airflow component backups in the same or different Region as your primary Amazon MWAA environment. To ensure continuity, you can asynchronously replicate these to your DR Region, with minimal performance impact on your primary Amazon MWAA environment. In the event of a rare primary Regional impairment or service disruption, this strategy will create a new Amazon MWAA environment and recover historical data to it from existing backups. However, it’s important to note that during the recovery process, there will be a period where no Airflow environments are operational to process workflows until the new environment is fully provisioned and marked as available.

This strategy provides a low-cost and low-complexity solution that is also suitable for mitigating against data loss or corruption within your primary Region. The amount of data being backed up and the time to create a new Amazon MWAA environment (typically 20–30 minutes) affects how quickly restoration can happen. To enable infrastructure to be redeployed quickly without errors, deploy using infrastructure as code (IaC). Without IaC, it may be complex to restore an analogous DR environment, which will lead to increased recovery times and possibly exceed your RTO.

Let’s explore the setup required when your primary Amazon MWAA environment is actively running, as shown in the following figure.

Backup and Restore - Pre

The solution comprises three key components. The first component is the primary environment, where the Airflow workflows are initially deployed and actively running. The second component is the disaster monitoring component, comprised of CloudWatch and a combination of an AWS Step Functions state machine and a AWS Lambda function. The third component is for creating and storing backups of all configurations and metadata that is required to restore. This can be in the same Region as your primary or replicated to your DR Region using S3 Cross-Region Replication (CRR). For CRR, you also pay for inter-Region data transfer out from Amazon S3 to each destination Region.

The first three steps in the workflow are as follows:

  1. As part of your backup creation process, Airflow metadata is replicated to an S3 bucket using an export DAG utility, run periodically based on your RPO interval.
  2. Your existing primary Amazon MWAA environment automatically emits the status of its scheduler’s health to the CloudWatch SchedulerHeartbeat metric.
  3. A multi-step Step Functions state machine is triggered from a periodic Amazon EventBridge schedule to monitor the scheduler’s health status. As the primary step of the state machine, a Lambda function evaluates the status of the SchedulerHeartbeat metric. If the metric is deemed healthy, no action is taken.

The following figure illustrates the additional steps in the solution workflow.

Backup and Restore post

  1. When the heartbeat count deviates from the normal count for a period of time, a series of actions are initiated to recover to a new Amazon MWAA environment in the DR Region. These actions include starting creation of a new Amazon MWAA environment, replicating the primary environment configurations, and then waiting for the new environment to become available.
  2. When the environment is available, an import DAG utility is run to restore the metadata contents from the backups. Any DAG runs that were interrupted during the impairment of the primary environment need to be manually rerun to maintain service level agreements. Future DAG runs are queued to run as per their next configured schedule.

DR strategy solution 2: Active-passive environments with periodic data synchronization

The active-passive environments with periodic data synchronization strategy focuses on maintaining recurrent data synchronization between an active primary and a passive Amazon MWAA DR environment. By periodically updating and synchronizing DAG stores and metadata databases, this strategy ensures that the DR environment remains current or nearly current with the primary. The DR Region can be the same or a different Region than your primary Amazon MWAA environment. In the event of a disaster, backups are available to revert to a previous known good state to minimize data loss or corruption.

This strategy provides low RTO and RPO with frequent synchronization, allowing quick recovery with minimal data loss. The infrastructure costs and code deployments are compounded to maintain both the primary and DR Amazon MWAA environments. Your DR environment is available immediately to run DAGs on.

The following figure illustrates the setup required when your primary Amazon MWAA environment is actively running.

Active Passive pre

The solution comprises four key components. Similar to the backup and restore solution, the first component is the primary environment, where the workflow is initially deployed and is actively running. The second component is the disaster monitoring component, consisting of CloudWatch and a combination of a Step Functions state machine and Lambda function. The third component creates and stores backups for all configurations and metadata required for the database synchronization. This can be in the same Region as your primary or replicated to your DR Region using Amazon S3 Cross-Region Replication. As mentioned earlier, for CRR, you also pay for inter-Region data transfer out from Amazon S3 to each destination Region. The last component is a passive Amazon MWAA environment that has the same Airflow code and environment configurations as the primary. The DAGs are deployed in the DR environment using the same continuous integration and continuous delivery (CI/CD) pipeline as the primary. Unlike the primary, DAGs are kept in a paused state to not cause duplicate runs.

The first steps of the workflow are similar to the backup and restore strategy:

  1. As part of your backup creation process, Airflow metadata is replicated to an S3 bucket using an export DAG utility, run periodically based on your RPO interval.
  2. Your existing primary Amazon MWAA environment automatically emits the status of its scheduler’s health to CloudWatch SchedulerHeartbeat metric.
  3. A multi-step Step Functions state machine is triggered from a periodic Amazon EventBridge schedule to monitor scheduler health status. As the primary step of the state machine, a Lambda function evaluates the status of the SchedulerHeartbeat metric. If the metric is deemed healthy, no action is taken.

The following figure illustrates the final steps of the workflow.

Active Passive post

  1. When the heartbeat count deviates from the normal count for a period of time, DR actions are initiated.
  2. As a first step, a Lambda function triggers an import DAG utility to restore the metadata contents from the backups to the passive Amazon MWAA DR environment. When the imports are complete, the same DAG can un-pause the other Airflow DAGs, making them active for future runs. Any DAG runs that were interrupted during the impairment of the primary environment need to be manually rerun to maintain service level agreements. Future DAG runs are queued to run as per their next configured schedule.

Best practices to improve resiliency of Amazon MWAA

To enhance the resiliency of your Amazon MWAA environment and ensure smooth disaster recovery, consider implementing the following best practices:

  • Robust backup and restore mechanisms – Implementing comprehensive backup and restore mechanisms for Amazon MWAA data is essential. Regularly deleting existing metadata based on your organization’s retention policies reduces backup times and makes your Amazon MWAA environment more performant.
  • Automation using IaC – Using automation and orchestration tools such as AWS CloudFormation, the AWS Cloud Development Kit (AWS CDK), or Terraform can streamline the deployment and configuration management of Amazon MWAA environments. This ensures consistency, reproducibility, and faster recovery during DR scenarios.
  • Idempotent DAGs and tasks – In Airflow, a DAG is considered idempotent if rerunning the same DAG with the same inputs multiple times has the same effect as running it only once. Designing idempotent DAGs and keeping tasks atomic decreases recovery time from failures when you have to manually rerun an interrupted DAG in your recovered environment.
  • Regular testing and validation – A robust Amazon MWAA DR strategy should include regular testing and validation exercises. By simulating disaster scenarios, you can identify any gaps in your DR plans, fine-tune processes, and ensure your Amazon MWAA environments are fully recoverable.


In this post, we explored the challenges for Amazon MWAA disaster recovery and discussed best practices to improve resiliency. We examined two DR strategy solutions: backup and restore and active-passive environments with periodic data synchronization. By implementing these solutions and following best practices, you can protect your Amazon MWAA environments, minimize downtime, and mitigate the impact of disasters. Regular testing, validation, and adaptation to evolving requirements are crucial for an effective Amazon MWAA DR strategy. By continuously evaluating and refining your disaster recovery plans, you can ensure the resilience and uninterrupted operation of your Amazon MWAA environments, even in the face of unforeseen events.

For additional details and code examples on Amazon MWAA, refer to the Amazon MWAA User Guide and the Amazon MWAA examples GitHub repo.

About the Authors

Parnab Basak is a Senior Solutions Architect and a Serverless Specialist at AWS. He specializes in creating new solutions that are cloud native using modern software development practices like serverless, DevOps, and analytics. Parnab works closely in the analytics and integration services space helping customers adopt AWS services for their workflow orchestration needs.

Chandan Rupakheti is a Solutions Architect and a Serverless Specialist at AWS. He is a passionate technical leader, researcher, and mentor with a knack for building innovative solutions in the cloud and bringing stakeholders together in their cloud journey. Outside his professional life, he loves spending time with his family and friends besides listening and playing music.

Vinod Jayendra is a Enterprise Support Lead in ISV accounts at Amazon Web Services, where he helps customers in solving their architectural, operational, and cost optimization challenges. With a particular focus on Serverless technologies, he draws from his extensive background in application development to deliver top-tier solutions. Beyond work, he finds joy in quality family time, embarking on biking adventures, and coaching youth sports team.

Rupesh Tiwari is a Senior Solutions Architect at AWS in New York City, with a focus on Financial Services. He has over 18 years of IT experience in the finance, insurance, and education domains, and specializes in architecting large-scale applications and cloud-native big data workloads. In his spare time, Rupesh enjoys singing karaoke, watching comedy TV series, and creating joyful moments with his family.

Enable metric-based and scheduled scaling for Amazon Managed Service for Apache Flink

Post Syndicated from Francisco Morillo original https://aws.amazon.com/blogs/big-data/enable-metric-based-and-scheduled-scaling-for-amazon-managed-service-for-apache-flink/

Thousands of developers use Apache Flink to build streaming applications to transform and analyze data in real time. Apache Flink is an open source framework and engine for processing data streams. It’s highly available and scalable, delivering high throughput and low latency for the most demanding stream-processing applications. Monitoring and scaling your applications is critical to keep your applications running successfully in a production environment.

Amazon Managed Service for Apache Flink is a fully managed service that reduces the complexity of building and managing Apache Flink applications. Amazon Managed Service for Apache Flink manages the underlying Apache Flink components that provide durable application state, metrics, logs, and more.

In this post, we show a simplified way to automatically scale up and down the number of KPUs (Kinesis Processing Units; 1 KPU is 1 vCPU and 4 GB of memory) of your Apache Flink applications with Amazon Managed Service for Apache Flink. We show you how to scale by using metrics such as CPU, memory, backpressure, or any custom metric of your choice. Additionally, we show how to perform scheduled scaling, allowing you to adjust your application’s capacity at specific times, particularly when dealing with predictable workloads. We also share an AWS CloudFormation utility to help you implement auto scaling quickly with your Amazon Managed Service for Apache Flink applications.

Metric-based scaling

This section describes how to implement a scaling solution for Amazon Managed Service for Apache Flink based on Amazon CloudWatch metrics. Amazon Managed Service for Apache Flink comes with an auto scaling option out of the box that scales out when container CPU utilization is above 75% for 15 minutes. This works well for many use cases; however, for some applications, you may need to scale based on a different metric, or trigger the scaling action at a certain point in time or by a different factor. You can customize your scaling policies and save costs by right-sizing your Amazon Managed Apache Flink applications the deploying this solution.

To perform metric-based scaling, we use CloudWatch alarms, Amazon EventBridge, AWS Step Functions, and AWS Lambda. You can choose from metrics coming from the source such as Amazon Kinesis Data Streams or Amazon Managed Streaming for Apache Kafka (Amazon MSK), or metrics from the Amazon Managed Service for Apache Flink application. You can find these components in the CloudFormation template in the GitHub repo.

The following diagram shows how to scale an Amazon Managed Service for Apache Flink application in response to a CloudWatch alarm.

This solution uses the metric selected and creates two CloudWatch alarms that, depending on the threshold you use, trigger a rule in EventBridge to start running a Step Functions state machine. The following diagram illustrates the state machine workflow.

Note: Amazon Kinesis Data Analytics was renamed to Amazon Managed Service for Apache Flink August 2023

The Step Functions workflow consists of the following steps:

  1. The state machine describes the Amazon Managed Service for Apache Flink application, which will provide information related to the current number of KPUs in the application, as well if the application is being updated or is it running.
  2. The state machine invokes a Lambda function that, depending on which alarm was triggered, will scale the application up or down, following the parameters set in the CloudFormation template. When scaling the application, it will use the increase factor (either add/subtract or multiple/divide based on that factor) defined in the CloudFormation template. You can have different factors for scaling in or out. If you want to take a more cautious approach to scaling, you can use add/subtract and use an increase factor for scaling in/out of 1.
  3. If the application has reached the maximum or minimum number of KPUs set in the parameters of the CloudFormation template, the workflow stops. Keep in mind that Amazon Managed Service for Apache Flink applications have a default maximum of 64 KPUs (you can request to increase this limit). Do not specify a maximum value above 64 KPUs if you have not requested to increase the quota, because the scaling solution will get stuck by failing to update.
  4. If the workflow continues, because the allocated KPUs haven’t reached the maximum or minimum values, the workflow will wait for a period of time you specify, and then describe the application and see if it has finished updating.
  5. The workflow will continue to wait until the application has finished updating. When the application is updated, the workflow will wait for a period of time you specify in the CloudFormation template, to allow the metric to fall within the threshold and have the CloudWatch rule change from ALARM state to OK.
  6. If the metric is still in ALARM state, the workflow will start again and continue to scale the application either up or down. If the metric is in OK state, the workflow will stop.

For applications that read from a Kinesis Data Streams source, you can use the metric millisBehindLatest. If using a Kafka source, you can use records lag max for scaling events. These metrics capture how far behind your application is from the head of the stream. You can also use a custom metric that you have registered in your Apache Flink applications.

The sample CloudFormation template allows you to select one of the following metrics:

  • Amazon Managed Service for Apache Flink application metrics – Requires an application name:
    • ContainerCPUUtilization – Overall percentage of CPU utilization across task manager containers in the Flink application cluster.
    • ContainerMemoryUtilization – Overall percentage of memory utilization across task manager containers in the Flink application cluster.
    • BusyTimeMsPerSecond – Time in milliseconds the application is busy (neither idle nor back pressured) per second.
    • BackPressuredTimeMsPerSecond – Time in milliseconds the application is back pressured per second.
    • LastCheckpointDuration – Time in milliseconds it took to complete the last checkpoint.
  • Kinesis Data Streams metrics – Requires the data stream name:
    • MillisBehindLatest – The number of milliseconds the consumer is behind the head of the stream, indicating how far behind the current time the consumer is.
    • IncomingRecords – The number of records successfully put to the Kinesis data stream over the specified time period. If no records are coming, this metric will be null and you won’t be able to scale down.
  • Amazon MSK metrics – Requires the cluster name, topic name, and consumer group name):
    • MaxOffsetLag – The maximum offset lag across all partitions in a topic.
    • SumOffsetLag – The aggregated offset lag for all the partitions in a topic.
    • EstimatedMaxTimeLag – The time estimate (in seconds) to drain MaxOffsetLag.
  • Custom metrics – Metrics you can define as part of your Apache Flink applications. Most common metrics are counters (continuously increase) or gauges (can be updated with last value). For this solution, you need to add the kinesisAnalytics dimension to the metric group. You also need to provide the custom metric name as a parameter in the CloudFormation template. If you need to use more dimensions in your custom metric, you need to modify the CloudWatch alarm so it’s able to use your specific metric. For more information on custom metrics, see Using Custom Metrics with Amazon Managed Service for Apache Flink.

The CloudFormation template deploys the resources as well as the auto scaling code. You only need to specify the name of the Amazon Managed Service for Apache Flink application, the metric to which you want to scale your application in or out, and the thresholds for triggering an alarm. The solution by default will use the average aggregation for metrics and a period duration of 60 seconds for each data point. You can configure the evaluation periods and data points to alarm when defining the CloudFormation template.

Scheduled scaling

This section describes how to implement a scaling solution for Amazon Managed Service for Apache Flink based on a schedule. To perform scheduled scaling, we use EventBridge and Lambda, as illustrated in the following figure.

These components are available in the CloudFormation template in the GitHub repo.

The EventBridge scheduler is triggered based on the parameters set when deploying the CloudFormation template. You define the KPU of the applications when running at peak times, as well as the KPU for non-peak times. The application runs with those KPU parameters depending on the time of day.

As with the previous example for metric-based scaling, the CloudFormation template deploys the resources and scaling code required. You only need to specify the name of the Amazon Managed Service for Apache Flink application and the schedule for the scaler to modify the application to the set number of KPUs.

Considerations for scaling Flink applications using metric-based or scheduled scaling

Be aware of the following when considering these solutions:

  • When scaling Amazon Managed Service for Apache Flink applications in or out, you can choose to either increase the overall application parallelism or modify the parallelism per KPU. The latter allows you to set the number of parallel tasks that can be scheduled per KPU. This sample only updates the overall parallelism, not the parallelism per KPU.
  • If SnapshotsEnabled is set to true in ApplicationSnapshotConfiguration, Amazon Managed Service for Apache Flink will automatically pause the application, take a snapshot, and then restore the application with the updated configuration whenever it is updated or scaled. This process may result in downtime for the application, depending on the state size, but there will be no data loss. When using metric-based scaling, you have to choose a minimum and a maximum threshold of KPU the application can have. Depending on by how much you perform the scaling, if the new desired KPU is bigger or lower than your thresholds, the solution will update the KPUs to be equal to your thresholds.
  • When using metric-based scaling, you also have to choose a cooling down period. This is the amount of time you want your application to wait after being updated, to see if the metric has gone from ALARM status to OK status. This value depends on how long are you willing to wait before another scaling event to occur.
  • With the metric-based scaling solution, you are limited to choosing the metrics that are listed in the CloudFormation template. However, you can modify the alarms to use any available metric in CloudWatch.
  • If your application is required to run without interruptions for periods of time, we recommend using scheduled scaling, to limit scaling to non-critical times.


In this post, we covered how you can enable custom scaling for Amazon Managed Service for Apache Flink applications using enhanced monitoring features from CloudWatch integrated with Step Functions and Lambda. We also showed how you can configure a schedule to scale an application using EventBridge. Both of these samples and many more can be found in the GitHub repo.

About the Authors

Deepthi Mohan is a Principal PMT on the Amazon Managed Service for Apache Flink team.

Francisco Morillo is a Streaming Solutions Architect at AWS. Francisco works with AWS customers, helping them design real-time analytics architectures using AWS services, supporting Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon Managed Service for Apache Flink.

Serverless ICYMI Q4 2023

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/serverless-icymi-q4-2023/

Welcome to the 24th edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share all the most recent product launches, feature enhancements, blog posts, webinars, live streams, and other interesting things that you might have missed!

In case you missed our last ICYMI, check out what happened last quarter here.

2023 Q4 Calendar

2023 Q4 Calendar


ServerlessVideo at re:Invent 2024

ServerlessVideo at re:Invent 2024

ServerlessVideo is a demo application built by the AWS Serverless Developer Advocacy team to stream live videos and also perform advanced post-video processing. It uses several AWS services including AWS Step Functions, Amazon EventBridge, AWS Lambda, Amazon ECS, and Amazon Bedrock in a serverless architecture that makes it fast, flexible, and cost-effective. Key features include an event-driven core with loosely coupled microservices that respond to events routed by EventBridge. Step Functions orchestrates using both Lambda and ECS for video processing to balance speed, scale, and cost. There is a flexible plugin-based architecture using Step Functions and EventBridge to integrate and manage multiple video processing workflows, which include GenAI.

ServerlessVideo allows broadcasters to stream video to thousands of viewers using Amazon IVS. When a broadcast ends, a Step Functions workflow triggers a set of configured plugins to process the video, generating transcriptions, validating content, and more. The application incorporates various microservices to support live streaming, on-demand playback, transcoding, transcription, and events. Learn more about the project and watch videos from reinvent 2023 at video.serverlessland.com.

AWS Lambda

AWS Lambda enabled outbound IPv6 connections from VPC-connected Lambda functions, providing virtually unlimited scale by removing IPv4 address constraints.

The AWS Lambda and AWS SAM teams also added support for sharing test events across teams using AWS SAM CLI to improve collaboration when testing locally.

AWS Lambda introduced integration with AWS Application Composer, allowing users to view and export Lambda function configuration details for infrastructure as code (IaC) workflows.

AWS added advanced logging controls enabling adjustable JSON-formatted logs, custom log levels, and configurable CloudWatch log destinations for easier debugging. AWS enabled monitoring of errors and timeouts occurring during initialization and restore phases in CloudWatch Logs as well, making troubleshooting easier.

For Kafka event sources, AWS enabled failed event destinations to prevent functions stalling on failing batches by rerouting events to SQS, SNS, or S3. AWS also enhanced Lambda auto scaling for Kafka event sources in November to reach maximum throughput faster, reducing latency for workloads prone to large bursts of messages.

AWS launched support for Python 3.12 and Java 21 Lambda runtimes, providing updated libraries, smaller deployment sizes, and better AWS service integration. AWS also introduced a simplified console workflow to automate complex network configuration when connecting functions to Amazon RDS and RDS Proxy.

Additionally in December, AWS enabled faster individual Lambda function scaling allowing each function to rapidly absorb traffic spikes by scaling up to 1000 concurrent executions every 10 seconds.

Amazon ECS and AWS Fargate

In Q4 of 2023, AWS introduced several new capabilities across its serverless container services including Amazon ECS, AWS Fargate, AWS App Runner, and more. These features help improve application resilience, security, developer experience, and migration to modern containerized architectures.

In October, Amazon ECS enhanced its task scheduling to start healthy replacement tasks before terminating unhealthy ones during traffic spikes. This prevents going under capacity due to premature shutdowns. Additionally, App Runner launched support for IPv6 traffic via dual-stack endpoints to remove the need for address translation.

In November, AWS Fargate enabled ECS tasks to selectively use SOCI lazy loading for only large container images in a task instead of requiring it for all images. Amazon ECS also added idempotency support for task launches to prevent duplicate instances on retries. Amazon GuardDuty expanded threat detection to Amazon ECS and Fargate workloads which users can easily enable.

Also in November, the open source Finch container tool for macOS became generally available. Finch allows developers to build, run, and publish Linux containers locally. A new website provides tutorials and resources to help developers get started.

Finally in December, AWS Migration Hub Orchestrator added new capabilities for replatforming applications to Amazon ECS using guided workflows. App Runner also improved integration with Route 53 domains to automatically configure required records when associating custom domains.

AWS Step Functions

In Q4 2023, AWS Step Functions announced the redrive capability for Standard Workflows. This feature allows failed workflow executions to be redriven from the point of failure, skipping unnecessary steps and reducing costs. The redrive functionality provides an efficient way to handle errors that require longer investigation or external actions before resuming the workflow.

Step Functions also launched support for HTTPS endpoints in AWS Step Functions, enabling easier integration with external APIs and SaaS applications without needing custom code. Developers can now connect to third-party HTTP services directly within workflows. Additionally, AWS released a new test state capability that allows testing individual workflow states before full deployment. This feature helps accelerate development by making it faster and simpler to validate data mappings and permissions configurations.

AWS announced optimized integrations between AWS Step Functions and Amazon Bedrock for orchestrating generative AI workloads. Two new API actions were added specifically for invoking Bedrock models and training jobs from workflows. These integrations simplify building prompt chaining and other techniques to create complex AI applications with foundation models.

Finally, the Step Functions Workflow Studio is now integrated in the AWS Application Composer. This unified builder allows developers to design workflows and define application resources across the full project lifecycle within a single interface.

Amazon EventBridge

Amazon EventBridge announced support for new partner integrations with Adobe and Stripe. These integrations enable routing events from the Adobe and Stripe platforms to over 20 AWS services. This makes it easier to build event-driven architectures to handle common use cases.

Amazon SNS

In Q4, Amazon SNS added native in-place message archiving for FIFO topics to improve event stream durability by allowing retention policies and selective replay of messages without provisioning separate resources. Additional message filtering operators were also introduced including suffix matching, case-insensitive equality checks, and OR logic for matching across properties to simplify routing logic implementation for publishers and subscribers. Finally, delivery status logging was enabled through AWS CloudFormation.

Amazon SQS

Amazon SQS has introduced several major new capabilities and updates. These improve visibility, throughput, and message handling for users. Specifically, Amazon SQS enabled AWS CloudTrail logging of key SQS APIs. This gives customers greater visibility into SQS activity. Additionally, SQS increased the throughput quota for the high throughput mode of FIFO queues. This was significantly increased in certain Regions. It also boosted throughput in Asia Pacific Regions. Furthermore, Amazon SQS added dead letter queue redrive support. This allows you to redrive messages that failed and were sent to a dead letter queue (DLQ).

Serverless at AWS re:Invent

Serverless videos from re:Invent

Serverless videos from re:Invent

Visit the Serverless Land YouTube channel to find a list of serverless and serverless container sessions from reinvent 2023. Hear from experts like Chris Munns and Julian Wood in their popular session, Best practices for serverless developers, or Nathan Peck and Jessica Deen in Deploying multi-tenant SaaS applications on Amazon ECS and AWS Fargate.

EDA Day Nashville

EDA Day Nashville

EDA Day Nashville

The AWS Serverless Developer Advocacy team hosted an event-driven architecture (EDA) day conference on October 26, 2022 in Nashville, Tennessee. This inaugural GOTO EDA day convened over 200 attendees ranging from prominent EDA community members to AWS speakers and product managers. Attendees engaged in 13 sessions, two workshops, and panels covering EDA adoption best practices. The event built upon 2022 content by incorporating additional topics like messaging, containers, and machine learning. It also created opportunities for students and underrepresented groups in tech to participate. The full-day conference facilitated education, inspiration, and thoughtful discussion around event-driven architectural patterns and services on AWS.

Videos from EDA Day are now available on the Serverless Land YouTube channel.

Serverless blog posts




Serverless container blog posts




Serverless Office Hours

Serverless office hours: Q4 videos




Containers from the Couch

Containers from the Couch









Still looking for more?

The Serverless landing page has more information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials.

You can also follow the Serverless Developer Advocacy team on Twitter to see the latest news, follow conversations, and interact with the team.

And finally, visit the Serverless Land and Containers on AWS websites for all your serverless and serverless container needs.

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

Post Syndicated from Sriharsh Adari original https://aws.amazon.com/blogs/big-data/build-efficient-etl-pipelines-with-aws-step-functions-distributed-map-and-redrive-feature/

AWS Step Functions is a fully managed visual workflow service that enables you to build complex data processing pipelines involving a diverse set of extract, transform, and load (ETL) technologies such as AWS Glue, Amazon EMR, and Amazon Redshift. You can visually build the workflow by wiring individual data pipeline tasks and configuring payloads, retries, and error handling with minimal code.

While Step Functions supports automatic retries and error handling when data pipeline tasks fail due to momentary or transient errors, there can be permanent failures such as incorrect permissions, invalid data, and business logic failure during the pipeline run. This requires you to identify the issue in the step, fix the issue and restart the workflow. Previously, to rerun the failed step, you needed to restart the entire workflow from the very beginning. This leads to delays in completing the workflow, especially if it’s a complex, long-running ETL pipeline. If the pipeline has many steps using map and parallel states, this also leads to increased cost due to increases in the state transition for running the pipeline from the beginning.

Step Functions now supports the ability for you to redrive your workflow from a failed, aborted, or timed-out state so you can complete workflows faster and at a lower cost, and spend more time delivering business value. Now you can recover from unhandled failures faster by redriving failed workflow runs, after downstream issues are resolved, using the same input provided to the failed state.

In this post, we show you an ETL pipeline job that exports data from Amazon Relational Database Service (Amazon RDS) tables using the Step Functions distributed map state. Then we simulate a failure and demonstrate how to use the new redrive feature to restart the failed task from the point of failure.

Solution overview

One of the common functionalities involved in data pipelines is extracting data from multiple data sources and exporting it to a data lake or synchronizing the data to another database. You can use the Step Functions distributed map state to run hundreds of such export or synchronization jobs in parallel. Distributed map can read millions of objects from Amazon Simple Storage Service (Amazon S3) or millions of records from a single S3 object, and distribute the records to downstream steps. Step Functions runs the steps within the distributed map as child workflows at a maximum parallelism of 10,000. A concurrency of 10,000 is well above the concurrency supported by many other AWS services such as AWS Glue, which has a soft limit of 1,000 job runs per job.

The sample data pipeline sources product catalog data from Amazon DynamoDB and customer order data from Amazon RDS for PostgreSQL database. The data is then cleansed, transformed, and uploaded to Amazon S3 for further processing. The data pipeline starts with an AWS Glue crawler to create the Data Catalog for the RDS database. Because starting an AWS Glue crawler is asynchronous, the pipeline has a wait loop to check if the crawler is complete. After the AWS Glue crawler is complete, the pipeline extracts data from the DynamoDB table and RDS tables. Because these two steps are independent, they are run as parallel steps: one using an AWS Lambda function to export, transform, and load the data from DynamoDB to an S3 bucket, and the other using a distributed map with AWS Glue job sync integration to do the same from the RDS tables to an S3 bucket. Note that AWS Identity and Access Management (IAM) permissions are required for invoking an AWS Glue job from Step Functions. For more information, refer to IAM Policies for invoking AWS Glue job from Step Functions.

The following diagram illustrates the Step Functions workflow.

There are multiple tables related to customers and order data in the RDS database. Amazon S3 hosts the metadata of all the tables as a .csv file. The pipeline uses the Step Functions distributed map to read the table metadata from Amazon S3, iterate on every single item, and call the downstream AWS Glue job in parallel to export the data. See the following code:

"States": {
            "Map": {
              "Type": "Map",
              "ItemProcessor": {
                "ProcessorConfig": {
                  "Mode": "DISTRIBUTED",
                  "ExecutionType": "STANDARD"
                "StartAt": "Export data for a table",
                "States": {
                  "Export data for a table": {
                    "Type": "Task",
                    "Resource": "arn:aws:states:::glue:startJobRun.sync",
                    "Parameters": {
                      "JobName": "ExportTableData",
                      "Arguments": {
                        "--dbtable.$": "$.tables"
                    "End": true
              "Label": "Map",
              "ItemReader": {
                "Resource": "arn:aws:states:::s3:getObject",
                "ReaderConfig": {
                  "InputType": "CSV",
                  "CSVHeaderLocation": "FIRST_ROW"
                "Parameters": {
                  "Bucket": "123456789012-stepfunction-redrive",
                  "Key": "tables.csv"
              "ResultPath": null,
              "End": true


To deploy the solution, you need the following prerequisites:

Launch the CloudFormation template

Complete the following steps to deploy the solution resources using AWS CloudFormation:

  1. Choose Launch Stack to launch the CloudFormation stack:
  2. Enter a stack name.
  3. Select all the check boxes under Capabilities and transforms.
  4. Choose Create stack.

The CloudFormation template creates many resources, including the following:

  • The data pipeline described earlier as a Step Functions workflow
  • An S3 bucket to store the exported data and the metadata of the tables in Amazon RDS
  • A product catalog table in DynamoDB
  • An RDS for PostgreSQL database instance with pre-loaded tables
  • An AWS Glue crawler that crawls the RDS table and creates an AWS Glue Data Catalog
  • A parameterized AWS Glue job to export data from the RDS table to an S3 bucket
  • A Lambda function to export data from DynamoDB to an S3 bucket

Simulate the failure

Complete the following steps to test the solution:

  1. On the Step Functions console, choose State machines in the navigation pane.
  2. Choose the workflow named ETL_Process.
  3. Run the workflow with default input.

Within a few seconds, the workflow fails at the distributed map state.

You can inspect the map run errors by accessing the Step Functions workflow execution events for map runs and child workflows. In this example, you can identity the exception is due to Glue.ConcurrentRunsExceededException from AWS Glue. The error indicates there are more concurrent requests to run an AWS Glue job than are configured. Distributed map reads the table metadata from Amazon S3 and invokes as many AWS Glue jobs as the number of rows in the .csv file, but AWS Glue job is set with the concurrency of 3 when it is created. This resulted in the child workflow failure, cascading the failure to the distributed map state and then the parallel state. The other step in the parallel state to fetch the DynamoDB table ran successfully. If any step in the parallel state fails, the whole state fails, as seen with the cascading failure.

Handle failures with distributed map

By default, when a state reports an error, Step Functions causes the workflow to fail. There are multiple ways you can handle this failure with distributed map state:

  • Step Functions enables you to catch errors, retry errors, and fail back to another state to handle errors gracefully. See the following code:
    Retry": [
                            "ErrorEquals": [
                              "Glue.ConcurrentRunsExceededException "
                            "BackoffRate": 20,
                            "IntervalSeconds": 10,
                            "MaxAttempts": 3,
                            "Comment": "Exception",
                            "JitterStrategy": "FULL"

  • Sometimes, businesses can tolerate failures. This is especially true when you are processing millions of items and you expect data quality issues in the dataset. By default, when an iteration of map state fails, all other iterations are aborted. With distributed map, you can specify the maximum number of, or percentage of, failed items as a failure threshold. If the failure is within the tolerable level, the distributed map doesn’t fail.
  • The distributed map state allows you to control the concurrency of the child workflows. You can set the concurrency to map it to the AWS Glue job concurrency. Remember, this concurrency is applicable only at the workflow execution level—not across workflow executions.
  • You can redrive the failed state from the point of failure after fixing the root cause of the error.

Redrive the failed state

The root cause of the issue in the sample solution is the AWS Glue job concurrency. To address this by redriving the failed state, complete the following steps:

  1. On the AWS Glue console, navigate to the job named ExportsTableData.
  2. On the Job details tab, under Advanced properties, update Maximum concurrency to 5.

With the launch of redrive feature, You can use redrive to restart executions of standard workflows that didn’t complete successfully in the last 14 days. These include failed, aborted, or timed-out runs. You can only redrive a failed workflow from the step where it failed using the same input as the last non-successful state. You can’t redrive a failed workflow using a state machine definition that is different from the initial workflow execution. After the failed state is redriven successfully, Step Functions runs all the downstream tasks automatically. To learn more about how distributed map redrive works, refer to Redriving Map Runs.

Because the distributed map runs the steps inside the map as child workflows, the workflow IAM execution role needs permission to redrive the map run to restart the distributed map state:

  "Version": "2012-10-17",
  "Statement": [
      "Effect": "Allow",
      "Action": [
      "Resource": "arn:aws:states:us-east-2:123456789012:execution:myStateMachine/myMapRunLabel:*"

You can redrive a workflow from its failed step programmatically, via the AWS Command Line Interface (AWS CLI) or AWS SDK, or using the Step Functions console, which provides a visual operator experience.

  1. On the Step Functions console, navigate to the failed workflow you want to redrive.
  2. On the Details tab, choose Redrive from failure.

The pipeline now runs successfully because there is enough concurrency to run the AWS Glue jobs.

To redrive a workflow programmatically from its point of failure, call the new Redrive Execution API action. The same workflow starts from the last non-successful state and uses the same input as the last non-successful state from the initial failed workflow. The state to redrive from the workflow definition and the previous input are immutable.

Note the following regarding different types of child workflows:

  • Redrive for express child workflows – For failed child workflows that are express workflows within a distributed map, the redrive capability ensures a seamless restart from the beginning of the child workflow. This allows you to resolve issues that are specific to individual iterations without restarting the entire map.
  • Redrive for standard child workflows – For failed child workflows within a distributed map that are standard workflows, the redrive feature functions the same way as with standalone standard workflows. You can restart the failed state within each map iteration from its point of failure, skipping unnecessary steps that have already successfully run.

You can use Step Functions status change notifications with Amazon EventBridge for failure notifications such as sending an email on failure.

Clean up

To clean up your resources, delete the CloudFormation stack via the AWS CloudFormation console.


In this post, we showed you how to use the Step Functions redrive feature to redrive a failed step within a distributed map by restarting the failed step from the point of failure. The distributed map state allows you to write workflows that coordinate large-scale parallel workloads within your serverless applications. Step Functions runs the steps within the distributed map as child workflows at a maximum parallelism of 10,000, which is well above the concurrency supported by many AWS services.

To learn more about distributed map, refer to Step Functions – Distributed Map. To learn more about redriving workflows, refer to Redriving executions.

About the Authors

Sriharsh Adari is a Senior Solutions Architect at Amazon Web Services (AWS), where he helps customers work backwards from business outcomes to develop innovative solutions on AWS. Over the years, he has helped multiple customers on data platform transformations across industry verticals. His core area of expertise include Technology Strategy, Data Analytics, and Data Science. In his spare time, he enjoys playing Tennis.

Joe Morotti is a Senior Solutions Architect at Amazon Web Services (AWS), working with Enterprise customers across the Midwest US to develop innovative solutions on AWS. He has held a wide range of technical roles and enjoys showing customers the art of the possible. He has attained seven AWS certification and has a passion for AI/ML and the contact center space. In his free time, he enjoys spending quality time with his family exploring new places and overanalyzing his sports team’s performance.

Uma Ramadoss is a specialist Solutions Architect at Amazon Web Services, focused on the Serverless platform. She is responsible for helping customers design and operate event-driven cloud-native applications and modern business workflows using services like Lambda, EventBridge, Step Functions, and Amazon MWAA.

AWS Step Functions Workflow Studio is now available in AWS Application Composer

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-step-functions-workflow-studio-is-now-available-in-aws-application-composer/

Today, we’re announcing that AWS Step Functions Workflow Studio is now available in AWS Application Composer. This new integration brings together the development of workflows and application resources into a unified visual infrastructure as code (IaC) builder.

Now, you can have a seamless transition between authoring workflows with AWS Step Functions Workflow Studio and defining resources with AWS Application Composer. This announcement allows you to create and manage all resources at any stage of your development journey. You can visualize the full application in AWS Application Composer, then zoom into the workflow details with AWS Step Functions Workflow Studio—all within a single interface.

Seamlessly build workflow and modern application
To help you design and build modern applications, we launched AWS Application Composer in March 2023. With AWS Application Composer, you can use a visual builder to compose and configure serverless applications from AWS services backed by deployment-ready IaC.

In various use cases of building modern applications, you may also need to orchestrate microservices, automate mission-critical business processes, create event-driven applications that respond to infrastructure changes, or build machine learning (ML) pipelines. To solve these challenges, you can use AWS Step Functions, a fully managed service that makes it easier to coordinate distributed application components using visual workflows. To simplify workflow development, in 2021 we introduced AWS Step Functions Workflow Studio, a low-code visual tool for rapid workflow prototyping and development across 12,000+ API actions from over 220 AWS services.

While AWS Step Functions Workflow Studio brings simplicity to building workflows, customers that want to deploy workflows using IaC had to manually define their state machine resource and migrate their workflow definitions to the IaC template.

Better together: AWS Step Functions Workflow Studio in AWS Application Composer
With this new integration, you can now design AWS Step Functions workflows in AWS Application Composer using a drag-and-drop interface. This accelerates the path from prototyping to production deployment and iterating on existing workflows.

You can start by composing your modern application with AWS Application Composer. Within the canvas, you can add a workflow by adding an AWS Step Functions state machine resource. This new capability provides you with the ability to visually design and build a workflow with an intuitive interface to connect workflow steps to resources.

How it works
Let me walk you through how you can use AWS Step Functions Workflow Studio in AWS Application Composer. For this demo, let’s say that I need to improve handling e-commerce transactions by building a workflow and integrating with my existing serverless APIs.

First, I navigate to AWS Application Composer. Because I already have an existing project that includes application code and IaC templates from AWS Application Composer, I don’t need to build anything from scratch.

I open the menu and select Project folder to open the files in my local development machine.

Then, I select the path of my local folder, and AWS Application Composer automatically detects the IaC template that I currently have.

Then, AWS Application Composer visualizes the diagram in the canvas. What I really like about using this approach is that AWS Application Composer activates Local sync mode, which automatically syncs and saves any changes in IaC templates into my local project.

Here, I have a simple serverless API running on Amazon API Gateway, which invokes an AWS Lambda function and integrates with Amazon DynamoDB.

Now, I’m ready to make some changes to my serverless API. I configure another route on Amazon API Gateway and add AWS Step Functions state machine to start building my workflow.

When I configure my Step Functions state machine, I can start editing my workflow by selecting Edit in Workflow Studio.

This opens Step Functions Workflow Studio within the AWS Application Composer canvas. I have the same experience as Workflow Studio in the AWS Step Functions console. I can use the canvas to add actions, flows , and patterns into my Step Functions state machine.

I start building my workflow, and here’s the result that I exported using Export PNG image in Workflow Studio.

But here’s where this new capability really helps me as a developer. In the workflow definition, I use various AWS resources, such as AWS Lambda functions and Amazon DynamoDB. If I need to reference the AWS resources I defined in AWS Application Composer, I can use an AWS CloudFormation substitution.

With AWS CloudFormation substitutions, I can add a substitution using an AWS CloudFormation convention, which is a dynamic reference to a value that is provided in the IaC template. I am using a placeholder substitution here so I can map it with an AWS resource in the AWS Application Composer canvas in a later step.

I can also define the AWS CloudFormation substitution for my Amazon DynamoDB table.

At this stage, I’m happy with my workflow. To review the Amazon States Language as my AWS Step Functions state machine definition, I can also open the Code tab. Now I don’t need to manually copy and paste this definition into IaC templates. I only need to save my work and choose Return to Application Composer.

Here, I can see that my AWS Step Functions state machine is updated both in the visual diagram and in the state machine definition section.

If I scroll down, I will find AWS Cloudformation Definition Substitutions for resources that I defined in Workflow Studio. I can manually replace the mapping here, or I can use the canvas.

To use the canvas, I simply drag and drop the respective resources in my Step Functions state machine and in the Application Composer canvas. Here, I connect the Inventory Process task state with a new AWS Lambda function. Also, my Step Functions state machine tasks can reference existing resources.

When I choose Template, the state machine definition is integrated with other AWS Application Composer resources. With this IaC template I can easily deploy using AWS Serverless Application Model Command Line Interface (AWS SAM CLI) or CloudFormation.

Things to know
Here is some additional information for you:

Pricing – The AWS Step Functions Workflow Studio in AWS Application Composer comes at no additional cost.

Availability – This feature is available in all AWS Regions where Application Composer is available.

AWS Step Functions Workflow Studio in AWS Application Composer provides you with an easy-to-use experience to integrate your workflow into modern applications. Get started and learn more about this feature on the AWS Application Composer page.

Happy building!
— Donnie

External endpoints and testing of task states now available in AWS Step Functions

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/external-endpoints-and-testing-of-task-states-now-available-in-aws-step-functions/

Now AWS Step Functions HTTPS endpoints let you integrate third-party APIs and external services to your workflows. HTTPS endpoints provide a simpler way of making calls to external APIs and integrating with existing SaaS providers, like Stripe for handling payments, GitHub for code collaboration and repository management, and Salesforce for sales and marketing insights. Before this launch, customers needed to use an AWS Lambda function to call the external endpoint, handling authentication and errors directly from the code.

Also, we are announcing a new capability to test your task states individually without the need to deploy or execute the state machine.

AWS Step Functions is a visual workflow service that makes it easy for developers to build distributed applications, automate processes, orchestrate microservices, and create data and machine learning (ML) pipelines. Step Functions integrates with over 220 AWS services and provides features that help developers build, such as built-in error handling, real-time and auditable workflow execution history, and large-scale parallel processing.

HTTPS endpoints
HTTPS endpoints are a new resource for your task states that allow you to connect to third-party HTTP targets outside AWS. Step Functions invokes the HTTP endpoint, deliver a request body, headers, and parameters, and get a response from the third-party services. You can use any preferred HTTP method, such as GET or POST.

HTTPS endpoints use Amazon EventBridge connections to manage the authentication credentials for the target. This defines the authorization type used, which can be a basic authentication with a username and password, an API key, or OAuth. EventBridge connections use AWS Secrets Manager to store the secret. This keeps the secrets out of the state machine, reducing the risks of accidentally exposing your secrets in logs or in the state machine definition.

Getting started with HTTPS endpoints
To get started with HTTPS endpoints, first you need to create an EventBridge connection. Then you need to create a new AWS Identity and Access Management (IAM) role and give permissions so your state machine can access the connection resource, get the secret from Secrets Manager, and get permissions to invoke an HTTP endpoint.

Here are the policies that you need to include in your state machine execution role:

    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Action": [
            "Resource": "arn:aws:secretsmanager:*:*:secret:events!connection/*"
    "Version": "2012-10-17",
    "Statement": [
            "Sid": "RetrieveConnectionCredentials",
            "Effect": "Allow",
            "Action": [
            "Resource": [
    "Version": "2012-10-17",
    "Statement": [
            "Sid": "InvokeHTTPEndpoint",
            "Effect": "Allow",
            "Action": [
            "Resource": [

After you have everything ready, you can create your state machine. In your state machine, add a new task state to call a third-party API. You can configure the API endpoint to point to the third-party URL you need, set the correct HTTP method, pick the connection Amazon Resource Name (ARN) for the connection you created previously as the authentication for that endpoint, and provide a request body if needed. In addition, all these parameters can be set dynamically at runtime from the state JSON input.

Call a third party API

Now, making external requests with Step Functions is easy, and you can take advantage of all the configurations that Step Functions provides to handle errors, such as retries for transient errors or momentary service unavailability, and redrive for errors that require longer investigation or resolution time.

Test state
To accelerate feedback cycles, we are also announcing a new capability to test individual states. This new feature allows you to test states independently from the execution of your workflow. This is particularly useful for testing endpoints configuration. You can change the input and test the different scenarios without the need to deploy your workflow or execute the whole state machine. This new feature is available in all task, choice, and pass states.

You will see the testing capability in the Step Functions Workflow Studio when you select a task.

Test state button

When you choose the Test state, you will be redirected to a different view where you can test the task state. You can test that the state machine role has the right permissions, the endpoint you want to call is correctly configured, and verify that the data manipulations work as expected.

How to test a state

Now, with all the features that Step Functions provides, it’s never been easier to build state machines that can solve a wide variety of problems, like payment flows, workflows with manual inputs, and integration to legacy systems. Using Step Functions HTTPS endpoints, you can directly integrate with popular payment platforms while ensuring that your users’ credit cards are only charged once and errors are handled automatically. In addition, you can test this new integration even before you deploy the state machine using the new test state feature.

These new features are available in all AWS Regions except Asia Pacific (Hyderabad), Asia Pacific (Melbourne), AWS Israel (Tel Aviv), China, and GovCloud Regions.

To get started you can try the “Generate Invoices using Stripe” sample project from Step Functions in the AWS Managment Console or check out the AWS Step Functions Developer Guide to learn more.


Build generative AI apps using AWS Step Functions and Amazon Bedrock

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/build-generative-ai-apps-using-aws-step-functions-and-amazon-bedrock/

Today we are announcing two new optimized integrations for AWS Step Functions with Amazon Bedrock. Step Functions is a visual workflow service that helps developers build distributed applications, automate processes, orchestrate microservices, and create data and machine learning (ML) pipelines.

In September, we made available Amazon Bedrock, the easiest way to build and scale generative artificial intelligence (AI) applications with foundation models (FMs). Bedrock offers a choice of foundation models from leading providers like AI21 Labs, Anthropic, Cohere, Stability AI, and Amazon, along with a broad set of capabilities that customers need to build generative AI applications, while maintaining privacy and security. You can use Amazon Bedrock from the AWS Management Console, AWS Command Line Interface (AWS CLI), or AWS SDKs.

The new Step Functions optimized integrations with Amazon Bedrock allow you to orchestrate tasks to build generative AI applications using Amazon Bedrock, as well as to integrate with over 220 AWS services. With Step Functions, you can visually develop, inspect, and audit your workflows. Previously, you needed to invoke an AWS Lambda function to use Amazon Bedrock from your workflows, adding more code to maintain them and increasing the costs of your applications.

Step Functions provides two new optimized API actions for Amazon Bedrock:

  • InvokeModel – This integration allows you to invoke a model and run the inferences with the input provided in the parameters. Use this API action to run inferences for text, image, and embedding models.
  • CreateModelCustomizationJob – This integration creates a fine-tuning job to customize a base model. In the parameters, you specify the foundation model and the location of the training data. When the job is completed, your custom model is ready to be used. This is an asynchronous API, and this integration allows Step Functions to run a job and wait for it to complete before proceeding to the next state. This means that the state machine execution will pause while the create model customization job is running and will resume automatically when the task is complete.

Optimized connectors

The InvokeModel API action accepts requests and responses that are up to 25 MB. However, Step Functions has a 256 kB limit on state payload input and output. In order to support larger payloads with this integration, you can define an Amazon Simple Storage Service (Amazon S3) bucket where the InvokeModel API reads data from and writes the result to. These configurations can be provided in the parameters section of the API action configuration parameters section.

How to get started with Amazon Bedrock and AWS Step Functions
Before getting started, ensure that you create the state machine in a Region where Amazon Bedrock is available. For this example, use US East (N. Virginia), us-east-1.

From the AWS Management Console, create a new state machine. Search for “bedrock,” and the two available API actions will appear. Drag the InvokeModel to the state machine.

Using the invoke model connector

You can now configure that state in the menu on the right. First, you can define which foundation model you want to use. Pick a model from the list, or get the model dynamically from the input.

Then you need to configure the model parameters. You can enter the inference parameters in the text box or load the parameters from Amazon S3.

Configuration for the API Action

If you keep scrolling in the API action configuration, you can specify additional configuration options for the API, such as the S3 destination bucket. When this field is specified, the API action stores the API response in the specified bucket instead of returning it to the state output. Here, you can also specify the content type for the requests and responses.

Additional configuration for the connector

When you finish configuring your state machine, you can create and run it. When the state machine runs, you can visualize the execution details, select the Amazon Bedrock state, and check its inputs and outputs.

Executing the state machine

Using Step Functions, you can build state machines as extensively as you need, combining different services to solve many problems. For example, you can use Step Functions with Amazon Bedrock to create applications using prompt chaining. This is a technique for building complex generative AI applications by passing multiple smaller and simpler prompts to the FM instead of a very long and detailed prompt. To build a prompt chain, you can create a state machine that calls Amazon Bedrock multiple times to get an inference for each of the smaller prompts. You can use the parallel state to run all these tasks in parallel and then use an AWS Lambda function that unifies the responses of the parallel tasks into one response and generates a result.

Available now
AWS Step Functions optimized integrations for Amazon Bedrock are limited to the AWS Regions where Amazon Bedrock is available.

You can get started with Step Functions and Amazon Bedrock by trying out a sample project from the Step Functions console.


Introducing the AWS Integrated Application Test Kit (IATK)

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/aws-integrated-application-test-kit/

This post is written by Dan Fox, Principal Specialist Solutions Architect, and Brian Krygsman, Senior Solutions Architect.

Today, AWS announced the public preview launch of the AWS Integrated Application Test Kit (IATK). AWS IATK is a software library that helps you write automated tests for cloud-based applications. This blog post presents several initial features of AWS IATK, and then shows working examples using an example video processing application. If you are getting started with serverless testing, learn more at serverlessland.com/testing.


When you create applications composed of serverless services like AWS Lambda, Amazon EventBridge, or AWS Step Functions, many of your architecture components cannot be deployed to your desktop, but instead only exist in the AWS Cloud. In contrast to working with applications deployed locally, these types of applications benefit from cloud-based strategies for performing automated tests. For its public preview launch, AWS IATK helps you implement some of these strategies for Python applications. AWS IATK will support other languages in future launches.

Locating resources for tests

When you write automated tests for cloud resources, you need the physical IDs of your resources. The physical ID is the name AWS assigns to a resource after creation. For example, to send requests to Amazon API Gateway you need the physical ID, which forms the API endpoint.

If you deploy cloud resources in separate infrastructure as code stacks, you might have difficulty locating physical IDs. In CloudFormation, you create the logical IDs of the resources in your template, as well as the stack name. With IATK, you can get the physical ID of a resource if you provide the logical ID and stack name. You can also get stack outputs by providing the stack name. These convenient methods simplify locating resources for the tests that you write.

Creating test harnesses for event driven architectures

To write integration tests for event driven architectures, establish logical boundaries by breaking your application into subsystems. Your subsystems should be simple enough to reason about, and contain understandable inputs and outputs. One useful technique for testing subsystems is to create test harnesses. Test harnesses are resources that you create specifically for testing subsystems.

For example, an integration test can begin a subsystem process by passing an input test event to it. IATK can create a test harness for you that listens to Amazon EventBridge for output events. (Under the hood, the harness is composed of an EventBridge Rule that forwards the output event to Amazon Simple Queue Service.) Your integration test then queries the test harness to examine the output and determine if the test passes or fails. These harnesses help you create integration tests in the cloud for event driven architectures.

Establishing service level agreements to test asynchronous features

If you write a synchronous service, your automated tests make requests and expect immediate responses. When your architecture is asynchronous, your service accepts a request and then performs a set of actions at a later time. How can you test for the success of an activity if it does not have a specified duration?

Consider creating reasonable timeouts for your asynchronous systems. Document timeouts as service level agreements (SLAs). You may decide to publish your SLAs externally or to document them as internal standards. IATK contains a polling feature that allows you to establish timeouts. This feature helps you to test that your asynchronous systems complete tasks in a timely manner.

Using AWS X-Ray for detailed testing

If you want to gain more visibility into the interior details of your application, instrument with AWS X-Ray. With AWS X-Ray, you trace the path of an event through multiple services. IATK provides conveniences that help you set the AWS X-Ray sampling rate, get trace trees, and assert for trace durations. These features help you observe and test your distributed systems in greater detail.

Learn more about testing asynchronous architectures at aws-samples/serverless-test-samples.

Overview of the example application

To demonstrate the features of IATK, this post uses a portion of a serverless video application designed with a plugin architecture. A core development team creates the primary application. Distributed development teams throughout the organization create the plugins. One AWS CloudFormation stack deploys the primary application. Separate stacks deploy each plugin.

Communications between the primary application and the plugins are managed by an EventBridge bus. Plugins pull application lifecycle events off the bus and must put completion notification events back on the bus within 20 seconds. For testing, the core team has created an AWS Step Functions workflow that mimics the production process by emitting properly formatted example lifecycle events. Developers run this test workflow in development and test environments to verify that their plugins are communicating properly with the event bus.

The following demonstration shows an integration test for the example application that validates plugin behavior. In the integration test, IATK locates the Step Functions workflow. It creates a test harness to listen for the event completion notification to be sent by the plugin. The test then runs the workflow to begin the lifecycle process and start plugin actions. Then IATK uses a polling mechanism with a timeout to verify that the plugin complies with the 20 second service level agreement. This is the sequence of processing:

Sequence of processing

  1. The integration test starts an execution of the test workflow.
  2. The workflow puts a lifecycle event onto the bus.
  3. The plugin pulls the lifecycle event from the bus.
  4. When the plugin is complete, it puts a completion event onto the bus.
  5. The integration test polls for the completion event to determine if the test passes within the SLA.

Deploying and testing the example application

Follow these steps to review this application, build it locally, deploy it in your AWS account, and test it.

Downloading the example application

  1. Open your terminal and clone the example application from GitHub with the following command or download the code. This repository also includes other example patterns for testing serverless applications.
    git clone https://github.com/aws-samples/serverless-test-samples
  2. The root of the IATK example application is in python-test-samples/integrated-application-test-kit. Change to this directory:
    cd serverless-test-samples/python-test-samples/integrated-application-test-kit

Reviewing the integration test

Before deploying the application, review how the integration test uses the IATK by opening plugins/2-postvalidate-plugins/python-minimal-plugin/tests/integration/test_by_polling.py in your text editor. The test class instantiates the IATK at the top of the file.

iatk_client = aws_iatk.AwsIatk(region=aws_region)

In the setUp() method, the test class uses IATK to fetch CloudFormation stack outputs. These outputs are references to deployed cloud components like the plugin tester AWS Step Functions workflow:

stack_outputs = self.iatk_client.get_stack_outputs(

The test class attaches a listener to the default event bus using an Event Rule provided in the stack outputs. The test uses this listener later to poll for events.

add_listener_output = self.iatk_client.add_listener(

The test class cleans up the listener in the tearDown() method.


Once the configurations are complete, the method test_minimal_plugin_event_published_polling() implements the actual test.

The test first initializes the trigger event.

trigger_event = {
    "eventHook": "postValidate",
    "pluginTitle": "PythonMinimalPlugin"

Next, the test starts an execution of the plugin tester Step Functions workflow. It uses the plugin_tester_arn that was fetched during setUp.


The test polls the listener, waiting for the plugin to emit events. It stops polling once it hits the SLA timeout or receives the maximum number of messages.

poll_output = self.iatk_client.poll_events(

Finally, the test asserts that it receives the right number of events, and that they are well-formed.

self.assertEqual(len(poll_output.events), 1)
self.assertEqual(received_event["source"], "video.plugin.PythonMinimalPlugin")
self.assertEqual(received_event["detail-type"], "plugin-complete")

Installing prerequisites

You need the following prerequisites to build this example:

Build and deploy the example application components

  1. Use AWS SAM to build and deploy the plugin tester to your AWS account. The plugin tester is the Step Functions workflow shown in the preceding diagram. During the build process, you can add the --use-container flag to the build command to instruct AWS SAM to create the application in a provided container. You can accept or override the default values during the deploy process. You will use “Stack Name” and “AWS Region” later to run the integration test.
    cd plugins/plugin_tester # Move to the plugin tester directory
    sam build --use-container # Build the plugin tester

    sam build

  2. Deploy the tester:
    sam deploy --guided # Deploy the plugin tester

    Deploy the tester

  3. Once the plugin tester is deployed, use AWS SAM to deploy the plugin.
    cd ../2-postvalidate-plugins/python-minimal-plugin # Move to the plugin directory
    sam build --use-container # Build the plugin

    Deploy the plugin

  4. Deploy the plugin:
    sam deploy --guided # Deploy the plugin

Running the test

You can run tests written with IATK using standard Python test runners like unittest and pytest. The example application test uses unittest.

    1. Use a virtual environment to organize your dependencies. From the root of the example application, run:
      python3 -m venv .venv # Create the virtual environment
      source .venv/bin/activate # Activate the virtual environment
    2. Install the dependencies, including the IATK:
      cd tests 
      pip3 install -r requirements.txt
    3. Run the test, providing the required environment variables from the earlier deployments. You can find correct values in the samconfig.toml file of the plugin_tester directory.

      cd integration
      PLUGIN_TESTER_STACK_NAME=video-plugin-tester \
      AWS_REGION=us-west-2 \
      python3 -m unittest ./test_by_polling.py

You should see output as unittest runs the test.

Open the Step Functions console in your AWS account, then choose the PluginLifecycleWorkflow-<random value> workflow to validate that the plugin tester successfully ran. A recent execution shows a Succeeded status:

Recent execution status

Review other IATK features

The example application includes examples of other IATK features like generating mock events and retrieving AWS X-Ray traces.

Cleaning up

Use AWS SAM to clean up both the plugin and the plugin tester resources from your AWS account.

  1. Delete the plugin resources:
    cd ../.. # Move to the plugin directory
    sam delete # Delete the plugin

    Deleting resources

  2. Delete the plugin tester resources:
    cd ../../plugin_tester # Move to the plugin tester directory
    sam delete # Delete the plugin tester

    Deleting the tester

The temporary test harness resources that IATK created during the test are cleaned up when the tearDown method runs. If there are problems during teardown, some resources may not be deleted. IATK adds tags to all resources that it creates. You can use these tags to locate the resources then manually remove them. You can also add your own tags.


The AWS Integrated Application Test Kit is a software library that provides conveniences to help you write automated tests for your cloud applications. This blog post shows some of the features of the initial Python version of the IATK.

To learn more about automated testing for serverless applications, visit serverlessland.com/testing. You can also view code examples at serverlessland.com/testing/patterns or at the AWS serverless-test-samples repository on GitHub.

For more serverless learning resources, visit Serverless Land.

Introducing AWS Step Functions redrive to recover from failures more easily

Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/introducing-aws-step-functions-redrive-a-new-way-to-restart-workflows/

Developers use AWS Step Functions, a visual workflow service to build distributed applications, automate IT and business processes, and orchestrate AWS services with minimal code.

Step Functions redrive for Standard Workflows allows you to redrive a failed workflow execution from its point of failure, rather than having to restart the entire workflow. This blog post explains how to use the new redrive feature to skip unnecessary workflow steps and reduce the cost of redriving failed workflows.

Handling workflow errors

Any workflow state can encounter runtime errors. Errors happen for various reasons, including state machine definition issues, task failures, incorrect permissions, and exceptions from downstream services. By default, when a state reports an error, Step Functions causes the workflow execution to fail. Step Functions allows you to handle errors by retrying, catching, and falling back to a defined state.

Now, you can also redrive the workflow from the failed state, skipping the successful prior workflow steps. This results in faster workflow completion and lower costs. You can only redrive a failed workflow execution from the step where it failed using the same input as the last non-successful state. You cannot redrive a failed workflow execution using a state machine definition that is different from the initial workflow execution.

Choosing between retry and redrive

Use the retry mechanism for transient issues such as network connectivity problems or momentary service unavailability You can configure the number of retries, along with intervals and back-off rates, providing the workflow with multiple attempts to complete a task successfully.

In scenarios where the underlying cause of an error requires longer investigation or resolution time, redrive becomes a valuable tool. Consider a situation where a downstream service experiences extended downtime or manual intervention is needed, such as updating a database or making code changes to a Lambda function. In these cases, being able to redrive the workflow can give you time to address the root cause before resuming the workflow execution.

Combining retry and redrive

Adopt a hybrid strategy that combines retry and redrive mechanisms:

  1. Retry mechanism: Configure an initial set of retries for automatically resolvable errors. This ensures that transient issues are promptly addressed, and the workflow proceeds without unnecessary delays.
  2. Error catching and redrive: If the retry mechanism exhausts without success, allow the state to fail and use the redrive feature to restart the workflow from the last non-successful state. This approach allows for intervention where errors persist or require external actions.

Reducing costs

AWS charges for Standard Workflows based on the number of state transitions required to run a workload. Step Functions counts a state transition each time a step of your workflow runs. Step Functions charges for the total number of state transitions across state machines, including retries. The cost is $0.025 per 1,000 state transitions. This means that reducing the number of state transitions reduces the cost of running your Standard Workflows.

If a workflow has many steps, includes parallel or map states, or is prone to errors that require frequent re-runs, this new feature reduces the costs incurred. You pay only for each state transition after the failed state and those costs for every downstream service invoked as part of the re-run.

The following example explains the cost implications of retrying a workflow that has failed, with and without redrive. In this example, a Step Functions workflow orchestrates Amazon Transcribe to generate a text transcription from an .mp4 file.

Since the failed state occurs towards the end of this workflow, the redrive execution does not run the successful states, reducing the overall successful completion time. If this workflow were to fail regularly, the reduction in transitions and execution duration becomes increasingly valuable.

The first time this workflow runs, the final state, which uses an AWS Lambda function to make an HTTP request fails with an IAM error. This is because the workflow does not have the required permissions to invoke the Lambda function. After granting the required permissions to the workflow’s execution role, redrive to continue the workflow from the failed state.

After the redrive, Step Functions workflow reports a different failure. This time it is related to the configuration of the Lambda function. This is an example of a downstream failure that does not require an update to my workflow definition.

After resolving the Lambda configuration issue and redriving the workflow, the execution completes successfully. The following image shows the execution details, including the number of redrives, the total state transitions, and the last redrive time:

Getting started with redrive

Redrive works for Standard Workflows only. You can redrive a workflow from its failed step programmatically, via the AWS CLI or AWS SDK, or using the Step Functions console, which provides a visual operator experience:

  1. From the Step Functions console, select the failed workflow you want to redrive, and choose Redrive.
  2. A modal appears with the execution details. Choose Redrive execution.

The state to redrive from, the workflow definition, and the previous input are immutable.

To redrive a workflow execution programmatically from its point of failure, call the new Redrive Execution API action. The same workflow execution starts from the last non-successful state and uses the same input as the last non-successful state from the initial failed workflow execution.

Programmatically catching failed workflow executions to redrive

Step Functions can process workloads autonomously, without the need for human interaction, or can include intervention from a user by implementing the .waitForTastToken pattern.

Redrive is for unhandled and unexpected errors only. Handling errors within a workflow using the built-in mechanisms for catch, retry, and routing to a Fail state, does not permit the workflow to redrive. However, it is possible to detect in near real-time when a workflow has failed, and programmatically redrive. When a workflow fails, it emits an event onto the Amazon EventBridge default event bus. The event looks like the following JSON object:

There are four new key/values pairs in this event:

"redriveCount": 0, 
"redriveDate": null, 
"redriveStatus": "REDRIVABLE", 
"redriveStatusReason": null,

The redrive count shows how many times the workflow has previously been redriven. The redrive status shows if the failed workflow is eligible for redrive execution.

To programmatically redrive the workflow from the failed state. Create a rule that pattern matches this event, and route the event onto a target service to handle the error. The target service uses the new States.RedriveExecution API to redrive the workflow.

Download and deploy the previous pattern from this example on serverlessland.com.

In the following example, the first state sends a post request to an API endpoint. If the request fails due to network connectivity or latency issues, the state retries. If the retry fails, then Step Functions emits a ` Step Functions Execution Status Change event onto the EventBridge default event bus. An EventBridge rule routes this event to a service where you can rectify this error and then redrive the task using the Step Functions API.

The new redrive feature also supports the distributed map state.

Redrive for express child workflow executions

For failed child workflow executions that are Express Workflows within a Distributed Map, the redrive capability ensures a seamless restart from the beginning of the child workflow. This allows you to resolve issues that are specific to individual iterations without restarting the entire map.

Redrive for standard child workflow executions

For failed child workflow executions within a Distributed Map that are Standard Workflows, the redrive feature functions in the same way in standalone Standard Workflows. You can restart the failed iteration from its point of failure, skipping unnecessary steps that have already successfully executed.


Step Functions redrive for Standard Workflows allows you to redrive a failed workflow execution from its point of failure rather than having to restart the entire workflow. This results in faster workflow completion and lower costs for processing failed executions. This is because it minimizes the number of state transitions and downstream service invocations.

Visit the Serverless Workflows Collection to browse the many deployable workflows to help build your serverless applications.

The serverless attendee’s guide to AWS re:Invent 2023

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/compute/the-serverless-attendees-guide-to-aws-reinvent-2023/

AWS re:Invent 2023 is fast approaching, bringing together tens of thousands of Builders in Las Vegas in November. However, even if you can’t attend in person, you can catch up with sessions on-demand.

Breakout sessions are lecture-style 60-minute informative sessions presented by AWS experts, customers, or partners. These sessions cover beginner (100 level) topics to advanced and expert (300–400 level) topics. The sessions are recorded and uploaded a few days after to the AWS Events YouTube channel.

This post shares the “must watch” breakout sessions related to serverless architectures and services.

Sessions related to serverless architecture


SVS401 | Best practices for serverless developers
Provides architectural best practices, optimizations, and useful shortcuts that experts use to build secure, high-scale, and high-performance serverless applications.

Chris Munns, Startup Tech Leader, AWS
Julian Wood, Principal Developer Advocate, AWS

SVS305 | Refactoring to serverless
Shows how you can refactor your application to serverless with real-life examples.

Gregor Hohpe, Senior Principal Evangelist, AWS
Sindhu Pillai, Senior Solutions Architect, AWS

SVS308 | Building low-latency, event-driven applications
Explores building serverless web applications for low-latency and event-driven support. Marvel Snap share how they achieve low-latency in their games using serverless technology.

Marcia Villalba, Principal Developer Advocate, AWS
Brenna Moore, Second Dinner

SVS309 | Improve productivity by shifting more responsibility to developers
Learn about approaches to accelerate serverless development with faster feedback cycles, exploring best practices and tools. Watch a live demo featuring an improved developer experience for building serverless applications while complying with enterprise governance requirements.

Heeki Park, Principal Solutions Architect, AWS
Sam Dengler, Capital One

GBL203-ES | Building serverless-first applications with MAPFRE
This session is delivered in Spanish. Learn what modern, serverless-first applications are and how to implement them with services such as AWS Lambda or AWS Fargate. Find out how MAPFRE have adopted and implemented a serverless strategy.

Jesus Bernal, Senior Solutions Architect, AWS
Iñigo Lacave, MAPFRE
Mat Jovanovic, MAPFRE

Sessions related to AWS Lambda


BOA311 | Unlocking serverless web applications with AWS Lambda Web Adapter
Learn about the AWS Lambda Web Adapter and how it integrates with familiar frameworks and tools. Find out how to migrate existing web applications to serverless or create new applications using AWS Lambda.

Betty Zheng, Senior Developer Advocate, AWS
Harold Sun, Senior Solutions Architect, AWS

OPN305 | The pragmatic serverless Python developer
Covers an opinionated approach to setting up a serverless Python project, including testing, profiling, deployments, and operations. Learn about many open source tools, including Powertools for AWS Lambda—a toolkit that can help you implement serverless best practices and increase developer velocity.

Heitor Lessa, Principal Solutions Architect, AWS
Ran Isenberg, CyberArk

XNT301 | Build production-ready serverless .NET apps with AWS Lambda
Explores development and architectural best practices when building serverless applications with .NET and AWS Lambda, including when to run ASP.NET on Lambda, code structure, and using native AOT to massively increase performance.

James Eastham, Senior Cloud Architect, AWS
Craig Bossie, Solutions Architect, AWS

COM306 | “Rustifying” serverless: Boost AWS Lambda performance with Rust
Discover how to deploy Rust functions using AWS SAM and cargo-lambda, facilitating a smooth development process from your local machine. Explore how to integrate Rust into Python Lambda functions effortlessly using tools like PyO3 and maturin, along with the AWS SDK for Rust. Uncover how Rust can optimize Lambda functions, including the development of Lambda extensions, all without requiring a complete rewrite of your existing code base.

Efi Merdler-Kravitz, Cloudex

COM305 | Demystifying and mitigating AWS Lambda cold starts
Examines the Lambda initialization process at a low level, using benchmarks comparing common architectural patterns, and then benchmarking various RAM configurations and payload sizes. Next, measure and discuss common mistakes that can increase initialization latency, explore and understand proactive initialization, and learn several strategies you can use to thaw your AWS Lambda cold starts.

AJ Stuyvenberg, Datadog

Sessions related to event-driven architecture


API302 | Building next gen applications with event driven architecture
Learn about common integration patterns and discover how you can use AWS messaging services to connect microservices and coordinate data flow using minimal custom code. Learn and plan for idempotency, handling duplicating events and building resiliency into your architectures.

Eric Johnson, Principal Developer Advocate, AWS

API303 | Navigating the journey of serverless event-driven architecture
Learn about the journey businesses undertake when adopting EDAs, from initial design and implementation to ongoing operation and maintenance. The session highlights the many benefits EDAs can offer organizations and focuses on areas of EDA that are challenging and often overlooked. Through a combination of patterns, best practices, and practical tips, this session provides a comprehensive overview of the opportunities and challenges of implementing EDAs and helps you understand how you can use them to drive business success.

David Boyne, Senior Developer Advocate, AWS

API309 | Advanced integration patterns and trade-offs for loosely coupled apps
In this session, learn about common design trade-offs for distributed systems, how to navigate them with design patterns, and how to embed those patterns in your cloud automation.

Dirk Fröhner, Principal Solutions Architect, AWS
Gregor Hohpe, Senior Principal Evangelist, AWS

SVS205 | Getting started building serverless event-driven applications
Learn about the process of prototyping a solution from concept to a fully featured application that uses Amazon API Gateway, AWS Lambda, Amazon EventBridge, AWS Step Functions, Amazon DynamoDB, AWS Application Composer, and more. Learn why serverless is a great tool set for experimenting with new ideas and how the extensibility and modularity of serverless applications allow you to start small and quickly make your idea a reality.

Emily Shea, Head of Application Integration Go-to-Market, AWS
Naren Gakka, Solutions Architect, AWS

API206 | Bringing workloads together with event-driven architecture
Attend this session to learn the steps to bring your existing container workloads closer together using event-driven architecture with minimal code changes and a high degree of reusability. Using a real-life business example, this session walks through a demo to highlight the power of this approach.

Dhiraj Mahapatro, Principal Solutions Architect, AWS
Nicholas Stumpos, JPMorgan Chase & Co

COM301 | Advanced event-driven patterns with Amazon EventBridge
Gain an understanding of the characteristics of EventBridge and how it plays a pivotal role in serverless architectures. Learn the primary elements of event-driven architecture and some of the best practices. With real-world use cases, explore how the features of EventBridge support implementing advanced architectural patterns in serverless.

Sheen Brisals, The LEGO Group

Sessions related to serverless APIs


SVS301 | Building APIs: Choosing the best API solution and strategy for your workloads
Learn about access patterns and how to evaluate the best API technology for your applications. The session considers the features and benefits of Amazon API Gateway, AWS AppSync, Amazon VPC Lattice, and other options.

Josh Kahn, Tech Leader Serverless, AWS
Arthi Jaganathan, Principal Solutions Architect, AWS

SVS323 | I didn’t know Amazon API Gateway did that
This session provides an introduction to Amazon API Gateway and the problems it solves. Learn about the moving parts of API Gateway and how it works, including common and not-so-common use cases. Discover why you should use API Gateway and what it can do.

Eric Johnson, Principal Developer Advocate, AWS

FWM201 | What’s new with AWS AppSync for enterprise API developers
Join this session to learn about all the exciting new AWS AppSync features released this year that make it even more seamless for API developers to realize the benefits of GraphQL for application development.

Michael Liendo, Senior Developer Advocate, AWS
Brice Pellé, Principal Product Manager, AWS

FWM204 | Implement real-time event patterns with WebSockets and AWS AppSync
Learn how the PGA Tour uses AWS AppSync to deliver real-time event updates to their app users; review new features, like enhanced filtering options and native integration with Amazon EventBridge; and provide a sneak peek at what’s coming next.

Ryan Yanchuleff, Senior Solutions Architect, AWS
Bill Fine, Senior Product Manager, AWS
David Provan, PGA Tour

Sessions related to AWS Step Functions


API401 | Advanced workflow patterns and business processes with AWS Step Functions
Learn about architectural best practices and repeatable patterns for building workflows and cost optimizations, and discover handy cheat codes that you can use to build secure, high-scale, high-performance serverless applications

Ben Smith, Principal Developer Advocate, AWS

BOA304 | Using AI and serverless to automate video production
Learn how to use Step Functions to build workflows using AI services and how to use Amazon EventBridge real-time events.

Marcia Villalba, Principal Developer Advocate, AWS

SVS204 | Building Serverlesspresso: Creating event-driven architectures
This session explores the design decisions that were made when building Serverlesspresso, how new features influenced the development process, and lessons learned when creating a production-ready application using this approach. Explore useful patterns and options for extensibility that helped in the design of a robust, scalable solution that costs about one dollar per day to operate. This session includes examples you can apply to your serverless applications and complex architectural challenges for larger applications.

James Beswick, Senior Manager Developer Advocacy, AWS

API310 | Scale interactive data analysis with Step Functions Distributed Map
Learn how to build a data processing or other automation once and readily scale it to thousands of parallel processes with serverless technologies. Explore how this approach simplifies development and error handling while improving speed and lowering cost. Hear from an AWS customer that refactored an existing machine learning application to use Distributed Map and the lessons they learned along the way.

Adam Wagner, Principal Solutions Architect, AWS
Roberto Iturralde, Vertex Pharmaceuticals

Sessions related to handling data using serverless services and serverless databases


SVS307 | Scaling your serverless data processing with Amazon Kinesis and Kafka
Explore how to build scalable data processing applications using AWS Lambda. Learn practical insights into integrating Lambda with Amazon Kinesis and Apache Kafka using their event-driven models for real-time data streaming and processing.

Julian Wood, Principal Developer Advocate, AWS

DAT410 | Advanced data modeling with Amazon DynamoDB
This session shows you advanced techniques to get the most out of DynamoDB. Learn how to “think in DynamoDB” by learning the DynamoDB foundations and principles for data modeling. Learn practical strategies and DynamoDB features to handle difficult use cases in your application.

Alex De Brie – Independent consultant

COM308 | Serverless data streaming: Amazon Kinesis Data Streams and AWS Lambda
Explore the intricacies of creating scalable, production-ready data streaming architectures using Kinesis Data Streams and Lambda. Delve into tips and best practices essential to navigating the challenges and pitfalls inherent to distributed systems that arise along the way, and observe how AWS services work and interact.

Anahit Pogosova, Solita

Additional resources

If you are attending the event, there are many chalk talks, workshops, and other sessions to visit. See ServerlessLand for a full list of all the serverless sessions and also the Serverless Hero, Danielle Heberling’s Serverless re:Invent attendee guide for her top picks.

Visit us in the AWS Village in the Expo Hall where you can find the Serverless and Containers booth and enjoy a free cup of coffee at Serverlesspresso.

For more serverless learning resources, visit Serverless Land.

Orchestrating dependent file uploads with AWS Step Functions

Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/orchestrating-dependent-file-uploads-with-aws-step-functions/

This post is written by Nelson Assis, Enterprise Support Lead, Serverless and Jevon Liburd, Technical Account Manager, Serverless

Amazon S3 is an object storage service that many customers use for file storage. With the use of Amazon S3 Event Notifications or Amazon EventBridge customers can create workloads with event-driven architecture (EDA). This architecture responds to events produced when changes occur to objects in S3 buckets.

EDA involves asynchronous communication between system components. This serves to decouple the components allowing each component to be autonomous.

Some scenarios may introduce coupling in the architecture due to dependency between events. This blog post presents a common example of this coupling and how it can be handled using AWS Step Functions.


In this example, an organization has two distributed autonomous teams, the Sales team and the Warehouse team. Each team is responsible for uploading a monthly data file to an S3 bucket so it can be processed.

The files generate events when they are uploaded, initiating downstream processes. The processing of the Warehouse file cleans the data and joins it with data from the Shipping team. The processing of the Sales file correlates the data with the combined Warehouse and Shipping data. This enables analysts to perform forecasting and gain other insights.

For this correlation to happen, the Warehouse file must be processed before the Sales file. As the two teams are autonomous, there is no coordination among the teams. This means that the files can be uploaded at any time with no assurance that the Warehouse file is processed before the Sales file.

For scenarios like these, the Aggregator pattern can be used. The pattern collects and stores the events, and triggers a new event based on the combined events. In the described scenario, the combined events are the processed Warehouse file and the uploaded Sales file.

The requirements of the aggregator pattern are:

  1. Correlation – A way to group the related events. This is fulfilled by a unique identifier in the file name.
  2. Event aggregator – A stateful store for the events.
  3. Completion check and trigger – A condition when the combined events have been received and a way to publish the resulting event.

Architecture overview

The architecture uses the following AWS services:

  1. File upload: The Sales and Warehouse teams upload their respective files to S3.
  2. EventBridge: The ObjectCreated event is sent to EventBridge where there is a rule with a target of the main workflow.
  3. Main state machine: This state machine orchestrates the aggregator operations and the processing of the files. It encapsulates the workflows for each file to separate the aggregator logic from the files’ workflow logic.
  4. File parser and correlation: The business logic to identify the file and its type is run in this Lambda function.
  5. Stateful store: A DynamoDB table stores information about the file such as the name, type, and processing status. The state machine reads from and writes to the DynamoDB table. Task tokens are also stored in this table.
  6. File processing: Depending on the file type and any pre-conditions, state machines corresponding to the file type are run. These state machines contain the logic to process the specific file.
  7. Task Token & Callback: The task token is generated when the dependent file tries to be processed before the independent file. The Step Functions “Wait for a Callback” pattern continues the execution of the dependent file after the independent file is processed.


You need the following prerequisites:

  • AWS CLI and AWS SAM CLI installed.
  • An AWS account.
  • Sufficient permissions to manage the AWS resources.
  • Git installed.

To deploy the example, follow the instructions in the GitHub repo.

This walkthrough shows what happens if the dependent file (Sales file) is uploaded before the independent one (Warehouse file).

  1. The workflow starts with the uploading of the Sales file to the dedicated Sales S3 bucket. The example uses separate S3 buckets for the two files as it assumes that the Sales and Warehouse teams are distributed and autonomous. You can find sample files in the code repository.
  2. Uploading the file to S3 sends an event to EventBridge, which the aggregator state machine acts on. The event pattern used in the EventBridge rule is:
      "detail-type": ["Object Created"],
      "source": ["aws.s3"],
      "detail": {
        "bucket": {
          "name": ["sales-mfu-eda-09092023", "warehouse-mfu-eda-09092023"]
        "reason": ["PutObject"]
  3. The aggregator state machine starts by invoking the file parser Lambda function. This function parses the file type and uses the identifier to correlate the files. In this example, the name of the file contains the file type and the correlation identifier (the year_month). To use other ways of representing the file type and correlation identifier, you can modify this function to parse that information.
  4. The next step in the state machine inserts a record for the event in the event aggregator DynamoDB table. The table has a composite primary key with the correlation identifier as the partition key and the file type as the sort key. The processing status of the file is tracked to give feedback on the state of the workflow.
  5. Based on the file type, the state machine determines which branch to follow. In the example, the Sales branch is run. The state machine tries to get the status of the (dependent) Warehouse file from DynamoDB using the correlation identifier. Using the result of this query, the state machine determines if the corresponding Warehouse file has already been processed.
  6. Since the Warehouse file is not processed yet, the waitForTaskToken integration pattern is used. The state machine waits at this step and creates a task token, which the external services use to trigger the state machine to continue its execution. The Sales record in the DynamoDB table is updated with the Task Token.
  7. Navigate to the S3 console and upload the sample Warehouse file to the Warehouse S3 bucket. This invokes a new instance of the Step Functions workflow, which flows through the other branch after the file type choice step. In this branch, the Warehouse state machine is run and the processing status of the file is updated in DynamoDB.

When the status of the Warehouse file is changed to “Completed”, the Warehouse state machine checks DynamoDB for a pending Sales file. If there is one, it retrieves the task token and calls the SendTaskSuccess method. This triggers the Sales state machine, which is in a waiting state to continue. The Sales state machine is started and the processing status is updated.


This blog post shows how to handle file dependencies in event driven architectures. You can customize the sample provided in the code repository for your own use case.

This solution is specific to file dependencies in event driven architectures. For more information on solving event dependencies and aggregators read the blog post: Moving to event-driven architectures with serverless event aggregators.

To learn more about event driven architectures, visit the event driven architecture section on Serverless Land.

Sending and receiving webhooks on AWS: Innovate with event notifications

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/sending-and-receiving-webhooks-on-aws-innovate-with-event-notifications/

This post is written by Daniel Wirjo, Solutions Architect, and Justin Plock, Principal Solutions Architect.

Commonly known as reverse APIs or push APIs, webhooks provide a way for applications to integrate to each other and communicate in near real-time. It enables integration for business and system events.

Whether you’re building a software as a service (SaaS) application integrating with your customer workflows, or transaction notifications from a vendor, webhooks play a critical role in unlocking innovation, enhancing user experience, and streamlining operations.

This post explains how to build with webhooks on AWS and covers two scenarios:

  • Webhooks Provider: A SaaS application that sends webhooks to an external API.
  • Webhooks Consumer: An API that receives webhooks with capacity to handle large payloads.

It includes high-level reference architectures with considerations, best practices and code sample to guide your implementation.

Sending webhooks

To send webhooks, you generate events, and deliver them to third-party APIs. These events facilitate updates, workflows, and actions in the third-party system. For example, a payments platform (provider) can send notifications for payment statuses, allowing ecommerce stores (consumers) to ship goods upon confirmation.

AWS reference architecture for a webhook provider

The architecture consists of two services:

  • Webhook delivery: An application that delivers webhooks to an external endpoint specified by the consumer.
  • Subscription management: A management API enabling the consumer to manage their configuration, including specifying endpoints for delivery, and which events for subscription.

AWS reference architecture for a webhook provider

Considerations and best practices for sending webhooks

When building an application to send webhooks, consider the following factors:

Event generation: Consider how you generate events. This example uses Amazon DynamoDB as the data source. Events are generated by change data capture for DynamoDB Streams and sent to Amazon EventBridge Pipes. You then simplify the DynamoDB response format by using an input transformer.

With EventBridge, you send events in near real time. If events are not time-sensitive, you can send multiple events in a batch. This can be done by polling for new events at a specified frequency using EventBridge Scheduler. To generate events from other data sources, consider similar approaches with Amazon Simple Storage Service (S3) Event Notifications or Amazon Kinesis.

Filtering: EventBridge Pipes support filtering by matching event patterns, before the event is routed to the target destination. For example, you can filter for events in relation to status update operations in the payments DynamoDB table to the relevant subscriber API endpoint.

Delivery: EventBridge API Destinations deliver events outside of AWS using REST API calls. To protect the external endpoint from surges in traffic, you set an invocation rate limit. In addition, retries with exponential backoff are handled automatically depending on the error. An Amazon Simple Queue Service (SQS) dead-letter queue retains messages that cannot be delivered. These can provide scalable and resilient delivery.

Payload Structure: Consider how consumers process event payloads. This example uses an input transformer to create a structured payload, aligned to the CloudEvents specification. CloudEvents provides an industry standard format and common payload structure, with developer tools and SDKs for consumers.

Payload Size: For fast and reliable delivery, keep payload size to a minimum. Consider delivering only necessary details, such as identifiers and status. For additional information, you can provide consumers with a separate API. Consumers can then separately call this API to retrieve the additional information.

Security and Authorization: To deliver events securely, you establish a connection using an authorization method such as OAuth. Under the hood, the connection stores the credentials in AWS Secrets Manager, which securely encrypts the credentials.

Subscription Management: Consider how consumers can manage their subscription, such as specifying HTTPS endpoints and event types to subscribe. DynamoDB stores this configuration. Amazon API Gateway, Amazon Cognito, and AWS Lambda provide a management API for operations.

Costs: In practice, sending webhooks incurs cost, which may become significant as you grow and generate more events. Consider implementing usage policies, quotas, and allowing consumers to subscribe only to the event types that they need.

Monetization: Consider billing consumers based on their usage volume or tier. For example, you can offer a free tier to provide a low-friction access to webhooks, but only up to a certain volume. For additional volume, you charge a usage fee that is aligned to the business value that your webhooks provide. At high volumes, you offer a premium tier where you provide dedicated infrastructure for certain consumers.

Monitoring and troubleshooting: Beyond the architecture, consider processes for day-to-day operations. As endpoints are managed by external parties, consider enabling self-service. For example, allow consumers to view statuses, replay events, and search for past webhook logs to diagnose issues.

Advanced Scenarios: This example is designed for popular use cases. For advanced scenarios, consider alternative application integration services noting their Service Quotas. For example, Amazon Simple Notification Service (SNS) for fan-out to a larger number of consumers, Lambda for flexibility to customize payloads and authentication, and AWS Step Functions for orchestrating a circuit breaker pattern to deactivate unreliable subscribers.

Receiving webhooks

To receive webhooks, you require an API to provide to the webhook provider. For example, an ecommerce store (consumer) may rely on notifications provided by their payment platform (provider) to ensure that goods are shipped in a timely manner. Webhooks present a unique scenario as the consumer must be scalable, resilient, and ensure that all requests are received.

AWS reference architecture for a webhook consumer

In this scenario, consider an advanced use case that can handle large payloads by using the claim-check pattern.

AWS reference architecture for a webhook consumer

At a high-level, the architecture consists of:

  • API: An API endpoint to receive webhooks. An event-driven system then authorizes and processes the received webhooks.
  • Payload Store: S3 provides scalable storage for large payloads.
  • Webhook Processing: EventBridge Pipes provide an extensible architecture for processing. It can batch, filter, enrich, and send events to a range of processing services as targets.

Considerations and best practices for receiving webhooks

When building an application to receive webhooks, consider the following factors:

Scalability: Providers typically send events as they occur. API Gateway provides a scalable managed endpoint to receive events. If unavailable or throttled, providers may retry the request, however, this is not guaranteed. Therefore, it is important to configure appropriate rate and burst limits. Throttling requests at the entry point mitigates impact on downstream services, where each service has its own quotas and limits. In many cases, providers are also aware of impact on downstream systems. As such, they send events at a threshold rate limit, typically up to 500 transactions per second (TPS).

Considerations and best practices for receiving webhooks

In addition, API Gateway allows you to validate requests, monitor for any errors, and protect against distributed denial of service (DDoS). This includes Layer 7 and Layer 3 attacks, which are common threats to webhook consumers given public exposure.

Authorization and Verification: Providers can support different authorization methods. Consider a common scenario with Hash-based Message Authentication Code (HMAC), where a shared secret is established and stored in Secrets Manager. A Lambda function then verifies integrity of the message, processing a signature in the request header. Typically, the signature contains a timestamped nonce with an expiry to mitigate replay attacks, where events are sent multiple times by an attacker. Alternatively, if the provider supports OAuth, consider securing the API with Amazon Cognito.

Payload Size: Providers may send a variety of payload sizes. Events can be batched to a single larger request, or they may contain significant information. Consider payload size limits in your event-driven system. API Gateway and Lambda have limits of 10 Mb and 6 Mb. However, DynamoDB and SQS are limited to 400kb and 256kb (with extension for large messages) which can represent a bottleneck.

Instead of processing the entire payload, S3 stores the payload. It is then referenced in DynamoDB, via its bucket name and object key. This is known as the claim-check pattern. With this approach, the architecture supports payloads of up to 6mb, as per the Lambda invocation payload quota.

Considerations and best practices for receiving webhooks

Idempotency: For reliability, many providers prioritize delivering at-least-once, even if it means not guaranteeing exactly once delivery. They can transmit the same request multiple times, resulting in duplicates. To handle this, a Lambda function checks against the event’s unique identifier against previous records in DynamoDB. If not already processed, you create a DynamoDB item.

Ordering: Consider processing requests in its intended order. As most providers prioritize at-least-once delivery, events can be out of order. To indicate order, events may include a timestamp or a sequence identifier in the payload. If not, ordering may be on a best-efforts basis based on when the webhook is received. To handle ordering reliably, select event-driven services that ensure ordering. This example uses DynamoDB Streams and EventBridge Pipes.

Flexible Processing: EventBridge Pipes provide integrations to a range of event-driven services as targets. You can route events to different targets based on filters. Different event types may require different processors. For example, you can use Step Functions for orchestrating complex workflows, Lambda for compute operations with less than 15-minute execution time, SQS to buffer requests, and Amazon Elastic Container Service (ECS) for long-running compute jobs. EventBridge Pipes provide transformation to ensure only necessary payloads are sent, and enrichment if additional information is required.

Costs: This example considers a use case that can handle large payloads. However, if you can ensure that providers send minimal payloads, consider a simpler architecture without the claim-check pattern to minimize cost.


Webhooks are a popular method for applications to communicate, and for businesses to collaborate and integrate with customers and partners.

This post shows how you can build applications to send and receive webhooks on AWS. It uses serverless services such as EventBridge and Lambda, which are well-suited for event-driven use cases. It covers high-level reference architectures, considerations, best practices and code sample to assist in building your solution.

For standards and best practices on webhooks, visit the open-source community resources Webhooks.fyi and CloudEvents.io.

For more serverless learning resources, visit Serverless Land.

Unstructured data management and governance using AWS AI/ML and analytics services

Post Syndicated from Sakti Mishra original https://aws.amazon.com/blogs/big-data/unstructured-data-management-and-governance-using-aws-ai-ml-and-analytics-services/

Unstructured data is information that doesn’t conform to a predefined schema or isn’t organized according to a preset data model. Unstructured information may have a little or a lot of structure but in ways that are unexpected or inconsistent. Text, images, audio, and videos are common examples of unstructured data. Most companies produce and consume unstructured data such as documents, emails, web pages, engagement center phone calls, and social media. By some estimates, unstructured data can make up to 80–90% of all new enterprise data and is growing many times faster than structured data. After decades of digitizing everything in your enterprise, you may have an enormous amount of data, but with dormant value. However, with the help of AI and machine learning (ML), new software tools are now available to unearth the value of unstructured data.

In this post, we discuss how AWS can help you successfully address the challenges of extracting insights from unstructured data. We discuss various design patterns and architectures for extracting and cataloging valuable insights from unstructured data using AWS. Additionally, we show how to use AWS AI/ML services for analyzing unstructured data.

Why it’s challenging to process and manage unstructured data

Unstructured data makes up a large proportion of the data in the enterprise that can’t be stored in a traditional relational database management systems (RDBMS). Understanding the data, categorizing it, storing it, and extracting insights from it can be challenging. In addition, identifying incremental changes requires specialized patterns and detecting sensitive data and meeting compliance requirements calls for sophisticated functions. It can be difficult to integrate unstructured data with structured data from existing information systems. Some view structured and unstructured data as apples and oranges, instead of being complementary. But most important of all, the assumed dormant value in the unstructured data is a question mark, which can only be answered after these sophisticated techniques have been applied. Therefore, there is a need to being able to analyze and extract value from the data economically and flexibly.

Solution overview

Data and metadata discovery is one of the primary requirements in data analytics, where data consumers explore what data is available and in what format, and then consume or query it for analysis. If you can apply a schema on top of the dataset, then it’s straightforward to query because you can load the data into a database or impose a virtual table schema for querying. But in the case of unstructured data, metadata discovery is challenging because the raw data isn’t easily readable.

You can integrate different technologies or tools to build a solution. In this post, we explain how to integrate different AWS services to provide an end-to-end solution that includes data extraction, management, and governance.

The solution integrates data in three tiers. The first is the raw input data that gets ingested by source systems, the second is the output data that gets extracted from input data using AI, and the third is the metadata layer that maintains a relationship between them for data discovery.

The following is a high-level architecture of the solution we can build to process the unstructured data, assuming the input data is being ingested to the raw input object store.

Unstructured Data Management - Block Level Architecture Diagram

The steps of the workflow are as follows:

  1. Integrated AI services extract data from the unstructured data.
  2. These services write the output to a data lake.
  3. A metadata layer helps build the relationship between the raw data and AI extracted output. When the data and metadata are available for end-users, we can break the user access pattern into additional steps.
  4. In the metadata catalog discovery step, we can use query engines to access the metadata for discovery and apply filters as per our analytics needs. Then we move to the next stage of accessing the actual data extracted from the raw unstructured data.
  5. The end-user accesses the output of the AI services and uses the query engines to query the structured data available in the data lake. We can optionally integrate additional tools that help control access and provide governance.
  6. There might be scenarios where, after accessing the AI extracted output, the end-user wants to access the original raw object (such as media files) for further analysis. Additionally, we need to make sure we have access control policies so the end-user has access only to the respective raw data they want to access.

Now that we understand the high-level architecture, let’s discuss what AWS services we can integrate in each step of the architecture to provide an end-to-end solution.

The following diagram is the enhanced version of our solution architecture, where we have integrated AWS services.

Unstructured Data Management - AWS Native Architecture

Let’s understand how these AWS services are integrated in detail. We have divided the steps into two broad user flows: data processing and metadata enrichment (Steps 1–3) and end-users accessing the data and metadata with fine-grained access control (Steps 4–6).

  1. Various AI services (which we discuss in the next section) extract data from the unstructured datasets.
  2. The output is written to an Amazon Simple Storage Service (Amazon S3) bucket (labeled Extracted JSON in the preceding diagram). Optionally, we can restructure the input raw objects for better partitioning, which can help while implementing fine-grained access control on the raw input data (labeled as the Partitioned bucket in the diagram).
  3. After the initial data extraction phase, we can apply additional transformations to enrich the datasets using AWS Glue. We also build an additional metadata layer, which maintains a relationship between the raw S3 object path, the AI extracted output path, the optional enriched version S3 path, and any other metadata that will help the end-user discover the data.
  4. In the metadata catalog discovery step, we use the AWS Glue Data Catalog as the technical catalog, Amazon Athena and Amazon Redshift Spectrum as query engines, AWS Lake Formation for fine-grained access control, and Amazon DataZone for additional governance.
  5. The AI extracted output is expected to be available as a delimited file or in JSON format. We can create an AWS Glue Data Catalog table for querying using Athena or Redshift Spectrum. Like the previous step, we can use Lake Formation policies for fine-grained access control.
  6. Lastly, the end-user accesses the raw unstructured data available in Amazon S3 for further analysis. We have proposed integrating Amazon S3 Access Points for access control at this layer. We explain this in detail later in this post.

Now let’s expand the following parts of the architecture to understand the implementation better:

  • Using AWS AI services to process unstructured data
  • Using S3 Access Points to integrate access control on raw S3 unstructured data

Process unstructured data with AWS AI services

As we discussed earlier, unstructured data can come in a variety of formats, such as text, audio, video, and images, and each type of data requires a different approach for extracting metadata. AWS AI services are designed to extract metadata from different types of unstructured data. The following are the most commonly used services for unstructured data processing:

  • Amazon Comprehend – This natural language processing (NLP) service uses ML to extract metadata from text data. It can analyze text in multiple languages, detect entities, extract key phrases, determine sentiment, and more. With Amazon Comprehend, you can easily gain insights from large volumes of text data such as extracting product entity, customer name, and sentiment from social media posts.
  • Amazon Transcribe – This speech-to-text service uses ML to convert speech to text and extract metadata from audio data. It can recognize multiple speakers, transcribe conversations, identify keywords, and more. With Amazon Transcribe, you can convert unstructured data such as customer support recordings into text and further derive insights from it.
  • Amazon Rekognition – This image and video analysis service uses ML to extract metadata from visual data. It can recognize objects, people, faces, and text, detect inappropriate content, and more. With Amazon Rekognition, you can easily analyze images and videos to gain insights such as identifying entity type (human or other) and identifying if the person is a known celebrity in an image.
  • Amazon Textract – You can use this ML service to extract metadata from scanned documents and images. It can extract text, tables, and forms from images, PDFs, and scanned documents. With Amazon Textract, you can digitize documents and extract data such as customer name, product name, product price, and date from an invoice.
  • Amazon SageMaker – This service enables you to build and deploy custom ML models for a wide range of use cases, including extracting metadata from unstructured data. With SageMaker, you can build custom models that are tailored to your specific needs, which can be particularly useful for extracting metadata from unstructured data that requires a high degree of accuracy or domain-specific knowledge.
  • Amazon Bedrock – This fully managed service offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon with a single API. It also offers a broad set of capabilities to build generative AI applications, simplifying development while maintaining privacy and security.

With these specialized AI services, you can efficiently extract metadata from unstructured data and use it for further analysis and insights. It’s important to note that each service has its own strengths and limitations, and choosing the right service for your specific use case is critical for achieving accurate and reliable results.

AWS AI services are available via various APIs, which enables you to integrate AI capabilities into your applications and workflows. AWS Step Functions is a serverless workflow service that allows you to coordinate and orchestrate multiple AWS services, including AI services, into a single workflow. This can be particularly useful when you need to process large amounts of unstructured data and perform multiple AI-related tasks, such as text analysis, image recognition, and NLP.

With Step Functions and AWS Lambda functions, you can create sophisticated workflows that include AI services and other AWS services. For instance, you can use Amazon S3 to store input data, invoke a Lambda function to trigger an Amazon Transcribe job to transcribe an audio file, and use the output to trigger an Amazon Comprehend analysis job to generate sentiment metadata for the transcribed text. This enables you to create complex, multi-step workflows that are straightforward to manage, scalable, and cost-effective.

The following is an example architecture that shows how Step Functions can help invoke AWS AI services using Lambda functions.

AWS AI Services - Lambda Event Workflow -Unstructured Data

The workflow steps are as follows:

  1. Unstructured data, such as text files, audio files, and video files, are ingested into the S3 raw bucket.
  2. A Lambda function is triggered to read the data from the S3 bucket and call Step Functions to orchestrate the workflow required to extract the metadata.
  3. The Step Functions workflow checks the type of file, calls the corresponding AWS AI service APIs, checks the job status, and performs any postprocessing required on the output.
  4. AWS AI services can be accessed via APIs and invoked as batch jobs. To extract metadata from different types of unstructured data, you can use multiple AI services in sequence, with each service processing the corresponding file type.
  5. After the Step Functions workflow completes the metadata extraction process and performs any required postprocessing, the resulting output is stored in an S3 bucket for cataloging.

Next, let’s understand how can we implement security or access control on both the extracted output as well as the raw input objects.

Implement access control on raw and processed data in Amazon S3

We just consider access controls for three types of data when managing unstructured data: the AI-extracted semi-structured output, the metadata, and the raw unstructured original files. When it comes to AI extracted output, it’s in JSON format and can be restricted via Lake Formation and Amazon DataZone. We recommend keeping the metadata (information that captures which unstructured datasets are already processed by the pipeline and available for analysis) open to your organization, which will enable metadata discovery across the organization.

To control access of raw unstructured data, you can integrate S3 Access Points and explore additional support in the future as AWS services evolve. S3 Access Points simplify data access for any AWS service or customer application that stores data in Amazon S3. Access points are named network endpoints that are attached to buckets that you can use to perform S3 object operations. Each access point has distinct permissions and network controls that Amazon S3 applies for any request that is made through that access point. Each access point enforces a customized access point policy that works in conjunction with the bucket policy that is attached to the underlying bucket. With S3 Access Points, you can create unique access control policies for each access point to easily control access to specific datasets within an S3 bucket. This works well in multi-tenant or shared bucket scenarios where users or teams are assigned to unique prefixes within one S3 bucket.

An access point can support a single user or application, or groups of users or applications within and across accounts, allowing separate management of each access point. Every access point is associated with a single bucket and contains a network origin control and a Block Public Access control. For example, you can create an access point with a network origin control that only permits storage access from your virtual private cloud (VPC), a logically isolated section of the AWS Cloud. You can also create an access point with the access point policy configured to only allow access to objects with a defined prefix or to objects with specific tags. You can also configure custom Block Public Access settings for each access point.

The following architecture provides an overview of how an end-user can get access to specific S3 objects by assuming a specific AWS Identity and Access Management (IAM) role. If you have a large number of S3 objects to control access, consider grouping the S3 objects, assigning them tags, and then defining access control by tags.

S3 Access Points - Unstructured Data Management - Access Control

If you are implementing a solution that integrates S3 data available in multiple AWS accounts, you can take advantage of cross-account support for S3 Access Points.


This post explained how you can use AWS AI services to extract readable data from unstructured datasets, build a metadata layer on top of them to allow data discovery, and build an access control mechanism on top of the raw S3 objects and extracted data using Lake Formation, Amazon DataZone, and S3 Access Points.

In addition to AWS AI services, you can also integrate large language models with vector databases to enable semantic or similarity search on top of unstructured datasets. To learn more about how to enable semantic search on unstructured data by integrating Amazon OpenSearch Service as a vector database, refer to Try semantic search with the Amazon OpenSearch Service vector engine.

As of writing this post, S3 Access Points is one of the best solutions to implement access control on raw S3 objects using tagging, but as AWS service features evolve in the future, you can explore alternative options as well.

About the Authors

Sakti Mishra is a Principal Solutions Architect at AWS, where he helps customers modernize their data architecture and define their end-to-end data strategy, including data security, accessibility, governance, and more. He is also the author of the book Simplify Big Data Analytics with Amazon EMR. Outside of work, Sakti enjoys learning new technologies, watching movies, and visiting places with family.

Bhavana Chirumamilla is a Senior Resident Architect at AWS with a strong passion for data and machine learning operations. She brings a wealth of experience and enthusiasm to help enterprises build effective data and ML strategies. In her spare time, Bhavana enjoys spending time with her family and engaging in various activities such as traveling, hiking, gardening, and watching documentaries.

Sheela Sonone is a Senior Resident Architect at AWS. She helps AWS customers make informed choices and trade-offs about accelerating their data, analytics, and AI/ML workloads and implementations. In her spare time, she enjoys spending time with her family—usually on tennis courts.

Daniel Bruno is a Principal Resident Architect at AWS. He had been building analytics and machine learning solutions for over 20 years and splits his time helping customers build data science programs and designing impactful ML products.

Serverless ICYMI Q3 2023

Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/serverless-icymi-q3-2023/

Welcome to the 23rd edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share all the most recent product launches, feature enhancements, blog posts, webinars, live streams, and other interesting things that you might have missed!

In case you missed our last ICYMI, check out what happened last quarter here.

AWS announces the general availability of Amazon Bedrock

Amazon Web Services (AWS) unveils five generative artificial intelligence (AI) innovations to democratize generative AI applications. Amazon Bedrock, now generally available, enables experimentation with top foundation models (FMs) and allows customization with proprietary data.

It supports creating managed agents for complex tasks without code and ensures security and privacy. Amazon Titan Embeddings, another FM, is generally available for various language-related use cases. Meta’s Llama 2, coming soon, enhances dialogue scenarios.

The upcoming Amazon CodeWhisperer customization capability enables secure customization using private code bases. Generative BI authoring capabilities in Amazon QuickSight simplify visualization creation for business analysts.

AWS Lambda

AWS Lambda now detects and stops recursive loops in Lambda functions. AWS Lambda now detects and halts functions caught in recursive or infinite loops, guarding against unexpected costs. Lambda identifies recursive behavior, discontinuing requests after 16 invocations. The feature addresses pitfalls stemming from misconfiguration or coding bugs, introducing detailed error messaging, and allowing users to set maximum limits on retry intervals. Notifications about recursive occurrences are relayed through the AWS Health Dashboard, emails, and CloudWatch Alarms for streamlined troubleshooting. Lambda uses AWS X-Ray trace headers for invocation tracking, requiring supported AWS SDK versions.

AWS simplifies writing .NET 6 Lambda functions. The Lambda Annotations Framework for .NET. A new programming model makes the experience of writing Lambda functions in C# feel more natural for .NET developers by using C# source generator technology. This streamlines the development workflow for .NET developers, making it easier to create serverless applications using the latest version of the .NET framework.

AWS Lambda and Amazon EventBridge Pipes now support enhanced filtering. Additional filtering capabilities include the ability to match against characters at the end of a value (suffix filtering), ignore case sensitivity (equals-ignore-case), and have a single rule match if any conditions across multiple separate fields are true (OR matching).

AWS Lambda Functions powered by AWS Graviton2 are now available in 6 additional Regions. Graviton2 processors are known for their performance benefits, and this expansion provides users with more choices for running serverless workloads.

AWS Lambda adds support for Python 3.11 allowing developers to take advantage of the latest features and improvements in the Python programming language for their serverless functions.

AWS Step Functions

AWS Step Functions enhances Workflow Studio, focusing on an Advanced Starter Template and Code Mode for efficient AWS Step Functions workflow creation. Users benefit from streamlined design-to-code transitions, pasting Amazon States Language (ASL) definitions directly into Workflow Studio, speeding up adjustments. Enhanced workflow execution and configuration allow direct execution and setting adjustments within Workflow Studio, improving user experience.

AWS Step Functions launches enhanced error handling This update helps users to identify errors with precision and refine retry strategies. Step Functions now enables detailed error messages in Fail states and precise control over retry intervals. Use the new maximum limits and jitter functionality to ensure efficient and controlled retries, preventing service overload in recovery scenarios.

AWS Step Functions distributed map is now available in the AWS GovCloud (US) Regions. This release highlights the availability of the distributed map feature in Step Functions specifically tailored for the AWS GovCloud (US) Regions. The distributed map feature is a powerful capability for orchestrating parallel and distributed processing in serverless workflows.


AWS SAM CLI announces local testing and debugging support on Terraform projects.

Developers can now use AWS SAM CLI to locally test and debug AWS Lambda functions and Amazon API Gateway defined in their Terraform projects. AWS SAM CLI reads infrastructure resource information from the Terraform application, allowing users to start Lambda functions and API Gateway endpoints locally in a Docker container.

This update enables faster development cycles for Terraform users, who can use AWS SAM CLI commands like `AWS SAM local start-api`, `sam local start-lambda`, and `sam local invoke`, along with `sam local generate` for generating mock test events.

Amazon EventBridge

Amazon EventBridge Scheduler adds schedule deletion after completion. This feature offers enhanced functionality by supporting the automatic deletion of schedules upon completion of their last invocation. It is applicable to various scheduling types, including one-time, cron, and rate schedules with an end date. Amazon EventBridge Scheduler, a centralized and highly scalable service, enables the creation, execution, and management of schedules.

With the ability to schedule millions of tasks invoking over 270 AWS services and 6,000 API operations. This update streamlines the process of managing completed schedules. The automatic deletion feature reduces the need for manual intervention or custom code, saving time and simplifying scalability for users leveraging EventBridge Scheduler.

Amazon EventBridge Pipes now available in three additional Regions. This update extends the availability of Amazon EventBridge Pipes, a powerful event-routing service, to three additional Regions.

Amazon EventBridge API Destinations is now available in additional Regions. Providing users with more options for building scalable and decoupled applications.

Amazon EventBridge Schema Registry and Schema Discovery now in additional Regions. This expansion allows you to discover and store event structure – or schema – in a shared, central location. You can download code bindings for those schemas for Java, Python, TypeScript, and Golang so it’s easier to use events as objects in your code.

Amazon SNS

To enhance message privacy and security, Amazon Simple Notification Service (SNS) implemented Message Data Protection, allowing users to de-identify outbound messages via redaction or masking. Amazon SNS FIFO topics now support message delivery to Amazon SQS Standard queues. This provides users with increased flexibility in managing message delivery and ordering.

Expanding its monitoring capabilities, Amazon SNS introduced Additional Usage Metrics in Amazon CloudWatch. This enhancement allows users to gain more comprehensive insights into the performance and utilization of their SNS resources. SNS extended its global SMS sending capabilities to Israel (Tel Aviv), providing users in that Region with additional options for SMS notifications. SNS also expanded its reach by supporting Mobile Push Notifications in twelve new AWS Regions. This expansion aligns with the growing demand for mobile notification capabilities, offering a broader coverage for users across diverse Regions.

Amazon SQS

Amazon Simple Queue Service (SQS) introduced a number of updates. Attribute-Based Access Control (ABAC) was implemented for scalable access permissions, while message data protection can now de-identify outbound messages via redaction or masking. SQS FIFO topics now support message delivery to Amazon SQS Standard queues, providing enhanced flexibility. Addressing throughput demands, SQS increased the quota for FIFO High Throughput mode. JSON protocol support was previewed, offering improved message format flexibility. These updates underscore SQS’s commitment to advanced security and flexibility.

Amazon API Gateway

Amazon API Gateway undergoes a console refresh, aligning with Cloudscape Design System guidelines. Notable enhancements include improved usability, sortable tables, enhanced API key management, and direct API deployment from the Resource view. The update introduces dark mode, accessibility improvements, and visual alignment with HTTP APIs and AWS Services.

GOTO EDA day Nashville 2023

Join GOTO EDA Day in Nashville on October 26 for insights on event-driven architectures. Learn from industry leaders at Music City Center with talks, panels, and Hands-On Labs. Limited tickets available.

Serverless blog posts

July 2023

July 5- Implementing AWS Lambda error handling patterns

July 6 – Implementing AWS Lambda error handling patterns

July 7 – Understanding AWS Lambda’s invoke throttling limits

July 10 – Detecting and stopping recursive loops in AWS Lambda functions

July 11 – Implementing patterns that exit early out of a parallel state in AWS Step Functions

July 26 – Migrating AWS Lambda functions from the Go1.x runtime to the custom runtime on Amazon Linux 2

July 27 – Python 3.11 runtime now available in AWS Lambda

August 2023

August 2 – Automatically delete schedules upon completion with Amazon EventBridge Scheduler

August 7 – Using response streaming with AWS Lambda Web Adapter to optimize performance

August 15 – Integrating IBM MQ with Amazon SQS and Amazon SNS using Apache Camel

August 15 – Implementing the transactional outbox pattern with Amazon EventBridge Pipes

August 23 – Protecting an AWS Lambda function URL with Amazon CloudFront and Lambda@Edge

August 29 – Enhancing file sharing using Amazon S3 and AWS Step Functions

August 31 – Enhancing Workflow Studio with new features for streamlined authoring

September 2023

September 5 – AWS SAM support for HashiCorp Terraform now generally available

September 14 – Building a secure webhook forwarder using an AWS Lambda extension and Tailscale

September 18 – Building resilient serverless applications using chaos engineering

September 19 – Implementing idempotent AWS Lambda functions with Powertools for AWS Lambda (TypeScript)

September 19 – Centralizing management of AWS Lambda layers across multiple AWS Accounts

September 26 – Architecting for scale with Amazon API Gateway private integrations

September 26 – Visually design your application with AWS Application Composer


Serverless Office Hours – Tues 10AM PT

July 2023

July 4 – Benchmarking Lambda cold starts

July 11 – Lambda testing: AWS SAM remote invoke

July 18 – Using DynamoDB global tables

July 25 – Serverless observability with SLIC-watch

August 2023

August 1 – Step Functions versions and aliases

August 8 – Deploying Lambda with EKS and Crossplane / Managing Lambda with Kubernetes

August 15 – Serverless caching with Momento

September 2023

September 5 – Run any web app on Lambda

September 12 – Building an API platform on AWS

September 19 – Idempotency: exactly once processing

September 26 – AWS Amplify Studio + GraphQL

FooBar Serverless YouTube channel

July 2023

July 27 – Generative AI and Serverless to create a new story everyday

August 2023

August 3Getting started with Data Streaming

August 10 – Amazon Kinesis Data Streams – Shards? Provisioned? On-demand? What does all this mean?

August 17 – Put and consume events with AWS Lambda, Amazon Kinesis Data Stream and Event Source Mapping

August 24 – Create powerful data pipelines with Amazon Kinesis and EventBridge Pipes

August 31 – New Step Functions versions and alias!

September 2023

September 7 – Amazon Kinesis Data Firehose – What is this service for?

September 14 – Kinesis Data Firehose with AWS CDK – Lambda transformations

September 21 – Advanced Event Source Mapping configuration | AWS Lambda and Amazon Kinesis Data Streams

September 28 – Data Streaming Patterns

Still looking for more?

The Serverless landing page has more information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials.

You can also follow the Serverless Developer Advocacy team on Twitter to see the latest news, follow conversations, and interact with the team.