Tag Archives: serverless

How Wallapop improved performance of analytics workloads with Amazon Redshift Serverless and data sharing

Post Syndicated from Eduard Lopez original https://aws.amazon.com/blogs/big-data/how-wallapop-improved-performance-of-analytics-workloads-with-amazon-redshift-serverless-and-data-sharing/

Amazon Redshift is a fast, fully managed cloud data warehouse that makes it straightforward and cost-effective to analyze all your data at petabyte scale, using standard SQL and your existing business intelligence (BI) tools. Today, tens of thousands of customers run business-critical workloads on Amazon Redshift.

Amazon Redshift Serverless makes it effortless to run and scale analytics workloads without having to manage any data warehouse infrastructure.

Redshift Serverless automatically provisions and intelligently scales data warehouse capacity to deliver fast performance for even the most demanding and unpredictable workloads, and you pay only for what you use.

This is ideal when it’s difficult to predict compute needs such as variable workloads, periodic workloads with idle time, and steady-state workloads with spikes. As your demand evolves with new workloads and more concurrent users, Redshift Serverless automatically provisions the right compute resources, and your data warehouse scales seamlessly and automatically.

Amazon Redshift data sharing allows you to securely share live, transactionally consistent data in one Redshift data warehouse with another Redshift data warehouse (provisioned or serverless) across accounts and Regions without needing to copy, replicate, or move data from one data warehouse to another.

Amazon Redshift data sharing enables you to evolve your Amazon Redshift deployment architectures into a hub-and-spoke or data mesh model to better meet performance SLAs, provide workload isolation, perform cross-group analytics, and onboard new use cases, all without the complexity of data movement and data copies.

In this post, we show how Wallapop adopted Redshift Serverless and data sharing to modernize their data warehouse architecture.

Wallapop’s initial data architecture platform

Wallapop is a Spanish ecommerce marketplace company focused on second-hand items, founded in 2013. Every day, they receive around 300,000 new items from buyers to be added to their catalog. The marketplace can be accessed via mobile app or website.

The average monthly traffic is around 15 million active users. Since its creation in 2013, it has reached more than 40 million downloads and more than 700 million products have been listed.

Amazon Redshift plays a central role in their data platform on AWS for ingestion, ETL (extract, transform, and load), machine learning (ML), and consumption workloads that run their insight consumption to drive decision-making.

The initial architecture is composed of one main Redshift provisioned cluster that handled all the workloads, as illustrated in the following diagram. Their cluster was deployed with 8 nodes ra3.4xlarge and concurrency scaling enabled.

Wallapop had three main areas to improve in their initial data architecture platform:

  • Workload isolation challenges with growing data volumes and new workloads running in parallel
  • Administrative burden on data engineering teams to manage the concurrent workloads, especially at peak times
  • Cost-performance ratio while scaling during peak periods

The areas of improvement mainly focused on performance of data consumption workloads along with the BI and analytics consumption tool, where high query concurrency was impacting the final analytics preparation and its insights consumption.

Solution overview

To improve their data platform architecture, Wallapop designed and built a new distributed approach with Amazon Redshift with the support of AWS.

Their cluster size of the provisioned data warehouse didn’t change. What changed was lowering the usage concurrency scaling to 1 hour, which is in the Free Tier usage for every 24 hours of using the main cluster. The following diagram illustrates the target architecture.

Solution details

The new data platform architecture combines Redshift Serverless and provisioned data warehouses with Amazon Redshift data sharing, helping Wallapop improve their overall Amazon Redshift experience with improved ease of use, performance, and optimized costs.

Redshift Serverless measures data warehouse capacity in Redshift Processing Units (RPUs). RPUs are resources used to handle workloads. You can adjust the base capacity setting from 8 RPUs to 512 RPUs in units of 8 (8, 16, 24, and so on).

The new architecture uses a Redshift provisioned cluster with RA3 nodes to run their constant and write workloads (data ingestion and transformation jobs). For cost-efficiency, Wallapop is also benefiting from Redshift reserved instances to optimize on costs for these known, predictable, and steady workloads. This cluster acts as the producer cluster in their distributed architecture using data sharing, meaning the data is ingested into the storage layer of Amazon Redshift—Redshift Managed Storage (RMS).

For the consumption part of the data platform architecture, the data is shared with different Redshift Serverless endpoints to meet the needs for different consumption workloads.

Data sharing provides workloads isolation. With this architecture, Wallapop achieves better workload isolation and ensures that only the right data is shared with the different consumption applications. Additionally, this approach avoids data duplication in their consumer part, which optimizes costs and allows better governance processes, because they only have to manage a single version of the data warehouse data instead of different copies or versions of it.

Redshift Serverless is used as a consumer part of the data platform architecture to meet those predictable and unpredictable, non-steady, and often demanding analytics workloads, such as their CI/CD jobs and BI and analytics consumption workloads coming from their data visualization application. Redshift Serverless also helps them achieve better workload isolation due to its managed auto scaling feature that makes sure performance is consistently good for these unpredictable workloads, even at peak times. It also provides a better user experience for the Wallapop data platform team, thanks to the autonomics capabilities that Redshift Serverless provides.

The new solution combining Redshift Serverless and data sharing allowed Wallapop to achieve better performance, cost, and ease of use.

Eduard Lopez, Wallapop Data Engineering Manager, shared the improved experience of analytics users: “Our analyst users are telling us that now ‘Looker flies.’ Insights consumption went up as a result of it without increasing costs.”

Evaluation of outcome

Wallapop started this re-architecture effort by first testing the isolation of their BI consumption workload with Amazon Redshift data sharing and Redshift Serverless with the support of AWS. The workload was tested using different base RPU configurations to measure the base capacity and resources in Redshift Serverless. Base RPU ranges for Redshift Serverless range from 8–512. Wallapop tested their BI workload with two configurations: 32 base RPU and 64 base RPU, after enabling data sharing from their Redshift provisioned cluster to ensure the serverless endpoints have access to the necessary datasets.

Based on the results measured 1 week before testing, the main area for improvement was the queries that took longer than 10 seconds to complete (52%), represented by the yellow, orange, and red areas of the following chart, as well as the long-running queries represented by the red area (over 600 seconds, 9%).

The first test of this workload with Redshift Serverless using a 64 base RPU configuration immediately showed performance improvement results: the queries running longer than 10 seconds were reduced by 38% and the long-running queries (over 120 seconds) were almost completely eliminated.

Javier Carbajo, Wallapop Data Engineer, says, “Providing a service without downtime or loading times that are too long was one of our main requirements since we couldn’t have analysts or stakeholders without being able to consult the data.”

Following the first set of results, Wallapop also tested with a Redshift Serverless configuration using 32 base RPU to compare the results and select the configuration that could offer them the best price-performance for this workload. With this configuration, the results were similar to the previously test run on Redshift Serverless with 64 base RPU (still showing significant performance improvement from the original results). Based on the tests, this configuration was selected for the new architecture.

Gergely Kajtár, Wallapop Data Engineer, says, “We noticed a significant increase in the daily workflows’ stability after the change to the new Redshift architecture.”

Following this first workload, Wallapop has continued expanding their Amazon Redshift distributed architecture with CI/CD workloads running on a separated Redshift Serverless endpoint using data sharing with their Redshift provisioned (RA3) cluster.

“With the new Redshift architecture, we have noticed remarkable improvements both in speed and stability. That has translated into an increase of 2 times in analytical queries, not only by analysts and data scientists but from other roles as well such as marketing, engineering, C-level, etc. That proves that investing in a scalable architecture like Redshift Serverless has a direct consequence on accelerating the adoption of data as decision-making driver in the organization.”

– Nicolás Herrero, Wallapop Director of Data & Analytics.

Conclusion

In this post, we showed you how this platform can help Wallapop to scale in the future by adding new consumers when new needs or applications require to access data.

If you’re new to Amazon Redshift, you can explore demos, other customer stories, and the latest features at Amazon Redshift. If you’re already using Amazon Redshift, reach out to your AWS account team for support, and learn more about what’s new with Amazon Redshift.


About the Authors

Eduard Lopez is the Data Engineer Manager at Wallapop. He is a software engineer with over 6 years of experience in data engineering, machine learning, and data science.

Daniel Martinez is a Solutions Architect in Iberia Digital Native Businesses (DNB), part of the worldwide commercial sales organization (WWCS) at AWS.

Jordi Montoliu is a Sr. Redshift Specialist in EMEA, part of the worldwide specialist organization (WWSO) at AWS.

Ziad Wali is an Acceleration Lab Solutions Architect at Amazon Web Services. He has over 10 years of experience in databases and data warehousing, where he enjoys building reliable, scalable, and efficient solutions. Outside of work, he enjoys sports and spending time in nature.

Semir Naffati is a Sr. Redshift Specialist Solutions Architect in EMEA, part of the worldwide specialist organization (WWSO) at AWS.

The serverless attendee’s guide to AWS re:Invent 2023

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/compute/the-serverless-attendees-guide-to-aws-reinvent-2023/

AWS re:Invent 2023 is fast approaching, bringing together tens of thousands of Builders in Las Vegas in November. However, even if you can’t attend in person, you can catch up with sessions on-demand.

Breakout sessions are lecture-style 60-minute informative sessions presented by AWS experts, customers, or partners. These sessions cover beginner (100 level) topics to advanced and expert (300–400 level) topics. The sessions are recorded and uploaded a few days after to the AWS Events YouTube channel.

This post shares the “must watch” breakout sessions related to serverless architectures and services.

Sessions related to serverless architecture

SVS401

SVS401 | Best practices for serverless developers
Provides architectural best practices, optimizations, and useful shortcuts that experts use to build secure, high-scale, and high-performance serverless applications.

Chris Munns, Startup Tech Leader, AWS
Julian Wood, Principal Developer Advocate, AWS

SVS305 | Refactoring to serverless
Shows how you can refactor your application to serverless with real-life examples.

Gregor Hohpe, Senior Principal Evangelist, AWS
Sindhu Pillai, Senior Solutions Architect, AWS

SVS308 | Building low-latency, event-driven applications
Explores building serverless web applications for low-latency and event-driven support. Marvel Snap share how they achieve low-latency in their games using serverless technology.

Marcia Villalba, Principal Developer Advocate, AWS
Brenna Moore, Second Dinner

SVS309 | Improve productivity by shifting more responsibility to developers
Learn about approaches to accelerate serverless development with faster feedback cycles, exploring best practices and tools. Watch a live demo featuring an improved developer experience for building serverless applications while complying with enterprise governance requirements.

Heeki Park, Principal Solutions Architect, AWS
Sam Dengler, Capital One

GBL203-ES | Building serverless-first applications with MAPFRE
This session is delivered in Spanish. Learn what modern, serverless-first applications are and how to implement them with services such as AWS Lambda or AWS Fargate. Find out how MAPFRE have adopted and implemented a serverless strategy.

Jesus Bernal, Senior Solutions Architect, AWS
Iñigo Lacave, MAPFRE
Mat Jovanovic, MAPFRE

Sessions related to AWS Lambda

BOA311

BOA311 | Unlocking serverless web applications with AWS Lambda Web Adapter
Learn about the AWS Lambda Web Adapter and how it integrates with familiar frameworks and tools. Find out how to migrate existing web applications to serverless or create new applications using AWS Lambda.

Betty Zheng, Senior Developer Advocate, AWS
Harold Sun, Senior Solutions Architect, AWS

OPN305 | The pragmatic serverless Python developer
Covers an opinionated approach to setting up a serverless Python project, including testing, profiling, deployments, and operations. Learn about many open source tools, including Powertools for AWS Lambda—a toolkit that can help you implement serverless best practices and increase developer velocity.

Heitor Lessa, Principal Solutions Architect, AWS
Ran Isenberg, CyberArk

XNT301 | Build production-ready serverless .NET apps with AWS Lambda
Explores development and architectural best practices when building serverless applications with .NET and AWS Lambda, including when to run ASP.NET on Lambda, code structure, and using native AOT to massively increase performance.

James Eastham, Senior Cloud Architect, AWS
Craig Bossie, Solutions Architect, AWS

COM306 | “Rustifying” serverless: Boost AWS Lambda performance with Rust
Discover how to deploy Rust functions using AWS SAM and cargo-lambda, facilitating a smooth development process from your local machine. Explore how to integrate Rust into Python Lambda functions effortlessly using tools like PyO3 and maturin, along with the AWS SDK for Rust. Uncover how Rust can optimize Lambda functions, including the development of Lambda extensions, all without requiring a complete rewrite of your existing code base.

Efi Merdler-Kravitz, Cloudex

COM305 | Demystifying and mitigating AWS Lambda cold starts
Examines the Lambda initialization process at a low level, using benchmarks comparing common architectural patterns, and then benchmarking various RAM configurations and payload sizes. Next, measure and discuss common mistakes that can increase initialization latency, explore and understand proactive initialization, and learn several strategies you can use to thaw your AWS Lambda cold starts.

AJ Stuyvenberg, Datadog

Sessions related to event-driven architecture

API302

API302 | Building next gen applications with event driven architecture
Learn about common integration patterns and discover how you can use AWS messaging services to connect microservices and coordinate data flow using minimal custom code. Learn and plan for idempotency, handling duplicating events and building resiliency into your architectures.

Eric Johnson, Principal Developer Advocate, AWS

API303 | Navigating the journey of serverless event-driven architecture
Learn about the journey businesses undertake when adopting EDAs, from initial design and implementation to ongoing operation and maintenance. The session highlights the many benefits EDAs can offer organizations and focuses on areas of EDA that are challenging and often overlooked. Through a combination of patterns, best practices, and practical tips, this session provides a comprehensive overview of the opportunities and challenges of implementing EDAs and helps you understand how you can use them to drive business success.

David Boyne, Senior Developer Advocate, AWS

API309 | Advanced integration patterns and trade-offs for loosely coupled apps
In this session, learn about common design trade-offs for distributed systems, how to navigate them with design patterns, and how to embed those patterns in your cloud automation.

Dirk Fröhner, Principal Solutions Architect, AWS
Gregor Hohpe, Senior Principal Evangelist, AWS

SVS205 | Getting started building serverless event-driven applications
Learn about the process of prototyping a solution from concept to a fully featured application that uses Amazon API Gateway, AWS Lambda, Amazon EventBridge, AWS Step Functions, Amazon DynamoDB, AWS Application Composer, and more. Learn why serverless is a great tool set for experimenting with new ideas and how the extensibility and modularity of serverless applications allow you to start small and quickly make your idea a reality.

Emily Shea, Head of Application Integration Go-to-Market, AWS
Naren Gakka, Solutions Architect, AWS

API206 | Bringing workloads together with event-driven architecture
Attend this session to learn the steps to bring your existing container workloads closer together using event-driven architecture with minimal code changes and a high degree of reusability. Using a real-life business example, this session walks through a demo to highlight the power of this approach.

Dhiraj Mahapatro, Principal Solutions Architect, AWS
Nicholas Stumpos, JPMorgan Chase & Co

COM301 | Advanced event-driven patterns with Amazon EventBridge
Gain an understanding of the characteristics of EventBridge and how it plays a pivotal role in serverless architectures. Learn the primary elements of event-driven architecture and some of the best practices. With real-world use cases, explore how the features of EventBridge support implementing advanced architectural patterns in serverless.

Sheen Brisals, The LEGO Group

Sessions related to serverless APIs

SVS301

SVS301 | Building APIs: Choosing the best API solution and strategy for your workloads
Learn about access patterns and how to evaluate the best API technology for your applications. The session considers the features and benefits of Amazon API Gateway, AWS AppSync, Amazon VPC Lattice, and other options.

Josh Kahn, Tech Leader Serverless, AWS
Arthi Jaganathan, Principal Solutions Architect, AWS

SVS323 | I didn’t know Amazon API Gateway did that
This session provides an introduction to Amazon API Gateway and the problems it solves. Learn about the moving parts of API Gateway and how it works, including common and not-so-common use cases. Discover why you should use API Gateway and what it can do.

Eric Johnson, Principal Developer Advocate, AWS

FWM201 | What’s new with AWS AppSync for enterprise API developers
Join this session to learn about all the exciting new AWS AppSync features released this year that make it even more seamless for API developers to realize the benefits of GraphQL for application development.

Michael Liendo, Senior Developer Advocate, AWS
Brice Pellé, Principal Product Manager, AWS

FWM204 | Implement real-time event patterns with WebSockets and AWS AppSync
Learn how the PGA Tour uses AWS AppSync to deliver real-time event updates to their app users; review new features, like enhanced filtering options and native integration with Amazon EventBridge; and provide a sneak peek at what’s coming next.

Ryan Yanchuleff, Senior Solutions Architect, AWS
Bill Fine, Senior Product Manager, AWS
David Provan, PGA Tour

Sessions related to AWS Step Functions

API401

API401 | Advanced workflow patterns and business processes with AWS Step Functions
Learn about architectural best practices and repeatable patterns for building workflows and cost optimizations, and discover handy cheat codes that you can use to build secure, high-scale, high-performance serverless applications

Ben Smith, Principal Developer Advocate, AWS

BOA304 | Using AI and serverless to automate video production
Learn how to use Step Functions to build workflows using AI services and how to use Amazon EventBridge real-time events.

Marcia Villalba, Principal Developer Advocate, AWS

SVS204 | Building Serverlesspresso: Creating event-driven architectures
This session explores the design decisions that were made when building Serverlesspresso, how new features influenced the development process, and lessons learned when creating a production-ready application using this approach. Explore useful patterns and options for extensibility that helped in the design of a robust, scalable solution that costs about one dollar per day to operate. This session includes examples you can apply to your serverless applications and complex architectural challenges for larger applications.

James Beswick, Senior Manager Developer Advocacy, AWS

API310 | Scale interactive data analysis with Step Functions Distributed Map
Learn how to build a data processing or other automation once and readily scale it to thousands of parallel processes with serverless technologies. Explore how this approach simplifies development and error handling while improving speed and lowering cost. Hear from an AWS customer that refactored an existing machine learning application to use Distributed Map and the lessons they learned along the way.

Adam Wagner, Principal Solutions Architect, AWS
Roberto Iturralde, Vertex Pharmaceuticals

Sessions related to handling data using serverless services and serverless databases

SVS307

SVS307 | Scaling your serverless data processing with Amazon Kinesis and Kafka
Explore how to build scalable data processing applications using AWS Lambda. Learn practical insights into integrating Lambda with Amazon Kinesis and Apache Kafka using their event-driven models for real-time data streaming and processing.

Julian Wood, Principal Developer Advocate, AWS

DAT410 | Advanced data modeling with Amazon DynamoDB
This session shows you advanced techniques to get the most out of DynamoDB. Learn how to “think in DynamoDB” by learning the DynamoDB foundations and principles for data modeling. Learn practical strategies and DynamoDB features to handle difficult use cases in your application.

Alex De Brie – Independent consultant

COM308 | Serverless data streaming: Amazon Kinesis Data Streams and AWS Lambda
Explore the intricacies of creating scalable, production-ready data streaming architectures using Kinesis Data Streams and Lambda. Delve into tips and best practices essential to navigating the challenges and pitfalls inherent to distributed systems that arise along the way, and observe how AWS services work and interact.

Anahit Pogosova, Solita

Additional resources

If you are attending the event, there are many chalk talks, workshops, and other sessions to visit. See ServerlessLand for a full list of all the serverless sessions and also the Serverless Hero, Danielle Heberling’s Serverless re:Invent attendee guide for her top picks.

Visit us in the AWS Village in the Expo Hall where you can find the Serverless and Containers booth and enjoy a free cup of coffee at Serverlesspresso.

For more serverless learning resources, visit Serverless Land.

Enhanced Amazon CloudWatch metrics for Amazon EventBridge

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/enhanced-amazon-cloudwatch-metrics-for-amazon-eventbridge/

This post is written by Vaibhav Shah, Sr. Solutions Architect.

Customers use event-driven architectures to orchestrate and automate their event flows from producers to consumers. Amazon EventBridge acts as a serverless event router for various targets based on event rules. It decouples the producers and consumers, allowing customers to build asynchronous architectures.

EventBridge provides metrics to enable you to monitor your events. Some of the metrics include: monitoring the number of partner events ingested, the number of invocations that failed permanently, and the number of times a target is invoked by a rule in response to an event, or the number of events that matched with any rule.

In response to customer requests, EventBridge has added additional metrics that allow customers to monitor their events and provide additional visibility. This blog post explains these new capabilities.

What’s new?

EventBridge has new metrics mainly around the API, events, and invocations metrics. These metrics give you insights into the total number of events published, successful events published, failed events, number of events matched with any or specific rule, events rejected because of throttling, latency, and invocations based metrics.

This allows you to track the entire span of event flow within EventBridge and quickly identify and resolve issues as they arise.

EventBridge now has the following metrics:

Metric Description Dimensions and Units
PutEventsLatency The time taken per PutEvents API operation

None

Units: Milliseconds

PutEventsRequestSize The size of the PutEvents API request in bytes

None

Units: Bytes

MatchedEvents Number of events that matched with any rule, or a specific rule None
RuleName,
EventBusName,
EventSourceName

Units: Count

ThrottledRules The number of times rule execution was throttled.

None, RuleName

Unit: Count

PutEventsApproximateCallCount Approximate total number of calls in PutEvents API calls.

None

Units: Count

PutEventsApproximateThrottledCount Approximate number of throttled requests in PutEvents API calls.

None

Units: Count

PutEventsApproximateFailedCount Approximate number of failed PutEvents API calls.

None

Units: Count

PutEventsApproximateSuccessCount Approximate number of successful PutEvents API calls.

None

Units: Count

PutEventsEntriesCount The number of event entries contained in a PutEvents request.

None

Units: Count

PutEventsFailedEntriesCount The number of event entries contained in a PutEvents request that failed to be ingested.

None

Units: Count

PutPartnerEventsApproximateCallCount Approximate total number of calls in PutPartnerEvents API calls. (visible in Partner’s account)

None

Units: Count

PutPartnerEventsApproximateThrottledCount Approximate number of throttled requests in PutPartnerEvents API calls. (visible in Partner’s account)

None

Units: Count

PutPartnerEventsApproximateFailedCount Approximate number of failed PutPartnerEvents API calls. (visible in Partner’s account)

None

Units: Count

PutPartnerEventsApproximateSuccessCount Approximate number of successful PutPartnerEvents API calls. (visible in Partner’s account)

None

Units: Count

PutPartnerEventsEntriesCount The number of event entries contained in a PutPartnerEvents request.

None

Units: Count

PutPartnerEventsFailedEntriesCount The number of event entries contained in a PutPartnerEvents request that failed to be ingested.

None

Units: Count

PutPartnerEventsLatency The time taken per PutPartnerEvents API operation (visible in Partner’s account)

None

Units: Milliseconds

InvocationsCreated Number of times a target is invoked by a rule in response to an event. One invocation attempt represents a single count for this metric.

None

Units: Count

InvocationAttempts Number of times EventBridge attempted invoking a target.

None

Units: Count

SuccessfulInvocationAttempts Number of times target was successfully invoked.

None

Units: Count

RetryInvocationAttempts The number of times a target invocation has been retried.

None

Units: Count

IngestiontoInvocationStartLatency The time to process events, measured from when an event is ingested by EventBridge to the first invocation of a target. None,
RuleName,
EventBusName

Units: Milliseconds

IngestiontoInvocationCompleteLatency The time taken from event Ingestion to completion of the first successful invocation attempt None,
RuleName,
EventBusName

Units: Milliseconds

Use-cases for these metrics

These new metrics help you improve observability and monitoring of your event-driven applications. You can proactively monitor metrics that help you understand the event flow, invocations, latency, and service utilization. You can also set up alerts on specific metrics and take necessary actions, which help improve your application performance, proactively manage quotas, and improve resiliency.

Monitor service usage based on Service Quotas

The PutEventsApproximateCallCount metric in the events family helps you identify the approximate number of events published on the event bus using the PutEvents API action. The PutEventsApproximateSuccessfulCount metric shows the approximate number of successful events published on the event bus.

Similarly, you can monitor throttled and failed events count with PutEventsApproximateThrottledCount and PutEventsApproximateFailedCount respectively. These metrics allow you to monitor if you are reaching your quota for PutEvents. You can use a CloudWatch alarm and set a threshold close to your account quotas. If that is triggered, send notifications using Amazon SNS to your operations team. They can work to increase the Service Quotas.

You can also set an alarm on the PutEvents throttle limit in transactions per second service quota.

  1. Navigate to the Service Quotas console. On the left pane, choose AWS services, search for EventBridge, and select Amazon EventBridge (CloudWatch Events).
  2. In the Monitoring section, you can monitor the percentage utilization of the PutEvents throttle limit in transactions per second.
    Monitor the percentage utilization of PutEvents
  3. Go to the Alarms tab, and choose Create alarm. In Alarm threshold, choose 80% of the applied quota value from the dropdown. Set the Alarm name to PutEventsThrottleAlarm, and choose Create.
    Create alarm
  4. To be notified if this threshold is breached, navigate to Amazon CloudWatch Alarms console and choose PutEventsThrottleAlarm.
  5. Select the Actions dropdown from the top right corner, and choose Edit.
  6. On the Specify metric and conditions page, under Conditions, make sure that the Threshold type is selected as Static and the % Utilization selected as Greater/Equal than 80. Choose Next.
    Specify metrics and conditions
  7. Configure actions to send notifications to an Amazon SNS topic and choose Next.
    7.	Configure actions to send notifications.
  8. The Alarm name should be already set to PutEventsThrottleAlarm. Choose Next, then choose Update alarm.
    Add name and description

This helps you get notified when the percentage utilization of PutEvents throttle limit in transactions per second reaches close to the threshold set. You can then request Service Quota increases if required.

Similarly, you can also create CloudWatch alarms on percentage utilization of Invocations throttle limit in transactions per second against the service quota.

Invocations throttle limit in transactions per second

Enhanced observability

The PutEventsLatency metric shows the time taken per PutEvents API operation. There are two additional metrics, IngestiontoInvocationStartLatency metric and IngestiontoInvocationCompleteLatency metric. The first metric shows the time to process events measured from when the events are first ingested by EventBridge to the first invocation of a target. The second shows the time taken from event ingestion to completion of the first successful invocation attempt.

This helps identify latency-related issues from the time of ingestion until the time it reaches the target based on the RuleName. If there is high latency, these two metrics give you visibility into this issue, allowing you to take appropriate action.

Enhanced observability

You can set a threshold around these metrics, and if the threshold is triggered, the defined actions can help recover from potential failures. One of the defined actions here can be to send events generated later to EventBridge in the secondary Region using EventBridge global endpoints.

Sometimes, events are not delivered to the target specified in the rule. This can be because the target resource is unavailable, you don’t have permission to invoke the target, or there are network issues. In such scenarios, EventBridge retries to send these events to the target for 24 hours or up to 185 times, both of which are configurable.

The new RetryInvocationAttempts metric shows the number of times the EventBridge has retried to invoke the target. The retries are done when requests are throttled, target service having availability issues, network issues, and service failures. This provides additional observability to the customers and can be used to trigger a CloudWatch alarm to notify teams if the desired threshold is crossed. If the retries are exhausted, store the failed events in the Amazon SQS dead-letter queues to process failed events for the later time.

In addition to these, EventBridge supports additional dimensions like DetailType, Source, and RuleName to MatchedEvents metrics. This helps you monitor the number of matched events coming from different sources.

  1. Navigate to the Amazon CloudWatch. On the left pane, choose Metrics, and All metrics.
  2. In the Browse section, select Events, and Source.
  3. From the Graphed metrics tab, you can monitor matched events coming from different sources.Graphed metrics tab

Failover events to secondary Region

The PutEventsFailedEntriesCount metric shows the number of events that failed ingestion. Monitor this metric and set a CloudWatch alarm. If it crosses a defined threshold, you can then take appropriate action.

Also, set an alarm on the PutEventsApproximateThrottledCount metric, which shows the number of events that are rejected because of throttling constraints. For these event ingestion failures, the client must resend the failed events to the event bus again, allowing you to process every single event critical for your application.

Alternatively, send events to EventBridge service in the secondary Region using Amazon EventBridge global endpoints to improve resiliency of your event-driven applications.

Conclusion

This blog shows how to use these new metrics to improve the visibility of event flows in your event-driven applications. It helps you monitor the events more effectively, from invocation until the delivery to the target. This improves observability by proactively alerting on key metrics.

For more serverless learning resources, visit Serverless Land.

Introducing the Amazon Linux 2023 runtime for AWS Lambda

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/introducing-the-amazon-linux-2023-runtime-for-aws-lambda/

This post is written by Rakshith Rao, Senior Solutions Architect.

AWS Lambda now supports Amazon Linux 2023 (AL2023) as a managed runtime and container base image. Named provided.al2023, this runtime provides an OS-only environment to run your Lambda functions.

It is based on the Amazon Linux 2023 minimal container image release and has several improvements over Amazon Linux 2 (AL2), such as a smaller deployment footprint, updated versions of libraries like glibc, and a new package manager.

What are OS-only Lambda runtimes?

Lambda runtimes define the execution environment where your function runs. They provide the OS, language support, and additional settings such as environment variables and certificates.

Lambda provides managed runtimes for Java, Python, Node.js, .NET, and Ruby. However, if you want to develop your Lambda functions in programming languages that are not supported by Lambda’s managed language runtimes, the ‘provided’ runtime family provides an OS-only environment in which you can run code written in any language. This release extends the provided runtime family to support Amazon Linux 2023.

Customers use these OS-only runtimes in three common scenarios. First, they are used with languages that compile to native code, such as Go, Rust, C++, .NET Native AOT and Java GraalVM Native. Since you only upload the compiled binary to Lambda, these languages do not require a dedicated language runtime, they only require an OS environment in which the binary can run.

Second, the OS-only runtimes also enable building third-party language runtimes that you can use off the shelf. For example, you can write Lambda functions in PHP using Bref, or Swift using the Swift AWS Lambda Runtime.

Third, you can use the OS-only runtime to deploy custom runtimes, which you build for a language or language version which Lambda does not provide a managed runtime. For example, Node.js 19 – Lambda only provides managed runtimes for LTS releases, which for Node.js are the even-numbered releases.

New in Amazon Linux 2023 base image for Lambda

Updated packages

AL2023 base image for Lambda is based on the AL2023-minimal container image and includes various package updates and changes compared with provided.al2.

The version of glibc in the AL2023 base image has been upgraded to 2.34, from 2.26 that was bundled in the AL2 base image. Some libraries that developers wanted to use in provided runtimes required newer versions of glibc. With this launch, you can now use an up-to-date version of glibc with your Lambda function.

The AL2 base image for Lambda came pre-installed with Python 2.7. This was needed because Python was a required dependency for some of the packages that were bundled in the base image. The AL2023 base image for Lambda has removed this dependency on Python 2.7 and does not come with any pre-installed language runtime. You are free to choose and install any compatible Python version that you need.

Since the AL2023 base image for Lambda is based on the AL2023-minimal distribution, you also benefit from a significantly smaller deployment footprint. The new image is less than 40MB compared to the AL2-based base image, which is over 100MB in size. You can find the full list of packages available in the AL2023 base image for Lambda in the “minimal container” column of the AL2023 package list documentation.

Package manager

Amazon Linux 2023 uses dnf as the package manager, replacing yum, which was the default package manager in Amazon Linux 2. AL2023 base image for Lambda uses microdnf as the package manager, which is a standalone implementation of dnf based on libdnf and does not require extra dependencies such as Python. microdnf in provided.al2023 is symlinked as dnf. Note that microdnf does not support all options of dnf. For example, you cannot install a remote rpm using the rpm’s URL or install a local rpm file. Instead, you can use the rpm command directly to install such packages.

This example Dockerfile shows how you can install packages using dnf while building a container-based Lambda function:

# Use the Amazon Linux 2023 Lambda base image
FROM public.ecr.aws/lambda/provided.al2023

# Install the required Python version
RUN dnf install -y python3

Runtime support

With the launch of provided.al2023 you can migrate your AL2 custom runtime-based Lambda functions right away. It also sets the foundation of future Lambda managed runtimes. The future releases of managed language runtimes such as Node.js 20, Python 3.12, Java 21, and .NET 8 are based on Amazon Linux 2023 and will use provided.al2023 as the base image.

Changing runtimes and using other compute services

Previously, the provided.al2 base image was built as a custom image that used a selection of packages from AL2. It included packages like curl and yum that were needed to build functions using custom runtime. Also, each managed language runtime used different packages based on the use case.

Since future releases of managed runtimes use provided.al2023 as the base image, they contain the same set of base packages that come with AL2023-minimal. This simplifies migrating your Lambda function from a custom runtime to a managed language runtime. It also makes it easier to switch to other compute services like AWS Fargate or Amazon Elastic Container Services (ECS) to run your application.

Upgrading from AL1-based runtimes

For more information on Lambda runtime deprecation, see Lambda runtimes.

AL1 end of support is scheduled for December 31, 2023. The AL1-based runtimes go1.x, java8 and provided will be deprecated from this date. You should migrate your Go based Lambda functions to the provided runtime family, such as provided.al2 or provided.al2023. Using a provided runtime offers several benefits over the go1.x runtime. First, you can run your Lambda functions on AWS Graviton2 processors that offer up to 34% better price-performance compared to functions running on x86_64 processors. Second, it offers a smaller deployment package and faster function invoke path. And third, it aligns Go with other languages that also compile to native code and run on the provided runtime family.

The deprecation of the Amazon Linux 1 (AL1) base image for Lambda is also scheduled for December 31, 2023. With provided.al2023 now generally available, you should start planning the migration of your go1.x and AL1 based Lambda functions to provided.al2023.

Using the AL2023 base image for Lambda

To build Lambda functions using a custom runtime, follow these steps using the provided.al2023 runtime.

AWS Management Console

Navigate to the Create function page in the Lambda console. To use the AL2023 custom runtime, select Provide your own bootstrap on Amazon Linux 2023 as the Runtime value:

Runtime value

AWS Serverless Application Model (AWS SAM) template

If you use the AWS SAM template to build and deploy your Lambda function, use the provided.al2023 as the value of the Runtime:

Resources:
  HelloWorldFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: hello-world/
      Handler: my.bootstrap.file
      Runtime: provided.al2023

Building Lambda functions that compile natively

Lambda’s custom runtime simplifies the experience to build functions in languages that compile to native code, broadening the range of languages you can use. Lambda provides the Runtime API, an HTTP API that custom runtimes can use to interact with the Lambda service. Implementations of this API, called Runtime Interface Client (RIC), allow your function to receive invocation events from Lambda, send the response back to Lambda, and report errors to the Lambda service. RICs are available as language-specific libraries for several popular programming langauges such as Go, Rust, Python, and Java.

For example, you can build functions using Go as shown in the Building with Go section of the Lambda developer documentation. Note that the name of the executable file of your function should always be bootstrap in provided.al2023 when using the zip deployment model. To use AL2023 in this example, use provided.al2023 as the runtime for your Lambda function.

If you are using CLI set the --runtime option to provided.al2023:

aws lambda create-function --function-name myFunction \
--runtime provided.al2023 --handler bootstrap \
--role arn:aws:iam::111122223333:role/service-role/my-lambda-role \
--zip-file fileb://myFunction.zip

If you are using AWS Serverless Application Model, use provided.al2023 as the value of the Runtime in your AWS SAM template file:

AWSTemplateFormatVersion: '2010-09-09'
Transform: 'AWS::Serverless-2016-10-31'
Resources:
  HelloWorldFunction:
    Type: AWS::Serverless::Function
    Metadata:
      BuildMethod: go1.x
    Properties:
      CodeUri: hello-world/ # folder where your main program resides
      Handler: bootstrap
      Runtime: provided.al2023
      Architectures: [arm64]

If you run your function as a container image as shown in the Deploy container image example, use this Dockerfile. You can use any name for the executable file of your function when using container images. You need to specify the name of the executable as the ENTRYPOINT in your Dockerfile:

FROM golang:1.20 as build
WORKDIR /helloworld

# Copy dependencies list
COPY go.mod go.sum ./

# Build with optional lambda.norpc tag
COPY main.go .
RUN go build -tags lambda.norpc -o main main.go

# Copy artifacts to a clean image
FROM public.ecr.aws/lambda/provided:al2023
COPY --from=build /helloworld/main ./main
ENTRYPOINT [ "./main" ]

Conclusion

With this launch, you can now build your Lambda functions using Amazon Linux 2023 as the custom runtime or use it as the base image to run your container-based Lambda functions. You benefit from the updated versions of libraries such as glibc, new package manager, and smaller deployment size than Amazon Linux 2 based runtimes. Lambda also uses Amazon Linux 2023-minimal as the basis for future Lambda runtime releases.

For more serverless learning resources, visit Serverless Land.

Scaling improvements when processing Apache Kafka with AWS Lambda

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/scaling-improvements-when-processing-apache-kafka-with-aws-lambda/

AWS Lambda is improving the automatic scaling behavior when processing data from Apache Kafka event-sources. Lambda is increasing the default number of initial consumers, improving how quickly consumers scale up, and helping to ensure that consumers don’t scale down too quickly. There is no additional action that you must take, and there is no additional cost.

Running Kafka on AWS

Apache Kafka is a popular open-source platform for building real-time streaming data pipelines and applications. You can deploy and manage your own Kafka solution on-premises or in the cloud on Amazon EC2.

Amazon Managed Streaming for Apache Kafka (MSK) is a fully managed service that makes it easier to build and run applications that use Kafka to process streaming data. MSK Serverless is a cluster type for Amazon MSK that allows you to run Kafka without having to manage and scale cluster capacity. It automatically provisions and scales capacity while managing the partitions in your topic, so you can stream data without thinking about right-sizing or scaling clusters. MSK Serverless offers a throughput-based pricing model, so you pay only for what you use. For more information, see Using Kafka to build your streaming application.

Using Lambda to consume records from Kafka

Processing streaming data can be complex in traditional, server-based architectures, especially if you must react in real-time. Many organizations spend significant time and cost managing and scaling their streaming platforms. In order to react fast, they must provision for peak capacity, which adds complexity.

Lambda and serverless architectures remove the undifferentiated heavy lifting when processing Kafka streams. You don’t have to manage infrastructure, can reduce operational overhead, lower costs, and scale on-demand. This helps you focus more on building streaming applications. You can write Lambda functions in a number of programming languages, which provide flexibility when processing streaming data.

Lambda event source mapping

Lambda can integrate natively with your Kafka environments as a consumer to process stream data as soon as it’s generated.

To consume streaming data from Kafka, you configure an event source mapping (ESM) on your Lambda functions. This is a resource managed by the Lambda service, which is separate from your function. It continually polls records from the topics in the Kafka cluster. The ESM optionally filters records and batches them into a payload. Then, it calls the Lambda Invoke API to deliver the payload to your Lambda function synchronously for processing.

As Lambda manages the pollers, you don’t need to manage a fleet of consumers across multiple teams. Each team can create and configure their own ESM with Lambda handling the polling.

AWS Lambda event source mapping

AWS Lambda event source mapping

For more information on using Lambda to process Kafka streams, see the learning guide.

Scaling and throughput

Kafka uses partitions to increase throughput and spread the load of records to all brokers in a cluster.

The Lambda event source mapping resource includes pollers and processors. Pollers have consumers that read records from Kafka partitions. The poller assigners send them to processors which batch the records and invoke your function.

When you create a Kafka event source mapping, Lambda allocates consumers to process all partitions in the Kafka topic. Previously, Lambda allocated a minimum of one processor for a consumer.

Lambda previous initial scaling

Lambda previous initial scaling

With these scaling improvements, Lambda allocates multiple processors to improve processing. This reduces the possibility of a single invoke slowing down the entire processing stream.

Lambda updated initial scaling

Lambda updated initial scaling

Each consumer sends records to multiple processors running in parallel to handle increased workloads. Records in each partition are only assigned to a single processor to maintain order.

Lambda automatically scales up or down the number of consumers and processors based on workload. Lambda samples the consumer offset lag of all the partitions in the topic every minute. If the lag is increasing, this means Lambda can’t keep up with processing the records from the partition.

The scaling algorithm accounts for the current offset lag, and also the rate of new messages added to the topic. Lambda can reach the maximum number of consumers within three minutes to lower the offset lag as quickly as possible. Lambda is also reducing the scale down behavior to ensure records are processed more quickly and latency is reduced, particularly for bursty workloads.

Total processors for all pollers can only scale up to the total number of partitions in the topic.

After successful invokes, the poller periodically commits offsets to the respective brokers.

Lambda further scaling

Lambda further scaling

You can monitor the throughput of your Kafka topic using consumer metrics consumer_lag and consumer_offset.

To check how many function invocations occur in parallel, you can also monitor the concurrency metrics for your function. The concurrency is approximately equal to the total number of processors across all pollers, depending on processor activity. For example, if three pollers have five processors running for a given ESM, the function concurrency would be approximately 15 (5 + 5 + 5).

Seeing the improved scaling in action

There are a number of Serverless Patterns that you can use to process Kafka streams using Lambda. To set up Amazon MSK Serverless, follow the instructions in the GitHub repo:

  1. Create an example Amazon MSK Serverless topic with 1000 partitions.
  2. ./kafka-topics.sh --create --bootstrap-server "{bootstrap-server}" --command-config client.properties --replication-factor 3 --partitions 1000 --topic msk-1000p
  3. Add records to the topic using a UUID as a key to distribute records evenly across partitions. This example adds 13 million records.
  4. for x in {1..13000000}; do echo $(uuidgen -r),message_$x; done | ./kafka-console-producer.sh --broker-list "{bootstrap-server}" --topic msk-1000p --producer.config client.properties --property parse.key=true --property key.separator=, --producer-property acks=all
  5. Create a Python function based on this pattern, which logs the processed records.
  6. Amend the function code to insert a delay of 0.1 seconds to simulate record processing.
  7. import json
    import base64
    import time
    
    def lambda_handler(event, context):
        # Define a variable to keep track of the number of the message in the batch of messages
        i=1
        # Looping through the map for each key (combination of topic and partition)
        for record in event['records']:
            for messages in event['records'][record]:
                print("********************")
                print("Record number: " + str(i))
                print("Topic: " + str(messages['topic']))
                print("Partition: " + str(messages['partition']))
                print("Offset: " + str(messages['offset']))
                print("Timestamp: " + str(messages['timestamp']))
                print("TimestampType: " + str(messages['timestampType']))
                if None is not messages.get('key'):
                    b64decodedKey=base64.b64decode(messages['key'])
                    decodedKey=b64decodedKey.decode('ascii')
                else:
                    decodedKey="null"
                if None is not messages.get('value'):
                    b64decodedValue=base64.b64decode(messages['value'])
                    decodedValue=b64decodedValue.decode('ascii')
                else:
                    decodedValue="null"
                print("Key = " + str(decodedKey))
                print("Value = " + str(decodedValue))
                i=i+1
                time.sleep(0.1)
        return {
            'statusCode': 200,
        }
    
  8. Configure the ESM to point to the previously created cluster and topic.
  9. Use the default batch size of 100. Set the StartingPosition to TRIM_HORIZON to process from the beginning of the stream.
  10. Deploy the function, which also adds and configures the ESM.
  11. View the Amazon CloudWatch ConcurrentExecutions and OffsetLag metrics to view the processing.

With the scaling improvements, once the ESM is configured, the ESM and function scale up to handle the number of partitions.

Lambda automatic scaling improvement graph

Lambda automatic scaling improvement graph

Increasing data processing throughput

It is important that your function can keep pace with the rate of traffic. A growing offset lag means that the function processing cannot keep up. If the age is high in relation to the stream’s retention period, you can lose data as records expire from the stream.

This value should generally not exceed 50% of the stream’s retention period. When the value reaches 100% of the stream retention period, data is lost. One temporary solution is to increase the retention time of the stream. This gives you more time to resolve the issue before losing data.

There are several ways to improve processing throughput.

  1. Avoid processing unnecessary records by using content filtering to control which records Lambda sends to your function. This helps reduce traffic to your function, simplifies code, and reduces overall cost.
  2. Lambda allocates processors across all pollers based on the number of partitions up to a maximum of one concurrent Lambda function per partition. You can increase the number of processing Lambda functions by increasing the number of partitions.
  3. For compute intensive functions, you can increase the memory allocated to your function, which also increases the amount of virtual CPU available. This can help reduce the duration of a processing function.
  4. Lambda polls Kafka with a configurable batch size of records. You can increase the batch size to process more records in a single invocation. This can improve processing time and reduce costs, particularly if your function has an increased init time. A larger batch size increases the latency to process the first record in the batch, but potentially decreases the latency to process the last record in the batch. There is a tradeoff between cost and latency when optimizing a partition’s capacity and the decision depends on the needs of your workload.
  5. Ensure that your producers evenly distribute records across partitions using an effective partition key strategy. A workload is unbalanced when a single key dominates other keys, creating a hot partition, which impacts throughput.

See Increasing data processing throughput for some additional guidance.

Conclusion

Today, AWS Lambda is improving the automatic scaling behavior when processing data from Apache Kafka event-sources. Lambda is increasing the default number of initial consumers, improving how quickly they scale up, and ensuring they don’t scale down too quickly. There is no additional action that you must take, and there is no additional cost.

You can explore the scaling improvements with your existing workloads or deploy an Amazon MSK cluster and try one of the patterns to measure processing time.

To explore using Lambda to process Kafka streams, see the learning guide.

For more serverless learning resources, visit Serverless Land.

How Gilead used Amazon Redshift to quickly and cost-effectively load third-party medical claims data

Post Syndicated from Rajiv Arora original https://aws.amazon.com/blogs/big-data/how-gilead-used-amazon-redshift-to-quickly-and-cost-effectively-load-third-party-medical-claims-data/

This post was co-written with Rajiv Arora, Director of Data Science Platform at Gilead Life Sciences.

Gilead Sciences, Inc. is a biopharmaceutical company committed to advancing innovative medicines to prevent and treat life-threatening diseases, including HIV, viral hepatitis, inflammation, and cancer. A leader in virology, Gilead historically relied on these drugs for growth but now through strategic investments, Gilead is expanding and increasing their focus in oncology, having acquired Kite and Immunomedics to boost their exposure to cell therapy and non-cell therapy, making it the primary growth engine. Because Gilead is expanding into biologics and large molecule therapies, and has an ambitious goal of launching 10 innovative therapies by 2030, there is heavy emphasis on using data with AI and machine learning (ML) to accelerate the drug discovery pipeline.

Amazon Redshift Serverless is a fully managed cloud data warehouse that allows you to seamlessly create your data warehouse with no infrastructure management required. You pay only for the compute resources and storage that you use. Redshift Serverless measures data warehouse capacity in Redshift Processing Units (RPUs), which are part of the compute resources. All of the data stored in your warehouse, such as tables, views, and users, make up a namespace in Redshift Serverless.

One of the benefits of Redshift Serverless is that you don’t need to size your data warehouse for your peak workload. The peak workload includes loading periodic large datasets in multi-terabyte range. You can set a base RPU from 8 up to 512 and Redshift Serverless will automatically scale the RPUs to meet your workload demands. This makes it straightforward to manage your data warehouse in a cost-effective manner.

In this post, we share how Gilead collaborated with AWS to redesign their data ingestion process. They used Redshift Serverless as their data producer to load third-party medical claims data in a fast and cost-effective way, reducing load times from days to hours.

Gilead use case

Gilead loads a variety of data from hundreds of sources to their R&D data environment. They recently needed to do a monthly load of 140 TB of uncompressed healthcare claims data in under 24 hours after receiving it to provide analysts and data scientists with up-to-date information on a patient’s healthcare journey. This data volume is expected to increase monthly and is fully refreshed each month. The 3-node RA3 16XL provisioned cluster that had previously been hosting their warehouse was taking around 12 hours to ingest this data to Amazon Redshift, and Gilead was looking to optimize the data ingestion process in a more dynamic manner. Working with Amazon Redshift specialists from AWS, Gilead chose Redshift Serverless as a way to cost-effectively load this data and then use Redshift data sharing to share the final dataset to two additional Redshift data warehouses for end-user queries.

Loading data is a key process for any analytical system, including Amazon Redshift. When loading very large datasets, it’s important to not only load the data as quickly as possible but also in a way that optimizes the consumption queries.

Gilead’s healthcare claims data took 40 hours to load, which meant delays in using the data for downstream processes. The teams sought improvements, targeting a maximum 24-hour SLA for the load. They achieved the load in 8 hours, an 80% reduction in time to make data available.

Solution overview

After collaborating, the Gilead and AWS teams decided on a two-step process to load the data to Amazon Redshift. First, the data was loaded without a distkey and sortkey, which let the load process use the full parallel resources of the cluster. Then we used a deep copy to redistribute this data and add the desired distribution and sort characteristics.

The solution uses Redshift Serverless. The team wanted to ingest data to meet the required SLA, and the following approaches were benchmarked:

  • COPY command – The COPY command uses the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from files on Amazon Simple Storage Service (Amazon S3)
  • Data lake analytics Amazon Redshift Spectrum is used to query data directly from files on Amazon S3 by selecting a subset of columns and avoiding the intermediate step of copying data to staging table

Initial Solution approach: Single COPY command

The team determined it would be more effective to apply the distribution and sort keys in a post-copy step. The data was loaded first using automatic distribution of data. This took roughly 12 hours to complete. The team created open and closed claims tables with defined dist keys and with 20% of the columns to alleviate the need to query the larger table. With this success, we learned that we can still improve the big copy, as detailed in the following sections.

Proposed Solution approach 1: Parallel COPY command

Based on the initial solution approach above, the team tested yearly parallel copy commands as illustrated in the following diagram.

Yearly Parallel Copy Commands

Below are the findings and learnings from this approach:

  • Ingesting data for 4 years using parallel copy showed a 25% performance improvement over the single copy command.
  • Compared to Initial solution approach, where we were taking 12 hours to ingest the data, we further optimized this runtime by 67% by segregating the data ingestion into separate yearly staging tables and running parallel copy commands.
  • After the data was loaded into staging yearly tables, we created the open and closed claim tables with an auto distkey with the subset of columns required for larger reporting groups. It took an additional 1 hour to create.

The team used a manifest file to make sure that the COPY command loads all of the required files for the respective year for ingesting.

Proposed Solution approach 2: Data Lake analytics

The team used this approach with Redshift Spectrum to load only the required columns to Redshift Serverless, which avoided loading data into multiple yearly tables and directly to a single table. The following diagram illustrates this approach.

Using Spectrum Approach

The workflow consists of the following steps:

  1. Crawl the files using AWS Glue.
  2. Create a data lake external schema and table in Redshift Serverless.
  3. Create two separate claims table for open and closed claims because open claims are most frequently consumed and are 20% of the columns and 100% of the data.
  4. Create open and closed tables with selective columns needed for optimal performance optimization during consumption instead of all columns in the original third-party dataset. The data volume distribution is as follows:
    • Total number of open claims records = 50 billion
    • Total number of closed claims records = 200 billion
    • Overall, total number of records = 250 billion
  5. Distribute open and closed tables with a customer-identified distkey.
  6. Configure data ingestion into open and closed claims tables combined using Redshift Serverless with 512 RPUs. This took 1.5 hours, which is further improved by 70% compared to scenario 1. We chose 512 RPUs in order to load data in the fastest way possible.

In this method, data ingestion was streamlined by only loading essential fields from the medical claims dataset and by splitting the table into open and closed claims. Open claims data is most frequently accessed and constitutes only 20% of columns so by splitting the tables. The team not only improved the ingestion performance but also consumption.

Amazon Redshift recently launched automatic mounting of AWS Glue Data Catalog, making it easier to run data lake analytics without manually creating external schemas. You can query data lake tables directly from Amazon Redshift Query Editor v2 or your favorite SQL editors.

Recommendations and best practices

Consider the following recommendations when loading large-scale data in Amazon Redshift.

  • Use Redshift Serverless with maximum 512 RPUs to efficiently and quickly load data
  • Depending on consumption use case and query pattern, adopt either of the following approaches:
    • When consumption queries require only selected fields from the dataset and most frequently access a subset of data, use data lake queries to load only the relevant columns from Amazon S3 into Amazon Redshift
    • When consumption queries require all fields, use COPY commands with a manifest file to ingest data in parallel into multiple logically separated tables and create a database view with UNION ALL of all tables
  • Avoid using varchar(max) while creating tables and create VARCHAR columns with the right size

Final Architecture

The following diagram shows the high-level final architecture that was implemented.

Final Architecture

Conclusion

With the scalability of Redshift Serverless, data sharing to decouple ingestion from consumption workloads, and data lake analytics to ingest data, Gilead made their 140 TB dataset available to their analysts within hours of it being delivered. The innovative architecture of using a serverless ingestion data warehouse, a serverless consumption data warehouse for power users, and their original 3-node provisioned cluster for standard queries gives Gilead isolation to ensure data loads don’t affect their users. The architecture provides scalability to serve infrequent large queries with their serverless consumer along with the benefit of a fixed-cost and fixed-performance option of their provisioned cluster for their standard user queries. Due to the monthly schedule of the data load and the variable need for large queries by consumers, Redshift Serverless proved to be a cost-effective option compared to simply increasing the provisioned cluster to serve each of these use cases.

This split producer/consumer model of using Redshift serverless can bring benefits to many workloads that have similar performance characteristics to Gilead’s warehouse. Customers regularly run large data loads infrequently, and those processes compete with user queries. With this pattern, you can rely on your queries to perform consistently regardless of whether new data is being loaded to the system. This strikes a balance between minimizing cost while maintaining performance and frees the system administrators to load data without affecting users.


About the Authors

Rajiv Arora is a Director of Clinical Data Science at Gilead Sciences with over 20 years of experience in the industry. He is responsible for the multi-modal data platform for the development organization and supports all statistical and predictive analytical infrastructure for RWE and Advanced Analytical functions.

Ritesh Kumar Sinha is an Analytics Specialist Solutions Architect based out of San Francisco. He has helped customers build scalable data warehousing and big data solutions for over 16 years. He loves to design and build efficient end-to-end solutions on AWS. In his spare time, he loves reading, walking, and doing yoga.

Raks KhareRaks Khare is an Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers architect data analytics solutions at scale on the AWS platform.

Brent Strong is a Senior Solutions Architect in the Healthcare and Life Sciences team at AWS. He has more than 15 years of experience in the industry, focusing on data and analytics and DevOps. At AWS, he works closely with large Life Sciences customers to help them deliver new and innovative treatments.

Phil Bates is a Senior Analytics Specialist Solutions Architect at AWS with over 25 years of data warehouse experience.

Introducing faster polling scale-up for AWS Lambda functions configured with Amazon SQS

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/introducing-faster-polling-scale-up-for-aws-lambda-functions-configured-with-amazon-sqs/

This post was written by Anton Aleksandrov, Principal Solutions Architect, and Tarun Rai Madan, Senior Product Manager.

Today, AWS is announcing that AWS Lambda supports up to five times faster polling scale-up rate for spiky Lambda workloads configured with Amazon Simple Queue Service (Amazon SQS) as an event source.

This feature enables customers building event-driven applications using Lambda and SQS to achieve more responsive scaling during a sudden burst of messages in their SQS queues, and reduces the need to duplicate Lambda functions or SQS queues to achieve faster message processing.

Overview

Customers building modern event-driven and messaging applications with AWS Lambda use the Amazon SQS as a fundamental building block for creating decoupled architectures. Amazon SQS is a fully managed message queueing service for microservices, distributed systems, and serverless applications. When a Lambda function subscribes to an SQS queue as an event source, Lambda polls the queue, retrieves the messages, and sends retrieved messages in batches to the function handler for processing. To consume messages efficiently, Lambda detects the increase in queue depth, and increases the number of poller processes to process the queued messages.

Up until today, the Lambda was adding up to 60 concurrent executions per minute for Lambda functions subscribed to SQS queues, scaling up to a maximum of 1,250 concurrent executions in approximately 20 minutes. However, customers tell us that some of the modern event-driven applications they build using Lambda and SQS are sensitive to sudden spikes in messages, which may cause noticeable delay in processing of messages for end users. In order to harness the power of Lambda for applications that experience a burst of messages in SQS queues, these customers needed Lambda message polling to scale up faster.

With today’s announcement, Lambda functions that subscribe to an SQS queue can scale up to five times faster for queues that see a spike in message backlog, adding up to 300 concurrent executions per minute, and scaling up to a maximum of 1,250 concurrent executions. This scaling improvement helps to use the simplicity of Lambda and SQS integration to build event-driven applications that scale faster during a surge of incoming messages, particularly for real-time systems. It also offers customers the benefit of faster processing during spikes of messages in SQS queues, while continuing to offer the flexibility to limit the maximum concurrent Lambda invocations per SQS event source.

Controlling the maximum concurrent Lambda invocations by SQS

The new improved scaling rates are automatically applied to all AWS accounts using Lambda and SQS as an event source. There is no explicit action that you must take, and there’s no additional cost. This scaling improvement helps customers to build more performant Lambda applications where they need faster SQS polling scale-up. To prevent potentially overloading the downstream dependencies, Lambda provides customers the control to set the maximum number of concurrent executions at a function level with reserved concurrency, and event source level with maximum concurrency.

The following diagram illustrates settings that you can use to control the flow rate of an SQS event-source. You use reserved concurrency to control function-level scaling, and maximum concurrency to control event source scaling.

Control the flow rate of an SQS event-source

Reserved concurrency is the maximum concurrency that you want to allocate to a function. When a function has reserved concurrency allocated, no other functions can use that concurrency.

AWS recommends using reserved concurrency when you want to ensure a function has enough concurrency to scale up. When an SQS event source is attempting to scale up concurrent Lambda invocations, but the function has already reached the threshold defined by the reserved concurrency, the Lambda service throttles further function invocations.

This may result in SQS event source attempting to scale down, reducing the number of concurrently processed messages. Depending on the queue configuration, the throttled messages are returned to the queue for retrying, expire based on the retention policy, or sent to a dead-letter queue (DLQ) or on-failure destination.

The maximum concurrency setting allows you to control concurrency at the event source level. It allows you to define the maximum number of concurrent invocations the event source attempts to send to the Lambda function. For scenarios where a single function has multiple SQS event sources configured, you can define maximum concurrency for each event source separately, providing more granular control. When trying to add rate control to SQS event sources, AWS recommends you start evaluating maximum concurrency control first, as it provides greater flexibility.

Reserved concurrency and maximum concurrency are complementary capabilities, and can be used together. Maximum concurrency can help to prevent overwhelming downstream systems and throttled invocations. Reserved concurrency helps to ensure available concurrency for the function.

Example scenario

Consider your business must process large volumes of documents from storage. Once every few hours, your business partners upload large volumes of documents to S3 buckets in your account.

For resiliency, you’ve designed your application to send a message to an SQS queue for each of the uploaded documents, so you can efficiently process them without accidentally skipping any. The documents are processed using a Lambda function, which takes around two seconds to process a single document.

Processing these documents is a CPU-intensive operation, so you decide to process a single document per invocation. You want to use the power of Lambda to fan out the parallel processing to as many concurrent execution environments as possible. You want the Lambda function to scale up rapidly to process those documents in parallel as fast as possible, and scale-down to zero once all documents are processed to save costs.

When a business partner uploads 200,000 documents, 200,000 messages are sent to the SQS queue. The Lambda function is configured with an SQS event source, and it starts consuming the messages from the queue.

This diagram shows the results of running the test scenario before the SQS event source scaling improvements. As expected, you can see that concurrent executions grow by 60 per minute. It takes approximately 16 minutes to scale up to 900 concurrent executions gradually and process all the messages in the queue.

Results of running the test scenario before the SQS event source scaling improvements

The following diagram shows the results of running the same test scenario after the SQS event source scaling improvements. The timeframe used for both charts is the same, but the performance on the second chart is better. Concurrent executions grow by 300 per minute. It only takes 4 minutes to scale up to 1,250 concurrent executions, and all the messages in the queue are processed in approximately 8 minutes.

The results of running the same test scenario after the SQS event source scaling improvements

Deploying this example

Use the example project to replicate this performance test in your own AWS account. Follow the instructions in README.md for provisioning the sample project in your AWS accounts using the AWS Cloud Development Kit (CDK).

This example project is configured to demonstrate a large-scale workload processing 200,000 messages. Running this sample project in your account may incur charges. See AWS Lambda pricing and Amazon SQS pricing.

Once deployed, use the application under the “sqs-cannon” directory to send 200,000 messages to the SQS queue (or reconfigure to any other number). It takes several minutes to populate the SQS queue with messages. After all messages are sent, enable the SQS event source, as described in the README.md, and monitor the charts in the provisioned CloudWatch dashboard.

The default concurrency quota for new AWS accounts is 1000. If you haven’t requested an increase in this quota, the number of concurrent executions is capped at this number. Use Service Quotas or contact your account team to request a concurrency increase.

Security best practices

Always use the least privileged permissions when granting your Lambda functions access to SQS queues. This reduces potential attack surface by ensuring that only specific functions have permissions to perform specific actions on specific queues. For example, in case your function only polls from the queue, grant it permission to read messages, but not to send new messages. A function execution role defines which actions your function is allowed to perform on other resources. A queue access policy defines the principals that can access this queue, and the actions that are allowed.

Use server-side encryption (SSE) to store sensitive data in encrypted SQS queues. With SSE, your messages are always stored in encrypted form, and SQS only decrypts them for sending to an authorized consumer. SSE protects the contents of messages in queues using SQS-managed encryption keys (SSE-SQS) or keys managed in the AWS Key Management Service (SSE-KMS).

Conclusion

The improved Lambda SQS event source polling scale-up capability enables up to five times faster scale-up performance for spiky event-driven workloads using SQS queues, at no additional cost. This improvement offers customers the benefit of faster processing during spikes of messages in SQS queues, while continuing to offer the flexibility to limit the maximum concurrent invokes by SQS as an event source.

For more serverless learning resources, visit Serverless Land.

Orchestrating dependent file uploads with AWS Step Functions

Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/orchestrating-dependent-file-uploads-with-aws-step-functions/

This post is written by Nelson Assis, Enterprise Support Lead, Serverless and Jevon Liburd, Technical Account Manager, Serverless

Amazon S3 is an object storage service that many customers use for file storage. With the use of Amazon S3 Event Notifications or Amazon EventBridge customers can create workloads with event-driven architecture (EDA). This architecture responds to events produced when changes occur to objects in S3 buckets.

EDA involves asynchronous communication between system components. This serves to decouple the components allowing each component to be autonomous.

Some scenarios may introduce coupling in the architecture due to dependency between events. This blog post presents a common example of this coupling and how it can be handled using AWS Step Functions.

Overview

In this example, an organization has two distributed autonomous teams, the Sales team and the Warehouse team. Each team is responsible for uploading a monthly data file to an S3 bucket so it can be processed.

The files generate events when they are uploaded, initiating downstream processes. The processing of the Warehouse file cleans the data and joins it with data from the Shipping team. The processing of the Sales file correlates the data with the combined Warehouse and Shipping data. This enables analysts to perform forecasting and gain other insights.

For this correlation to happen, the Warehouse file must be processed before the Sales file. As the two teams are autonomous, there is no coordination among the teams. This means that the files can be uploaded at any time with no assurance that the Warehouse file is processed before the Sales file.

For scenarios like these, the Aggregator pattern can be used. The pattern collects and stores the events, and triggers a new event based on the combined events. In the described scenario, the combined events are the processed Warehouse file and the uploaded Sales file.

The requirements of the aggregator pattern are:

  1. Correlation – A way to group the related events. This is fulfilled by a unique identifier in the file name.
  2. Event aggregator – A stateful store for the events.
  3. Completion check and trigger – A condition when the combined events have been received and a way to publish the resulting event.

Architecture overview

The architecture uses the following AWS services:

  1. File upload: The Sales and Warehouse teams upload their respective files to S3.
  2. EventBridge: The ObjectCreated event is sent to EventBridge where there is a rule with a target of the main workflow.
  3. Main state machine: This state machine orchestrates the aggregator operations and the processing of the files. It encapsulates the workflows for each file to separate the aggregator logic from the files’ workflow logic.
  4. File parser and correlation: The business logic to identify the file and its type is run in this Lambda function.
  5. Stateful store: A DynamoDB table stores information about the file such as the name, type, and processing status. The state machine reads from and writes to the DynamoDB table. Task tokens are also stored in this table.
  6. File processing: Depending on the file type and any pre-conditions, state machines corresponding to the file type are run. These state machines contain the logic to process the specific file.
  7. Task Token & Callback: The task token is generated when the dependent file tries to be processed before the independent file. The Step Functions “Wait for a Callback” pattern continues the execution of the dependent file after the independent file is processed.

Walkthrough

You need the following prerequisites:

  • AWS CLI and AWS SAM CLI installed.
  • An AWS account.
  • Sufficient permissions to manage the AWS resources.
  • Git installed.

To deploy the example, follow the instructions in the GitHub repo.

This walkthrough shows what happens if the dependent file (Sales file) is uploaded before the independent one (Warehouse file).

  1. The workflow starts with the uploading of the Sales file to the dedicated Sales S3 bucket. The example uses separate S3 buckets for the two files as it assumes that the Sales and Warehouse teams are distributed and autonomous. You can find sample files in the code repository.
  2. Uploading the file to S3 sends an event to EventBridge, which the aggregator state machine acts on. The event pattern used in the EventBridge rule is:
    {
      "detail-type": ["Object Created"],
      "source": ["aws.s3"],
      "detail": {
        "bucket": {
          "name": ["sales-mfu-eda-09092023", "warehouse-mfu-eda-09092023"]
        },
        "reason": ["PutObject"]
      }
    }
  3. The aggregator state machine starts by invoking the file parser Lambda function. This function parses the file type and uses the identifier to correlate the files. In this example, the name of the file contains the file type and the correlation identifier (the year_month). To use other ways of representing the file type and correlation identifier, you can modify this function to parse that information.
  4. The next step in the state machine inserts a record for the event in the event aggregator DynamoDB table. The table has a composite primary key with the correlation identifier as the partition key and the file type as the sort key. The processing status of the file is tracked to give feedback on the state of the workflow.
  5. Based on the file type, the state machine determines which branch to follow. In the example, the Sales branch is run. The state machine tries to get the status of the (dependent) Warehouse file from DynamoDB using the correlation identifier. Using the result of this query, the state machine determines if the corresponding Warehouse file has already been processed.
  6. Since the Warehouse file is not processed yet, the waitForTaskToken integration pattern is used. The state machine waits at this step and creates a task token, which the external services use to trigger the state machine to continue its execution. The Sales record in the DynamoDB table is updated with the Task Token.
  7. Navigate to the S3 console and upload the sample Warehouse file to the Warehouse S3 bucket. This invokes a new instance of the Step Functions workflow, which flows through the other branch after the file type choice step. In this branch, the Warehouse state machine is run and the processing status of the file is updated in DynamoDB.

When the status of the Warehouse file is changed to “Completed”, the Warehouse state machine checks DynamoDB for a pending Sales file. If there is one, it retrieves the task token and calls the SendTaskSuccess method. This triggers the Sales state machine, which is in a waiting state to continue. The Sales state machine is started and the processing status is updated.

Conclusion

This blog post shows how to handle file dependencies in event driven architectures. You can customize the sample provided in the code repository for your own use case.

This solution is specific to file dependencies in event driven architectures. For more information on solving event dependencies and aggregators read the blog post: Moving to event-driven architectures with serverless event aggregators.

To learn more about event driven architectures, visit the event driven architecture section on Serverless Land.

GoDaddy benchmarking results in up to 24% better price-performance for their Spark workloads with AWS Graviton2 on Amazon EMR Serverless

Post Syndicated from Mukul Sharma original https://aws.amazon.com/blogs/big-data/godaddy-benchmarking-results-in-up-to-24-better-price-performance-for-their-spark-workloads-with-aws-graviton2-on-amazon-emr-serverless/

This is a guest post co-written with Mukul Sharma, Software Development Engineer, and Ozcan IIikhan, Director of Engineering from GoDaddy.

GoDaddy empowers everyday entrepreneurs by providing all the help and tools to succeed online. With more than 22 million customers worldwide, GoDaddy is the place people come to name their ideas, build a professional website, attract customers, and manage their work.

GoDaddy is a data-driven company, and getting meaningful insights from data helps us drive business decisions to delight our customers. At GoDaddy, we embarked on a journey to uncover the efficiency promises of AWS Graviton2 on Amazon EMR Serverless as part of our long-term vision for cost-effective intelligent computing.

In this post, we share the methodology and results of our benchmarking exercise comparing the cost-effectiveness of EMR Serverless on the arm64 (Graviton2) architecture against the traditional x86_64 architecture. EMR Serverless on Graviton2 demonstrated an advantage in cost-effectiveness, resulting in significant savings in total run costs. We achieved 23.85% improvement in price-performance for sample production Spark workloads—an outcome that holds tremendous potential for businesses striving to maximize their computing efficiency.

Solution overview

GoDaddy’s intelligent compute platform envisions simplification of compute operations for all personas, without limiting power users, to ensure out-of-box cost and performance optimization for data and ML workloads. As a part of this vision, GoDaddy’s Data & ML Platform team plans to use EMR Serverless as one of the compute solutions under the hood.

The following diagram shows a high-level illustration of the intelligent compute platform vision.

Benchmarking EMR Serverless for GoDaddy

EMR Serverless is a serverless option in Amazon EMR that eliminates the complexities of configuring, managing, and scaling clusters when running big data frameworks like Apache Spark and Apache Hive. With EMR Serverless, businesses can enjoy numerous benefits, including cost-effectiveness, faster provisioning, simplified developer experience, and improved resilience to Availability Zone failures.

At GoDaddy, we embarked on a comprehensive study to benchmark EMR Serverless using real production workflows at GoDaddy. The purpose of the study was to evaluate the performance and efficiency of EMR Serverless and develop a well-informed adoption plan. The results of the study have been extremely promising, showcasing the potential of EMR Serverless for our workloads.

Having achieved compelling results in favor of EMR Serverless for our workloads, our attention turned to evaluating the utilization of the Graviton2 (arm64) architecture on EMR Serverless. In this post, we focus on comparing the performance of Graviton2 (arm64) with the x86_64 architecture on EMR Serverless. By conducting this apples-to-apples comparative analysis, we aim to gain valuable insights into the benefits and considerations of using Graviton2 for our big data workloads.

By using EMR Serverless and exploring the performance of Graviton2, GoDaddy aims to optimize their big data workflows and make informed decisions regarding the most suitable architecture for their specific needs. The combination of EMR Serverless and Graviton2 presents an exciting opportunity to enhance the data processing capabilities and drive efficiency in our operations.

AWS Graviton2

The Graviton2 processors are specifically designed by AWS, utilizing powerful 64-bit Arm Neoverse cores. This custom-built architecture provides a remarkable boost in price-performance for various cloud workloads.

In terms of cost, Graviton2 offers an appealing advantage. As indicated in the following table, the pricing for Graviton2 is 20% lower compared to the x86 architecture option.

   x86_64  arm64 (Graviton2) 
per vCPU per hour $0.052624 $0.042094
per GB per hour $0.0057785 $0.004628
per storage GB per hour* $0.000111

*Ephemeral storage: 20 GB of ephemeral storage is available for all workers by default—you pay only for any additional storage that you configure per worker.

For specific pricing details and current information, refer to Amazon EMR pricing.

AWS benchmark

The AWS team performed benchmark tests on Spark workloads with Graviton2 on EMR Serverless using the TPC-DS 3 TB scale performance benchmarks. The summary of their analysis are as follows:

  • Graviton2 on EMR Serverless demonstrated an average improvement of 10% for Spark workloads in terms of runtime. This indicates that the runtime for Spark-based tasks was reduced by approximately 10% when utilizing Graviton2.
  • Although the majority of queries showcased improved performance, a small subset of queries experienced a regression of up to 7% on Graviton2. These specific queries showed a slight decrease in performance compared to the x86 architecture option.
  • In addition to the performance analysis, the AWS team considered the cost factor. Graviton2 is offered at a 20% lower cost than the x86 architecture option. Taking this cost advantage into account, the AWS benchmark set yielded an overall 27% better price-performance for workloads. This means that by using Graviton2, users can achieve a 27% improvement in performance per unit of cost compared to the x86 architecture option.

These findings highlight the significant benefits of using Graviton2 on EMR Serverless for Spark workloads, with improved performance and cost-efficiency. It showcases the potential of Graviton2 in delivering enhanced price-performance ratios, making it an attractive choice for organizations seeking to optimize their big data workloads.

GoDaddy benchmark

During our initial experimentation, we observed that arm64 on EMR Serverless consistently outperformed or performed on par with x86_64. One of the jobs showed a 7.51% increase in resource usage on arm64 compared to x86_64, but due to the lower price of arm64, it still resulted in a 13.48% cost reduction. In another instance, we achieved an impressive 43.7% reduction in run cost, attributed to both the lower price and reduced resource utilization. Overall, our initial tests indicated that arm64 on EMR Serverless delivered superior price-performance compared to x86_64. These promising findings motivated us to conduct a more comprehensive and rigorous study.

Benchmark results

To gain a deeper understanding of the value of Graviton2 on EMR Serverless, we conducted our study using real-life production workloads from GoDaddy, which are scheduled to run at a daily cadence. Without any exceptions, EMR Serverless on arm64 (Graviton2) is significantly more cost-effective compared to the same jobs run on EMR Serverless on the x86_64 architecture. In fact, we recorded an impressive 23.85% improvement in price-performance across the sample GoDaddy jobs using Graviton2.

Like the AWS benchmarks, we observed slight regressions of less than 5% in the total runtime of some jobs. However, given that these jobs will be migrated from Amazon EMR on EC2 to EMR Serverless, the overall total runtime will still be shorter due to the minimal provisioning time in EMR Serverless. Additionally, across all jobs, we observed an average speed up of 2.1% in addition to the cost savings achieved.

These benchmarking results provide compelling evidence of the value and effectiveness of Graviton2 on EMR Serverless. The combination of improved price-performance, shorter runtimes, and overall cost savings makes Graviton2 a highly attractive option for optimizing big data workloads.

Benchmarking methodology

As an extension of a larger benchmarking EMR Serverless for GoDaddy study, where we divided Spark jobs into brackets based on total runtime (quick-run, medium-run, long-run), we measured effect of architecture (arm64 vs. x86_64) on total cost and total runtime. All other parameters were kept the same to achieve an apples-to-apples comparison.

The team followed these steps:

  1. Prepare the data and environment.
  2. Choose two random production jobs from each job bracket.
  3. Make necessary changes to avoid inference with actual production outputs.
  4. Run tests to execute scripts over multiple iterations to collect accurate and consistent data points.
  5. Validate input and output datasets, partitions, and row counts to ensure identical data processing.
  6. Gather relevant metrics from the tests.
  7. Analyze results to draw insights and conclusions.

The following table shows the summary of an example Spark job.

Metric  EMR Serverless (Average) – X86_64  EMR Serverless (Average) – Graviton  X86_64 vs Graviton (% Difference) 
Total Run Cost $2.76 $1.85 32.97%

Total Runtime

(hh:mm:ss)

00:41:31 00:34:32 16.82%
EMR Release Label emr-6.9.0
Job Type Spark
Spark Version Spark 3.3.0
Hadoop Distribution Amazon 3.3.3
Hive/HCatalog Version Hive 3.1.3, HCatalog 3.1.3

Summary of results

The following table presents a comparison of job performance between EMR Serverless on arm64 (Graviton2) and EMR Serverless on x86_64. For each architecture, every job was run at least three times to obtain the accurate average cost and runtime.

 Job  Average x86_64 Cost Average arm64 Cost Average x86_64 Runtime (hh:mm:ss) Average arm64 Runtime (hh:mm:ss)  Average Cost Savings %  Average Performance Gain % 
1 $1.64 $1.25 00:08:43 00:09:01 23.89% -3.24%
2 $10.00 $8.69 00:27:55 00:28:25 13.07% -1.79%
3 $29.66 $24.15 00:50:49 00:53:17 18.56% -4.85%
4 $34.42 $25.80 01:20:02 01:24:54 25.04% -6.08%
5 $2.76 $1.85 00:41:31 00:34:32 32.97% 16.82%
6 $34.07 $24.00 00:57:58 00:51:09 29.57% 11.76%
Average  23.85% 2.10%

Note that the improvement calculations are based on higher-precision results for more accuracy.

Conclusion

Based on this study, GoDaddy observed a significant 23.85% improvement in price-performance for sample production Spark jobs utilizing the arm64 architecture compared to the x86_64 architecture. These compelling results have led us to strongly recommend internal teams to use arm64 (Graviton2) on EMR Serverless, except in cases where there are compatibility issues with third-party packages and libraries. By adopting an arm64 architecture, organizations can achieve enhanced cost-effectiveness and performance for their workloads, contributing to more efficient data processing and analytics.


About the Authors

Mukul Sharma is a Software Development Engineer on Data & Analytics (DnA) organization at GoDaddy. He is a polyglot programmer with experience in a wide array of technologies to rapidly deliver scalable solutions. He enjoys singing karaoke, playing various board games, and working on personal programming projects in his spare time.

Ozcan Ilikhan is a Director of Engineering on Data & Analytics (DnA) organization at GoDaddy. He is passionate about solving customer problems and increasing efficiency using data and ML/AI. In his spare time, he loves reading, hiking, gardening, and working on DIY projects.

Harsh Vardhan Singh Gaur is an AWS Solutions Architect, specializing in analytics. He has over 6 years of experience working in the field of big data and data science. He is passionate about helping customers adopt best practices and discover insights from their data.

Ramesh Kumar Venkatraman is a Senior Solutions Architect at AWS who is passionate about containers and databases. He works with AWS customers to design, deploy, and manage their AWS workloads and architectures. In his spare time, he loves to play with his two kids and follows cricket.

Transforming transactions: Streamlining PCI compliance using AWS serverless architecture

Post Syndicated from Abdul Javid original https://aws.amazon.com/blogs/security/transforming-transactions-streamlining-pci-compliance-using-aws-serverless-architecture/

Compliance with the Payment Card Industry Data Security Standard (PCI DSS) is critical for organizations that handle cardholder data. Achieving and maintaining PCI DSS compliance can be a complex and challenging endeavor. Serverless technology has transformed application development, offering agility, performance, cost, and security.

In this blog post, we examine the benefits of using AWS serverless services and highlight how you can use them to help align with your PCI DSS compliance responsibilities. You can remove additional undifferentiated compliance heavy lifting by building modern applications with abstracted AWS services. We review an example payment application and workflow that uses AWS serverless services and showcases the potential reduction in effort and responsibility that a serverless architecture could provide to help align with your compliance requirements. We present the review through the lens of a merchant that has an ecommerce website and include key topics such as access control, data encryption, monitoring, and auditing—all within the context of the example payment application. We don’t discuss additional service provider requirements from the PCI DSS in this post.

This example will help you navigate the intricate landscape of PCI DSS compliance. This can help you focus on building robust and secure payment solutions without getting lost in the complexities of compliance. This can also help reduce your compliance burden and empower you to develop your own secure, scalable applications. Join us in this journey as we explore how AWS serverless services can help you meet your PCI DSS compliance objectives.

Disclaimer

This document is provided for the purposes of information only; it is not legal advice, and should not be relied on as legal advice. Customers are responsible for making their own independent assessment of the information in this document. This document: (a) is for informational purposes only, (b) represents current AWS product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided “as is” without warranties, representations, or conditions of any kind, whether express or implied. The responsibilities and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any agreement between AWS and its customers.

AWS encourages its customers to obtain appropriate advice on their implementation of privacy and data protection environments, and more generally, applicable laws and other obligations relevant to their business.

PCI DSS v4.0 and serverless

In April 2022, the Payment Card Industry Security Standards Council (PCI SSC) updated the security payment standard to “address emerging threats and technologies and enable innovative methods to combat new threats.” Two of the high-level goals of these updates are enhancing validation methods and procedures and promoting security as a continuous process. Adopting serverless architectures can help meet some of the new and updated requirements in version 4.0, such as enhanced software and encryption inventories. If a customer has access to change a configuration, it’s the customer’s responsibility to verify that the configuration meets PCI DSS requirements. There are more than 20 PCI DSS requirements applicable to Amazon Elastic Compute Cloud (Amazon EC2). To fulfill these requirements, customer organizations must implement controls such as file integrity monitoring, operating system level access management, system logging, and asset inventories. Using AWS abstracted services in this scenario can remove undifferentiated heavy lifting from your environment. With abstracted AWS services, because there is no operating system to manage, AWS becomes responsible for maintaining consistent time settings for an abstracted service to meet Requirement 10.6. This will also shift your compliance focus more towards your application code and data.

This makes more of your PCI DSS responsibility addressable through the AWS PCI DSS Attestation of Compliance (AOC) and Responsibility Summary. This attestation package is available to AWS customers through AWS Artifact.

Reduction in compliance burden

You can use three common architectural patterns within AWS to design payment applications and meet PCI DSS requirements: infrastructure, containerized, and abstracted. We look into EC2 instance-based architecture (infrastructure or containerized patterns) and modernized architectures using serverless services (abstracted patterns). While both approaches can help align with PCI DSS requirements, there are notable differences in how they handle certain elements. EC2 instances provide more control and flexibility over the underlying infrastructure and operating system, assisting you in customizing security measures based on your organization’s operational and security requirements. However, this also means that you bear more responsibility for configuring and maintaining security controls applicable to the operating systems, such as network security controls, patching, file integrity monitoring, and vulnerability scanning.

On the other hand, serverless architectures similar to the preceding example can reduce much of the infrastructure management requirements. This can relieve you, the application owner or cloud service consumer, of the burden of configuring and securing those underlying virtual servers. This can streamline meeting certain PCI requirements, such as file integrity monitoring, patch management, and vulnerability management, because AWS handles these responsibilities.

Using serverless architecture on AWS can significantly reduce the PCI compliance burden. Approximately 43 percent of the overall PCI compliance requirements, encompassing both technical and non-technical tests, are addressed by the AWS PCI DSS Attestation of Compliance.

Customer responsible
52%
AWS responsible
43%
N/A
5%

The following table provides an analysis of each PCI DSS requirement against the serverless architecture in Figure 1, which shows a sample payment application workflow. You must evaluate your own use and secure configuration of AWS workload and architectures for a successful audit.

PCI DSS 4.0 requirements Test cases Customer responsible AWS responsible N/A
Requirement 1: Install and maintain network security controls 35 13 22 0
Requirement 2: Apply secure configurations to all system components 27 16 11 0
Requirement 3: Protect stored account data 55 24 29 2
Requirement 4: Protect cardholder data with strong cryptography during transmission over open, public networks 12 7 5 0
Requirement 5: Protect all systems and networks from malicious software 25 4 21 0
Requirement 6: Develop and maintain secure systems and software 35 31 4 0
Requirement 7: Restrict access to system components and cardholder data by business need-to-know 22 19 3 0
Requirement 8: Identify users and authenticate access to system components 52 43 6 3
Requirement 9: Restrict physical access to cardholder data 56 3 53 0
Requirement 10: Log and monitor all access to system components and cardholder data 38 17 19 2
Requirement 11: Test security of systems and networks regularly 51 22 23 6
Requirement 12: Support information security with organizational policies 56 44 2 10
Total 464 243 198 23
Percentage 52% 43% 5%

Note: The preceding table is based on the example reference architecture that follows. The actual extent of PCI DSS requirements reduction can vary significantly depending on your cardholder data environment (CDE) scope, implementation, and configurations.

Sample payment application and workflow

This example serverless payment application and workflow in Figure 1 consists of several interconnected steps, each using different AWS services. The steps are listed in the following text and include brief descriptions. They cover two use cases within this example application — consumers making a payment and a business analyst generating a report.

The example outlines a basic serverless payment application workflow using AWS serverless services. However, it’s important to note that the actual implementation and behavior of the workflow may vary based on specific configurations, dependencies, and external factors. The example serves as a general guide and may require adjustments to suit the unique requirements of your application or infrastructure.

Several factors, including but not limited to, AWS service configurations, network settings, security policies, and third-party integrations, can influence the behavior of the system. Before deploying a similar solution in a production environment, we recommend thoroughly reviewing and adapting the example to align with your specific use case and requirements.

Keep in mind that AWS services and features may evolve over time, and new updates or changes may impact the behavior of the components described in this example. Regularly consult the AWS documentation and ensure that your configurations adhere to best practices and compliance standards.

This example is intended to provide a starting point and should be considered as a reference rather than an exhaustive solution. Always conduct thorough testing and validation in your specific environment to ensure the desired functionality and security.

Figure 1: Serverless payment architecture and workflow

Figure 1: Serverless payment architecture and workflow

  • Use case 1: Consumers make a payment
    1. Consumers visit the e-commerce payment page to make a payment.
    2. The request is routed to the payment application’s domain using Amazon Route 53, which acts as a DNS service.
    3. The payment page is protected by AWS WAF to inspect the initial incoming request for any malicious patterns, web-based attacks (such as cross-site scripting (XSS) attacks), and unwanted bots.
    4. An HTTPS GET request (over TLS) is sent to the public target IP. Amazon CloudFront, a content delivery network (CDN), acts as a front-end proxy and caches and fetches static content from an Amazon Simple Storage Service (Amazon S3) bucket.
    5. AWS WAF inspects the incoming request for any malicious patterns, if the request is blocked, the request doesn’t return static content from the S3 bucket.
    6. User authentication and authorization are handled by Amazon Cognito, providing a secure login and scalable customer identity and access management system (CIAM)
    7. AWS WAF processes the request to protect against web exploits, then Amazon API Gateway forwards it to the payment application API endpoint.
    8. API Gateway launches AWS Lambda functions to handle payment requests. AWS Step Functions state machine oversees the entire process, directing the running of multiple Lambda functions to communicate with the payment processor, initiate the payment transaction, and process the response.
    9. The cardholder data (CHD) is temporarily cached in Amazon DynamoDB for troubleshooting and retry attempts in the event of transaction failures.
    10. A Lambda function validates the transaction details and performs necessary checks against the data stored in DynamoDB. A web notification is sent to the consumer for any invalid data.
    11. A Lambda function calculates the transaction fees.
    12. A Lambda function authenticates the transaction and initiates the payment transaction with the third-party payment provider.
    13. A Lambda function is initiated when a payment transaction with the third-party payment provider is completed. It receives the transaction status from the provider and performs multiple actions.
    14. Consumers receive real-time notifications through a web browser and email. The notifications are initiated by a step function, such as order confirmations or payment receipts, and can be integrated with external payment processors through an Amazon Simple Notification Service (Amazon SNS) Amazon Simple Email Service (Amazon SES) web hook.
    15. A separate Lambda function clears the DynamoDB cache.
    16. The Lambda function makes entries into the Amazon Simple Queue Service (Amazon SQS) dead-letter queue for failed transactions to retry at a later time.
  • Use case 2: An admin or analyst generates the report for non-PCI data
    1. An admin accesses the web-based reporting dashboard using their browser to generate a report.
    2. The request is routed to AWS WAF to verify the source that initiated the request.
    3. An HTTPS GET request (over TLS) is sent to the public target IP. CloudFront fetches static content from an S3 bucket.
    4. AWS WAF inspects incoming requests for any malicious patterns, if the request is blocked, the request doesn’t return static content from the S3 bucket. The validated traffic is sent to Amazon S3 to retrieve the reporting page.
    5. The backend requests of the reporting page pass through AWS WAF again to provide protection against common web exploits before being forwarded to the reporting API endpoint through API Gateway.
    6. API Gateway launches a Lambda function for report generation. The Lambda function retrieves data from DynamoDB storage for the reporting mechanism.
    7. The AWS Security Token Service (AWS STS) issues temporary credentials to the Lambda service in the non-PCI serverless account, allowing it to launch the Lambda function in the PCI serverless account. The Lambda function retrieves non-PCI data and writes it into DynamoDB.
    8. The Lambda function fetches the non-PCI data based on the report criteria from the DynamoDB table from the same account.

Additional AWS security and governance services that would be implemented throughout the architecture are shown in Figure 1, Label-25. For example, Amazon CloudWatch monitors and alerts on all the Lambda functions within the environment.

Label-26 demonstrates frameworks that can be used to build the serverless applications.

Scoping and requirements

Now that we’ve established the reference architecture and workflow, lets delve into how it aligns with PCI DSS scope and requirements.

PCI scoping

Serverless services are inherently segmented by AWS, but they can be used within the context of an AWS account hierarchy to provide various levels of isolation as described in the reference architecture example.

Segregating PCI data and non-PCI data into separate AWS accounts can help in de-scoping non-PCI environments and reducing the complexity and audit requirements for components that don’t handle cardholder data.

PCI serverless production account

  • This AWS account is dedicated to handling PCI data and applications that directly process, transmit, or store cardholder data.
  • Services such as Amazon Cognito, DynamoDB, API Gateway, CloudFront, Amazon SNS, Amazon SES, Amazon SQS, and Step Functions are provisioned in this account to support the PCI data workflow.
  • Security controls, logging, monitoring, and access controls in this account are specifically designed to meet PCI DSS requirements.

Non-PCI serverless production account

  • This separate AWS account is used to host applications that don’t handle PCI data.
  • Since this account doesn’t handle cardholder data, the scope of PCI DSS compliance is reduced, simplifying the compliance process.

Note: You can use AWS Organizations to centrally manage multiple AWS accounts.

AWS IAM Identity Center (successor to AWS Single Sign-On) is used to manage user access to each account and is integrated with your existing identify provider. This helps to ensure you’re meeting PCI requirements on identity, access control of card holder data, and environment.

Now, let’s look at the PCI DSS requirements that this architectural pattern can help address.

Requirement 1: Install and maintain network security controls

  • Network security controls are limited to AWS Identity and Access Management (IAM) and application permissions because there is no customer controlled or defined network. VPC-centric requirements aren’t applicable because there is no VPC. The configuration settings for serverless services can be covered under Requirement 6 to for secure configuration standards. This supports compliance with Requirements 1.2 and 1.3.

Requirement 2: Apply secure configurations to all system components

  • AWS services are single function by default and exist with only the necessary functionality enabled for the functioning of that service. This supports compliance with much of Requirement 2.2.
  • Access to AWS services is considered non-console and only accessible through HTTPS through the service API. This supports compliance with Requirement 2.2.7.
  • The wireless requirements under Requirement 2.3 are not applicable, because wireless environments don’t exist in AWS environments.

Requirement 3: Protect stored account data

  • AWS is responsible for destruction of account data configured for deletion based on DynamoDB Time to Live (TTL) values. This supports compliance with Requirement 3.2.
  • DynamoDB and Amazon S3 offer secure storage of account data, encryption by default in transit and at rest, and integration with AWS Key Management Service (AWS KMS). This supports compliance with Requirements 3.5 and 4.2.
  • AWS is responsible for the generation, distribution, storage, rotation, destruction, and overall protection of encryption keys within AWS KMS. This supports compliance with Requirements 3.6 and 3.7.
  • Manual cleartext cryptographic keys aren’t available in this solution, Requirement 3.7.6 is not applicable.

Requirement 4: Protect cardholder data with strong cryptography during transmission over open, public networks

  • AWS Certificate Manager (ACM) integrates with API Gateway and enables the use of trusted certificates and HTTPS (TLS) for secure communication between clients and the API. This supports compliance with Requirement 4.2.
  • Requirement 4.2.1.2 is not applicable because there are no wireless technologies in use in this solution. Customers are responsible for ensuring strong cryptography exists for authentication and transmission over other wireless networks they manage outside of AWS.
  • Requirement 4.2.2 is not applicable because no end-user technologies exist in this solution. Customers are responsible for ensuring the use of strong cryptography if primary account numbers (PAN) are sent through end-user messaging technologies in other environments.

Requirement 5: Protect a ll systems and networks from malicious software

  • There are no customer-managed compute resources in this example payment environment, Requirements 5.2 and 5.3 are the responsibility of AWS.

Requirement 6: Develop and maintain secure systems and software

  • Amazon Inspector now supports Lambda functions, adding continual, automated vulnerability assessments for serverless compute. This supports compliance with Requirement 6.2.
  • Amazon Inspector helps identify vulnerabilities and security weaknesses in the payment application’s code, dependencies, and configuration. This supports compliance with Requirement 6.3.
  • AWS WAF is designed to protect applications from common attacks, such as SQL injections, cross-site scripting, and other web exploits. AWS WAF can filter and block malicious traffic before it reaches the application. This supports compliance with Requirement 6.4.2.

Requirement 7: Restrict access to system components and cardholder data by business need to know

  • IAM and Amazon Cognito allow for fine-grained role- and job-based permissions and access control. Customers can use these capabilities to configure access following the principles of least privilege and need-to-know. IAM and Cognito support the use of strong identification, authentication, authorization, and multi-factor authentication (MFA). This supports compliance with much of Requirement 7.

Requirement 8: Identify users and authenticate access to system components

  • IAM and Amazon Cognito also support compliance with much of Requirement 8.
  • Some of the controls in this requirement are usually met by the identity provider for internal access to the cardholder data environment (CDE).

Requirement 9: Restrict physical access to cardholder data

  • AWS is responsible for the destruction of data in DynamoDB based on the customer configuration of content TTL values for Requirement 9.4.7. Customers are responsible for ensuring their database instance is configured for appropriate removal of data by enabling TTL on DDB attributes.
  • Requirement 9 is otherwise not applicable for this serverless example environment because there are no physical media, electronic media not already addressed under Requirement 3.2, or hard-copy materials with cardholder data. AWS is responsible for the physical infrastructure under the Shared Responsibility Model.

Requirement 10: Log and monitor all access to system components and cardholder data

  • AWS CloudTrail provides detailed logs of API activity for auditing and monitoring purposes. This supports compliance with Requirement 10.2 and contains all of the events and data elements listed.
  • CloudWatch can be used for monitoring and alerting on system events and performance metrics. This supports compliance with Requirement 10.4.
  • AWS Security Hub provides a comprehensive view of security alerts and compliance status, consolidating findings from various security services, which helps in ongoing security monitoring and testing. Customers must enable PCI DSS security standard, which supports compliance with Requirement 10.4.2.
  • AWS is responsible for maintaining accurate system time for AWS services. In this example, there are no compute resources for which customers can configure time. Requirement 10.6 is addressable through the AWS Attestation of Compliance and Responsibility Summary available in AWS Artifact.

Requirement 11: Regularly test security systems and processes

  • Testing for rogue wireless activity within the AWS-based CDE is the responsibility of AWS. AWS is responsible for the management of the physical infrastructure under Requirement 11.2. Customers are still responsible for wireless testing for their environments outside of AWS, such as where administrative workstations exist.
  • AWS is responsible for internal vulnerability testing of AWS services, and supports compliance with Requirement 11.3.1.
  • Amazon GuardDuty, a threat detection service that continuously monitors for malicious activity and unauthorized access, providing continuous security monitoring. This supports the IDS requirements under Requirement 11.5.1, and covers the entire AWS-based CDE.
  • AWS Config allows customers to catalog, monitor and manage configuration changes for their AWS resources. This supports compliance with Requirement 11.5.2.
  • Customers can use AWS Config to monitor the configuration of the S3 bucket hosting the static website. This supports compliance with Requirement 11.6.1.

Requirement 12: Support information security with organizational policies and programs

  • Customers can download the AWS AOC and Responsibility Summary package from Artifact to support Requirement 12.8.5 and the identification of which PCI DSS requirements are managed by the third-party service provider (TSPS) and which by the customer.

Conclusion

Using AWS serverless services when developing your payment application can significantly help reduce the number of PCI DSS requirements you need to meet by yourself. By offloading infrastructure management to AWS and using serverless services such as Lambda, API Gateway, DynamoDB, Amazon S3, and others, you can benefit from built-in security features and help align with your PCI DSS compliance requirements.

Contact us to help design an architecture that works for your organization. AWS Security Assurance Services is a Payment Card Industry-Qualified Security Assessor company (PCI-QSAC) and HITRUST External Assessor firm. We are a team of industry-certified assessors who help you to achieve, maintain, and automate compliance in the cloud by tying together applicable audit standards to AWS service-specific features and functionality. We help you build on frameworks such as PCI DSS, HITRUST CSF, NIST, SOC 2, HIPAA, ISO 27001, GDPR, and CCPA.

More information on how to build applications using AWS serverless technologies can be found at Serverless on AWS.

Want more AWS Security news? Follow us on Twitter.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Serverless re:Post, Security, Identity, & Compliance re:Post or contact AWS Support.

Abdul Javid

Abdul Javid

Abdul is a Senior Security Assurance Consultant and PCI DSS Qualified Security Assessor with AWS Security Assurance Services, and has more than 25 years of IT governance, operations, security, risk, and compliance experience. Abdul leverages his experience and knowledge to advise AWS customers with guidance and advice on their compliance journey. Abdul earned an M.S. in Computer Science from IIT, Chicago and holds various industry recognized sought after certifications in security and program and risk management from prominent organizations like AWS, HITRUST, ISACA, PMI, PCI DSS, and ISC2.

Ted Tanner

Ted Tanner

Ted is a Principal Assurance Consultant and PCI DSS Qualified Security Assessor with AWS Security Assurance Services, and has more than 25 years of IT and security experience. He uses this experience to provide AWS customers with guidance on compliance and security, and on building and optimizing their cloud compliance programs. He is co-author of the Payment Card Industry Data Security Standard (PCI DSS) v3.2.1 on AWS Compliance Guide and the soon-to-be-released v4.0 edition.

Tristan Watty

Tristan Watty

Dr. Watty is a Senior Security Consultant within the Professional Services team of Amazon Web Services based in Queens, New York. He is a passionate Tech Enthusiast, Influencer, and Amazonian with 15+ years of professional and educational experience with a specialization in Security, Risk, and Compliance. His zeal lies in empowering customers to develop and put into action secure mechanisms that steer them towards achieving their security goals. Dr. Watty also created and hosts an AWS Security Show named “Security SideQuest!” that airs on the AWS Twitch Channel.

Padmakar Bhosale

Padmakar Bhosale

Padmakar is a Sr. Technical Account Manager with over 25 years of experience in the Financial, Banking, and Cloud Services. He provides AWS customers with guidance and advice on Payment Services, Core Banking Ecosystem, Credit Union Banking Technologies, Resiliency on AWS Cloud, AWS Accounts & Network levels PCI Segmentations, and Optimization of the Customer’s Cloud Journey experience on AWS Cloud.

Sending and receiving webhooks on AWS: Innovate with event notifications

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/sending-and-receiving-webhooks-on-aws-innovate-with-event-notifications/

This post is written by Daniel Wirjo, Solutions Architect, and Justin Plock, Principal Solutions Architect.

Commonly known as reverse APIs or push APIs, webhooks provide a way for applications to integrate to each other and communicate in near real-time. It enables integration for business and system events.

Whether you’re building a software as a service (SaaS) application integrating with your customer workflows, or transaction notifications from a vendor, webhooks play a critical role in unlocking innovation, enhancing user experience, and streamlining operations.

This post explains how to build with webhooks on AWS and covers two scenarios:

  • Webhooks Provider: A SaaS application that sends webhooks to an external API.
  • Webhooks Consumer: An API that receives webhooks with capacity to handle large payloads.

It includes high-level reference architectures with considerations, best practices and code sample to guide your implementation.

Sending webhooks

To send webhooks, you generate events, and deliver them to third-party APIs. These events facilitate updates, workflows, and actions in the third-party system. For example, a payments platform (provider) can send notifications for payment statuses, allowing ecommerce stores (consumers) to ship goods upon confirmation.

AWS reference architecture for a webhook provider

The architecture consists of two services:

  • Webhook delivery: An application that delivers webhooks to an external endpoint specified by the consumer.
  • Subscription management: A management API enabling the consumer to manage their configuration, including specifying endpoints for delivery, and which events for subscription.

AWS reference architecture for a webhook provider

Considerations and best practices for sending webhooks

When building an application to send webhooks, consider the following factors:

Event generation: Consider how you generate events. This example uses Amazon DynamoDB as the data source. Events are generated by change data capture for DynamoDB Streams and sent to Amazon EventBridge Pipes. You then simplify the DynamoDB response format by using an input transformer.

With EventBridge, you send events in near real time. If events are not time-sensitive, you can send multiple events in a batch. This can be done by polling for new events at a specified frequency using EventBridge Scheduler. To generate events from other data sources, consider similar approaches with Amazon Simple Storage Service (S3) Event Notifications or Amazon Kinesis.

Filtering: EventBridge Pipes support filtering by matching event patterns, before the event is routed to the target destination. For example, you can filter for events in relation to status update operations in the payments DynamoDB table to the relevant subscriber API endpoint.

Delivery: EventBridge API Destinations deliver events outside of AWS using REST API calls. To protect the external endpoint from surges in traffic, you set an invocation rate limit. In addition, retries with exponential backoff are handled automatically depending on the error. An Amazon Simple Queue Service (SQS) dead-letter queue retains messages that cannot be delivered. These can provide scalable and resilient delivery.

Payload Structure: Consider how consumers process event payloads. This example uses an input transformer to create a structured payload, aligned to the CloudEvents specification. CloudEvents provides an industry standard format and common payload structure, with developer tools and SDKs for consumers.

Payload Size: For fast and reliable delivery, keep payload size to a minimum. Consider delivering only necessary details, such as identifiers and status. For additional information, you can provide consumers with a separate API. Consumers can then separately call this API to retrieve the additional information.

Security and Authorization: To deliver events securely, you establish a connection using an authorization method such as OAuth. Under the hood, the connection stores the credentials in AWS Secrets Manager, which securely encrypts the credentials.

Subscription Management: Consider how consumers can manage their subscription, such as specifying HTTPS endpoints and event types to subscribe. DynamoDB stores this configuration. Amazon API Gateway, Amazon Cognito, and AWS Lambda provide a management API for operations.

Costs: In practice, sending webhooks incurs cost, which may become significant as you grow and generate more events. Consider implementing usage policies, quotas, and allowing consumers to subscribe only to the event types that they need.

Monetization: Consider billing consumers based on their usage volume or tier. For example, you can offer a free tier to provide a low-friction access to webhooks, but only up to a certain volume. For additional volume, you charge a usage fee that is aligned to the business value that your webhooks provide. At high volumes, you offer a premium tier where you provide dedicated infrastructure for certain consumers.

Monitoring and troubleshooting: Beyond the architecture, consider processes for day-to-day operations. As endpoints are managed by external parties, consider enabling self-service. For example, allow consumers to view statuses, replay events, and search for past webhook logs to diagnose issues.

Advanced Scenarios: This example is designed for popular use cases. For advanced scenarios, consider alternative application integration services noting their Service Quotas. For example, Amazon Simple Notification Service (SNS) for fan-out to a larger number of consumers, Lambda for flexibility to customize payloads and authentication, and AWS Step Functions for orchestrating a circuit breaker pattern to deactivate unreliable subscribers.

Receiving webhooks

To receive webhooks, you require an API to provide to the webhook provider. For example, an ecommerce store (consumer) may rely on notifications provided by their payment platform (provider) to ensure that goods are shipped in a timely manner. Webhooks present a unique scenario as the consumer must be scalable, resilient, and ensure that all requests are received.

AWS reference architecture for a webhook consumer

In this scenario, consider an advanced use case that can handle large payloads by using the claim-check pattern.

AWS reference architecture for a webhook consumer

At a high-level, the architecture consists of:

  • API: An API endpoint to receive webhooks. An event-driven system then authorizes and processes the received webhooks.
  • Payload Store: S3 provides scalable storage for large payloads.
  • Webhook Processing: EventBridge Pipes provide an extensible architecture for processing. It can batch, filter, enrich, and send events to a range of processing services as targets.

Considerations and best practices for receiving webhooks

When building an application to receive webhooks, consider the following factors:

Scalability: Providers typically send events as they occur. API Gateway provides a scalable managed endpoint to receive events. If unavailable or throttled, providers may retry the request, however, this is not guaranteed. Therefore, it is important to configure appropriate rate and burst limits. Throttling requests at the entry point mitigates impact on downstream services, where each service has its own quotas and limits. In many cases, providers are also aware of impact on downstream systems. As such, they send events at a threshold rate limit, typically up to 500 transactions per second (TPS).

Considerations and best practices for receiving webhooks

In addition, API Gateway allows you to validate requests, monitor for any errors, and protect against distributed denial of service (DDoS). This includes Layer 7 and Layer 3 attacks, which are common threats to webhook consumers given public exposure.

Authorization and Verification: Providers can support different authorization methods. Consider a common scenario with Hash-based Message Authentication Code (HMAC), where a shared secret is established and stored in Secrets Manager. A Lambda function then verifies integrity of the message, processing a signature in the request header. Typically, the signature contains a timestamped nonce with an expiry to mitigate replay attacks, where events are sent multiple times by an attacker. Alternatively, if the provider supports OAuth, consider securing the API with Amazon Cognito.

Payload Size: Providers may send a variety of payload sizes. Events can be batched to a single larger request, or they may contain significant information. Consider payload size limits in your event-driven system. API Gateway and Lambda have limits of 10 Mb and 6 Mb. However, DynamoDB and SQS are limited to 400kb and 256kb (with extension for large messages) which can represent a bottleneck.

Instead of processing the entire payload, S3 stores the payload. It is then referenced in DynamoDB, via its bucket name and object key. This is known as the claim-check pattern. With this approach, the architecture supports payloads of up to 6mb, as per the Lambda invocation payload quota.

Considerations and best practices for receiving webhooks

Idempotency: For reliability, many providers prioritize delivering at-least-once, even if it means not guaranteeing exactly once delivery. They can transmit the same request multiple times, resulting in duplicates. To handle this, a Lambda function checks against the event’s unique identifier against previous records in DynamoDB. If not already processed, you create a DynamoDB item.

Ordering: Consider processing requests in its intended order. As most providers prioritize at-least-once delivery, events can be out of order. To indicate order, events may include a timestamp or a sequence identifier in the payload. If not, ordering may be on a best-efforts basis based on when the webhook is received. To handle ordering reliably, select event-driven services that ensure ordering. This example uses DynamoDB Streams and EventBridge Pipes.

Flexible Processing: EventBridge Pipes provide integrations to a range of event-driven services as targets. You can route events to different targets based on filters. Different event types may require different processors. For example, you can use Step Functions for orchestrating complex workflows, Lambda for compute operations with less than 15-minute execution time, SQS to buffer requests, and Amazon Elastic Container Service (ECS) for long-running compute jobs. EventBridge Pipes provide transformation to ensure only necessary payloads are sent, and enrichment if additional information is required.

Costs: This example considers a use case that can handle large payloads. However, if you can ensure that providers send minimal payloads, consider a simpler architecture without the claim-check pattern to minimize cost.

Conclusion

Webhooks are a popular method for applications to communicate, and for businesses to collaborate and integrate with customers and partners.

This post shows how you can build applications to send and receive webhooks on AWS. It uses serverless services such as EventBridge and Lambda, which are well-suited for event-driven use cases. It covers high-level reference architectures, considerations, best practices and code sample to assist in building your solution.

For standards and best practices on webhooks, visit the open-source community resources Webhooks.fyi and CloudEvents.io.

For more serverless learning resources, visit Serverless Land.

Unlock scalable analytics with AWS Glue and Google BigQuery

Post Syndicated from Kartikay Khator original https://aws.amazon.com/blogs/big-data/unlock-scalable-analytics-with-aws-glue-and-google-bigquery/

Data integration is the foundation of robust data analytics. It encompasses the discovery, preparation, and composition of data from diverse sources. In the modern data landscape, accessing, integrating, and transforming data from diverse sources is a vital process for data-driven decision-making. AWS Glue, a serverless data integration and extract, transform, and load (ETL) service, has revolutionized this process, making it more accessible and efficient. AWS Glue eliminates complexities and costs, allowing organizations to perform data integration tasks in minutes, boosting efficiency.

This blog post explores the newly announced managed connector for Google BigQuery and demonstrates how to build a modern ETL pipeline with AWS Glue Studio without writing code.

Overview of AWS Glue

AWS Glue is a serverless data integration service that makes it easier to discover, prepare, and combine data for analytics, machine learning (ML), and application development. AWS Glue provides all the capabilities needed for data integration, so you can start analyzing your data and putting it to use in minutes instead of months. AWS Glue provides both visual and code-based interfaces to make data integration easier. Users can more easily find and access data using the AWS Glue Data Catalog. Data engineers and ETL (extract, transform, and load) developers can visually create, run, and monitor ETL workflows in a few steps in AWS Glue Studio. Data analysts and data scientists can use AWS Glue DataBrew to visually enrich, clean, and normalize data without writing code.

Introducing Google BigQuery Spark connector

To meet the demands of diverse data integration use cases, AWS Glue now offers a native spark connector for Google BigQuery. Customers can now use AWS Glue 4.0 for Spark to read from and write to tables in Google BigQuery. Additionally, you can read an entire table or run a custom query and write your data using direct and indirect writing methods. You connect to BigQuery using service account credentials stored securely in AWS Secrets Manager.

Benefits of Google BigQuery Spark connector

  • Seamless integration: The native connector offers an intuitive and streamlined interface for data integration, reducing the learning curve.
  • Cost efficiency: Building and maintaining custom connectors can be expensive. The native connector provided by AWS Glue is a cost-effective alternative.
  • Efficiency: Data transformation tasks that previously took weeks or months can now be accomplished within minutes, optimizing efficiency.

Solution overview

In this example, you create two ETL jobs using AWS Glue with the native Google BigQuery connector.

  1. Query a BigQuery table and save the data into Amazon Simple Storage Service (Amazon S3) in Parquet format.
  2. Use the data extracted from the first job to transform and create an aggregated result to be stored in Google BigQuery.

solution architecture

Prerequisites

The dataset used in this solution is the NCEI/WDS Global Significant Earthquake Database, with a global listing of over 5,700 earthquakes from 2150 BC to the present. Copy this public data into your Google BigQuery project or use your existing dataset.

Configure BigQuery connections

To connect to Google BigQuery from AWS Glue, see Configuring BigQuery connections. You must create and store your Google Cloud Platform credentials in a Secrets Manager secret, then associate that secret with a Google BigQuery AWS Glue connection.

Set up Amazon S3

Every object in Amazon S3 is stored in a bucket. Before you can store data in Amazon S3, you must create an S3 bucket to store the results.

To create an S3 bucket:

  1. On the AWS Management Console for Amazon S3, choose Create bucket.
  2. Enter a globally unique Name for your bucket; for example, awsglue-demo.
  3. Choose Create bucket.

Create an IAM role for the AWS Glue ETL job

When you create the AWS Glue ETL job, you specify an AWS Identity and Access Management (IAM) role for the job to use. The role must grant access to all resources used by the job, including Amazon S3 (for any sources, targets, scripts, driver files, and temporary directories), and Secrets Manager.

For instructions, see Configure an IAM role for your ETL job.

Solution walkthrough

Create a visual ETL job in AWS Glue Studio to transfer data from Google BigQuery to Amazon S3

  1. Open the AWS Glue console.
  2. In AWS Glue, navigate to Visual ETL under the ETL jobs section and create a new ETL job using Visual with a blank canvas.
  3. Enter a Name for your AWS Glue job, for example, bq-s3-dataflow.
  4. Select Google BigQuery as the data source.
    1. Enter a name for your Google BigQuery source node, for example, noaa_significant_earthquakes.
    2. Select a Google BigQuery connection, for example, bq-connection.
    3. Enter a Parent project, for example, bigquery-public-datasources.
    4. Select Choose a single table for the BigQuery Source.
    5. Enter the table you want to migrate in the form [dataset].[table], for example, noaa_significant_earthquakes.earthquakes.
      big query data source for bq to amazon s3 dataflow
  5. Next, choose the data target as Amazon S3.
    1. Enter a Name for the target Amazon S3 node, for example, earthquakes.
    2. Select the output data Format as Parquet.
    3. Select the Compression Type as Snappy.
    4. For the S3 Target Location, enter the bucket created in the prerequisites, for example, s3://<YourBucketName>/noaa_significant_earthquakes/earthquakes/.
    5. You should replace <YourBucketName> with the name of your bucket.
      s3 target node for bq to amazon s3 dataflow
  6. Next go to the Job details. In the IAM Role, select the IAM role from the prerequisites, for example, AWSGlueRole.
    IAM role for bq to amazon s3 dataflow
  7. Choose Save.

Run and monitor the job

  1. After your ETL job is configured, you can run the job. AWS Glue will run the ETL process, extracting data from Google BigQuery and loading it into your specified S3 location.
  2. Monitor the job’s progress in the AWS Glue console. You can see logs and job run history to ensure everything is running smoothly.

run and monitor bq to amazon s3 dataflow

Data validation

  1. After the job has run successfully, validate the data in your S3 bucket to ensure it matches your expectations. You can see the results using Amazon S3 Select.

review results in amazon s3 from the bq to s3 dataflow run

Automate and schedule

  1. If needed, set up job scheduling to run the ETL process regularly. You can use AWS to automate your ETL jobs, ensuring your S3 bucket is always up to date with the latest data from Google BigQuery.

You’ve successfully configured an AWS Glue ETL job to transfer data from Google BigQuery to Amazon S3. Next, you create the ETL job to aggregate this data and transfer it to Google BigQuery.

Finding earthquake hotspots with AWS Glue Studio Visual ETL.

  1. Open AWS Glue console.
  2. In AWS Glue navigate to Visual ETL under the ETL jobs section and create a new ETL job using Visual with a blank canvas.
  3. Provide a name for your AWS Glue job, for example, s3-bq-dataflow.
  4. Choose Amazon S3 as the data source.
    1. Enter a Name for the source Amazon S3 node, for example, earthquakes.
    2. Select S3 location as the S3 source type.
    3. Enter the S3 bucket created in the prerequisites as the S3 URL, for example, s3://<YourBucketName>/noaa_significant_earthquakes/earthquakes/.
    4. You should replace <YourBucketName> with the name of your bucket.
    5. Select the Data format as Parquet.
    6. Select Infer schema.
      amazon s3 source node for s3 to bq dataflow
  5. Next, choose Select Fields transformation.
    1. Select earthquakes as Node parents.
    2. Select fields: id, eq_primary, and country.
      select field node for amazon s3 to bq dataflow
  6. Next, choose Aggregate transformation.
    1. Enter a Name, for example Aggregate.
    2. Choose Select Fields as Node parents.
    3. Choose eq_primary and country as the group by columns.
    4. Add id as the aggregate column and count as the aggregation function.
      aggregate node for amazon s3 to bq dataflow
  7. Next, choose RenameField transformation.
    1. Enter a name for the source Amazon S3 node, for example, Rename eq_primary.
    2. Choose Aggregate as Node parents.
    3. Choose eq_primary as the Current field name and enter earthquake_magnitude as the New field name.
      rename eq_primary field for amazon s3 to bq dataflow
  8. Next, choose RenameField transformation
    1. Enter a name for the source Amazon S3 node, for example, Rename count(id).
    2. Choose Rename eq_primary as Node parents.
    3. Choose count(id) as the Current field name and enter number_of_earthquakes as the New field name.
      rename cound(id) field for amazon s3 to bq dataflow
  9. Next, choose the data target as Google BigQuery.
    1. Provide a name for your Google BigQuery source node, for example, most_powerful_earthquakes.
    2. Select a Google BigQuery connection, for example, bq-connection.
    3. Select Parent project, for example, bigquery-public-datasources.
    4. Enter the name of the Table you want to create in the form [dataset].[table], for example, noaa_significant_earthquakes.most_powerful_earthquakes.
    5. Choose Direct as the Write method.
      bq destination for amazon s3 to bq dataflow
  10. Next go to the Job details tab and in the IAM Role, select the IAM role from the prerequisites, for example, AWSGlueRole.
    IAM role for amazon s3 to bq dataflow
  11. Choose Save.

Run and monitor the job

  1. After your ETL job is configured, you can run the job. AWS Glue runs the ETL process, extracting data from Google BigQuery and loading it into your specified S3 location.
  2. Monitor the job’s progress in the AWS Glue console. You can see logs and job run history to ensure everything is running smoothly.

monitor and run for amazon s3 to bq dataflow

Data validation

  1. After the job has run successfully, validate the data in your Google BigQuery dataset. This ETL job returns a list of countries where the most powerful earthquakes have occurred. It provides these by counting the number of earthquakes for a given magnitude by country.

aggregated results for amazon s3 to bq dataflow

Automate and schedule

  1. You can set up job scheduling to run the ETL process regularly. AWS Glue allows you to automate your ETL jobs, ensuring your S3 bucket is always up to date with the latest data from Google BigQuery.

That’s it! You’ve successfully set up an AWS Glue ETL job to transfer data from Amazon S3 to Google BigQuery. You can use this integration to automate the process of data extraction, transformation, and loading between these two platforms, making your data readily available for analysis and other applications.

Clean up

To avoid incurring charges, clean up the resources used in this blog post from your AWS account by completing the following steps:

  1. On the AWS Glue console, choose Visual ETL in the navigation pane.
  2. From the list of jobs, select the job bq-s3-data-flow and delete it.
  3. From the list of jobs, select the job s3-bq-data-flow and delete it.
  4. On the AWS Glue console, choose Connections in the navigation pane under Data Catalog.
  5. Choose the BiqQuery connection you created and delete it.
  6. On the Secrets Manager console, choose the secret you created and delete it.
  7. On the IAM console, choose Roles in the navigation pane, then select the role you created for the AWS Glue ETL job and delete it.
  8. On the Amazon S3 console, search for the S3 bucket you created, choose Empty to delete the objects, then delete the bucket.
  9. Clean up resources in your Google account by deleting the project that contains the Google BigQuery resources. Follow the documentation to clean up the Google resources.

Conclusion

The integration of AWS Glue with Google BigQuery simplifies the analytics pipeline, reduces time-to-insight, and facilitates data-driven decision-making. It empowers organizations to streamline data integration and analytics. The serverless nature of AWS Glue means no infrastructure management, and you pay only for the resources consumed while your jobs are running. As organizations increasingly rely on data for decision-making, this native spark connector provides an efficient, cost-effective, and agile solution to swiftly meet data analytics needs.

If you’re interested to see how to read from and write to tables in Google BigQuery in AWS Glue, take a look at step-by-step video tutorial. In this tutorial, we walk through the entire process, from setting up the connection to running the data transfer flow. For more information on AWS Glue, visit AWS Glue.

Appendix

If you are looking to implement this example, using code instead of the AWS Glue console, use the following code snippets.

Reading data from Google BigQuery and writing data into Amazon S3

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# STEP-1 Read the data from Big Query Table 
noaa_significant_earthquakes_node1697123333266 = (
    glueContext.create_dynamic_frame.from_options(
        connection_type="bigquery",
        connection_options={
            "connectionName": "bq-connection",
            "parentProject": "bigquery-public-datasources",
            "sourceType": "table",
            "table": "noaa_significant_earthquakes.earthquakes",
        },
        transformation_ctx="noaa_significant_earthquakes_node1697123333266",
    )
)
# STEP-2 Write the data read from Big Query Table into S3
# You should replace <YourBucketName> with the name of your bucket.
earthquakes_node1697157772747 = glueContext.write_dynamic_frame.from_options(
    frame=noaa_significant_earthquakes_node1697123333266,
    connection_type="s3",
    format="glueparquet",
    connection_options={
        "path": "s3://<YourBucketName>/noaa_significant_earthquakes/earthquakes/",
        "partitionKeys": [],
    },
    format_options={"compression": "snappy"},
    transformation_ctx="earthquakes_node1697157772747",
)

job.commit()

Reading and aggregating data from Amazon S3 and writing into Google BigQuery

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from awsglue import DynamicFrame
from pyspark.sql import functions as SqlFuncs

def sparkAggregate(
    glueContext, parentFrame, groups, aggs, transformation_ctx
) -> DynamicFrame:
    aggsFuncs = []
    for column, func in aggs:
        aggsFuncs.append(getattr(SqlFuncs, func)(column))
    result = (
        parentFrame.toDF().groupBy(*groups).agg(*aggsFuncs)
        if len(groups) > 0
        else parentFrame.toDF().agg(*aggsFuncs)
    )
    return DynamicFrame.fromDF(result, glueContext, transformation_ctx)

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# STEP-1 Read the data from Amazon S3 bucket
# You should replace <YourBucketName> with the name of your bucket.
earthquakes_node1697218776818 = glueContext.create_dynamic_frame.from_options(
    format_options={},
    connection_type="s3",
    format="parquet",
    connection_options={
        "paths": [
            "s3://<YourBucketName>/noaa_significant_earthquakes/earthquakes/"
        ],
        "recurse": True,
    },
    transformation_ctx="earthquakes_node1697218776818",
)

# STEP-2 Select fields
SelectFields_node1697218800361 = SelectFields.apply(
    frame=earthquakes_node1697218776818,
    paths=["id", "eq_primary", "country"],
    transformation_ctx="SelectFields_node1697218800361",
)

# STEP-3 Aggregate data
Aggregate_node1697218823404 = sparkAggregate(
    glueContext,
    parentFrame=SelectFields_node1697218800361,
    groups=["eq_primary", "country"],
    aggs=[["id", "count"]],
    transformation_ctx="Aggregate_node1697218823404",
)

Renameeq_primary_node1697219483114 = RenameField.apply(
    frame=Aggregate_node1697218823404,
    old_name="eq_primary",
    new_name="earthquake_magnitude",
    transformation_ctx="Renameeq_primary_node1697219483114",
)

Renamecountid_node1697220511786 = RenameField.apply(
    frame=Renameeq_primary_node1697219483114,
    old_name="`count(id)`",
    new_name="number_of_earthquakes",
    transformation_ctx="Renamecountid_node1697220511786",
)

# STEP-1 Write the aggregated data in Google BigQuery
most_powerful_earthquakes_node1697220563923 = (
    glueContext.write_dynamic_frame.from_options(
        frame=Renamecountid_node1697220511786,
        connection_type="bigquery",
        connection_options={
            "connectionName": "bq-connection",
            "parentProject": "bigquery-public-datasources",
            "writeMethod": "direct",
            "table": "noaa_significant_earthquakes.most_powerful_earthquakes",
        },
        transformation_ctx="most_powerful_earthquakes_node1697220563923",
    )
)

job.commit()


About the authors

Kartikay Khator is a Solutions Architect in Global Life Sciences at Amazon Web Services (AWS). He is passionate about building innovative and scalable solutions to meet the needs of customers, focusing on AWS Analytics services. Beyond the tech world, he is an avid runner and enjoys hiking.

Kamen SharlandjievKamen Sharlandjiev is a Sr. Big Data and ETL Solutions Architect and Amazon AppFlow expert. He’s on a mission to make life easier for customers who are facing complex data integration challenges. His secret weapon? Fully managed, low-code AWS services that can get the job done with minimal effort and no coding.

Anshul SharmaAnshul Sharma is a Software Development Engineer in AWS Glue Team. He is driving the connectivity charter which provide Glue customer native way of connecting any Data source (Data-warehouse, Data-lakes, NoSQL etc) to Glue ETL Jobs. Beyond the tech world, he is a cricket and soccer lover.

Archiving and replaying messages with Amazon SNS FIFO

Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/archiving-and-replaying-messages-with-amazon-sns-fifo/

This post is written by A Mohammed Atiq, Solutions Architect and Mithun Mallick, Principal Solutions Architect, Serverless

Amazon Simple Notification Service (SNS) offers a flexible, fully managed messaging service, allowing applications to send and receive messages. SNS acts as a channel, delivering events from publishers to subscribers.

Today, AWS is announcing a new capability that enables you to archive and replay messages published to SNS FIFO (first-in first-Out) topics. Now, when enabled with an archive policy, SNS FIFO topics automatically:

  • Archives events, with a no-code, in-place message archive that doesn’t require any external resources. You only need to define an archive policy on your topic, including the required retention period (from 1 to 365 days).
  • Replays events: subscribers benefit from a managed, no-code message replay functionality, with built-in progress reporting and message filtering capabilities. To initiate a replay, subscribers simply apply a replay policy to their subscription, defining a starting point and an ending point using timestamps.

This feature can be useful in failure recovery and state replication scenarios.

Failure recovery

In failure recovery scenarios, developers can use this to reprocess a subset of messages and recover from a downstream application failure or a dependency issue.

Consider a situation where a search application needs to reprocess messages because the search engine’s index has been erased. To initiate recovery, the search application would update the ReplayPolicy attribute in its existing subscription using the SetSubscriptionAttributes API action, to start receiving messages from a specific point in time, rather than from when the Archive policy was applied to the topic.

State replication

For state replication scenarios, this feature enables new applications to duplicate the state of previously subscribed applications.

Consider an internal data warehouse application that must replicate the state of an external search application to make the data indexed in the search engine available to product managers and other internal staff. The data warehouse application subscribes its newly created endpoint (for example, an Amazon SQS FIFO queue) to the topic using the Subscribe API action and sets the ReplayPolicy subscription attribute.

If it opts to replicate the full state of the search engine, it might set the timestamp in its ReplayPolicy to coincide with the search engine’s subscription’s creation date and time, ensuring all data ever delivered to the search engine is also delivered to the data warehouse tool.

Enabling the archive policy via the SNS console

When creating a new SNS FIFO topic, you see an option for the archive policy. This policy determines how long SNS stores your messages, making them available for potential resending to a subscription if necessary. The Archive policy does not activate by default – you must enable it for each topic manually or automate the operation.

For instance, the retention period for this FIFO topic is set at 30 days. However, you can adjust this duration anywhere from 1 to 365 days. Once you activate the archive policy, messages sent to this topic are archived for the defined period.

To confirm that the archive policy is in effect after creating the topic, check the topic details. Next to the retention policy, and its status is displayed as Active.

By subscribing an SQS FIFO queue to an SNS FIFO topic, you can replay messages, and the Replay status shows Not running. You can subscribe both FIFO and standard SQS queues to their SNS FIFO topics, providing flexibility for various use-case requirements. To initiate a replay, navigate to the SNS console, choose Replay, and then choose Start replay.

When you initiate a replay, a window appears, allowing you to specify the start and end dates, as well as the exact time from which messages are archived. This feature affords the flexibility to replay only messages of interest, instead of every archived message, by allowing you to set on a specific schedule. When you choose Start replay, the service begins sending messages to the subscriber.

You can also define settings for the SNS FIFO archive and replay features with both AWS CloudFormation and the AWS Serverless Application Model (AWS SAM).

Use Cases

Replaying events for error recovery in microservices

In a scenario where an insurance application uses multiple microservices, consider one claims processing microservice encounters an error and drops a claim. Such an oversight can cause the workload to be out of sync.

With the archive and replay feature, you can revisit and replay events from the time the error was detected. This allows the microservice to recognize the missed event and complete the necessary actions, ensuring the system remains updated and accurate.

  1. Messages are published to an SNS FIFO topic from an application.
  2. Messages are delivered to an SQS FIFO queue containing claim details to be processed by a downstream microservice.
  3. The microservice fails to process a series of messages due to an exception and discards all of the messages.
  4. The user then initiates a replay from the SNS FIFO topic, specifies the time frame of messages to replay based on when the failure occurred.
  5. The microservice is now able to successfully process the replayed messages and persists data into a DynamoDB table.

Replicating state across Regions

In situations where an application spans multiple Regions, and a microservice encounters difficulties in its primary Region, you can replicate the infrastructure to another Region using an active/standby setup.

You can reroute traffic to the standby microservice in the secondary Region, maintaining synchronization through event replays. You can set an end time in the SNS replay policy but if this isn’t defined, replaying continues until all the most recent messages are sent.

After, the SNS subscription resumes normal functioning, capturing all new messages. This approach is suitable for many state replication scenarios, such as cross-Region backup strategies, as it helps minimize downtime and prevent message loss.

  1. Messages are published to an SNS FIFO topic from an application.
  2. Messages are delivered to an SQS FIFO queue containing claim details to be processed by a downstream microservice.
  3. The microservice failed to process a series of messages due to an exception and discarded all of the messages.
  4. The user then subscribes a new SQS FIFO queue in another region, initiates a replay from the SNS FIFO topic and specifies the time frame of messages to replay based on when the failure occurred.
  5. The microservice in a different region is able to retrieve the replayed messages from the new SQS FIFO queue, successfully processes the series of messages and persists data into a DynamoDB table.

Configuring SNS FIFO archive and replay for auto insurance processing

Managing auto insurance claims requires timely coordination. This walkthrough shows the combined benefits of SNS FIFO and SQS FIFO to process claims in the correct sequence.

Both SQS FIFO and SQS standard queues can be subscribed to the SNS FIFO topic, offering versatility in handling claims. The archive and replay functionality of SNS FIFO is paramount; disruptions in downstream microservices don’t compromise claim integrity due to the replay capability.

This walkthrough guides you through deploying an auto insurance claims processing example using the AWS CLI. You create an SNS FIFO topic for claim submissions and two SQS FIFO queues. The first queue is for primary processing of the claims, while the second is specifically for message replays to support application state replication across various system instances.

Prerequisites

Step 1 – Creating resources using the AWS CLI and storing variables

Run the following commands in the terminal.

REGION=$(aws configure get region)

# Create an SNS FIFO topic for auto insurance claims
AUTO_INSURANCE_TOPIC_ARN=$(aws sns create-topic --name "AutoInsuranceClaimsTopic.fifo" --attributes "FifoTopic=true,ContentBasedDeduplication=true,DisplayName=Auto Insurance Claims Topic" --region $REGION | jq -r '.TopicArn')

# Create primary and replay SQS FIFO queues
AUTO_INSURANCE_QUEUE_URL=$(aws sqs create-queue --queue-name "AutoInsuranceClaimsQueue.fifo" --attributes "FifoQueue=true" --region $REGION | jq -r '.QueueUrl')
AUTO_INSURANCE_REPLAY_QUEUE_URL=$(aws sqs create-queue --queue-name "AutoInsuranceReplayQueue.fifo" --attributes "FifoQueue=true" --region $REGION | jq -r '.QueueUrl')

# Get ARNs for both SQS queues
AUTO_INSURANCE_QUEUE_ARN=$(aws sqs get-queue-attributes --queue-url $AUTO_INSURANCE_QUEUE_URL --attribute-names QueueArn --output text --query 'Attributes.QueueArn')
AUTO_INSURANCE_REPLAY_QUEUE_ARN=$(aws sqs get-queue-attributes --queue-url $AUTO_INSURANCE_REPLAY_QUEUE_URL --attribute-names QueueArn --region $REGION | jq -r '.Attributes.QueueArn')

# Define a policy allowing the topic to publish to both queues
SQS_POLICY_TEMPLATE="{\"Policy\" : \"{ \\\"Version\\\": \\\"2012-10-17\\\", \\\"Statement\\\": [ { \\\"Sid\\\": \\\"1\\\", \\\"Effect\\\": \\\"Allow\\\", \\\"Principal\\\": { \\\"Service\\\": \\\"sns.amazonaws.com\\\" }, \\\"Action\\\": [\\\"sqs:SendMessage\\\"], \\\"Resource\\\": [\\\"$AUTO_INSURANCE_QUEUE_ARN\\\", \\\"$AUTO_INSURANCE_REPLAY_QUEUE_ARN\\\"], \\\"Condition\\\": { \\\"ArnLike\\\": { \\\"aws:SourceArn\\\": [\\\"$AUTO_INSURANCE_TOPIC_ARN\\\"] } } } ]}\"}"

# Apply the access policy to the queues
aws sqs set-queue-attributes --queue-url $AUTO_INSURANCE_QUEUE_URL --attributes file://<(echo $SQS_POLICY_TEMPLATE)
aws sqs set-queue-attributes --queue-url $AUTO_INSURANCE_REPLAY_QUEUE_URL --attributes file://<(echo $SQS_POLICY_TEMPLATE)

# Subscribe the primary queue to the created SNS FIFO topic
aws sns subscribe --topic-arn $AUTO_INSURANCE_TOPIC_ARN --protocol sqs --notification-endpoint $AUTO_INSURANCE_QUEUE_ARN --region $REGION

Step 2 – Setting the archive policy on the SNS FIFO topic

Modify the attributes of the SNS FIFO topic to set a retention period. This determines how long a message is retained in the topic archive. This example uses 30 days.

# Set a 30-day retention period for the SNS FIFO topic

aws sns set-topic-attributes --region $REGION --topic-arn $AUTO_INSURANCE_TOPIC_ARN --attribute-name ArchivePolicy --attribute-value "{\"MessageRetentionPeriod\":\"30\"}"

Step 3- Publishing auto insurance claim details

Publish a sample claim to the SNS FIFO topic. This step mimics a real-world scenario where an insurance claim must be processed by subscribers of the topic.

# Get the current timestamp and publish a sample insurance claim
TIMESTAMP_START=$(date -u +%FT%T.000Z)
aws sns publish --region $REGION --topic-arn $AUTO_INSURANCE_TOPIC_ARN --message "{ \"claim_type\": \"collision\", \"registration\": \"AB123CDE\" }" --message-group-id "group1"

Step 4 – Reading auto insurance claim details

Retrieve the insurance claim details from the primary SQS FIFO queue. This simulates a process reading the insurance claim to take action. After reading the message, the claim is deleted from the queue to avoid reprocessing.

# Fetch the claim details from the primary queue, then delete to avoid redundancy
MESSAGE=$(aws sqs receive-message --region $REGION --queue-url $AUTO_INSURANCE_QUEUE_URL --output json)
MESSAGE_TEXT=$(echo "$MESSAGE" | jq -r '.Messages[0].Body')
MESSAGE_RECEIPT=$(echo "$MESSAGE" | jq -r '.Messages[0].ReceiptHandle')
aws sqs delete-message --region $REGION --queue-url $AUTO_INSURANCE_QUEUE_URL --receipt-handle $MESSAGE_RECEIPT
echo "Received claim details: ${MESSAGE_TEXT}"

Step 5 – Subscribing the replay SQS queue to the SNS FIFO topic

To ensure no claims are lost, configure a replay policy for your SQS FIFO queue subscription. This policy sets the schedule from which messages are replayed to the SQS FIFO queue. Here, you subscribe a replay queue with a replay policy and then monitor the status of the replay queue. Once complete, read the replayed claim details from the secondary SQS FIFO queue. If any processing issues occurred initially, there is a second chance to process the claim.

Subscribe replay queue to SNS FIFO topic:

# Subscribe the replay queue to the topic and define its replay policy
NEW_SUBSCRIPTION_ARN=$(aws sns subscribe --region $REGION --topic-arn $AUTO_INSURANCE_TOPIC_ARN --protocol sqs --return-subscription-arn --notification-endpoint $AUTO_INSURANCE_REPLAY_QUEUE_ARN --attributes "{\"ReplayPolicy\":\"{\\\"PointType\\\":\\\"Timestamp\\\",\\\"StartingPoint\\\":\\\"$TIMESTAMP_START\\\"}\"}" --output json | jq -r '.SubscriptionArn')

To monitor the replay status:

# Wait for the replay to complete
while [[ $(aws sns get-subscription-attributes --region $REGION --subscription-arn $NEW_SUBSCRIPTION_ARN --output text | awk 'END{print $9}') != 'Completed' ]]; do printf "."; sleep 5; done; echo "Replay complete";

To read the replayed message and delete the message from the queue:

# Fetch the replayed message and then remove it from the queue
REPLAYED_MESSAGE=$(aws sqs receive-message --region $REGION --queue-url $AUTO_INSURANCE_REPLAY_QUEUE_URL --output json)
REPLAYED_MESSAGE_TEXT=$(echo "$REPLAYED_MESSAGE" | jq -r '.Messages[0].Body')
REPLAYED_MESSAGE_RECEIPT=$(echo "$REPLAYED_MESSAGE" | jq -r '.Messages[0].ReceiptHandle')
aws sqs delete-message --region $REGION --queue-url $AUTO_INSURANCE_REPLAY_QUEUE_URL --receipt-handle $REPLAYED_MESSAGE_RECEIPT
echo "Received replayed claim details: ${REPLAYED_MESSAGE_TEXT}"

Cleaning up

To avoid incurring unnecessary costs, clean up the resources created in this walkthrough:

# Delete the primary SQS FIFO queue
aws sqs delete-queue --queue-url $AUTO_INSURANCE_QUEUE_URL --region $REGION

# Delete the replay SQS FIFO queue
aws sqs delete-queue --queue-url $AUTO_INSURANCE_REPLAY_QUEUE_URL --region $REGION

# Unset the 'ArchivePolicy' attribute
aws sns set-topic-attributes --region $REGION --topic-arn $AUTO_INSURANCE_TOPIC_ARN --attribute-name ArchivePolicy --attribute-value "{}"

# Delete the SNS FIFO topic
aws sns delete-topic --topic-arn $AUTO_INSURANCE_TOPIC_ARN --region $REGION

Conclusion

The new SNS FIFO archive and replay feature provides a substantial foundation for event-driven applications, emphasizing failure recovery and application state replication. These features allow developers to efficiently manage and recover from disruptions, and ensure state replication across different application instances or environments.

Get started with this new SNS FIFO capability by using the AWS Management Console, AWS CLI, AWS Software Development Kit (SDK), or AWS CloudFormation. For information on cost, see SNS pricing and SQS pricing.

For more serverless learning resources, visit Serverless Land.

Enhancing runtime security and governance with the AWS Lambda Runtime API proxy extension

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/enhancing-runtime-security-and-governance-with-the-aws-lambda-runtime-api-proxy-extension/

This post is written by Anton Aleksandrov, Principal Serverless Solutions Architect,  and Shridhar Pandey, Senior AWS Lambda Product Manager.

AWS Lambda runtimes use the Lambda Runtime API to communicate with the Lambda service. Runtimes use it to retrieve inbound events to be processed by the function handler, return successful handler responses to the Lambda service, and report failures. This post shows how to intercept inbound events and outbound responses without changing function code, using the Runtime API proxy pattern.

This approach enables security vendors and engineering teams to provide enhanced, non-invasive security and governance tools for Lambda functions. Use cases include sanitizing event payload, blocking malicious events, and auditing and augmenting payloads.

Overview

The Lambda Runtime API is an HTTP endpoint available within the Lambda execution environment. It allows the Lambda runtime that executes the function code to communicate with the Lambda service. It is used by managed Lambda runtimes, such as Node.js or Python, as well as custom runtime, which enables developers to create their own Lambda runtime in any programming language of their choice.

Lambda runtimes use the Runtime API to retrieve the next incoming event to be processed by the function handler and return the handler response to the Lambda service.

Lambda Extensions enable you to integrate Lambda functions with your organization’s preferred tools for monitoring, observability, security, and governance. You can use extensions from AWS, AWS Lambda Ready partners, and open-source projects for a wide range of use cases. Extensions allow adding functionality, such as pre-fetching configurations or dispatching telemetry, without making intrusive changes to function code. Lambda Extensions are packaged as Lambda layers and run as a separate process in the execution environment.

This is how runtimes and extensions communicate with the Lambda service via the Runtime API and Extensions API endpoints:

AWS Lambda Runtime API and Extensions API endpoints

AWS Lambda Runtime API and Extensions API endpoints

When you register your extension with the Lambda service, you can specify you want to receive the INVOKE event. The Lambda service sends this event to the extension asynchronously when a function is invoked.

The information supplied contains the function invocation metadata, but not the event payload. This makes the event useful for observability, but limited for application security and governance use cases, such as inspecting payloads for vulnerabilities, sanitizing inputs, and blocking malicious events.

The Lambda Runtime API proxy is a pattern that enables you to hook into the function invocation request and response lifecycle. It allows you to use extensions to implement advanced security, compliance, governance, and observability scenarios without changes to function code. You can add runtime security mechanisms, implement audit procedures for data flowing in and out of the function, enhance observability by auto-injecting tracing headers, and many more.

Understanding the Lambda Runtime API workflow

This is how the Lambda Runtime consumes the Lambda Runtime API:

How the Lambda Runtime consumes the Lambda Runtime API

How the Lambda Runtime consumes the Lambda Runtime API

Lambda runtimes use the value of the AWS_LAMBDA_RUNTIME_API environment variable to make Runtime API requests. The two primary endpoints are /next, which is used to retrieve the next event to process, and /response, which is used to return event processing results to the Lambda service. In addition, the Runtime API also provides endpoints for reporting failures. See the full protocol and schema definitions of the Runtime API.

How the Runtime API proxy approach works

The Runtime API proxy is a component that you can build to hook into the invocation workflow. It proxies requests and responses, allowing you to augment them, and control the workflow:

Runtime API proxy hooks

Runtime API proxy hooks

When the Lambda service creates a new execution environment, it starts by initializing the extensions attached to the function. The execution environment waits for all extensions to register with the Lambda service by calling the Extensions API /register endpoint, then proceeds to initialize the runtime. This allows you to start the Runtime API proxy HTTP listener during extension initialization, making it ready to serve the runtime requests.

Runtime API proxy flow

Runtime API proxy flow

By default, the value of the AWS_LAMBDA_RUNTIME_API environment variable in the runtime process points to the Lambda Runtime API endpoint 127.0.0.1:9001. You can use a wrapper script to change that value to point to the Runtime API proxy endpoint instead.

A wrapper script enables you to customize the runtime startup behavior of your Lambda function by letting you set configuration parameters that cannot be set through language-specific environment variables. You can add a wrapper script to your function by setting the AWS_LAMBDA_EXEC_WRAPPER environment variable. The following wrapper script assumes that the Runtime API Proxy is listening on port 9009.

#!/bin/bash
export AWS_LAMBDA_RUNTIME_API="127.0.0.1:9009"
exec "$@"

You can either add this export line to an existing wrapper script or create a new one.

Runtime API proxy example

Runtime API proxy example

The Runtime API Proxy is bootstrapped by the Lambda service when a new execution environment is created and it’s ready to proxy requests from the Lambda runtime to the Runtime API before first invocation.

Implementing the Runtime API proxy logic

AWS recommends you implement extensions using a programming language that compiles to a binary executable, such as Golang or Rust. This allows you to use the extension with any Lambda runtime. Extensions implemented in interpreted languages, such as JavaScript and Python, or languages that require additional virtual machines, such as Java and C#, can only be used with that specific runtime.

This diagram shows a scenario where both incoming events and outbound responses are processed by the extension. You can use this workflow for auditing event or response payloads, sanitizing them, or injecting additional properties. You can use it for scenarios like masking account numbers, stripping personally identifiable information (PII), or injecting observability headers.

Runtime API proxy logic

Runtime API proxy logic

This diagram demonstrates an advanced scenario, where the first inbound event is identified as malicious, and rejected by the Runtime API proxy. The function handler is not invoked. The second event is not flagged as malicious, and is therefore forwarded to the handler for processing. You can use this workflow for security scenarios like runtime application protection.

Runtime API proxy security scenario

Runtime API proxy security scenario

AWS Partners using the Runtime API Proxy solution

“Using Lambda Runtime API proxy solution is a game-changing approach for us. It enables us to support multiple Lambda runtimes with a single implementation, provides comprehensive visibility into Lambda execution, and allows to detect attackers targeting serverless applications,” says Julio Guerra, Engineering Manager, Application Security Management, Datadog.

“Lambda Runtime API proxy is a simple solution that gives us a pluggable way to protect Lambda Function URLs. We can implement request authorization and enrichment with no changes to function code,” says Ilya Zilber, Senior Manager, Solutions Engineering, Okta Inc.

Security best practices

Extensions run within the same execution environment as the function, so they have the same level of access to resources such as file system, networking, and environment variables. IAM permissions assigned to the function are shared with extensions. Our guidance is to assign the least required privileges to your functions.

Always install extensions from a trusted source only. Use Infrastructure as Code (IaC) tools, such as AWS CloudFormation, to simplify the task of attaching the same extension configuration, including AWS Identity and Access Management (IAM) permissions, to multiple functions. Additionally, IaC tools allow you to have an audit record of extensions and versions you’ve used previously.

When building extensions, do not log sensitive data. Sanitize payloads and metadata before logging or persisting them for audit purposes.

Considerations

The Runtime API proxy approach allows you to hook into the Lambda request/response workflow, enabling new security and observability use cases. There are several important considerations:

  • This requires you to have a good understanding of the Lambda execution environment lifecycle and the Lambda Runtime API. You must implement proxying for all Runtime API endpoints and handle potential runtime failures.
  • Prepare your extension for composability for scenarions in which more than one extension implements the Runtime API proxy pattern. Allow your extension consumers to configure the extension via environment variables using at least two parameters – the port your proxy listens on and the Runtime API endpoint your proxy forwards requests to. The latter should default to the original value of the AWS_LAMBDA_RUNTIME_API environment variable. See sample implementations below for details.
  • Proxying API requests with default buffered responses requires additional work to support functions with response payload streaming.
  • Proxying API requests adds latency. The added overhead depends on your implementation. AWS recommends using programming languages that can be compiled to an executable binary, such as Rust and Golang, and keeping your extensions lightweight and optimized.

Samples

You can find sample extensions implementing the Runtime API Proxy at https://github.com/aws-samples/aws-lambda-extensions/. See Golang, Rust, and Node.js samples.

Follow the instructions described in README.md for a step-by-step tutorial on running the extension.

Conclusion

This post introduces and illustrates the Lambda Runtime API proxy pattern. You can use this pattern to hook into the Lambda function request and response workflow to intercept, process, audit, modify, and block inbound events and handler responses.

You can use this pattern to implement enhanced runtime security and governance scenarios, as well as scenarios from other domains.. AWS customers and partners can use this advanced solution approach to add enhanced security and observability to Lambda functions without requiring code changes.

For more serverless learning resources, visit Serverless Land.

Mask and redact sensitive data published to Amazon SNS using managed and custom data identifiers

Post Syndicated from Otavio Ferreira original https://aws.amazon.com/blogs/security/mask-and-redact-sensitive-data-published-to-amazon-sns-using-managed-and-custom-data-identifiers/

Today, we’re announcing a new capability for Amazon Simple Notification Service (Amazon SNS) message data protection. In this post, we show you how you can use this new capability to create custom data identifiers to detect and protect domain-specific sensitive data, such as your company’s employee IDs. Previously, you could only use managed data identifiers to detect and protect common sensitive data, such as names, addresses, and credit card numbers.

Overview

Amazon SNS is a serverless messaging service that provides topics for push-based, many-to-many messaging for decoupling distributed systems, microservices, and event-driven serverless applications. As applications become more complex, it can become challenging for topic owners to manage the data flowing through their topics. These applications might inadvertently start sending sensitive data to topics, increasing regulatory risk. To mitigate the risk, you can use message data protection to protect sensitive application data using built-in, no-code, scalable capabilities.

To discover and protect data flowing through SNS topics with message data protection, you can associate data protection policies to your topics. Within these policies, you can write statements that define which types of sensitive data you want to discover and protect. Within each policy statement, you can then define whether you want to act on data flowing inbound to an SNS topic or outbound to an SNS subscription, the AWS accounts or specific AWS Identity and Access Management (IAM) principals the statement applies to, and the actions you want to take on the sensitive data found.

Now, message data protection provides three actions to help you protect your data. First, the audit operation reports on the amount of sensitive data found. Second, the deny operation helps prevent the publishing or delivery of payloads that contain sensitive data. Third, the de-identify operation can mask or redact the sensitive data detected. These no-code operations can help you adhere to a variety of compliance regulations, such as Health Insurance Portability and Accountability Act (HIPAA), Federal Risk and Authorization Management Program (FedRAMP), General Data Protection Regulation (GDPR), and Payment Card Industry Data Security Standard (PCI DSS).

This message data protection feature coexists with the message data encryption feature in SNS, both contributing to an enhanced security posture of your messaging workloads.

Managed and custom data identifiers

After you add a data protection policy to your SNS topic, message data protection uses pattern matching and machine learning models to scan your messages for sensitive data, then enforces the data protection policy in real time. The types of sensitive data are referred to as data identifiers. These data identifiers can be either managed by Amazon Web Services (AWS) or custom to your domain.

Managed data identifiers (MDI) are organized into five categories:

In a data protection policy statement, you refer to a managed data identifier using its Amazon Resource Name (ARN), as follows:

{
    "Name": "__example_data_protection_policy",
    "Description": "This policy protects sensitive data in expense reports",
    "Version": "2021-06-01",
    "Statement": [{
        "DataIdentifier": [
            "arn:aws:dataprotection::aws:data-identifier/CreditCardNumber"
        ],
        "..."
    }]
}

Custom data identifiers (CDI), on the other hand, enable you to define custom regular expressions in the data protection policy itself, then refer to them from policy statements. Using custom data identifiers, you can scan for business-specific sensitive data, which managed data identifiers can’t. For example, you can use a custom data identifier to look for company-specific employee IDs in SNS message payloads. Internally, SNS has guardrails to make sure custom data identifiers are safe and that they add only low single-digit millisecond latency to message processing.

In a data protection policy statement, you refer to a custom data identifier using only the name that you have given it, as follows:

{
    "Name": "__example_data_protection_policy",
    "Description": "This policy protects sensitive data in expense reports",
    "Version": "2021-06-01",
    "Configuration": {
        "CustomDataIdentifier": [{
            "Name": "MyCompanyEmployeeId", "Regex": "EID-\d{9}-US"
        }]
    },
    "Statement": [{
        "DataIdentifier": [
            "arn:aws:dataprotection::aws:data-identifier/CreditCardNumber",
            "MyCompanyEmployeeId"
        ],
        "..."
    }]
}

Note that custom data identifiers can be used in conjunction with managed data identifiers, as part of the same data protection policy statement. In the preceding example, both MyCompanyEmployeeId and CreditCardNumber are in scope.

For more information, see Data Identifiers, in the SNS Developer Guide.

Inbound and outbound data directions

In addition to the DataIdentifier property, each policy statement also sets the DataDirection property (whose value can be either Inbound or Outbound) as well as the Principal property (whose value can be any combination of AWS accounts, IAM users, and IAM roles).

When you use message data protection for data de-identification and set DataDirection to Inbound, instances of DataIdentifier published by the Principal are masked or redacted before the payload is ingested into the SNS topic. This means that every endpoint subscribed to the topic receives the same modified payload.

When you set DataDirection to Outbound, on the other hand, the payload is ingested into the SNS topic as-is. Then, instances of DataIdentifier are either masked, redacted, or kept as-is for each subscribing Principal in isolation. This means that each endpoint subscribed to the SNS topic might receive a different payload from the topic, with different sensitive data de-identified, according to the data access permissions of its Principal.

The following snippet expands the example data protection policy to include the DataDirection and Principal properties.

{
    "Name": "__example_data_protection_policy",
    "Description": "This policy protects sensitive data in expense reports",
    "Version": "2021-06-01",
    "Configuration": {
        "CustomDataIdentifier": [{
            "Name": "MyCompanyEmployeeId", "Regex": "EID-\d{9}-US"
        }]
    },
    "Statement": [{
        "DataIdentifier": [
            "MyCompanyEmployeeId",
            "arn:aws:dataprotection::aws:data-identifier/CreditCardNumber"
        ],
        "DataDirection": "Outbound",
        "Principal": [ "arn:aws:iam::123456789012:role/ReportingApplicationRole" ],
        "..."
    }]
}

In this example, ReportingApplicationRole is the authenticated IAM principal that called the SNS Subscribe API at subscription creation time. For more information, see How do I determine the IAM principals for my data protection policy? in the SNS Developer Guide.

Operations for data de-identification

To complete the policy statement, you need to set the Operation property, which informs the SNS topic of the action that it should take when it finds instances of DataIdentifer in the outbound payload.

The following snippet expands the data protection policy to include the Operation property, in this case using the Deidentify object, which in turn supports masking and redaction.

{
    "Name": "__example_data_protection_policy",
    "Description": "This policy protects sensitive data in expense reports",
    "Version": "2021-06-01",
    "Configuration": {
        "CustomDataIdentifier": [{
            "Name": "MyCompanyEmployeeId", "Regex": "EID-\d{9}-US"
        }]
    },
    "Statement": [{
        "Principal": [
            "arn:aws:iam::123456789012:role/ReportingApplicationRole"
        ],
        "DataDirection": "Outbound",
        "DataIdentifier": [
            "MyCompanyEmployeeId",
            "arn:aws:dataprotection::aws:data-identifier/CreditCardNumber"
        ],
        "Operation": { "Deidentify": { "MaskConfig": { "MaskWithCharacter": "#" } } }
    }]
}

In this example, the MaskConfig object instructs the SNS topic to mask instances of CreditCardNumber in Outbound messages to subscriptions created by ReportingApplicationRole, using the MaskWithCharacter value, which in this case is the hash symbol (#). Alternatively, you could have used the RedactConfig object instead, which would have instructed the SNS topic to simply cut the sensitive data off the payload.

The following snippet shows how the outbound payload is masked, in real time, by the SNS topic.

// original message published to the topic:
My credit card number is 4539894458086459

// masked message delivered to subscriptions created by ReportingApplicationRole:
My credit card number is ################

For more information, see Data Protection Policy Operations, in the SNS Developer Guide.

Applying data de-identification in a use case

Consider a company where managers use an internal expense report management application where expense reports from employees can be reviewed and approved. Initially, this application depended only on an internal payment application, which in turn connected to an external payment gateway. However, this workload eventually became more complex, because the company started also paying expense reports filed by external contractors. At that point, the company built a mobile application that external contractors could use to view their approved expense reports. An important business requirement for this mobile application was that specific financial and PII data needed to be de-identified in the externally displayed expense reports. Specifically, both the credit card number used for the payment and the internal employee ID that approved the payment had to be masked.

Figure 1: Expense report processing application

Figure 1: Expense report processing application

To distribute the approved expense reports to both the payment application and the reporting application that backed the mobile application, the company used an SNS topic with a data protection policy. The policy has only one statement, which masks credit card numbers and employee IDs found in the payload. This statement applies only to the IAM role that the company used for subscribing the AWS Lambda function of the reporting application to the SNS topic. This access permission configuration enabled the Lambda function from the payment application to continue receiving the raw data from the SNS topic.

The data protection policy from the previous section addresses this use case. Thus, when a message representing an expense report is published to the SNS topic, the Lambda function in the payment application receives the message as-is, whereas the Lambda function in the reporting application receives the message with the financial and PII data masked.

Deploying the resources

You can apply a data protection policy to an SNS topic using the AWS Management Console, AWS Command Line Interface (AWS CLI), AWS SDK, or AWS CloudFormation.

To automate the provisioning of the resources and the data protection policy of the example expense management use case, we’re going to use CloudFormation templates. You have two options for deploying the resources:

Deploy using the individual CloudFormation templates in sequence

  1. Prerequisites template: This first template provisions two IAM roles with a managed policy that enables them to create SNS subscriptions and configure the subscriber Lambda functions. You will use these provisioned IAM roles in steps 3 and 4 that follow.
  2. Topic owner template: The second template provisions the SNS topic along with its access policy and data protection policy.
  3. Payment subscriber template: The third template provisions the Lambda function and the corresponding SNS subscription that comprise of the Payment application stack. When prompted, select the PaymentApplicationRole in the Permissions panel before running the template. Moreover, the CloudFormation console will require you to acknowledge that a CloudFormation transform might require access capabilities.
  4. Reporting subscriber template: The final template provisions the Lambda function and the SNS subscription that comprise of the Reporting application stack. When prompted, select the ReportingApplicationRole in the Permissions panel, before running the template. Moreover, the CloudFormation console will require, once again, that you acknowledge that a CloudFormation transform might require access capabilities.
Figure 2: Select IAM role

Figure 2: Select IAM role

Now that the application stacks have been deployed, you’re ready to start testing.

Testing the data de-identification operation

Use the following steps to test the example expense management use case.

  1. In the Amazon SNS console, select the ApprovalTopic, then choose to publish a message to it.
  2. In the SNS message body field, enter the following message payload, representing an external contractor expense report, then choose to publish this message:
    {
        "expense": {
            "currency": "USD",
            "amount": 175.99,
            "category": "Office Supplies",
            "status": "Approved",
            "created_at": "2023-10-17T20:03:44+0000",
            "updated_at": "2023-10-19T14:21:51+0000"
        },
        "payment": {
            "credit_card_network": "Visa",
            "credit_card_number": "4539894458086459"
        },
        "reviewer": {
            "employee_id": "EID-123456789-US",
            "employee_location": "Seattle, USA"
        },
        "contractor": {
            "employee_id": "CID-000012348-CA",
            "employee_location": "Vancouver, CAN"
        }
    }
    

  3. In the CloudWatch console, select the log group for the PaymentLambdaFunction, then choose to view its latest log stream. Now look for the log stream entry that shows the message payload received by the Lambda function. You will see that no data has been masked in this payload, as the payment application requires raw financial data to process the credit card transaction.
  4. Still in the CloudWatch console, select the log group for the ReportingLambdaFunction, then choose to view its latest log stream. Now look for the log stream entry that shows the message payload received by this Lambda function. You will see that the values for properties credit_card_number and employee_id have been masked, protecting the financial data from leaking into the external reporting application.
    {
        "expense": {
            "currency": "USD",
            "amount": 175.99,
            "category": "Office Supplies",
            "status": "Approved",
            "created_at": "2023-10-17T20:03:44+0000",
            "updated_at": "2023-10-19T14:21:51+0000"
        },
        "payment": {
            "credit_card_network": "Visa",
            "credit_card_number": "################"
        },
        "reviewer": {
            "employee_id": "################",
            "employee_location": "Seattle, USA"
        },
        "contractor": {
            "employee_id": "CID-000012348-CA",
            "employee_location": "Vancouver, CAN"
        }
    }
    

As shown, different subscribers received different versions of the message payload, according to their sensitive data access permissions.

Cleaning up the resources

After testing, avoid incurring usage charges by deleting the resources that you created. Open the CloudFormation console and delete the four CloudFormation stacks that you created during the walkthrough.

Conclusion

This post showed how you can use Amazon SNS message data protection to discover and protect sensitive data published to or delivered from your SNS topics. The example use case shows how to create a data protection policy that masks messages delivered to specific subscribers if the payloads contain financial or personally identifiable information.

For more details, see message data protection in the SNS Developer Guide. For information on costs, see SNS pricing.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on AWS re:Post or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Otavio-Ferreira-author

Otavio Ferreira

Otavio is the GM for Amazon SNS, and has been leading the service since 2016, responsible for software engineering, product management, technical program management, and technical operations. Otavio has spoken at AWS conferences—AWS re:Invent and AWS Summit—and written a number of articles for the AWS Compute and AWS Security blogs.

SmugMug’s durable search pipelines for Amazon OpenSearch Service

Post Syndicated from Lee Shepherd original https://aws.amazon.com/blogs/big-data/smugmugs-durable-search-pipelines-for-amazon-opensearch-service/

SmugMug operates two very large online photo platforms, SmugMug and Flickr, enabling more than 100 million customers to safely store, search, share, and sell tens of billions of photos. Customers uploading and searching through decades of photos helped turn search into critical infrastructure, growing steadily since SmugMug first used Amazon CloudSearch in 2012, followed by Amazon OpenSearch Service since 2018, after reaching billions of documents and terabytes of search storage.

Here, Lee Shepherd, SmugMug Staff Engineer, shares SmugMug’s search architecture used to publish, backfill, and mirror live traffic to multiple clusters. SmugMug uses these pipelines to benchmark, validate, and migrate to new configurations, including Graviton based r6gd.2xlarge instances from i3.2xlarge, along with testing Amazon OpenSearch Serverless. We cover three pipelines used for publishing, backfilling, and querying without introducing spiky unrealistic traffic patterns, and without any impact on production services.

There are two main architectural pieces critical to the process:

  • A durable source of truth for index data. It’s best practice and part of our backup strategy to have a durable store beyond the OpenSearch index, and Amazon DynamoDB provides scalability and integration with AWS Lambda that simplifies a lot of the process. We use DynamoDB for other non-search services, so this was a natural fit.
  • A Lambda function for publishing data from the source of truth into OpenSearch. Using function aliases helps run multiple configurations of the same Lambda function at the same time and is key to keeping data in sync.

Publishing

The publishing pipeline is driven from events like a user entering keywords or captions, new uploads, or label detection through Amazon Rekognition. These events are processed, combining data from a few other asset stores like Amazon Aurora MySQL Compatible Edition and Amazon Simple Storage Service (Amazon S3), before writing a single item into DynamoDB.

Writing to DynamoDB invokes a Lambda publishing function, through the DynamoDB Streams Kinesis Adapter, that takes a batch of updated items from DynamoDB and indexes them into OpenSearch. There are other benefits to using the DynamoDB Streams Kinesis Adapter such as reducing the number of concurrent Lambdas required.

The publishing Lambda function uses environment variables to determine what OpenSearch domain and index to publish to. A production alias is configured to write to the production OpenSearch domain, off of the DynamoDB table or Kinesis Stream

When testing new configurations or migrating, a migration alias is configured to write to the new OpenSearch domain but use the same trigger as the production alias. This enables dual indexing of data to both OpenSearch Service domains simultaneously.

Here’s an example of the DynamoDB table schema:

 "Id": 123456,  // partition key
 "Fields": {
  "format": "JPG",
  "height": 1024,
  "width": 1536,
  ...
 },
 "LastUpdated": 1600107934,

The ‘LastUpdated’ value is used as the document version when indexing, allowing OpenSearch to reject any out-of-order updates.

Backfilling

Now that changes are being published to both domains, the new domain (index) needs to be backfilled with historical data. To backfill a newly created index, a combination of Amazon Simple Queue Service (Amazon SQS) and DynamoDB is used. A script populates an SQS queue with messages that contain instructions for parallel scanning a segment of the DynamoDB table.

The SQS queue launches a Lambda function that reads the message instructions, fetches a batch of items from the corresponding segment of the DynamoDB table, and writes them into an OpenSearch index. New messages are written to the SQS queue to keep track of progress through the segment. After the segment completes, no more messages are written to the SQS queue and the process stops itself.

Concurrency is determined by the number of segments, with additional controls provided by Lambda concurrency scaling. SmugMug is able to index more than 1 billion documents per hour on their OpenSearch configuration while incurring zero impact to the production domain.

A NodeJS AWS-SDK based script is used to seed the SQS queue. Here’s a snippet of the SQS configuration script’s options:

Usage: queue_segments [options]

Options:
--search-endpoint <url>  OpenSearch endpoint url
--sqs-url <url>          SQS queue url
--index <string>         OpenSearch index name
--table <string>         DynamoDB table name
--key-name <string>      DynamoDB table partition key name
--segments <int>         Number of parallel segments

Along with the format of the resulting SQS message:

{
  searchEndpoint: opts.searchEndpoint,
  sqsUrl: opts.sqsUrl,
  table: opts.table,
  keyName: opts.keyName,
  index: opts.index,
  segment: i,
  totalSegments: opts.segments,
  exclusiveStartKey: <lastEvaluatedKey from previous iteration>
}

As each segment is processed, the ‘lastEvaluatedKey’ from the previous iteration is added to the message as the ‘exclusiveStartKey’ for the next iteration.

Mirroring

Last, our mirrored search query results run by sending an OpenSearch query to an SQS queue, in addition to our production domain. The SQS queue launches a Lambda function that replays the query to the replica domain. The search results from these requests are not sent to any user but allow replicating production load on the OpenSearch service under test without impact to production systems or customers.

Conclusion

When evaluating a new OpenSearch domain or configuration, the main metrics we are interested in are query latency performance, namely the took latencies (latencies per time), and most importantly latencies for searching. In our move to Graviton R6gd, we saw about 40 percent lower P50-P99 latencies, along with similar gains in CPU usage compared to i3’s (ignoring Graviton’s lower costs). Another welcome benefit was the more predictable and monitorable JVM memory pressure with the garbage collection changes from the addition of G1GC on R6gd and other new instances.

Using this pipeline, we’re also testing OpenSearch Serverless and finding its best use-cases. We’re excited about that service and fully intend to have an entirely serverless architecture in time. Stay tuned for results.


About the Authors

Lee Shepherd is a SmugMug Staff Software Engineer

Aydn Bekirov is an Amazon Web Services Principal Technical Account Manager

Serverless ICYMI Q3 2023

Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/serverless-icymi-q3-2023/

Welcome to the 23rd edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share all the most recent product launches, feature enhancements, blog posts, webinars, live streams, and other interesting things that you might have missed!

In case you missed our last ICYMI, check out what happened last quarter here.

AWS announces the general availability of Amazon Bedrock

Amazon Web Services (AWS) unveils five generative artificial intelligence (AI) innovations to democratize generative AI applications. Amazon Bedrock, now generally available, enables experimentation with top foundation models (FMs) and allows customization with proprietary data.

It supports creating managed agents for complex tasks without code and ensures security and privacy. Amazon Titan Embeddings, another FM, is generally available for various language-related use cases. Meta’s Llama 2, coming soon, enhances dialogue scenarios.

The upcoming Amazon CodeWhisperer customization capability enables secure customization using private code bases. Generative BI authoring capabilities in Amazon QuickSight simplify visualization creation for business analysts.

AWS Lambda

AWS Lambda now detects and stops recursive loops in Lambda functions. AWS Lambda now detects and halts functions caught in recursive or infinite loops, guarding against unexpected costs. Lambda identifies recursive behavior, discontinuing requests after 16 invocations. The feature addresses pitfalls stemming from misconfiguration or coding bugs, introducing detailed error messaging, and allowing users to set maximum limits on retry intervals. Notifications about recursive occurrences are relayed through the AWS Health Dashboard, emails, and CloudWatch Alarms for streamlined troubleshooting. Lambda uses AWS X-Ray trace headers for invocation tracking, requiring supported AWS SDK versions.

AWS simplifies writing .NET 6 Lambda functions. The Lambda Annotations Framework for .NET. A new programming model makes the experience of writing Lambda functions in C# feel more natural for .NET developers by using C# source generator technology. This streamlines the development workflow for .NET developers, making it easier to create serverless applications using the latest version of the .NET framework.

AWS Lambda and Amazon EventBridge Pipes now support enhanced filtering. Additional filtering capabilities include the ability to match against characters at the end of a value (suffix filtering), ignore case sensitivity (equals-ignore-case), and have a single rule match if any conditions across multiple separate fields are true (OR matching).

AWS Lambda Functions powered by AWS Graviton2 are now available in 6 additional Regions. Graviton2 processors are known for their performance benefits, and this expansion provides users with more choices for running serverless workloads.

AWS Lambda adds support for Python 3.11 allowing developers to take advantage of the latest features and improvements in the Python programming language for their serverless functions.

AWS Step Functions

AWS Step Functions enhances Workflow Studio, focusing on an Advanced Starter Template and Code Mode for efficient AWS Step Functions workflow creation. Users benefit from streamlined design-to-code transitions, pasting Amazon States Language (ASL) definitions directly into Workflow Studio, speeding up adjustments. Enhanced workflow execution and configuration allow direct execution and setting adjustments within Workflow Studio, improving user experience.

AWS Step Functions launches enhanced error handling This update helps users to identify errors with precision and refine retry strategies. Step Functions now enables detailed error messages in Fail states and precise control over retry intervals. Use the new maximum limits and jitter functionality to ensure efficient and controlled retries, preventing service overload in recovery scenarios.

AWS Step Functions distributed map is now available in the AWS GovCloud (US) Regions. This release highlights the availability of the distributed map feature in Step Functions specifically tailored for the AWS GovCloud (US) Regions. The distributed map feature is a powerful capability for orchestrating parallel and distributed processing in serverless workflows.

AWS SAM

AWS SAM CLI announces local testing and debugging support on Terraform projects.

Developers can now use AWS SAM CLI to locally test and debug AWS Lambda functions and Amazon API Gateway defined in their Terraform projects. AWS SAM CLI reads infrastructure resource information from the Terraform application, allowing users to start Lambda functions and API Gateway endpoints locally in a Docker container.

This update enables faster development cycles for Terraform users, who can use AWS SAM CLI commands like `AWS SAM local start-api`, `sam local start-lambda`, and `sam local invoke`, along with `sam local generate` for generating mock test events.

Amazon EventBridge

Amazon EventBridge Scheduler adds schedule deletion after completion. This feature offers enhanced functionality by supporting the automatic deletion of schedules upon completion of their last invocation. It is applicable to various scheduling types, including one-time, cron, and rate schedules with an end date. Amazon EventBridge Scheduler, a centralized and highly scalable service, enables the creation, execution, and management of schedules.

With the ability to schedule millions of tasks invoking over 270 AWS services and 6,000 API operations. This update streamlines the process of managing completed schedules. The automatic deletion feature reduces the need for manual intervention or custom code, saving time and simplifying scalability for users leveraging EventBridge Scheduler.

Amazon EventBridge Pipes now available in three additional Regions. This update extends the availability of Amazon EventBridge Pipes, a powerful event-routing service, to three additional Regions.

Amazon EventBridge API Destinations is now available in additional Regions. Providing users with more options for building scalable and decoupled applications.

Amazon EventBridge Schema Registry and Schema Discovery now in additional Regions. This expansion allows you to discover and store event structure – or schema – in a shared, central location. You can download code bindings for those schemas for Java, Python, TypeScript, and Golang so it’s easier to use events as objects in your code.

Amazon SNS

To enhance message privacy and security, Amazon Simple Notification Service (SNS) implemented Message Data Protection, allowing users to de-identify outbound messages via redaction or masking. Amazon SNS FIFO topics now support message delivery to Amazon SQS Standard queues. This provides users with increased flexibility in managing message delivery and ordering.

Expanding its monitoring capabilities, Amazon SNS introduced Additional Usage Metrics in Amazon CloudWatch. This enhancement allows users to gain more comprehensive insights into the performance and utilization of their SNS resources. SNS extended its global SMS sending capabilities to Israel (Tel Aviv), providing users in that Region with additional options for SMS notifications. SNS also expanded its reach by supporting Mobile Push Notifications in twelve new AWS Regions. This expansion aligns with the growing demand for mobile notification capabilities, offering a broader coverage for users across diverse Regions.

Amazon SQS

Amazon Simple Queue Service (SQS) introduced a number of updates. Attribute-Based Access Control (ABAC) was implemented for scalable access permissions, while message data protection can now de-identify outbound messages via redaction or masking. SQS FIFO topics now support message delivery to Amazon SQS Standard queues, providing enhanced flexibility. Addressing throughput demands, SQS increased the quota for FIFO High Throughput mode. JSON protocol support was previewed, offering improved message format flexibility. These updates underscore SQS’s commitment to advanced security and flexibility.

Amazon API Gateway

Amazon API Gateway undergoes a console refresh, aligning with Cloudscape Design System guidelines. Notable enhancements include improved usability, sortable tables, enhanced API key management, and direct API deployment from the Resource view. The update introduces dark mode, accessibility improvements, and visual alignment with HTTP APIs and AWS Services.

GOTO EDA day Nashville 2023

Join GOTO EDA Day in Nashville on October 26 for insights on event-driven architectures. Learn from industry leaders at Music City Center with talks, panels, and Hands-On Labs. Limited tickets available.

Serverless blog posts

July 2023

July 5- Implementing AWS Lambda error handling patterns

July 6 – Implementing AWS Lambda error handling patterns

July 7 – Understanding AWS Lambda’s invoke throttling limits

July 10 – Detecting and stopping recursive loops in AWS Lambda functions

July 11 – Implementing patterns that exit early out of a parallel state in AWS Step Functions

July 26 – Migrating AWS Lambda functions from the Go1.x runtime to the custom runtime on Amazon Linux 2

July 27 – Python 3.11 runtime now available in AWS Lambda

August 2023

August 2 – Automatically delete schedules upon completion with Amazon EventBridge Scheduler

August 7 – Using response streaming with AWS Lambda Web Adapter to optimize performance

August 15 – Integrating IBM MQ with Amazon SQS and Amazon SNS using Apache Camel

August 15 – Implementing the transactional outbox pattern with Amazon EventBridge Pipes

August 23 – Protecting an AWS Lambda function URL with Amazon CloudFront and Lambda@Edge

August 29 – Enhancing file sharing using Amazon S3 and AWS Step Functions

August 31 – Enhancing Workflow Studio with new features for streamlined authoring

September 2023

September 5 – AWS SAM support for HashiCorp Terraform now generally available

September 14 – Building a secure webhook forwarder using an AWS Lambda extension and Tailscale

September 18 – Building resilient serverless applications using chaos engineering

September 19 – Implementing idempotent AWS Lambda functions with Powertools for AWS Lambda (TypeScript)

September 19 – Centralizing management of AWS Lambda layers across multiple AWS Accounts

September 26 – Architecting for scale with Amazon API Gateway private integrations

September 26 – Visually design your application with AWS Application Composer

Videos

Serverless Office Hours – Tues 10AM PT

July 2023

July 4 – Benchmarking Lambda cold starts

July 11 – Lambda testing: AWS SAM remote invoke

July 18 – Using DynamoDB global tables

July 25 – Serverless observability with SLIC-watch

August 2023

August 1 – Step Functions versions and aliases

August 8 – Deploying Lambda with EKS and Crossplane / Managing Lambda with Kubernetes

August 15 – Serverless caching with Momento

September 2023

September 5 – Run any web app on Lambda

September 12 – Building an API platform on AWS

September 19 – Idempotency: exactly once processing

September 26 – AWS Amplify Studio + GraphQL

FooBar Serverless YouTube channel

July 2023

July 27 – Generative AI and Serverless to create a new story everyday

August 2023

August 3Getting started with Data Streaming

August 10 – Amazon Kinesis Data Streams – Shards? Provisioned? On-demand? What does all this mean?

August 17 – Put and consume events with AWS Lambda, Amazon Kinesis Data Stream and Event Source Mapping

August 24 – Create powerful data pipelines with Amazon Kinesis and EventBridge Pipes

August 31 – New Step Functions versions and alias!

September 2023

September 7 – Amazon Kinesis Data Firehose – What is this service for?

September 14 – Kinesis Data Firehose with AWS CDK – Lambda transformations

September 21 – Advanced Event Source Mapping configuration | AWS Lambda and Amazon Kinesis Data Streams

September 28 – Data Streaming Patterns

Still looking for more?

The Serverless landing page has more information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials.

You can also follow the Serverless Developer Advocacy team on Twitter to see the latest news, follow conversations, and interact with the team.

Filtering events in Amazon EventBridge with wildcard pattern matching

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/filtering-events-in-amazon-eventbridge-with-wildcard-pattern-matching/

This post is written by Rajdeep Banerjee, Sr PSA, and Brian Krygsman, Sr. Solutions Architect.

Amazon EventBridge recently announced support for wildcard filters in rule event patterns. An EventBridge event bus is a serverless event router that helps you decouple your event-driven systems. You can route events between your systems, AWS services, or third-party SaaS services. You attach a rule to your event bus to define logic for routing events from producers to consumers.

You set an event pattern on the rule to filter incoming events to specific consumers. The new wildcard filter lets you build more flexible event matching patterns to reduce rule management and optimize your event consumers. This shows how these EventBridge attributes work together.

How EventBridge features work together

Wildcard filters use the wildcard character (*) to match zero, single, or multiple characters in a string value. For example, a filter string like "*.png"  matches strings that end with ".png".

You can also use multiple wildcard characters in a filter. For example, a filter string like "*Title*" matches string values that include "Title" in the middle. When using wildcard filters, be careful to avoid matching more events than you intend.

This blog post describes how you can use wildcard filters in example scenarios. For more information about event-driven architectures, visit Serverless Land.

Wildcard pattern matching in S3 Event Notifications

Applications must often perform an action when new data is available. One example can be to process trading data uploaded to your Amazon S3 bucket. The data may be stored in individual folders depending on the date, time, and stock symbol. Business rules may dictate that when stock XYZ receives a file, it must send a notification to a downstream system.

This is the typical folder structure in an S3 bucket:

s3 folder structure

S3 can send an event to EventBridge when an object is written to a bucket. The S3 event includes the object key (for example, 2023-10-01/T13:22:22Z/XYZ/filename.ext). When any object is uploaded to the XYZ folder, you can use an EventBridge rule to send these events to a downstream service like an Amazon SQS.

Before this launch, you would first send the event to an AWS Lambda function. Existing prefix and suffix filters alone are insufficient because of the extra date and time folders. The function would run your code to inspect the object path for the stock symbol. Your code would then forward events to SQS when they matched.

With the new wildcard patterns in EventBridge rules, the logic is simpler. You no longer need to create a Lambda function to run custom matching code. You can instead use wildcard characters in the rule’s filter pattern, matching against portions of the S3 object key.

  1. To use this, start with creating a new rule in the EventBridge console:
    Define rule detail
  2. Choose Next. Keep the standard parameters and move to the Event pattern section. Here you can use a JSON-based event pattern.
    {
      "source": ["aws.s3"],
      "detail": {
        "bucket": {
          "name": ["intraday-trading-data"]
        },
        "object": {
          "key": [{
            "wildcard": "*/XYZ/*"
          }]
        }
      }
    }
    
  3. This pattern looks for Event Notifications from a specific bucket. The pattern then filters the events further by the object keys that match "*/XYZ/*". The rule filters out notifications from other stock symbols, listening to only “XYZ“ data, irrespective of date and time of the data feed.
  4. To use an SQS queue for the filtered event target, you must provide resource-based policies for EventBridge to send messages to the queue.
    Select target(s)
  5. Choose Next and review the rule details before saving.
  6. Before testing, enable S3 event notifications to EventBridge in the S3 console:
    Enable S3 event notifications to EventBridge in the S3 console
  7. To test the new wildcard pattern, upload any sample CSV file in the XYZ folder to launch the Event Notifications.
    Upload CSV
  8. You can monitor EventBridge CloudWatch metrics to check if the rule is invoked from the S3 upload. The SQS CloudWatch metrics show if messages are received from the EventBridge rule.
    CloudWatch metrics

Filtering based on Amazon Resource Name (ARN)

Customers often need to perform actions when AWS Identity and Access Management (IAM) policies are added to specific roles. You can achieve this by creating custom EventBridge rules, which filter the event to match or create multiple rules to achieve the same effect. With the newly introduced wildcard filter, the task to invoke an action is simplified.

Consider an IAM role with fine-grained IAM policies attached. You may need to ensure any new policy attached to this role must be from a specific ARNs. This action can be implemented like this.

When you attach a new IAM policy to a role, it generates an event like this:

{
    "version": "0",
    "id": "0b85984e-ec53-84ba-140e-9e0cff7f05b4",
    "detail-type": "AWS API Call via CloudTrail",
    "source": "aws.iam",
    "account": "123456789012",
    "time": "2023-10-07T20:23:28Z",
    "region": "us-east-1",
    "resources": [],
    "detail": {
        "eventVersion": "1.08",
        "userIdentity": {
            "arn": "arn:aws:sts::123456789012:assumed-role/Admin/UserName",
            // ... additional detail fields
        },
        "eventTime": "2023-10-07T20:23:28Z",
        "eventSource": "iam.amazonaws.com",
        "eventName": "AttachRolePolicy",
        // ... additional detail fields

    }
}

You can create a rule matching against a combination of these event properties. You can filter detail.userIdentity.arn with a wildcard to catch events that come from a particular ARN. You can then route these events to a target like an Amazon CloudWatch Logs stream to record the change. You can also route them to Amazon Simple Notification Service (SNS). You can use the SNS notification to start a review and ensure that the newly attached policies are well-crafted as part of your reconciliation and audit process. The filter looks like this:

{
  "source": ["aws.iam"],
  "detail-type": ["AWS API Call via CloudTrail"],
  "detail": {
    "eventSource": ["iam.amazonaws.com"],
    "eventName": ["AttachRolePolicy"],
    "userIdentity": {
      "arn": [{
        "wildcard": "arn:aws:sts::123456789012:assumed-role/*/*"
      }]
    }
  }
}

Filtering custom events

You can use EventBridge to build your own event-driven systems with loosely coupled, scalable application services. When building event-driven applications in AWS, you can publish events to the default event bus, or create a custom event bus. You define the structure of events emitted from your services.

This structure is known as the event schema. When you attach rules to your bus to route events from producers to consumers, you match against values from properties in your event schema. Wildcard filters allow you to match property values that are unknown ahead of time, or across multiple value variants.

Consider an ecommerce application as an example. You may have several decoupled services working together, like a shopping cart service, an inventory service, and others. Each of these services emits events onto your event bus as your customers shop.

Events may include errors, to record problems customers encounter using your system. You can use a single rule with a wildcard filter to match all error events and send them to a common target. This allows you to simplify observability across your services.

This is the event flow:

Event flow

Your shopping cart service may emit a timeout error event:

{
  "version": "0",
  "id": "24a4b957-570d-590b-c213-2a72e5dc4c66",
  "detail-type": "shopping.cart.error.timeout",
  "source": "com.mybusiness.shopping.cart",
  "account": "123456789012",
  "time": "2023-10-06T03:28:44Z",
  "region": "us-west-2",
  "resources": [],
  "detail": {
    "message": "Operation timed out.",
    "related-entity": {
      "entity-type": "order",
      "id": "123"
    },
    // ... additional detail fields
  }
}

The detail-type property of the example event determines what type of event this is. Other services may emit error events with different prefixes in detail-type. Other error types might have different suffixes in detail-type.

For example, an inventory service may emit an out-of-stock error event like this:

{
  "version": "0",
  "id": "e456f480-cc1e-47fa-8399-ab2e54116958",
  "detail-type": "shopping.inventory.error.outofstock",
  "source": "com.mybusiness.shopping.inventory",
  "account": "123456789012",
  "time": "2023-10-06T03:28:44Z",
  "region": "us-west-2",
  "resources": [],
  "detail": {
    "message": "Product cannot be added to a cart. Out of stock.",
    "related-entity": {
      "entity-type": "product",
      "id": "456"
    }
    // ... additional detail fields
  }
}

To route these events to a common target like an Amazon CloudWatch Logs stream, you can create a rule with a wildcard filter matching against detail-type. You can combine this with a prefix filter on source that filters events down to only services from your shopping system. The filter looks like this:

{
  "source": [{
    "prefix": "com.mybusiness.shopping."
  }],
  "detail-type": [{
    "wildcard": "*.error.*"
  }]
}

Without a wildcard filter you would need to create a more complex matching pattern, possibly across multiple rules.

Conclusion

Wildcard filters in EventBridge rules help simplify your event driven applications by ensuring the correct events are passed on to your targets. The new feature reduces the need for custom code, which was required previously. Try EventBridge rules with wildcard filters and experience the benefits of this new feature in your event-driven serverless applications.

For more serverless learning resources, visit Serverless Land.

Use SAML with Amazon Cognito to support a multi-tenant application with a single user pool

Post Syndicated from Neela Kulkarni original https://aws.amazon.com/blogs/security/use-saml-with-amazon-cognito-to-support-a-multi-tenant-application-with-a-single-user-pool/

Amazon Cognito is a customer identity and access management solution that scales to millions of users. With Cognito, you have four ways to secure multi-tenant applications: user pools, application clients, groups, or custom attributes. In an earlier blog post titled Role-based access control using Amazon Cognito and an external identity provider, you learned how to configure Cognito authentication and authorization with a single tenant. In this post, you will learn to configure Cognito with a single user pool for multiple tenants to securely access a business-to-business application by using SAML custom attributes. With custom-attribute–based multi-tenancy, you can store tenant identification data like tenantName as a custom attribute in a user’s profile and pass it to your application. You can then handle multi-tenancy logic in your application and backend services. With this approach, you can use a unified sign-up and sign-in experience for your users. To identify the user’s tenant, your application can use the tenantName custom attribute.

One Cognito user pool for multiple customers

Customers like the simplicity of using a single Cognito user pool for their multi-customer application. With this approach, your customers will use the same URL to access the application. You will set up each new customer by configuring SAML 2.0 integration with the customer’s external identity provider (IdP). Your customers can control access to your application by using an external identity store, such as Google Workspace, Okta, or Active Directory Federation Service (AD FS), in which they can create, manage, and revoke access for their users.

After SAML integration is configured, Cognito returns a JSON web token (JWT) to the frontend during the user authentication process. This JWT contains attributes your application can use for authorization and access control. The token contains claims about the identity of the authenticated user, such as name and email. You can use this identity information inside your application. You can also configure Cognito to add custom attributes to the JWT, such as tenantName.

In this post, we demonstrate the approach of keeping a mapping between a user’s email domain and tenant name in an Amazon DynamoDB table. The DynamoDB table will have an emailDomain field as a key and a corresponding tenantName field.

Cognito architecture

To illustrate how this works, we’ll start with a demo application that was introduced in the earlier blog post. The demo application is implemented by using Amazon Cognito, AWS Amplify, Amazon API Gateway, AWS Lambda, Amazon DynamoDB, Amazon Simple Storage Service (Amazon S3), and Amazon CloudFront to achieve a serverless architecture. This architecture is shown in Figure 1.

Figure 1: Demo application architecture

Figure 1: Demo application architecture

The workflow that happens when you access the web application for the first time using your browser is as follows (the numbered steps correspond to the numbered labels in the diagram):

  1. The client-side/frontend of the application prompts you to enter the email that you want to use to sign in to the application.
  2. The application invokes the Tenant Match API action through API Gateway, which, in turn, calls the Lambda function that takes the email address as an input and queries it against the DynamoDB table with the email domain. Figure 2 shows the data stored in DynamoDB, which includes the tenant name and IdP ID. You can add additional flexibility to this solution by adding web client IDs or custom redirect URLs. For the purpose of this example, we are using the same redirect URL for all tenants (the client application).
    Figure 2: DynamoDB tenant table

    Figure 2: DynamoDB tenant table

  3. If a matching record is found, the Lambda function returns the record to the AWS Amplify frontend application.
  4. The client application uses the IdP ID from the response and passes it to Cognito for federated login. Cognito then reroutes the login request to the corresponding IdP. The AWS Amplify frontend application then redirects the browser to the IdP.
  5. At the IdP sign-in page, you sign in with a valid user account (for example, [email protected] or [email protected]). After you sign in successfully, a SAML response is sent back from the IdP to Cognito.

    You can review the SAML content by using the instructions in How to view a SAML response in your browser for troubleshooting, as shown in Figure 3.

    Figure 3: SAML content

    Figure 3: SAML content

  6. Cognito handles the SAML response and maps the SAML attributes to a just-in-time user profile. The SAML groups attributes is mapped to a custom user pool attribute named custom:groups.
  7. To identify the tenant, additional attributes are populated in the JWT. After successful authentication, a PreTokenGeneration Lambda function is invoked, which reads the mapped custom:groups attribute value from SAML, parses it, and converts it to an array. After that, the function parses the email address and captures the domain name. It then queries the DynamoDB table for the tenantName name by using the email domain name. Finally, the function sets the custom:domainName and custom:tenantName attributes in the JWT, as shown following.
    "email": "[email protected]" ( Standard existing profile attribute )
    New attributes:
    "cognito:groups": [.                           
    "pet-app-users",
    "pet-app-admin"
    ],
    "custom:tenantName": "AnyCompany"
    "custom:domainName": "anycompany.com"

    This attribute conversion is optional and demonstrates how you can use a PreTokenGeneration Lambda invocation to customize your JWT token claims, mapping the IdP groups to the attributes your application recognizes. You can also use this invocation to make additional authorization decisions. For example, if user is a member of multiple groups, you may choose to map only one of them.

  8. Amazon Cognito returns the JWT tokens to the AWS Amplify frontend application. The Amplify client library stores the tokens and handles refreshes. This token is used to make calls to protected APIs in Amazon API Gateway.
  9. API Gateway uses a Cognito user pools authorizer to validate the JWT’s signature and expiration. If this is successful, API Gateway passes the JWT to the application’s Lambda function (also referred to as the backend).
  10. The backend application code reads the cognito:groups claim from the JWT and decides if the action is allowed. If the user is a member of the right group, then the action is allowed; otherwise the action is denied.

Implement the solution

You can implement this example application by using an AWS CloudFormation template to provision your cloud application and AWS resources.

To deploy the demo application described in this post, you need the following prerequisites:

  1. An AWS account.
  2. Familiarity with navigating the AWS Management Console or AWS CLI.
  3. Familiarity with deploying CloudFormation templates.

To deploy the template

  • Choose the following Launch Stack button to launch a CloudFormation stack in your account.

    Select this image to open a link that starts building the CloudFormation stack

    Note: The stack will launch in the N. Virginia (us-east-1) Region. To deploy this solution into other AWS Regions, download the solution’s CloudFormation template from GitHub, modify it, and deploy it to the selected Region.

The stack creates a Cognito user pool called ExternalIdPDemoPoolXXXX in the AWS Region that you have specified. The CloudFormation Outputs field contains a list of values that you will need for further configuration.

IdP configuration

The next step is to configure your IdP. Each IdP has its own procedure for configuration, but there are some common steps you need to follow.

To configure your IdP

  1. Provide the IdP with the values for the following two properties:
    • Single sign on URL / Assertion Consumer Service URL / ACS URL (for this example, https://<CognitoDomainURL>/saml2/idpresponse)
    • Audience URI / SP Entity ID / Entity ID: (For this example, urn:amazon:cognito:sp:<yourUserPoolID>)
  2. Configure the field mapping for the SAML response in the IdP. Map the first name, last name, email, and groups (as a multi-value attribute) into SAML response attributes with the names firstName, lastName, email, and groups, respectively.
    • Recommended: Filter the mapped groups to only those that are relevant to the application (for example, by a prefix filter). There is a 2,048-character limit on the custom attribute, so filtering helps avoid exceeding the character limit, and also helps avoid passing irrelevant information to the application.
  3. In each IdP, create two demo groups called pet-app-users and pet-app-admins, and create two demo users, for example, [email protected] and [email protected], and then assign one to each group, respectively.

To illustrate, we set up three different IdPs to represent three different tenants. Use the following links for instructions on how to configure each IdP:

You will need the metadata URL or file from each IdP, because you will use this to configure your user pool integration. For more information, see Integrating third-party SAML identity providers with Amazon Cognito user pools.

Cognito configuration

After your IdPs are configured and your CloudFormation stack is deployed, you can configure Cognito.

To configure Cognito

  1. Use your browser to navigate to the Cognito console, and for User pool name, select the Cognito user pool.
    Figure 4: Select the Cognito user pool

    Figure 4: Select the Cognito user pool

  2. On the Sign-in experience screen, on the Federated identity provider sign-in tab, choose Add identity provider.
  3. Choose SAML for the sign-in option, and then enter the values for your IdP. You can either upload the metadata XML file or provide the metadata endpoint URL. Add mapping for the attributes as shown in Figure 5.
    Figure 5: Attribute mappings for the IdP

    Figure 5: Attribute mappings for the IdP

    Upon completion you will see the new IdP displayed as shown in Figure 6.

    Figure 6: List of federated IdPs

    Figure 6: List of federated IdPs

  4. On the App integration tab, select the app client that was created by the CloudFormation template.
    Figure 7: Select the app client

    Figure 7: Select the app client

  5. Under Hosted UI, choose Edit. Under Identity providers, select the Identity Providers that you want to set up for federated login, and save the change.
    Figure 8: Select identity providers

    Figure 8: Select identity providers

API gateway

The example application uses a serverless backend. There are two API operations defined in this example, as shown in Figure 9. One operation gets tenant details and the other is the /pets API operation, which fetches information on pets based on user identity. The TenantMatch API operation will be run when you sign in with your email address. The operation passes your email address to the backend Lambda function.

Figure 9: Example APIs

Figure 9: Example APIs

Lambda functions

You will see three Lambda functions deployed in the example application, as shown in Figure 10.

Figure 10: Lambda functions

Figure 10: Lambda functions

The first one is GetTenantInfo, which is used for the TenantMatch API operation. It reads the data from the TenantTable based on the email domain and passes the record back to the application. The second function is PreTokenGeneration, which reads the mapped custom:groups attribute value, parses it, converts it to an array, and then stores it in the cognito:groups claim. The second Lambda function is invoked by the Cognito user pool after sign-in is successful. In order to customize the mapping, you can edit the Lambda function’s code in the index.js file and redeploy. The third Lambda function is added to support the Pets API operation.

DynamoDB tables

You will see three DynamoDB tables deployed in the example application, as shown in Figure 11.

Figure 11: DynamoDB tables

Figure 11: DynamoDB tables

The TenantTable table holds the tenant details where you must add the mapping between the customer domain and the IdP ID setup in Cognito. This approach can be expanded to add more flexibility in case you want to add custom redirect URLs or Cognito app IDs for each tenant. You must create entries to correspond to the IdPs you have configured, as shown in Figure 12.

Figure 12: Tenant IdP mappings table

Figure 12: Tenant IdP mappings table

In addition to TenantTable, there is the ExternalIdPDemo-ItemsTable table, which holds the data related to the Pets application, based on user identity. There is also ExternalIdPDemo-UsersTable, which holds user details like the username, last forced sign-out time, and TTL required for the application to manage the user session.

You can now sign in to the example application through each IdP by navigating to the application URL found in the CloudFormation Outputs section, as shown in Figure 13.

Figure 13: Cognito sign-in screen

Figure 13: Cognito sign-in screen

You will be redirected to the IdP, as shown in Figure 14.

Figure 14: Google Workspace sign-in screen

Figure 14: Google Workspace sign-in screen

The AWS Amplify frontend application parses the JWT to identify the tenant name and provide authorization based on group membership, as shown in Figure 15.

Figure 15: Application home screen upon successful sign-in

Figure 15: Application home screen upon successful sign-in

If a different user logs in with a different role, the AWS Amplify frontend application provides authorization based on specific content of the JWT.

Conclusion

You can integrate your application with your customer’s IdP of choice for authentication and authorization and map information from the IdP to the application. By using Amazon Cognito, you can normalize the structure of the JWT token that is used for this process, so that you can add multiple IdPs, each for a different tenant, through a single Cognito user pool. You can do all this without changing application code. The native integration of Amazon API Gateway with the Cognito user pools authorizer streamlines your validation of the JWT integrity, and after the JWT has been validated, you can use it to make authorization decisions in your application’s backend. By following the example in this post, you can focus on what differentiates your application, and let AWS do the undifferentiated heavy lifting of identity management for your customer-facing applications.

For the code examples described in this post, see the amazon-cognito-example-for-multi-tenant code repository on GitHub. To learn more about using Cognito with external IdPs, see the Amazon Cognito documentation. You can also learn to build software as a service (SaaS) application architectures on AWS. If you have any questions about Cognito or any other AWS services, you may post them to AWS re:Post.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Security, Identity, & Compliance re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Ray Zaman

Ray Zaman

A Principal Solutions Architect with AWS, Ray has over 30 years of experience helping customers in finance, healthcare, insurance, manufacturing, media, petrochemical, pharmaceutical, public utility, retail, semiconductor, telecommunications, and waste management industries build technology solutions.

Neela Kulkarni

Neela Kulkarni

Neela is a Solutions Architect with Amazon Web Services. She primarily serves independent software vendors in the Northeast US, providing architectural guidance and best practice recommendations for new and existing workloads. Outside of work, she enjoys traveling, swimming, and spending time with her family.

Yuri Duchovny

Yuri Duchovny

Yuri is a New York–based Solutions Architect specializing in cloud security, identity, and compliance. He supports cloud transformations at large enterprises, helping them make optimal technology and organizational decisions. Prior to his AWS role, Yuri’s areas of focus included application and networking security, DoS, and fraud protection. Outside of work, he enjoys skiing, sailing, and traveling the world.

Abdul Qadir

Abdul Qadir

Abdul is an AWS Solutions Architect based in New Jersey. He works with independent software vendors in the Northeast US and helps customers build well-architected solutions on the AWS Cloud platform.

Building a serverless document chat with AWS Lambda and Amazon Bedrock

Post Syndicated from Pascal Vogel original https://aws.amazon.com/blogs/compute/building-a-serverless-document-chat-with-aws-lambda-and-amazon-bedrock/

This post is written by Pascal Vogel, Solutions Architect, and Martin Sakowski, Senior Solutions Architect.

Large language models (LLMs) are proving to be highly effective at solving general-purpose tasks such as text generation, analysis and summarization, translation, and much more. Because they are trained on large datasets, they can use a broad generalist knowledge base. However, as training takes place offline and uses publicly available data, their ability to access specialized, private, and up-to-date knowledge is limited.

One way to improve LLM knowledge in a specific domain is fine-tuning them on domain-specific datasets. However, this is time and resource intensive, requires specialized knowledge, and may not be appropriate for some tasks. For example, fine-tuning won’t allow an LLM to access information with daily accuracy.

To address these shortcomings, Retrieval Augmented Generation (RAG) is proving to be an effective approach. With RAG, data external to the LLM is used to augment prompts by adding relevant retrieved data in the context. This allows for integrating disparate data sources and the complete separation of data sources from the machine learning model entirely.

Tools such as LangChain or LlamaIndex are gaining popularity because of their ability to flexibly integrate with a variety of data sources such as (vector) databases, search engines, and current public data sources.

In the context of LLMs, semantic search is an effective search approach, as it considers the context and intent of user-provided prompts as opposed to a traditional literal search. Semantic search relies on word embeddings, which represent words, sentences, or documents as vectors. Consequently, documents must be transformed into embeddings using an embedding model as the basis for semantic search. Because this embedding process only needs to happen when a document is first ingested or updated, it’s a great fit for event-driven compute with AWS Lambda.

This blog post presents a solution that allows you to ask natural language questions of any PDF document you upload. It combines the text generation and analysis capabilities of an LLM with a vector search on the document content. The solution uses serverless services such as AWS Lambda to run LangChain and Amazon DynamoDB for conversational memory.

Amazon Bedrock is used to provide serverless access to foundational models such as Amazon Titan and models developed by leading AI startups, such as AI21 Labs, Anthropic, and Cohere. See the GitHub repository for a full list of available LLMs and deployment instructions.

You learn how the solution works, what design choices were made, and how you can use it as a blueprint to build your own custom serverless solutions based on LangChain that go beyond prompting individual documents. The solution code and deployment instructions are available on GitHub.

Solution overview

Let’s look at how the solution works at a high level before diving deeper into specific elements and the AWS services used in the following sections. The following diagram provides a simplified view of the solution architecture and highlights key elements:

The process of interacting with the web application looks like this:

  1. A user uploads a PDF document into an Amazon Simple Storage Service (Amazon S3) bucket through a static web application frontend.
  2. This upload triggers a metadata extraction and document embedding process. The process converts the text in the document into vectors. The vectors are loaded into a vector index and stored in S3 for later use.
  3. When a user chats with a PDF document and sends a prompt to the backend, a Lambda function retrieves the index from S3 and searches for information related to the prompt.
  4. An LLM then uses the results of this vector search, previous messages in the conversation, and its general-purpose capabilities to formulate a response to the user.

As can be seen on the following screenshot, the web application deployed as part of the solution allows you to upload documents and list uploaded documents and their associated metadata, such as number of pages, file size, and upload date. The document status indicates if a document is successfully uploaded, is being processed, or is ready for a conversation.

Web application document list view

By clicking on one of the processed documents, you can access a chat interface, which allows you to send prompts to the backend. It is possible to have multiple independent conversations with each document with separate message history.

Web application chat view

Embedding documents

Solution architecture diagram excerpt: embedding documents

When a new document is uploaded to the S3 bucket, an S3 event notification triggers a Lambda function that extracts metadata, such as file size and number of pages, from the PDF file and stores it in a DynamoDB table. Once the extraction is complete, a message containing the document location is placed on an Amazon Simple Queue Service (Amazon SQS) queue. Another Lambda function polls this queue using Lambda event source mapping. Applying the decouple messaging pattern to the metadata extraction and document embedding functions ensures loose coupling and protects the more compute-intensive downstream embedding function.

The embedding function loads the PDF file from S3 and uses a text embedding model to generate a vector representation of the contained text. LangChain integrates with text embedding models for a variety of LLM providers. The resulting vector representation of the text is loaded into a FAISS index. FAISS is an open source vector store that can run inside the Lambda function memory using the faiss-cpu Python package. Finally, a dump of this FAISS index is stored in the S3 bucket besides the original PDF document.

Generating responses

Solution architecture diagram excerpt: generating responses

When a prompt for a specific document is submitted via the Amazon API Gateway REST API endpoint, it is proxied to a Lambda function that:

  1. Loads the FAISS index dump of the corresponding PDF file from S3 and into function memory.
  2. Performs a similarity search of the FAISS vector store based on the prompt.
  3. If available, retrieves a record of previous messages in the same conversation via the DynamoDBChatMessageHistory integration. This integration can store message history in DynamoDB. Each conversation is identified by a unique ID.
  4. Finally, a LangChain ConversationalRetrievalChain passes the combination of the prompt submitted by the user, the result of the vector search, and the message history to an LLM to generate a response.

Web application and file uploads

Solution architecture diagram excerpt: web application

A static web application serves as the frontend for this solution. It’s built with React, TypeScriptVite, and TailwindCSS and deployed via AWS Amplify Hosting, a fully managed CI/CD and hosting service for fast, secure, and reliable static and server-side rendered applications. To protect the application from unauthorized access, it integrates with an Amazon Cognito user pool. The API Gateway uses an Amazon Cognito authorizer to authenticate requests.

Users upload PDF files directly to the S3 bucket using S3 presigned URLs obtained via the REST API. Several Lambda functions implement API endpoints used to create, read, and update document metadata in a DynamoDB table.

Extending and adapting the solution

The solution provided serves as a blueprint that can be enhanced and extended to develop your own use cases based on LLMs. For example, you can extend the solution so that users can ask questions across multiple PDF documents or other types of data sources. LangChain makes it easy to load different types of data into vector stores, which you can then use for semantic search.

Once your use case involves searching across multiple documents, consider moving from loading vectors into memory with FAISS to a dedicated vector database. There are several options for vector databases on AWS. One serverless option is Amazon Aurora Serverless v2 with the pgvector extension for PostgreSQL. Alternatively, vector databases developed by AWS Partners such as Pinecone or MongoDB Atlas Vector Search can be integrated with LangChain. Besides vector search, LangChain also integrates with traditional external data sources, such as the enterprise search service Amazon Kendra, Amazon OpenSearch, and many other data sources.

The solution presented in this blog post uses similarity search to find information in the vector database that closely matches the user-supplied prompt. While this works well in the presented use case, you can also use other approaches, such as maximal marginal relevance, to find the most relevant information to provide to the LLM. When searching across many documents and receiving many results, techniques such as MapReduce can improve the quality of the LLM responses.

Depending on your use case, you may also want to select a different LLM to achieve an ideal balance between quality of results and cost. Amazon Bedrock is a fully managed service that makes foundational models (FMs) from leading AI startups and Amazon available via an API, so you can choose from a wide range of FMs to find the model that’s best suited for your use case. You can use models such as Amazon Titan, Jurassic-2 from AI21 Labs, or Anthropic Claude.

To further optimize the user experience of your generative AI application, consider streaming LLM responses to your frontend in real-time using Lambda response streaming and implementing real-time data updates using AWS AppSync subscriptions or Amazon API Gateway WebSocket APIs.

Conclusion

AWS serverless services make it easier to focus on building generative AI applications by providing automatic scaling, built-in high availability, and a pay-for-use billing model. Event-driven compute with AWS Lambda is a good fit for compute-intensive, on-demand tasks such as document embedding and flexible LLM orchestration.

The solution in this blog post combines the capabilities of LLMs and semantic search to answer natural language questions directed at PDF documents. It serves as a blueprint that can be extended and adapted to fit further generative AI use cases.

Deploy the solution by following the instructions in the associated GitHub repository.

For more serverless learning resources, visit Serverless Land.