Tag Archives: Customer Solutions

Powering global payout intelligence: How MassPay uses Amazon Redshift Serverless and zero-ETL to drive deeper analytics.

2025-06-02 Yossi Shlomo

Post Syndicated from Yossi Shlomo original https://aws.amazon.com/blogs/big-data/powering-global-payout-intelligence-how-masspay-uses-amazon-redshift-serverless-and-zero-etl-to-drive-deeper-analytics/

Since the company was founded in 2019, MassPay’s singular objective has been to deliver frictionless global payments that power innovation and lift people, businesses, and quality of life worldwide. Today, the MassPay payment orchestration offering empowers companies to move money across borders effortlessly; enabling local payment experiences in over 175 countries and 70 currencies—including digital wallets, locally preferred alternative payment methods, and cryptocurrencies. From hyper-localized checkout experiences to instant global payouts, we orchestrate seamless financial experiences that reflect how people and businesses transact around the world.

As we have expanded globally, so has the complexity of our data. In this blog post we shall cover how understanding real-time payout performance, identifying customer behavior patterns across regions, and optimizing internal operations required more than traditional business intelligence and analytics tools. And how since implementing Amazon Redshift and Zero-ETL, we’ve seen 90% reduction in data availability latency, payments data available for analytics 1.5x faster, leading to 45% reduction in time-to-insight and 37% fewer support tickets related to transaction visibility and payment inquiries.

Unlocking deeper payout intelligence and global insights

To continue our innovation—and to continue to exceed our partners’ and customers’ expectations—we knew we needed to go beyond basic reporting. We know success is dependent upon developing a truly data-driven organization. This means tracking granular KPIs across payout success rates, payment method adoption, transaction velocity, customer onboarding funnel drop-off, and support ticket correlation. We also wanted to better forecast customer payment expectations, monitor foreign exchange cost trends, and understand market-specific nuances such as how payout timing impacts seller satisfaction in social commerce ecosystems.

“We didn’t just want more data. We wanted faster, smarter insights that would shape decisions in real time. Being a data-driven organization means our teams don’t guess. They know. And that gives us, our partners, and our customers real operational and competitive advantages.”

– Yossi Schlomo, Director of Payment Systems Architecture

MySQL databases, CSV exports, and third-party reporting tools wouldn’t support the scale or speed we needed to deliver.

Choosing AWS: A scalable and integrated analytics foundation

We chose Amazon Web Services (AWS) for our data infrastructure and to accelerate our analytics capabilities.

At the core of our stack is Amazon Redshift Serverless with AI-driven scaling and optimizations enabled, which gives us scalable, fast, and cost-efficient analytics without the burden of managing infrastructure. Coupled with Amazon Aurora MySQL-Compatible Edition as our transactional data store and Amazon Redshift zero-ETL integration, we eliminated manual data pipelines altogether. Transactional data flows into Amazon Redshift in near real-time, instantly powering dashboards, alerts, and machine learning (ML) models.

This data feeds interactive dashboards—both internally and embedded within our platform for customers. Now, executives, operations leads, and customer success teams can drill into payout performance by region, merchant, or payment method, while customers get real-time visibility into their own payout analytics as part of our platform experience. The architecture is shown in the following figure.

MassPay Zero-ETL architecture with Amazon Redshift Serverless

Why it’s different and what it unlocked

Without Amazon Redshift Serverless and zero-ETL, we would have had to invest in costly custom data pipelines, maintain separate exchange, transform, and load (ETL) infrastructure, and manually manage data freshness. The integration with Aurora MySQL-Compatible is seamless and reduces our analytics latency from minutes to seconds.

“Our differentiator is simple: We operationalize not just transactions but analytics for global payments. Most platforms can tell you if a transaction went through. For payments and payouts, MassPay can tell you how fast it went, what it cost, what method was most effective, and what that means for your business in real time.”

– Yossi Schlomo, Director of Payment Systems Architecture

Embedded intelligence, built for scale

Every MassPay customer gets access to comprehensive payment analytics. These are accessed using our API or through a white-label dashboard (shown in the following figure). This detail is core to our product and central to our value proposition. As part of our go-to-market strategy, we showcase these capabilities in every demo, and they’ve proven to be key drivers in conversion and upsell conversations, especially with platforms targeting high-growth ecosystems.We use tiered pricing models based on transaction volume, and our embedded intelligence helps our partners and customers optimize usage and scale efficiently.

MassPay Dashboard

What we’ve gained

Since implementing Amazon Redshift and Zero-ETL, we’ve seen measurable results including:

90% reduction in data availability latency and data available for analytics 1.5x faster
45% reduction in time-to-insight across payment and payout intelligence reports
37% fewer support tickets related to transaction visibility and payment inquiries
Real-time Net Promoter Score (NPS) tracking correlates with payout success metrics, driving faster resolution paths

What’s next

We’re now extending our analytics model to include more advanced ML-based payout failure prediction and ML-based payment authorization prediction, FX optimization alerts, partner-level and network-level benchmarking, and much more.

Conclusion

MassPay isn’t just payments. We aren’t just payouts. We are the engine powering modern commerce. With AWS, we’re turning complex global payments infrastructure into a smart, transparent, and scalable platform for insights. For our partners, and for our customers, this means better decisions, faster payment processing, faster payouts, and truly global reach without guesswork.

We encourage you to leverage below resources to explore these features further

About the authors

Yossi Shlomo serves as the Director of Payment Systems Architecture at MassPay. Yossi is an expert in credit card payment systems, PCI compliance, and secure transaction architecture, helping global platforms process payments at scale with confidence. He specializes in building scalable, cloud-based transaction systems and optimizing global payment gateways for performance and reliability.

Milind Oke is a Amazon Redshift and SageMaker Lakehouse specialist Solutions Architect as AWS. He is based out of New York and has been building enterprise data platforms, data warehousing, and analytics solutions for customers across various domains over two decades. In the 5 years with AWS, Milind has been a speaker at worldwide technical conferences and is co-author of Amazon Redshift: The Definitive Guide: Jump-Start Analytics Using Cloud Data Warehousing 1st Edition.

How Launchpad from Pega enables secure SaaS extensibility with AWS Lambda

2025-05-30 Anton Aleksandrov

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/architecture/how-launchpad-from-pega-enables-secure-saas-extensibility-with-aws-lambda/

Large organizations increasingly adopt software as a service (SaaS) solutions to focus on business priorities, reduce infrastructure management overhead, and optimize costs. These organizations expect SaaS vendors to provide customizability facilities for tailoring the solution behavior according to their needs. Although traditional approaches like feature flags and webhooks offer some flexibility, they often fall short of providing a high degree of customizability. A new emerging pattern in this space is tenant-supplied custom code execution, which allows tenants to inject their own code into specific workflow points, enabling deep customization while preserving the core SaaS solutions’ integrity and security.

In this post, we share how Pegasystems (Pega) built Launchpad, its new SaaS development platform, to solve a core challenge in multi-tenant environments: enabling secure customer customization. By running tenant code in isolated environments with AWS Lambda, Launchpad offers its customers a secure, scalable foundation, eliminating the need for bespoke code customizations.

Solution overview

Launchpad, which is built on AWS, is an end-to-end platform on which software providers can build, launch, and operate workflow-centric B2B SaaS applications and AI solutions. It provides a managed, secure, scalable cloud environment for hosting multi-tenant applications and data. It accelerates the build experience with generative AI-powered low code tools, prebuilt capabilities, and subscriber-level configuration. Being a multi-tenant platform at its core, Launchpad had to maintain stringent isolation across tenants in its architecture.

One of the requirements Launchpad had was to allow their tenants to augment the workflows natively by providing custom code. Some common scenarios included communicating with external systems with proprietary non-industry-standard protocols, reuse of existing business logic, and SDK-based custom code development.

The solution necessitated the ability for tenants to provide custom code that would implement the required business logic, which Launchpad would be executing. This required architecting a secure runtime environment for custom code execution that maintains the highest degree of cross-tenant isolation within the multi-tenant architecture, at the same time allowing sufficient access to platform APIs and services. It was essential to build an architecture that would decouple the environment running tenant code from the core SaaS platform, as illustrated in the following diagram.

Architecting the solution topology

To achieve the required high level of compute isolation for running code provided by different tenants, Launchpad has adopted Lambda functions in its architecture as the secure ephemeral compute environment. Each untrusted code snippet provided by tenants is bootstrapped as a stand-alone Lambda function, with strong Firecracker-based isolation across different functions and execution environments addressing Launchpad’s requirements. This isolation provides dedicated resources, customizable access permissions, independent monitoring and operations, and automatic scaling for each function, while maintaining complete separation from other functions and their execution environments, as illustrated in the following diagram.

With Lambda being a serverless compute service, adopting it for the Launchpad architecture yielded several significant benefits. The major business benefit was that tenants could implement thousands of custom workflow augmentations on their own simply by providing code snippets, instead of the Launchpad engineering team being responsible for implementing them in the core platform code. Other benefits included:

Managed runtimes – AWS handles patching and updating the underlying infrastructure, operating system, and runtimes for customers, reducing the potential attack surface.
Fine-grained permissions – Each function can have its own set of access policies to tightly control what resources and actions it can access.
No need to pre-provision and pay for overprovisioned capacity – Lambda functions scale up and down automatically based on traffic patterns.
Built-in monitoring – Lambda functions emit detailed metrics, logs, and traces through Amazon CloudWatch and AWS X-Ray out of the box, making it straightforward to monitor tenant code execution.

To further reduce risks, Launchpad runs these Lambda functions with untrusted code in a dedicated AWS account. This account is separated from the core SaaS platform account. When end-users create a new function in the Launchpad authoring portal, they upload their code and specify the code handler to be executed during the invocation. Users can also map function input and output to Launchpad fields for further processing to enable an even higher degree of customizability and integration. The multi-tenant authoring service is a Control Plane component that runs as a microservice on the Amazon Elastic Kubernetes Service (Amazon EKS) cluster and uses the Lambda API for function lifecycle management, as illustrated in the following diagram. After a function resource is created, it can be used for further invocations.

Runtime architecture

At runtime, when Launchpad needs to invoke a function, it calls the Lambda Invoke API. Before the function is invoked, the multi-tenant runtime service performs a tenancy check to make sure the request is coming from an authorized tenant by doing the token validation. After a successful validation, the service invokes the required Lambda function. To invoke functions hosted in a different AWS account, the multi-tenant runtime service uses an AWS Identity and Access Management (IAM) role to assume the required permissions and invokes the Lambda service using the AWS SDK. The sequence of interactions is shown in the following architecture diagram.

The workflow consists of the following steps:

An incoming user request reaches the application gateway service.
The application gateway authenticates the request using the tenancy security service.
After it’s authenticated, the request is forwarded to the multi-tenant runtime service
The multi-tenant runtime service validates the supplied token and performs a tenancy check. This makes sure tenants can only invoke own functions they have permissions for (for example, functions they own).
The multi-tenant runtime service pod assumes the IAM role required for invoking the tenant-specific Lambda function in a different AWS account.
The multi-tenant runtime service pod invokes the required Lambda function.

Invoking the platform API from custom code is as straightforward as connecting to any external API. The custom code can authenticate with the platform using OAuth2. To facilitate the authentication, the developer can pass along the credentials as input parameters to the function from the core platform. Then the developer can create a corresponding record (isolated by tenant) in the platform that stores the credentials per function, and pass credentials as input parameters during invocation.

Distributed architecture observability

Operating a distributed architecture that runs untrusted code across multiple AWS accounts requires a comprehensive observability strategy. Launchpad’s approach combines centralized logging and monitoring with cross-account aggregation to provide a unified operational view of the platform.

The monitoring architecture uses CloudWatch Metrics to observe the Lambda functions, aggregating them through a centralized observability layer. This setup empowers platform operators to correlate Lambda function metrics with the core platform services running on Amazon EKS. Launchpad also collects per-function telemetry like function invocations, error rates, and execution time, which allows them to observe per-tenant metrics. These telemetry dimensions enable both a platform-wide and tenant-specific monitoring perspective.

For logging and troubleshooting, Launchpad implements a unified logging pipeline that aggregates Lambda function logs with application gateway and runtime service logs. Each request flowing through the system carries a correlation ID, so operators can trace execution paths across the core SaaS services and into the tenant functions running in the AWS account running tenant Lambda functions.

With this multi-layer observability architecture, Launchpad can maintain operational excellence while running tenant code securely at scale. Regular operational reviews drive continuous improvements in monitoring coverage and incident response procedures. Having per-tenant Lambda functions make it possible for Launchpad to use tenant-specific cost allocation tags, further empowering them to understand the cost footprint of running tenant custom code.

Best practices

When building a SaaS solution, maintaining a unified core code base is essential for scalability and manageability. Implementing per-tenant variations within the core platform code can lead to maintenance complexity and technical debt. Instead, architect your SaaS solution to have extension points, which allow your tenants to inject their custom code at specific points in the workflow, enabling customization without compromising the platform’s maintainability. This pattern makes sure the core SaaS platform remains clean and standardized while offering the flexibility that customers demand.

Additional best practices include:

Use separate accounts for running Lambda functions with untrusted tenant-provided code to make sure it’s isolated from your core SaaS platform code.
Grant absolute minimum required access permissions to the execution role assigned to the function. The custom code running within the execution environment gets permissions defined in the execution role when making requests to AWS API endpoints. If the function doesn’t need to reach out to AWS API endpoints, remove all permissions from the execution role and add an explicit AWSDenyAll policy.
Use separate Lambda functions for each code snippet and each tenant. This will provide the highest degree of cross-tenant isolation. Resources are not reused across different functions and execution environments.
Use Lambda layers in case you need to add a layer of vendor-provided code in order to keep it separated from the untrusted tenant-provided code.
Implement additional security controls, such as using Amazon Virtual Private Cloud (Amazon VPC) constructs to restrict network access and VPC Flow Logs for network activity monitoring.

Conclusion

The implementation of a secure untrusted code execution environment within SaaS platforms addresses a critical need for tenant customization while maintaining architectural integrity. Lambda offers a built-in isolation model, fine-grained security controls, and serverless scalability, so SaaS providers such as Launchpad can address the requirements of executing tenant-provided code in a multi-tenant environment and offer robust customization capabilities while maintaining strict security boundaries and operational efficiency. This architectural pattern enables providers to focus on core platform development while confidently supporting tenant-specific workflows through the secure and scalable Lambda execution environment.

To learn more, refer to the Security Overview of AWS Lambda white paper. For additional serverless architectural patterns, see Serverlessland.com.

About the authors

PackScan: Building real-time sort center analytics with AWS Services

2025-05-30 Sairam Vangapally

Post Syndicated from Sairam Vangapally original https://aws.amazon.com/blogs/big-data/packscan-building-real-time-sort-center-analytics-with-aws-services/

Amazon manages a complex logistics network with multiple touch points, from fulfillment centers to sort centers to final customer delivery. Among these, sort centers play a crucial role in the middle mile, providing faster and more efficient package movement. Within Amazon’s Middle Mile operations, high-volume sort centers process millions of packages daily, making immediate access to operational data essential for optimizing efficiency and decision-making. Real-time visibility into key metrics—such as package movements, container statuses, and associate productivity—is critical for smooth logistics operations. To address the need for real-time operational planning, the Amazon Middle Mile team developed PackScan, a cloud-based platform designed to provide instant insights across the network. By significantly reducing data latency, PackScan enables proactive decision-making, so teams can monitor inbound package flows, optimize outbound shipments based on live data, track associate productivity, identify bottlenecks, and enhance overall operational efficiency—all in real time.

In this post, we explore how PackScan uses Amazon cloud-based services to drive real-time visibility, improve logistics efficiency, and support the seamless movement of packages across Amazon’s Middle Mile network.

Prerequisites

This post assumes a foundational understanding of the following services and concepts:

Core services such as Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (Amazon SQS), AWS Lambda, Amazon Data Firehose, and Amazon OpenSearch Service
Infrastructure components like Amazon Elastic Compute Cloud (Amazon EC2), Amazon Load Balancers, and security configurations including network rules and security groups
Visualization tools such as Grafana hosted on Amazon EC2

Although hands-on experience is not required, a conceptual understanding of these services will help in understanding the architecture, design patterns, and components discussed throughout the article.

Business challenges

Amazon’s sort centers handle over 15 million packages daily across more than 120 facilities in North America. Given this scale, even minor delays in operational insights can lead to inefficiencies, increased costs, and escalations. Traditionally, data latencies of up to an hour have restricted the ability to make proactive decisions, directly affecting productivity, resource allocation, and responsiveness—especially during peak periods like holiday seasons and big deal days.

Without immediate visibility into package movements, container statuses, and associate performance, operational teams face challenges in identifying and resolving bottlenecks in real time. The lack of timely insights can disrupt the flow of packages, leading to shipment delays, reduced throughput, and suboptimal facility performance. Addressing these inefficiencies required a solution capable of delivering real-time, high-fidelity data to support rapid decision-making.

To bridge this gap, Amazon’s Middle Mile organization needed a scalable platform that could enhance visibility, minimize latency, and provide up-to-the-minute insights into logistics operations. PackScan was designed to meet these demands, giving teams access to the real-time data necessary to optimize workflows, mitigate bottlenecks, and improve overall efficiency.

Data flow

In 2024, PackScan was deployed across 80 sort centers in the USA, enabling real-time package analytics. The solution powers Grafana dashboards, which refresh every 10 seconds by fetching live package data from OpenSearch Service. With this near real-time visibility, operations teams can monitor package movement and sorting efficiency across sort centers. The following diagram outlines how package scan data is ingested, processed, and made actionable.

Each sort center is equipped with hardware at inbound stations where packages arrive from trailers. Integrated barcode scanners automatically scan each package as it enters the sorting process. Every scan generates an SNS event, capturing key attributes such as the package ID, dimensions, the associate who performed the scan, and the timestamp and location of the scan.

After they’re generated, these SNS events are ingested into Data Firehose through a Lambda function, where the data undergoes real-time enrichment. During this process, additional attributes are appended, including the business logic rules. The enriched data is then streamed into OpenSearch Service, where events are indexed to enable fast and efficient querying. With the indexed package scan events available in OpenSearch Service, real-time analytics and monitoring become possible. The Grafana dashboards query this data every 10 seconds, providing operational insights into package inflow metrics and associate performance.

Solution overview

PackScan was implemented using a structured and scalable approach, using AWS cloud-based services to enable high-frequency data ingestion, real-time processing, and actionable insights. The architecture is designed to minimize latency while providing reliability, scalability, and operational efficiency. The solution is built around a serverless, event-driven architecture that dynamically scales based on data ingestion volumes. The architecture—illustrated in the following figure—enabled us to build a real-time data solution, utilizing the advantages of various AWS services to provide low-latency analytics, high scalability, and real-time operational insights across Amazon’s sort centers.

The following are the key components and features of the solution:

Real-time data processing – Lambda functions serve as the processing backbone of the system, handling 500,000 scan events per second. Each incoming event is processed by applying data transformations, enrichment, and validation before passing it downstream.
High-frequency data ingestion and streaming – Data Firehose is the primary ingestion pipeline, handling millions of scan events daily from thousands of barcode scanners across multiple sort centers. The Firehose streams handle incoming data of 12,000 PUT requests per second, maintaining smooth ingestion and low-latency streaming. Data retention policies are set to buffer and forward enriched events every 60 seconds or upon reaching 5 MB batch size, optimizing storage and processing efficiency.
Optimized querying and operational insights – OpenSearch Service is used to index and store the processed scan events, providing real-time querying and anomaly detection. The OpenSearch cluster consists of 12 data nodes (r5.4xlarge.search) and 3 primary nodes (r5.large.search), processing up to 10 GB of data per day with a rolling index strategy, where indexes are rotated every 24 hours to maintain query performance. The system supports concurrent queries per second, enabling logistics teams to perform rapid lookups and gain instant visibility into package movements.
Live visualization and dashboarding – Grafana, hosted on an m5.12xlarge EC2 instance, provides real-time visualization of key logistics metrics. The dashboards refresh every 10 seconds, querying OpenSearch and displaying up-to-the-minute package analytics. The setup includes multiple preconfigured dashboards, monitoring package flow at different inbound stations, and workforce efficiency. These dashboards support concurrent users, enabling supervisors and associates to track and optimize operations proactively. The following screenshot shows one of the real-time dashboards, with details of package flow by different routes within sort centers.

The entire PackScan architecture is designed for automatic scaling, adjusting dynamically based on data ingestion volume to maintain efficiency during peak and off-peak operations. This approach provides cost-effective resource utilization while maintaining high availability and performance.

Business outcomes

The implementation of PackScan has led to measurable improvements in operational efficiency, workforce productivity, and real-time decision-making across Amazon’s sort centers. By reducing data latency and enabling real-time insights, PackScan has transformed logistics operations in meaningful ways:

Widespread deployment – PackScan was deployed across 80 sort centers, supporting approximately 1,000 display monitors that provide real-time operational insights.
Significant reduction in data latency – Data latency dropped from approximately 1 hour to less than 1 minute, allowing for real-time operational responsiveness and minimizing workflow disruptions.
Proactive operational management – With dynamic workload balancing and instant bottleneck identification, supervisors can now address issues as they arise, leading to smoother operations and fewer escalations.
Boost in workforce productivity – The real-time performance feedback has enhanced associate engagement, resulting in a 25% increase in throughput per hour and 12% reduction in labor hours.

Overall, PackScan has redefined real-time logistics visibility within Amazon’s Middle Mile operations, empowering operational teams with actionable insights, enhanced workforce efficiency, and a data-driven approach to package movement and sort center performance.

Lessons learned and best practices

The deployment and scaling of PackScan provided valuable insights into optimizing real-time logistics visibility. Several key lessons and best practices emerged from this implementation:

Cloud architecture drives efficiency – Adopting Amazon technologies provides seamless scalability, reduced operational overhead, and lower infrastructure costs, while maintaining high reliability. The following table shows an approximate breakdown of monthly service costs observed in production. This is an estimation based on current pricing; we recommend checking the respective AWS service pricing pages to generate the most up-to-date quote. This architecture demonstrates that with combination of provisioned and serverless design, production-ready solutions can be built and scaled at a fraction of the cost of traditional infrastructure.

AWS Service	Description	Estimated Monthly Cost
Amazon EC2	Three EC2 instances of type m5.12xlarge hosting Grafana	$1,700
AWS Lambda	Streams SNS events to Data Firehose	$4,000
Amazon Data Firehose	Real-time data delivery with 12,000 records streaming to OpenSearch Service	$1,500
Amazon OpenSearch Service	Indexing and querying package scan events	$28,000

Real-time visibility is a game changer – Immediate access to operational data enhances agility, enabling teams to make timely, data-driven decisions that prevent bottlenecks and improve throughput.
Continuous monitoring enhances decision-making – Operational dashboards should evolve with business needs. Regular monitoring and updates provide accuracy, usability, and relevance in driving informed decision-making.

By applying these best practices, PackScan has set a foundation for scalable, real-time logistics management, making sure that Amazon’s Middle Mile operations remain proactive, efficient, and highly responsive to changing business demands.

Conclusion

PackScan has successfully transformed real-time operational visibility within Amazon’s sort centers, addressing critical challenges in data latency, workforce productivity, and logistics efficiency. By using AWS services, particularly Data Firehose for real-time data delivery and OpenSearch Service for analytics, PackScan has enabled proactive decision-making, streamlined operations, and enhanced throughput in high-volume sort environments. Looking ahead, future enhancements will focus on further elevating operational intelligence and scalability, including:

Integrating predictive analytics to anticipate workflow bottlenecks and optimize resource allocation
Scaling the solution across additional operational scenarios, providing greater resilience and adaptability to dynamic logistics environments

With these advancements, PackScan will continue to drive operational excellence, cost-efficiency, and real-time decision-making capabilities, reinforcing Amazon’s commitment to innovation in logistics and supply chain management.

For those interested in implementing similar solutions, we recommend exploring AWS Serverless Architecture Patterns and the AWS Architecture Blog for additional insights and best practices in building scalable, real-time analytics solutions.

About the authors

Sairam Vangapally is a Data Engineer at Amazon with extensive experience architecting real-time, large-scale data platforms that power critical logistics operations across North America. He has led the design and deployment of end-to-end data pipelines, enabling high-throughput ingestion, transformation, and analytics at scale. He is passionate about building resilient data infrastructure and driving cross-functional collaboration to deliver solutions that accelerate operational insights and business impact.

Nitin Goyal serves as a Data Engineering Manager in Amazon’s Sort Center organization, where he leads initiatives to optimize operational efficiency across North American facilities. With over nine years of tenure at Amazon spanning multiple teams, he specializes in architecting high-performance data systems, with particular emphasis on real-time streaming pipelines, artificial intelligence, and low-latency solutions. His expertise drives the development of sophisticated operational workflows that enhance sort center productivity and effectiveness.

Improving platform resilience at Cash App

2025-05-29 Dustin Ellis

Post Syndicated from Dustin Ellis original https://aws.amazon.com/blogs/architecture/improving-platform-resilience-at-cash-app/

This post was coauthored with engineers at Cash App, including Ben Apprederisse (Platform Engineering Technical Lead), Jan Zantinge (Traffic Engineer), and Rachel Sheikh (Compute Engineer).

Cash App, a leading peer-to-peer payments and digital wallet service from Block, Inc., has grown to serve over 57 million users since its launch in 2013. Providing diverse offerings like instant money transfers, debit cards, stock and bitcoin investments, personal loans, and tax filing, Cash App continuously strives to make money more relatable, accessible, and secure. To meet this mission, their platform built on AWS must remain highly scalable and resilient.

Cash App has implemented resilience improvements across the entire technology stack, but for this post, we explore two enhancements implemented in 2024. First, we discuss how Cash App improved the resilience of its compute platform built on Amazon Elastic Kubernetes Service (Amazon EKS) by implementing a dual-cluster topology to reduce single points of failure. We also discuss how Cash App used AWS Fault Injection Service (AWS FIS) to conduct an Availability Zone power interruption scenario in non-production environments, preparing the platform team for real-world failures and ongoing regulatory requirements.

Moving to Amazon EKS

Cash App has used Amazon EKS as a shared platform for years, and in 2024, we migrated the last of our biggest and most critical services to it, driven by the ease of management and our strategy to run workloads in the cloud. This shift empowered our teams to focus more on application delivery and less on cluster management, abstracting away some of the complexity in operating Kubernetes clusters. With Amazon EKS, AWS is responsible for managing the Kubernetes control plane, which includes the control plane nodes, the ETCD database, and other infrastructure necessary for AWS to deliver a service offering with enhanced security features. As a consumer of Amazon EKS, we are still responsible for things like AWS Identity and Access Management (IAM), pod security, runtime security, networking, and cluster automatic scaling in the data plane for worker nodes. AWS refers to this as the Shared Responsibility Model; for more details, refer to Best Practices for Security. For a while, maintaining a single shared EKS cluster for each environment and AWS Region was acceptable and met our technical requirements.

Scaling challenges with a single shared EKS cluster

As Cash App’s shared EKS cluster grew and onboarded hundreds of microservices, we became an early adopter of the Karpenter Cluster Autoscaler, which improved application availability and cluster efficiency. Karpenter was designed to enhance node lifecycle management within Kubernetes clusters. It automates provisioning and deprovisioning of nodes based on the specific scheduling needs of pods, allowing for efficient scaling and cost optimization. We also reviewed and adopted many of the Amazon EKS Best Practices Guides for Reliability, Security, Networking, and Scalability.

Despite these efforts, in 2023, we realized having a single shared EKS cluster with hundreds of worker nodes had become a critical dependency in our cloud platform architecture, whereby standard operations like cluster upgrades posed a risk of impacting production traffic or causing downtime. Although we had a staging environment that allowed us to test new features and changes before rolling them out in production, it didn’t always reflect real traffic patterns that could allow impactful changes to go unnoticed in lower environments. For example, in late 2023, production traffic was briefly impacted due to a relatively new Kubernetes feature called API Priority and Fairness (APF) that introduced temporary control plane throttling following a cluster upgrade.

As a leader in the financial services industry, any interruption to our services directly impacts customers’ ability to transact and participate in the economy, so round-the-clock availability of our platform is essential. Following this incident, we reviewed and implemented the Kubernetes Control Plane best practices pertaining to APF and set up new detections in our observability platform to monitor things like FlowSchemas and queue depth. We also created a business case for investing in platform evolution and moving towards a multi-cluster topology. The desired state was to have a safer way to upgrade clusters, such that traffic is always serviceable regardless of the state of a single cluster in the fleet.

The following diagram illustrates our basic architecture in late 2023, portraying a single, Multi-AZ EKS cluster at Cash App. We had two ingress paths: for public ingress, we used Amazon Route 53 to a Network Load Balancer (NLB). For internal ingress across Amazon Virtual Private Cloud (Amazon VPC) networks (for example, from other business units), we set up a VPC endpoint service. Other components (such as service dependencies like database clusters and other Regions) are intentionally omitted from the diagram for simplicity.

Enterprise network architecture showcasing secure cross-VPC communication via PrivateLink to shared EKS cluster with public/internal ingress

Platform enhancements

In 2024, our cloud platform team implemented architectural changes to improve the reliability of the shared Amazon EKS platform, which was already running hundreds of services and over 10,000 pods. In the process, we also recognized the importance of chaos engineering to provide controlled failure testing and preparedness, which is also an evolving regulatory requirement for financial institutions. Our team has made multiple resilience improvements to apply long-term scalability; the rest of this post focuses on two enhancements we believe other AWS customers can benefit from.

Enhancement 1: Implementing a dual-cluster topology

When researching multi-cluster topologies and interacting with AWS specialists, we learned about architectural patterns to improve platform resilience with varying degrees of complexity. This included cell-based architecture, Route 53 weighted routing, and NLB chaining, among other patterns. Our initial requirements for the multi-cluster architecture were as follows:

Effortlessly deploy a service across two or more clusters
Serve production traffic from multiple clusters at the same time if needed
Seamlessly upgrade a production cluster without impacting production traffic
Have the ability to take a cluster out of the traffic flow during planned and unplanned events
Achieve reliability improvements with minimal architectural and cost changes

Based on these initial requirements and the complexity required to properly implement a cell-based architecture, we decided to start with Route 53 weighted routing for public ingress, and NLB chaining for internal ingress to achieve immediate improvements. Although we plan to implement a cell-based architecture in the future, it requires careful consideration and engineering investment (for example, creating a cell router and cell provisioning system). At the time, we needed a simpler, short-term solution that could improve overall scalability and reliability. From a Kubernetes perspective, we started with two Multi-AZ clusters in our production environment, each operating independently, such that a planned or unplanned event in one cluster would not affect the other cluster. Existing pipelines were extended to deploy the same Kubernetes resources into both clusters, which would actively serve ingress traffic using an NLB. If needed, we could divert all ingress traffic to one cluster during planned or unplanned events by updating the weights in Route 53 (for public ingress) or updating the listener rules and target groups on the frontend NLB (for internal ingress). In the following sections, we dive deeper into both approaches.

Public ingress changes

For public ingress, requests are still handled by Route 53. With the introduction of weighted routing, requests can be evenly routed between the two backend NLBs (one per cluster). This setup allows Cash App to split traffic evenly across clusters or temporarily route all traffic to one cluster during upgrades. As an example, during cluster A’s upgrade, we can direct ingress traffic to cluster B, enabling a sequential upgrade process with minimal disruption. This dual-cluster approach helps mitigate single points of failure and establishes a foundation for reliability improvements. The following diagram illustrates this architecture.

Highly available AWS Cash App architecture using Route 53 for 50/50 traffic distribution to dual NLB and EKS clusters across three AZs

Highly available architecture using Route 53 for 50/50 traffic distribution to dual NLB and EKS clusters across three AZs

Internal ingress changes

Cash App works closely with other business units like Square, and therefore needs to have a secure and scalable ingress path for cross-VPC traffic. To support these use cases, internal ingress requests are still routed from other business unit VPCs to the NLB associated with our VPC endpoint service. From there, requests are load balanced roughly evenly to two backend NLBs (one per cluster) using the listener rules. The following diagram illustrates this architecture.

Cash App architecture featuring secure VPC peering, tiered load balancing, and redundant EKS clusters for high availability

Key considerations

Consider the following when assessing these approaches:

The NLB chaining approach used for internal ingress enables controlled failover and maintenance with minimal disruption by adjusting NLB listeners and target groups, and reduces single points of failure by distributing requests across two or more clusters.
Although this pattern met our initial requirements and made it possible to shift traffic from one cluster to another, it’s not as granular as we would like (for example, shift traffic for only service A to cluster B, and leave all other traffic on cluster A). We will continue to evolve the architecture and implement a more intelligent ingress router and cell provisioning system. Furthermore, at the time of writing, NLB doesn’t support weighted target group routing.
Using the NLB chaining approach adds additional cost due to multiple NLBs in the traffic path. With NLBs, you’re charged for each hour or partial hour that an NLB is running, and the number of Network Load Balancer Capacity Units (NLCUs) used by the NLB per hour.
The NLB chaining approach doesn’t achieve the full benefits of a cell-based architecture, which provides greater isolation and fault isolation boundaries. Moving to a cell-based architecture requires more complexity and coordination, so we decided to start with NLB chaining and Route 53 weighted routing for immediate improvements and minimal re-architecture.

Looking forward, we plan to implement a cell-based architecture that aligns with Block’s compute strategy. This will consist of a cell management layer (to create, update, and delete cells) and a cell routing layer (to route requests to the correct cell), which allows for fine-grained traffic routing.

Enhancement 2: Using AWS FIS for Availability Zone power interruption

After moving to a dual-cluster topology, we researched ways to test ongoing platform resilience. Specifically, we wanted to test what would happen to our environment during single Availability Zone impairments and how our services and data stores would recover. This led us to AWS FIS, which we could use to conduct an Availability Zone power interruption scenario in our staging environment. AWS FIS is a fully managed service for running fault injection experiments on AWS infrastructure, making it straightforward to get started with chaos engineering. Chaos engineering is the practice of stressing an application in testing or production environments by creating disruptive events, such as a sudden increase in CPU or memory consumption, observing how the system responds, and implementing improvements.

Our initial AWS FIS experiment used multiple actions, including network disruption, database failover (targeting Amazon Aurora and Amazon ElastiCache clusters), and EKS worker node disruption (pod eviction and rescheduling). For the full list of AWS FIS actions, refer to AZ Availability: Power Interruption. For our initial experiment, we targeted infrastructure in a single staging account; however, AWS FIS supports multi-account experiments by setting up an orchestrator account that enables centralized configuration, management, and logging. The orchestrator account owns the AWS FIS experiment template. For future AWS FIS experiments, we plan to target more accounts and resources, operating within the constraints of AWS FIS quotas and limits.

For clear communication during the AWS FIS experiment, our platform team coordinated with internal application stakeholders and our AWS account team, monitoring metrics to gauge failover and recovery responses. After the experiment, we debriefed with AWS teams and submitted product feature requests to support ongoing service evolution. Moving forward, we plan to conduct AWS FIS experiments at least twice a year to reinforce reliability best practices and adhere to ongoing regulatory requirements. We will also diversify the types of fault injection experiments we perform, and expand beyond the single Availability Zone failure tests. For example, we might introduce a capability that allows application teams to perform one-time AWS FIS experiments just for their service and backend database infrastructure in a controlled, self-service fashion. An added benefit of using AWS FIS is that there is no agent or infrastructure to deploy or manage, and you’re only billed for the duration of the experiment. This made it both straightforward and affordable to get started with chaos engineering. The following diagram illustrates the architecture of the Availability Zone power interruption scenario targeting subnets, compute, and database clusters.

AWS Fault Injection Service AZ Availability: Power Interruption scenario

Key considerations

Setting up the AWS FIS experiment required us to modify default service limits and quotas to accommodate the scale and scope, such as the number of target resources for ElastiCache, Amazon Relational Database Service (Amazon RDS), and subnets. We recommend reviewing the quotas, service limits, and supported services prior to conducting your own experiment, and collaborate with your AWS account team to achieve a seamless experience.

We also learned that certain failures, such as database failover, require more than network disruption actions alone—specific failover actions had to be configured in the AWS FIS experiment template to trigger the desired behavior, as outlined further in the service documentation. If you’re starting with the Availability Zone power interruption scenario, be sure to include all of the required actions to achieve the desired outcome. Finally, we recommend using AWS FIS in lower environments because AWS FIS actions will have real impact on the targeted resource state and availability.

Conclusion

Through these enhancements, Cash App has improved overall resilience posture and laid the groundwork for future improvements. On the Amazon EKS side, we aim to implement an ingress routing layer through cell-based architecture patterns, enabling more granular traffic routing and improved fault isolation boundaries. We will continue to use Karpenter for cluster automatic scaling, and are in the process of rolling out AWS Graviton based instances to further improve resource efficiency. Finally, we will continue using AWS FIS for chaos engineering at both the platform and application level, performing larger-scale AWS FIS experiments with more complexity.

For other blog posts and related resources, refer to:

About the authors

How Airties achieved scalability and cost-efficiency by moving from Kafka to Amazon Kinesis Data Streams

2025-05-29 Steven Aerts, Reza Radmehr

Post Syndicated from Steven Aerts, Reza Radmehr original https://aws.amazon.com/blogs/big-data/how-airties-achieved-scalability-and-cost-efficiency-by-moving-from-kafka-to-amazon-kinesis-data-streams/

This post was cowritten with Steven Aerts and Reza Radmehr from Airties.

Airties is a wireless networking company that provides AI-driven solutions for enhancing home connectivity. Founded in 2004, Airties specializes in developing software and hardware for wireless home networking, including Wi-Fi mesh systems, extenders, and routers. The flagship software as a service (SaaS) product, Airties Home, is an AI-driven platform designed to automate customer experience management for home connectivity, offering proactive customer care, network optimization, and real-time insights. By using AWS managed services, Airties can focus on their core mission: improving home Wi-Fi experiences through automated optimization and proactive issue resolution. This includes minimizing network downtime, enabling faster diagnostic capabilities for troubleshooting, and enhancing overall Wi-Fi quality. The solution has demonstrated significant impact in reducing both the frequency of help desk calls and average call duration, leading to improved customer satisfaction and reduced operational costs for Airties while delivering enhanced service quality to their customers and the end-users.

In 2023, Airties initiated a strategic migration from Apache Kafka running on Amazon Elastic Compute Cloud (Amazon EC2) to Amazon Kinesis Data Streams. Prior to this migration, Airties operated multiple fixed-size Kafka clusters, each deployed in a single Availability Zone to minimize cross-AZ traffic costs. Although this architecture served its purpose, it required constant monitoring and manual scaling to handle varying data loads. The transition to Kinesis Data Streams marked a significant step in their cloud optimization journey, enabling true serverless operations with automatic scaling capabilities. This migration resulted in substantial infrastructure cost reduction while improving system reliability, eliminating the need for manual cluster management and capacity planning.

This post explores the strategies the Airties team employed during this transformation, the challenges they overcame, and how they achieved a more efficient, scalable, and maintenance-free streaming infrastructure.

Kafka use cases for Airties workloads

Airties continuously ingests data from tens of millions of access points (such as modems and routers) using AWS IoT Core. Before the transition, these messages were queued and stored within multiple siloed Kafka clusters, with each cluster deployed in a separate Availability Zone to minimize cross-AZ traffic costs. This fragmented architecture created several operational challenges. The segmented data storage required complex extract, transform, and load (ETL) processes to consolidate information across clusters, increasing the time to derive meaningful insights. The data collected serves multiple critical purposes—from real-time monitoring and reactive troubleshooting to predictive maintenance and historical analysis. However, the siloed nature of the data storage made it particularly challenging to perform cross-cluster analytics and delayed the ability to identify network-wide patterns and trends.

The data processing architecture at Airties served two distinct use cases. The first was a traditional streaming pattern with a batch reader processing data in bulk for analytical purposes. The second use case used Kafka as a queryable data store—a pattern that, though unconventional, has become increasingly common in large-scale data architectures.

For this second use case, Airties needed to provide immediate access to historical device data when troubleshooting customer issues or analyzing specific network events. This was implemented by maintaining a mapping of data points to their Kafka offsets in a database. When customer support or analytics teams needed to retrieve specific historical data, they could quickly locate and fetch the exact records from high-retention Kafka topics using these stored offsets. This approach eliminated the need for a separate database system while maintaining fast access to historical data.

To handle the massive scale of operations, this solution was horizontally scaled across dozens of Kafka clusters, with each cluster responsible for managing approximately 25 TB of records.

The following diagram illustrates the previous Kafka-based architecture.

Challenges with the Kafka-based architecture

At Airties, managing and scaling Kafka clusters has presented several challenges, hindering the organization from focusing on delivering business value effectively:

Operational overhead: Maintaining and monitoring Kafka clusters requires significant manual effort and operational overhead at Airties. Tasks such as managing cluster upgrades, handling hardware failures and rotation, and conducting load testing constantly demand engineering attention. These operational tasks take away from the team’s ability to concentrate on core business functions and value-adding activities within the company.

Scaling complexities : The process of scaling Kafka clusters involves multiple manual steps that create operational burden for the cloud team. These include configuring new brokers, rebalancing partitions across nodes, and providing proper data distribution—all while maintaining system stability. As data volume and throughput requirements fluctuate, scaling typically involves adding or removing entire Kafka clusters, which is a complex and time-consuming process for the Airties team.

Right-sizing cluster capacity: The static nature of Kafka clusters created a “one-size-fits-none” situation for Airties. For large-scale deployments with high data volumes and throughput requirements, adding new clusters required significant manual work, including capacity planning, broker configuration, and partition rebalancing, making it inefficient for handling dynamic scaling needs. Conversely, for smaller deployments, the standard cluster size was oversized, leading to resource waste and unnecessary costs.

How the new architecture addresses these challenges

The Airties team needed to find a scalable, high-performance, and cost-effective solution for real-time data processing that would allow seamless scaling with increasing data volumes. Data durability was a critical requirement, because losing device telemetry data would create permanent gaps in customer analytics and historical troubleshooting capabilities. Although temporary delays in data access could be tolerated, the loss of any device data point was unacceptable for maintaining service quality and customer support effectiveness.

To address these challenges, Airties implemented two different approaches for different scenarios.

The primary use case was real-time data streaming with Kinesis Data Streams. Airties replaced Kafka with Kinesis Data Streams to handle the continuous ingestion and processing of telemetry data from tens of millions of endpoints. This shift offered significant advantages:

Auto-scaling capabilities : Kinesis Data Streams can be scaled through simple API calls, alleviating the need for complex configurations and manual interventions.

Stream isolation : Each stream operates independently, meaning scaling operations on one stream have no impact on others. This alleviated the risks associated with cluster-wide changes in their previous Kafka setup.

Dynamic shard management : Unlike Kafka, where changing the number of partitions requires creating a new topic, Kinesis Data Streams allows adding or removing shards dynamically without losing message ordering within a partition.
Application Auto Scaling: Airties implemented AWS Application Auto Scaling with Kinesis Data Streams, allowing the system to automatically adjust the number of shards based on actual usage patterns and throughput requirements.

These features empowered Airties to efficiently manage resources, optimizing costs during periods of lower activity while seamlessly scaling up to handle peak loads.

For providing on-demand access to historical device data, Airties implemented a decoupled architecture that separates streaming, storage, and data access concerns. This approach replaced the previous solution where historical data was stored directly in Kafka topics. The new architecture consists of several key components working together:

Data collection and processing : The architecture begins with a consumer application that processes data from Kinesis Data Streams. This application implements analyzing the data, as making it available for detailed historical analysis. The result of the data analysis is written to Amazon Data Firehose, which buffers the data, writing it regularly to Amazon Simple Storage Service (Amazon S3), where it can later be picked up by Amazon EMR. This path is optimized for efficient storage and bulk reading from Amazon S3 by Amazon EMR. For raw data storage, multiple raw data samples are batched together in bulk files, which are stored in a separate Amazon S3 path. This path is optimized for storage efficiency and fetching raw data using Amazon S3 range queries.
Indexing and metadata management: To enable fast data retrieval, the architecture implements a sophisticated indexing system. For each record in the uploaded bulk files, two crucial pieces of information are recorded in an Amazon DynamoDB table: the Amazon S3 location (bucket and key) where the bulk file was written, and the sequence number of the corresponding data record in the Kinesis Data Streams queue. This indexing strategy provides low-latency access to specific data points, efficient querying capabilities for both real-time and historical data, automatic scaling to handle increasing data volumes, and high availability for metadata lookups.
Ad-hoc data retrieval: When specific historical data needs to be accessed, the system follows an efficient retrieval process. First, the application queries the DynamoDB table using the relevant identifiers. The query returns the exact Amazon S3 location and offset where the required data is stored. The application then fetches the specific data directly from Amazon S3 using range queries. This approach enables quick access to historical data points, minimal data transfer costs by retrieving only needed records, efficient troubleshooting and analysis workflows, and reduced latency for customer support operations.

This decoupled architecture uses the strengths of each AWS service: Amazon Kinesis Data Streams provides scalable and reliable real-time data streaming, while Amazon S3 delivers durable and cost-effective object storage for raw data, and Amazon DynamoDB enables fast and flexible storage of metadata and indexing. By separating streaming from storage and utilizing each service for its specific strengths, Airties created a more cost-effective and scalable solution for ad-hoc data access needs, aligning each component with its optimal AWS service. The new architecture not only improved data access performance but also significantly reduced operational complexity. Instead of managing Kafka topics for historical data storage, Airties now benefits from fully managed AWS services that automatically handle scaling, durability, and availability. This approach has proven particularly valuable for customer support scenarios, where quick access to historical device data is crucial for resolving issues efficiently.

Solution overview

Airties’s new architecture involves several critical components, including efficient data ingestion, indexing with AWS Lambda functions, optimized data aggregation and processing, and comprehensive monitoring and management practices using Amazon CloudWatch. The following diagram illustrates this architecture.

The new architecture consists of the following key stages:

Data collection and storage: The data journey begins with Kinesis Data Streams, which ingests real-time data from millions of access points. This streaming data is then processed by a consumer application that batches the data into bulk files (also known as briefcase files) for efficient storage in Amazon S3. This approach of streaming, batching, and then storing minimizes write operations and reduces overall costs, while providing data durability through built-in replication in Amazon S3. When the data is in Amazon S3, it’s readily available for both immediate processing and long-term analysis. The processing pipeline continues with aggregators that read data from Amazon S3, process it, and store aggregated results back in Amazon S3. By integrating AWS Glue for ETL operations and Amazon Athena for SQL-based querying, Airties can process large volumes of data efficiently and generate insights quickly and cost-effectively.
Data aggregation and bulk file creation: The aggregators play a crucial role in the initial data processing. They aggregate the incoming data based on predefined criteria and create bulk files. This aggregation process reduces the volume of data that needs to be processed in subsequent steps, optimizing the overall data processing workflow. The aggregators then write these bulk files directly to Amazon S3.
Indexing: Upon successful upload of a bulk file to Amazon S3 by the aggregators, the aggregator will write an index entry for the bulk file an Amazon DynamoDB table. This indexing mechanism allows for efficient retrieval of data based on device IDs and timestamps, facilitating quick access to relevant data using S3 range queries on the bulk files.
Further processing and analysis: The bulk files stored in Amazon S3 are now in a format optimized for querying and analysis. These files can be further processed using AWS Glue and analyzed using Athena, allowing for complex queries and in-depth data exploration without the need for additional data transformation steps.
Monitoring and management: To maintain the reliability and performance of the Kafka-less architecture, comprehensive monitoring and management practices were implemented. CloudWatch provides real-time monitoring of system performance and resource utilization, allowing for proactive management of potential issues. Additionally, automated alerts and notifications make sure anomalies are promptly addressed.

Results and benefits

The transition to this new architecture yielded significant benefits for Airties:

Scalability and performance: The new architecture empowers Airties to scale seamlessly with increasing data volumes. The ability to independently scale reader and writer operations has reduced performance impacts during high-demand periods. This is a significant improvement over the previous Kafka-based system, where scaling often required complex reconfigurations and could affect the entire cluster. With Kinesis Data Streams, Airties can now handle peak loads effortlessly while optimizing resource usage during quieter periods.
Reliability and fault tolerance: By using AWS managed services, Airties has significantly reduced system latency and improved overall uptime. The automatic data replication and recovery processes of Kinesis Data Streams provide enhanced data durability, a critical requirement for Airties’s operations. The improved high availability means that Airties can now offer more reliable services to their customers, minimizing disruptions and enhancing the overall quality of their home connectivity solutions.
Operational efficiency: The new architecture has dramatically reduced the need for manual intervention in capacity management. This shift has freed up valuable engineering resources, allowing the team to focus on delivering business value rather than managing infrastructure. The simplified operational model has increased the team’s productivity, empowering them to innovate faster and respond more quickly to customer needs. The reduction in operational overhead has also led to faster deployment cycles and more frequent feature releases, enhancing Airties’s competitiveness in the market.
Environmental impact and sustainability: The transition to a serverless architecture demonstrated significant environmental benefits, achieving a remarkable 40% reduction in energy consumption. This substantial decrease in energy usage was achieved by eliminating the need for constantly running EC2 instances and using more efficient, managed AWS services. This improvement in energy efficiency aligns with Airties’s commitment to environmental sustainability and establishes them as an environmentally responsible leader in the tech industry.
Cost optimization: The financial benefits of transitioning to a Kafka-less architecture are clearly demonstrated through comprehensive AWS Cost Explorer data. As shown in the following diagram, the total cost breakdown across all relevant services from January to July includes EC2 instances, DynamoDB, other Amazon EC2 costs, Kinesis Data Streams, Amazon S3, and Amazon Data Firehose. The most notable change was a 33% reduction in total monthly infrastructure costs (compared to January baseline), primarily achieved through significant decrease in Amazon EC2 related costs as the migration progressed, elimination of dedicated Kafka infrastructure, and efficient use of the AWS pay-as-you-go model. Although new costs were introduced for managed services (DynamoDB, Kinesis Data Streams, Amazon Data Firehose, Amazon S3), the overall monthly AWS costs maintained a clear downward trend. With these cost savings, Airties can offer more competitive pricing to their customers. The diagram below shows monthly cost breakdown during the transition.

Conclusion

The transition to this new architecture with Kinesis Data Streams has marked a significant milestone in Airties’s journey towards operational excellence and sustainability. These initiatives have not only enhanced system performance and scalability, but have also resulted in substantial cost savings (33%) and energy efficiency (40%). By using advanced technologies and innovative solutions on AWS, the Airties team continues to set the benchmark for efficient, reliable, and sustainable operations, while paving the way for a sustainable future. In order to explore how you can modernize your streaming architecture with AWS, see the Kinesis Data Streams documentation and watch this re:invent session on serverless data streaming with Kinesis Data Streams and AWS Lambda.

About the Authors

Steven Aerts is a principal software engineer at Airties, where his team is responsible for ingesting, processing, and analyzing the data of tens of millions of homes to improve their Wi-Fi experience. He was a speaker at conferences like Devoxx and AWS Summit Dubai, and is an open source contributor.

Reza Radmehr is a Sr. Leader of Cloud Infrastructure and Operations at Airties, where he leads AWS infrastructure design, DevOps and SRE automation, and FinOps practices. He focuses on building scalable, cost-efficient, and reliable systems, driving operational excellence through smart, data-driven cloud strategies. He is passionate about blending financial insight with technical innovation to improve performance and efficiency at scale.

Ramazan Ginkaya is a Sr. Technical Account Manager at AWS with over 17 years of experience in IT, telecommunications, and cloud computing. He is a passionate problem-solver, providing technical guidance to AWS customers to help them achieve operational excellence and maximize the value of cloud computing.

Unlock self-serve streaming SQL with Amazon Managed Service for Apache Flink

2025-05-28 Sofie Zilberman

Post Syndicated from Sofie Zilberman original https://aws.amazon.com/blogs/big-data/unlock-self-serve-streaming-sql-with-amazon-managed-service-for-apache-flink/

This post is co-written with Gal Krispel from Riskified.

Riskified is an ecommerce fraud prevention and risk management platform that helps businesses optimize online transactions by distinguishing legitimate customers from fraudulent ones.

Using artificial intelligence and machine learning (AI/ML), Riskified analyzes real-time transaction data to detect and prevent fraud while maximizing transaction approval rates. The platform provides a chargeback guarantee, protecting merchants from losses due to fraudulent transactions. Riskified’s solutions include account protection, policy abuse prevention, and chargeback management software, making it a comprehensive tool for reducing risk and enhancing customer experience. Businesses across various industries, including retail, travel, and digital goods, use Riskified to increase revenue while minimizing fraud-related losses. Riskified’s core business of real-time fraud prevention makes low-latency streaming technologies a fundamental part of its solution.

Businesses often can’t afford to wait for batch processing to make critical decisions. With real-time data streaming technologies like Apache Flink, Apache Spark, and Apache Kafka Streams, organizations can react instantly to emerging trends, detect anomalies, and enhance customer experiences. These technologies are powerful processing engines that perform analytical operations at scale. However, unlocking the full potential of streaming data often requires complex engineering efforts, limiting accessibility for analysts and business users.

Streaming pipelines are in high demand from Riskified’s Engineering department. Therefore, a user-friendly interface for creating streaming pipelines is a critical feature to increase analytical precision for detecting fraudulent transactions.

In this post, we present Riskified’s journey toward enabling self-service streaming SQL pipelines. We walk through the motivations behind the shift from Confluent ksqlDB to Apache Flink, the architecture Riskified built using Amazon Managed Service for Apache Flink, the technical challenges they faced, and the solutions that helped them make streaming accessible, scalable, and production-ready.

Using SQL to create streaming pipelines

Customers have a range of open source data processing technologies to choose from, such as Flink, Spark, ksqlDB, and RisingWave. Each platform offers a streaming API for data processing. SQL streaming jobs offer a powerful and intuitive way to process real-time data with minimal complexity. These pipelines use SQL, a widely known and declarative language, to perform real-time transformations, filtering, aggregations, and joins in continuous data streams.

To illustrate the power of streaming SQL in ecommerce fraud prevention, consider the concept of velocity checks, which are a critical fraud detection pattern. Velocity checks are a type of security measure used to detect unusual or rapid activity by monitoring the frequency and volume of specific actions within a given timeframe. These checks help identify potential fraud or abuse by analyzing repeated behaviors that deviate from normal user patterns. Common examples include detecting multiple transactions from the same IP address in a short time span, monitoring bursts of account creation attempts, or tracking the repeated use of a single payment method across different accounts.

Use case: Riskified’s velocity checks

Riskified implemented a real-time velocity check using streaming SQL to monitor purchasing behavior based on user identifier.

In this setup, transaction data is continuously streamed through a Kafka topic. Each message contains user agent information originating from the browser, along with the raw transaction data. Streaming SQL queries are used to aggregate the number of transactions originating from a single user identifier within short time windows.

For example, if the number of transactions from a given user identifier exceeds a certain threshold within a 10-second period, this might signal fraudulent activity. When that threshold is breached, the system can automatically flag or block the transactions before they are completed. The following figure and accompanying code provide a simplified example of the streaming SQL query used to detect this behavior.

Velocity check SQL flow

SELECT userIdentifier,TUMBLE_START(createdAt, INTERVAL '10' SECONDS) 
  AS windowStart,TUMBLE_END(createdAt, INTERVAL '10' SECONDS) 
  AS windowEnd, COUNT(*) AS paymentAttempts
FROM transactions
  WINDOW TUMBLING (SIZE 10 SECONDS)
GROUP BY userIdentifier;

Although defining SQL queries over static datasets might appear straightforward, developing and maintaining robust streaming applications introduces unique challenges. Traditional SQL operates on bounded datasets, which are finite collections of data stored in tables. In contrast, streaming SQL is designed to process continuous, unbounded data streams resembling the SQL syntax.

To address these challenges at scale and make streaming job creation accessible to engineering teams, Riskified implemented a self-serve solution based on Confluent ksqlDB, using its SQL interface and built-in Kafka integration. Engineers could define and deploy streaming pipelines using SQL, chaining ksqlDB streams from source to sink. The system supported both stateless and stateful processing directly on Kafka topics, with Avro schemas used to define the structure of streaming data.

Although ksqlDB provided a fast and approachable starting point, it eventually revealed several limitations. These included challenges with schema evolution, difficulties in managing compute resources, and the absence of an abstraction for managing pipelines as a cohesive unit. As a result, Riskified began exploring alternative technologies that could better support its expanding streaming use cases. The following sections outline these challenges in more detail.

Evolving the stream processing architecture

In evaluating alternatives, Riskified focused on technologies that could address the specific demands of fraud detection while preserving the simplicity that made the original approach appealing. The team encountered the following challenges in maintaining the previous solution:

Schemas are managed in Confluent Schema Registry, and the message format is Avro with FULL compatibility mode enforced. Schemas are constantly evolving according to business requirements. They are version controlled using Git with a strict continuous integration and continuous delivery (CI/CD) pipeline. As schemas grew more complex, ksqlDB’s approach to schema evolution didn’t automatically incorporate newly added fields. This behavior required dropping streams and recreating them to add new fields instead of just restarting the application to incorporate new fields. This approach caused inconsistencies with offset management due to the stream’s tear-down.
ksqlDB enforces a TopicNameStrategy schema registration strategy, which provides 1:1 schema-to-topic coupling. This means the exact schema definition has to be registered multiple times, one time for each topic it is used for. Riskified’s schema registry deployment uses RecordNameStrategy for schema registration. It’s an efficient schema registry strategy that allows for sharing schemas across multiple topics, storing fewer schemas, and reducing registry management overhead. Having mixed strategies in the schema registry caused errors with Kafka consumer clients attempting to decode messages, because the client implementation expected a RecordNameStrategy according to Riskified’s standard.
ksqlDB internally registers schema definitions in specific ways where fields are interpreted as nullable, and Avro Enum types are converted to Strings. This behavior caused deserialization errors when attempting to migrate native Kafka consumer applications to use the ksqlDB output topic. Riskified’s code base uses the Scala programming language, where optional fields in the schema are interpreted as Option. Transforming every field as optional in the schema definition required heavy refactoring, treating all Enum fields as Strings, and handling the Option data type for every field that requires safe handling. This cascading effect made the migration process more involved, requiring additional time and resources to achieve a smooth transition.

Managing resource contention in ksqlDB streaming workloads

ksqlDB queries are compiled into a Kafka Streams topology. The query definition defines the topology’s behavior.

Streaming query resources are shared rather than isolated. This approach typically leads to the overallocation of cluster resources. Its tasks are distributed across nodes in a ksqlDB cluster. This architecture means processing tasks with no resource isolation, and a specific task can impact other tasks running on the same node.

Resource contention between tasks on the same node is common in a production-intensive environment when using a cluster architecture solution. Operation teams often fine-tune cluster configurations to maintain acceptable performance, frequently mitigating issues by over-provisioning cluster nodes.

Challenges with ksqlDB pipelines

A ksqlDB pipeline is a chain of individual streams and lacks flow-level abstraction. Imagine a complex pipeline where a consumer publishes to multiple topics. In ksqlDB, each topic (both input and output) must be managed as a separate stream abstraction. However, there is no high-level abstraction to represent an entire pipeline that chains these streams together. As a result, engineering teams must manually assemble individual streams into a cohesive data flow, without built-in support for managing them as a single, complete pipeline.

This architectural approach particularly impacts operational tasks. Troubleshooting requires examining each stream separately, making it difficult to monitor and maintain pipelines that contain dozens of interconnected streams. When issues occur, the health of each stream needs to be checked individually, with no logical data flow component to help understand the relationships between streams or their role in the overall pipeline. The absence of a unified view of the data flow significantly increased operational complexity.

Flink as an alternative

Riskified began exploring alternatives for its streaming platform. The requirements were clear: a strong processing technology that combines a rich low-level API and a streaming SQL engine, backed by a strong open source community, proven to perform in the most demanding production environments.

Unlike the previous solution, which supported only Kafka-to-Kafka integration, Flink offers an array of connectors for various databases and Streaming platforms. It was quickly recognized that Flink had the potential to handle complex streaming use cases.

Flink offers multiple deployment options, including standalone clusters, native Kubernetes deployments using operators, and Hadoop YARN clusters. For enterprises seeking a fully managed option, cloud providers like AWS offer managed Flink services that help alleviate operational overhead, such as Managed Service for Apache Flink.

Benefits of using Managed Service for Apache Flink

Riskified decided to implement a solution using Managed Service for Apache Flink. This choice offered several key advantages:

It offers a quick and reliable way to run Flink applications and reduces the operational overhead of independently managing the infrastructure.
Managed Service for Apache Flink provides true job isolation by running each streaming application in its dedicated cluster. This means you can manage resources separately for each job and reduce the risk of heavy streaming jobs inflicting resource starvation for other running jobs.
It offers built-in monitoring using Amazon CloudWatch metrics, application state backup with managed snapshots, and automatic scaling.
AWS offers comprehensive documentation and practical examples to help accelerate the implementation process.

With these features, Riskified could focus on what truly matters—getting closer to the business goal and starting to write applications.

Using Flink’s streaming SQL engine

Developers can use Flink to build complex and scalable streaming applications, but Riskified saw it as more than just a tool for experts. They wanted to democratize the power of Flink into a tool for the entire organization, to solve complex business challenges involving real-time analytics requirements without needing a dedicated data professional.

To replace their previous solution, they envisioned maintaining a “build once, deploy many” application, which encapsulates the complexity of the Flink programming and allows the users to focus on the SQL processing logic.

Kafka was maintained as the input and output technology for the initial migration use case, which is similar to the ksqlDB setup. They designed a single, flexible Flink application where end-users can modify the input topics, SQL processing logic, and output destinations through runtime properties. Although ksqlDB primarily focuses on Kafka integration, Flink’s extensive connector ecosystem enables it to expand to diverse data sources and destinations in future phases.

Managed Service for Apache Flink provides a flexible way to configure streaming applications without modifying their code. By using runtime parameters, you can change the application’s behavior without modifying its source code.

Using Managed Service for Apache Flink for this approach includes the following steps:

Apply parameters for the input/output Kafka topic, a SQL query, and the input/output schema ID (assuming you’re using Confluent Schema Registry).
Use AvroSchemaConverter to convert an Avro schema into a Flink table.
Apply the SQL processing logic and save the output as a view.
Sink the view results into Kafka.

The following diagram illustrates this workflow.
Streaming SQL system diagram

Performing Flink SQL query compilation without a Flink runtime environment

Providing end-users with significant control to define their pipelines makes it critical to verify the SQL query defined by the user before deployment. This validation prevents failed or hanging jobs that could consume unnecessary resources and incur unnecessary costs.

A key challenge was validating Flink SQL queries without deploying the full Flink runtime. After investigating Flink’s SQL implementation, Riskified discovered its dependency on Apache Calcite – a dynamic data management framework that handles SQL parsing, optimization, and query planning independently of data storage. This insight enabled using Calcite directly for query validation before job deployment.

You must know how the data is structured to validate a Flink SQL query on a streaming source like a Kafka topic. Otherwise, unexpected errors might occur when attempting to query the streaming source. Although an expected schema is used with relational databases, it’s not enforced for streaming sources.

Schemas guarantee a deterministic structure for the data stored in a Kafka topic when using a schema registry. A schema can be materialized into a Calcite table that defines how data is structured in the Kafka topic. It allows inferring table structures directly from schemas (in this case, Avro format was used), enabling thorough field-level validation, including type checking and field existence, all before job deployment. This table can later be used to validate the SQL query.

The following code is an example of supporting basic field types validation using Calcite’s AbstractTable:

public class FlinkValidator {
    public static void validateSQL(String sqlQuery, Schema avroSchema) throws Exception {
        SqlParser.Config sqlConfig = SqlParser.config()
                .withCaseSensitive(true);
        SqlParser sqlParser = SqlParser.create(sqlQuery, sqlConfig);
        SqlNode parsedQuery = sqlParser.parseQuery();
        RelDataTypeFactory typeFactory = new SqlTypeFactoryImpl(RelDataTypeFactory.DEFAULT);
        CalciteSchema rootSchema = createSchemaWithAvro(avroSchema);
        SqlValidator validator = SqlValidatorUtil.newValidator(
                Frameworks.newConfigBuilder().build().getOperatorTable(),
                rootSchema.createCatalogReader(Collections.emptyList(), typeFactory),
                typeFactory,
                SqlValidator.Config.DEFAULT
        );
        validator.validate(parsedQuery);
    }
    private static CalciteSchema createSchemaWithAvro(Schema avroSchema) {
        CalciteSchema rootSchema = CalciteSchema.createRootSchema(true);
        rootSchema.add("TABLE", new SimpleAvroTable(avroSchema));
        return rootSchema;
    }
    private static class SimpleAvroTable extends org.apache.calcite.schema.impl.AbstractTable {
        private final Schema avroSchema;
        public SimpleAvroTable(Schema avroSchema) {
            this.avroSchema = avroSchema;
        }
        @Override
        public RelDataType getRowType(RelDataTypeFactory typeFactory) {
            RelDataTypeFactory.Builder builder = typeFactory.builder();
            for (Schema.Field field : avroSchema.getFields()) {
                builder.add(field.name(), convertAvroType(field.schema(), typeFactory));
            }
            return builder.build();
        }
        private RelDataType convertAvroType(Schema schema, RelDataTypeFactory typeFactory) {
            switch (schema.getType()) {
                case STRING:
                    return typeFactory.createSqlType(SqlTypeName.VARCHAR);
                case INT:
                    return typeFactory.createSqlType(SqlTypeName.INTEGER);
                default:
                    return typeFactory.createSqlType(SqlTypeName.ANY);
            }
        }
    }
}

You can integrate this validation approach as an intermediate step before creating the application. You can create a streaming job programmatically with the AWS SDK, AWS Command Line Interface (AWS CLI), or Terraform. The validation occurs before submitting the streaming job.

Flink SQL and Confluent Avro data type mapping limitation

Flink provides several APIs designed for different levels of abstraction and user expertise:

Flink SQL sits at the highest level, allowing users to express data transformations using familiar SQL syntax, which is ideal for analysts and teams comfortable with relational concepts.
The Table API offers a similar approach but is embedded in Java or Python, enabling type-safe and more programmatic expressions.
For more control, the DataStream API exposes low-level constructs to manage event time, stateful operations, and complex event processing.
At the most granular level, the ProcessFunction API provides full access to Flink’s runtime features. It’s suitable for advanced use cases that demand detailed control over state and processing behavior.

Riskified initially used the Table API to define streaming transformations. However, when deploying their first Flink job to a staging environment, they encountered serialization errors related to the avro-confluent library and Table API. Riskified’s schemas rely heavily on Avro Enum types, which the avro-confluent integration doesn’t fully support. As a result, Enum fields were converted to Strings, leading to mismatches during serialization and errors when attempting to sink processed data back to Kafka using Flink’s Table API.

Riskified developed an alternative approach to overcome the Enum serialization limitations while maintaining schema requirements. They discovered that Flink’s DataStream API could correctly handle Confluent’s Avro records serialization with Enum fields, unlike the Table API. They implemented a hybrid solution combining both APIs because the pipeline only required SQL processing on the source Kafka topic. It can sink to the output without any additional processing. The Table API is used for data processing and transformations, only converting to the DataStream API at the final output stage.

Managed Service for Apache Flink supports Flink APIs. It can switch between the Table API and the DataStream API.
A MapFunction can convert the Row type of the Table API into a DataStream of GenericRecord. The MapFunction maps Flink’s Row data type into GenericRecord types by iterating over the Avro schema fields and building the GenericRecord from the Flink Row type, casting the Row fields into the correct data type according to the Avro schema. This conversion is required to overcome the avro-confluent library limitation with Flink SQL.

The following diagram and illustrates this workflow.

Flink Table and DataStream APIs

The following code is an example query:

// SQL Query for filtering
Table queryResults = tableEnv.sqlQuery(
       "SELECT * FROM InputTable");
// 1. Convert query results from Table API to a DataStream<Row> and use DataStream API to sink query results to Kafka topic
DataStream<Row> rowStream = tableEnv.toDataStream(queryResults);
// Fetch the schema string from the schema registry
String schemaString = fetchSchemaString(schemaRegistryURL, schemaSubjectName);
// 2. Convert Row to GenericRecord with explicit TypeInformation, using custom AvroMapper
TypeInformation<GenericRecord> typeInfo = new GenericRecordAvroTypeInfo(avroSchema);
DataStream<GenericRecord> genericRecordStream = rowStream
       .map(new AvroMapper(schemaString))
       .returns(typeInfo); // Explicitly set TypeInformation
// 3. Define Kafka sink using ConfluentRegistryAvroSerializationSchema
KafkaSink<GenericRecord> kafkaSink = KafkaSink.<GenericRecord>builder()
       .setBootstrapServers(bootstrapServers)
       .setRecordSerializer(
               KafkaRecordSerializationSchema.builder()
                       .setTopic(sinkTopic)
                       .setValueSerializationSchema(
                               ConfluentRegistryAvroSerializationSchema.forGeneric(
                                       schemaSubjectName,
                                       avroSchema,
                                       schemaRegistryURL
                               )
                       )
                       .build()
       )
       .build();
// Sink to Kafka
genericRecordStream.sinkTo(kafkaSink);

CI/CD With Managed Service for Apache Flink

With Managed Service for Apache Flink, you can run a job by selecting an Amazon Simple Storage Service (Amazon S3) key containing the application JAR. Riskified’s Flink code base was structured as a multi-module repository to support additional use cases besides supporting self-service SQL. Each Flink job source code in the repository is an independent Java module. The CI pipeline implemented a robust build and deployment process consisting of the following steps:

Build and compile each module.
Run tests.
Package the modules.
Upload the artifact to the artifacts bucket twice: one JAR under <module>-<version>.jar and the second as <module>-latest.jar, resembling a Docker registry like Amazon Elastic Container Registry (Amazon ECR). Managed Service for Apache Flink jobs uses the latest tag artifact in this case. However, a copy of old artifacts is kept for code rollback reasons.

A CD process follows this process:

When merged, it lists all jobs for each module using the AWS CLI for Managed Service for Apache Flink.
The application JAR location is updated for each application, which triggers a deployment.
When the application is in a running state with no errors, the following application will be continued.

To allow safe deployment, this process is done gradually for every environment, starting with the staging environment.

Self-service interface for submitting SQL jobs

Riskified believes an intuitive UI is crucial for system adoption and efficiency. However, developing a dedicated UI for Flink job submission requires a pragmatic approach, because it might not be worth investing in unless there’s already a web interface for internal development operations.

Investing in UI development should align with the organization’s existing tools and workflows. Riskified had an internal web portal for similar operations, which made the addition of Flink job submission capabilities a natural extension of the self-service infrastructure.

An AWS SDK was installed on the web server to allow interaction with AWS components. The client receives user input from the UI and translates it into runtime properties to adjust the behavior of the Flink application. The web server then uses the CreateApplication API action to submit the job to Managed Service for Apache Flink.

Although an intuitive UI significantly enhances system adoption, it’s not the only path to accessibility. Alternatively, a well-designed CLI tool or REST API endpoint can provide the same self-service capabilities.

The following diagram illustrates this workflow.

Flow sequence diagram

Production experience: Flink’s implementation upsides

The transition to Flink and Managed Service for Apache Flink proved efficient in numerous aspects:

Schema evolution and data handling – Riskified can either periodically fetch updated schemas or restart applications when schemas evolve. They can use existing schemas without self-registration.
Resource isolation and management – Managed Service for Apache Flink runs each Flink job as an isolated cluster, reducing resource contention between jobs.
Resource allocation and cost-efficiency – Managed Service for Apache Flink enables minimum resource allocation with automatic scaling, proving to be more cost-efficient.
Job management and flow visibility – Flink provides a cohesive data flow abstraction through its job and task model. It manages the entire data flow in a single job and distributes the workload evenly over multiple nodes. This unified approach enables better visibility into the entire data pipeline, simplifying monitoring, troubleshooting, and optimizing complex streaming workflows.
Built-in recovery mechanism – Managed Service for Apache Flink automatically creates checkpoints and savepoints that enable stateful Flink applications to recover from failures and resume processing without data loss. With this feature, streaming jobs are durable and can recover safely from errors.
Comprehensive observability – Managed Service for Apache Flink exposes CloudWatch metrics that monitor Flink application performance and statistics. You can also create alarms based on these metrics. Riskfied decided to use the Cloudwatch Prometheus Exporter to export these metrics to Prometheus and build PrometheusRules to align Flink’s monitoring to the Riskified standard, which uses Prometheus and Grafana for monitoring and alerting.

Next steps

Although the initial focus was Kafka-to-Kafka streaming queries, Flink’s wide range of sink connectors offers the possibility of pluggable multi-destination pipelines. This versatility is on Riskfied’s roadmap for future enhancements.

Flink’s DataStream API provides capabilities that extend far beyond self-serving streaming SQL capabilities, opening new avenues for more sophisticated fraud detection use cases. Riskified is exploring ways to use DataStream APIs to enhance ecommerce fraud prevention strategies.

Conclusions

In this post, we shared how Riskified successfully transitioned from ksqlDB to Managed Service for Apache Flink for its self-serve streaming SQL engine. This move addressed key challenges like schema evolution, resource isolation, and pipeline management. Managed Service for Apache Flink offers features such as including isolated jobs environments, automatic scaling, and built-in monitoring, which proved more efficient and cost-effective. Although Flink SQL limitations with Kafka required workarounds, using Flink’s DataStream API and user-defined functions resolved these issues. The transition has paved the way for future expansion with multi-targets and advanced fraud detection capabilities, solidifying Flink as a robust and scalable solution for Riskified’s streaming needs.

If Riskified’s journey has sparked your interest in building a self-service streaming SQL platform, here’s how to get started:

Learn more about Managed Service for Apache Flink:
Get hands-on experience:
- Amazon Managed Service for Apache Flink Workshop
- Amazon Managed Service for Apache Flink Examples

About the authors

Gal Krispel is a Data Platform Engineer at Riskified, specializing in streaming technologies such as Apache Kafka and Apache Flink. He focuses on building scalable, real-time data pipelines that power Riskified’s core products. Gal is particularly interested in making complex data architectures accessible and efficient across the organization. His work spans real-time analytics, event-driven design, and the seamless integration of stream processing into large-scale production systems.

Sofia Zilberman works as a Senior Streaming Solutions Architect at AWS, helping customers design and optimize real-time data pipelines using open-source technologies like Apache Flink, Kafka, and Apache Iceberg. With experience in both streaming and batch data processing, she focuses on making data workflows efficient, observable, and high-performing.

Lorenzo Nicora works as Senior Streaming Solution Architect at AWS, helping customers across EMEA. He has been building cloud-centered, data-intensive systems for over 25 years, working across industries both through consultancies and product companies. He has used open-source technologies extensively and contributed to several projects, including Apache Flink, and is the maintainer of the Flink Prometheus connector.

Transforming Maya’s API management with Amazon API Gateway

2025-05-27 Arthi Jaganathan

Post Syndicated from Arthi Jaganathan original https://aws.amazon.com/blogs/architecture/transforming-mayas-api-management-with-amazon-api-gateway/

In this post, you will learn how Amazon Web Services (AWS) customer, Maya, the Philippines’ leading fintech company and digital bank, built an API management platform to address the growing complexities of managing multiple APIs hosted on Amazon API Gateway. API Gateway is a fully managed service that you can use to create RESTful and WebSocket APIs.

At Maya, different teams build APIs to expose their services to merchants. As the number of applications grew, the overhead of managing APIs increased. An API platform is a set of tools to simplify and standardize across API management concerns such as security, governance, automated deployments, observability, and integrations with multiple AWS accounts. This frees up application teams to focus on features while offloading management concerns to the API platform.

Initial state

Prior to implementing the API platform, Maya used a decentralized API management approach, which created significant challenges. Individual teams operated independent API gateways, resulting in fragmented infrastructure, leading to several issues:

Lack of standardization: Implementing consistent API standards across the organization proved difficult. Each team maintained its own configurations and practices, leading to inconsistencies in security and documentation.
Security posture maintenance: While Maya maintained a strong security posture, doing so across the numerous independent gateways was unsustainable. The overhead of applying consistent security policies and updates across all gateways was becoming increasingly burdensome.
Inconsistent operational visibility: Observability wasn’t inherently limited, rather inconsistently applied. Having multiple, different gateways makes it challenging to enforce a unified observability strategy and correlate data across the entire API ecosystem.

Solution overview

To address these challenges, Maya implemented an API platform, code-named Unified API Gateway. This centralized API management helps enforce consistent standards and improve overall security and observability. The following image illustrates the architecture of the Unified API Gateway and how it integrates with backend services managed and owned by different teams across different AWS accounts.

Enterprise-level AWS architecture diagram showing secured API gateway with multi-account EKS service distribution

API Platform Architecture

Maya chose to host all APIs in a central API account to centralize governance. This is managed by a dedicated shared services cloud team. Amazon CloudFront with AWS WAF and AWS Shield Advanced integration provides perimeter security. An AWS Lambda authorizer provides application security by managing authentication, authorization, and session management. This mitigates against the OWASP top 10 API security risks.

Integration to backend services is configured through API Gateway private integration and AWS Transit Gateway. In a decentralized API deployment strategy where APIs are co-hosted with the service in the respective AWS account, the integration will be simpler because you won’t need cross-account network connectivity. You will still benefit from the API management techniques covered in this post.

Standardization through structured service on-boarding

OpenAPI Specification (OAS) provides a structured definition for APIs. As shown in the following figure, service teams define the API OAS specification. This is embedded in Terraform infrastructure-as-code template for API Gateway. These are checked into source code repository and deployed using GitLab CI.

End-to-end API infrastructure pipeline showing specification integration through GitLab CI to AWS API Gateway

API Gateway Infrastructure-as-code (IaC) Pipeline

A configuration file used as a Terraform template supplies parameters for components of the solution such as backend integration, Lambda authorizer details, and additional headers for auditing. The following OAS snippets demonstrate this.

Integration with the backend service

x-amazon-apigateway-integration:
   type: "http_proxy"
   connectionId: "${vpc_link_id}"
   httpMethod: "GET"
   uri: "http://$${stageVariables.url}:11620/v1/api/endpoint/{id}" # double $ is not a typo

Adding headers to the request

x-amazon-apigateway-integration:
   type: "http_proxy"
   connectionId: "${vpc_link_id}"
   httpMethod: "GET"
   uri: "http://$${stageVariables.url}:11620/v1/api/endpoint/{id}"
   requestParameters:
      integration.request.header.x-requesting-service-id: "'api-gw'"
      integration.request.header.x-org-customer-id: "context.authorizer.x-org-customer-id"

Security definition

securitySchemes:
   lambda-authorizer:
      type: "${authorizer_type}"  
      name: "${authorizer_name}"
      x-amazon-apigateway-authtype: "custom"
      x-amazon-apigateway-authorizer:
         type: "request"
         authorizerUri: "${authorizer_uri}"
         authorizerCredentials: "${authorizer_credentials}"
         identitySource: "${authorizer_identity_source}"

API Gateway supports most of the OpenAPI 2.0 specification and the OpenAPI 3.0 specification but there are a few exceptions. Maya uses a custom plugin in the pipeline to enforce necessary limiting rules to help ensure compatibility with API Gateway.

To simplify deployment for development teams, a custom Terraform module abstracts away the API Gateway implementation details.

module "test-microservice-api-gateway" {
  # module version parameters
  source = "gitlabinstance.com/platform-engineering/apigw-terraform-module/aws"
  version = "1.2.7"

  # module deployed infrastructure parameters
  api_name = var.api_name
  api_mapping_path = var.api_mapping_path
  environment = var.environment
  aws_region = var.aws_region
  account_id = var.account_id
  tags = var.tags
  domain_name = var.domain_name
  stage_name = var.stage_name

  oas_path = var.oas_path # this value is populated via environment variable in Gitlab CI/CD

  providers = {
     aws = aws.apigw
  }
  authorizer_credentials = var.authorizer_credentials
  authorizer_uri = var.authorizer_uri
  vpc_link_id = var.vpc_link_id
  endpoint_url = var.endpoint_url
}

To use multi-level prefixes for custom domains with REST API Gateway, you need the Terraform module for API Gateway v2.

resource "aws_api_gateway_rest_api" "apigw" {
   name = "${var.environment}-${var.api_name}"
   body = templatefile(
     local.oasFilePath,
     {
       vpc_link_id = var.vpc_link_id
       authorizer_uri = var.authorizer_uri
       authorizer_credentials = var.authorizer_credentials
     }
  )
  description = "API Gateway for ${var.api_name}"
  endpoint_configuration {
    types = ["REGIONAL"]
  }

   # Default endpoint needs to be disabled if CloudFront is used as entry point to API Gateway
  disable_execute_api_endpoint = true
  tags = local.tags
  }

  # Use apigatewayv2 in order to have multi level base path ex. /v1/service_name
  resource "aws_apigatewayv2_api_mapping" "this" {
     domain_name = var.domain_name
    api_id = aws_api_gateway_rest_api.apigw.id
    stage = aws_api_gateway_stage.apigw.stage_name
    api_mapping_key = var.api_mapping_path
  }

Simplify API security with automation

Maya’s Unified API Gateway implements a robust, multi-layered security strategy. This approach helps ensure comprehensive protection from external threats and enforces stringent access control policies.

AWS WAF inspects and filters incoming traffic to protect against common web exploits, including OWASP Top 10, such as SQL injection and cross-site scripting attacks. A combination of custom and managed rule sets blocks malicious requests and enforces security policies. AWS Shield Advanced mitigates distributed denial of service (DDoS) attacks and provides 24/7 access to the AWS Shield Response Team (SRT) for expert support during attack events. This helps ensure high availability and resiliency.

API Gateway is integrated with a Lambda authorizer for authentication and authorization. The custom function implements fine-grained access control based on several factors such as identity, roles, and scopes.

To help ensure the consistency and integrity of the API configurations, all updates and deployments are strictly managed through an automated infrastructure-as-code (IaC) pipeline. This helps eliminate the risk of unauthorized or accidental manual changes to the API Gateway and any underlying infrastructure. The IaC pipeline makes sure that all API configurations, including security settings, are deployed through a controlled and auditable process. This prevents configuration drift and makes sure that security policies are consistently applied across all APIs. This also means that all changes are subject to code reviews and version control, adding another layer of security and traceability.

End-to-end visibility with observability

Maya’s Unified API Gateway prioritizes comprehensive observability to proactively monitor API performance, identify potential issues, and provide a seamless user experience. It uses a combination of AWS services and integrated tools to achieve this.

Amazon CloudWatch is used to monitor key performance metrics, including latency, error rates, and requests counts. CloudWatch provides real-time insights into the health and performance of APIs. Alerts on P95 and P99 values help identify and address performance bottlenecks, ensuring responsiveness.

CloudWatch metrics are streamed to Dynatrace, an application performance monitoring (APM) tool. The centralized view helps correlate data from various sources, create custom dashboards, and configure intelligent alerts based on predefined thresholds.

To help ensure complete visibility into API activity, the Lambda authorizer and API Gateway access logs are centralized in Splunk. This provides a comprehensive audit trail to track authentication and authorization events, identify security incidents, and troubleshoot API requests. Headers generated after authentication and authorization are done are passed down to the backend services for proper log correlation.

Future roadmap

The Unified API Gateway will continue to evolve to meet the growing needs of the organization and its partners and customers. The following are the key future enhancements that will further streamline API management, improve the developer experience, and enhance security.

Integration with the internal developer portal: This will provide a self-service UI for bootstrapping new APIs from scratch and further empower developers. This will also simplify documentation and discovery by cataloging all APIs
A modular, extension-based design for enhanced processing: This will introduce custom processing of requests in-line in the gateway account before integrating with backend services. Examples include digital signature verification, message transformation, and custom business logic. A modular design will offer a flexible and scalable way to enhance the functionality of Maya’s APIs without modifying backend services.
Bring your own (BYO) authorizer: Support a wider range of identity providers and authentication protocols, providing greater flexibility and control over API access.
Centralizing schema validation: Moving schema validation to API Gateway to bring consistency and improve the robustness and security of APIs by preventing malformed or malicious requests from being processed.
API monetization: Create new revenue streams by adding support for usage-based billing, tiered pricing, and subscription models.

Conclusion

This post has described the creation of Maya’s robust API management and governance solution, using a combination of native AWS services and powerful partner tools such as Terraform and Dynatrace. We’ve demonstrated how this Unified API Gateway has streamlined and automated core API processes, transforming Maya’s previously fragmented infrastructure into a secure and observable ecosystem. By establishing clear guardrails, the API solution team empowers developers to rapidly deploy APIs while maintaining consistent standards.

With the recent implementation of this solution across more teams, Maya is focused on defining and tracking key performance indicators (KPIs). We anticipate measuring critical metrics such as API onboarding efficiency, developer experience, API latency, and security incident rates. These insights will serve as a foundation for continuous improvement and optimization, ensuring the solution’s sustained effectiveness and evolution.

Visit the API platform guidance on Serverlessland to learn more about building API platforms. See the API Gateway pattern collection to learn more about designing REST API integrations on AWS.

About the Authors

How Smartsheet boosts developer productivity with Amazon Bedrock and Roo Code

2025-05-19 Rony Blum

Post Syndicated from Rony Blum original https://aws.amazon.com/blogs/architecture/how-smartsheet-boosts-developer-productivity-with-amazon-bedrock-and-roo-code/

This post was co-written with JB Brown from Smartsheet.

Organizations are often seeking ways to enhance developer productivity while maintaining cost-efficiency. AI-powered coding assistants are effective solutions to accelerate development cycles, but implementing these solutions at scale while managing costs presents unique challenges. Roo Code, an AI-powered autonomous coding agent located directly in developers’ editors, represents the latest evolution in AI-assisted development. It helps developers code faster and smarter by offering capabilities ranging from code generation and refactoring to documentation writing and code base analysis through natural language interaction. This post explores how Smartsheet successfully deployed Roo Code with Amazon Bedrock and Anthropic’s Claude, achieving significant improvements in developer efficiency while optimizing costs through innovative caching strategies.

The Amazon Bedrock prompt caching capability, which became generally available in April 2025, represents a significant advancement in optimizing AI model interactions. With this feature, organizations can cache frequently used prompts across multiple API calls, reducing costs by up to 90% and lowing latency by up to 85%. The technology is particularly valuable for applications that repeatedly use similar context, such as coding assistants that need to maintain context about code files. Prompt caching supports multiple foundation models (FMs), including Anthropic’s Claude 3.5 Haiku and Anthropic’s Claude 3.7 Sonnet, as well as the Amazon Nova family of models. The cached context remains available for up to 5 minutes after each access, with each cache hit resetting this countdown. For organizations implementing AI-assisted development workflows, they can use this capability to balance performance with cost-efficiency.

Benefits of Amazon Bedrock prompt caching

Smartsheet is a leading cloud-based enterprise work management service, enabling millions of users worldwide to plan, manage, track, automate, and report on work at scale. Their engineering team identified an opportunity to enhance developer productivity by implementing AI-assisted coding capabilities across their organization. The solution needed to scale efficiently while maintaining reasonable costs, leading them to choose Roo Code with Amazon Bedrock with Anthropic’s Claude as their FM provider.

“Our engineering team has achieved a 60% reduction in operational costs and 20% improvement in response latency when using Roo Code with Amazon Bedrock prompt caching. These improvements directly translate to enhanced developer productivity across our organization,” says JB Brown, VP of Engineering at Smartsheet.

This fact is underscored by the team’s ability to rapidly generate comprehensive code documentation and architecture diagrams in a mere four hours, a task that previously consumed 2 weeks of manual effort. This newfound efficiency extends to critical areas like AWS cost management, where a vital analysis tool was built in just 30 minutes, and is projected to save Smartsheet $450,000 annually. Furthermore, the speed and accessibility of information have been significantly amplified, as evidenced by the 6–8 hours of senior engineer time saved by quickly explaining complex Terraform/Terragrunt infrastructure. These examples highlight how Amazon Bedrock prompt caching is not just about incremental gains, but about unlocking substantial time and resource savings, empowering our developers to focus on innovation rather than time-consuming manual processes.

Solution overview

When implementing AI-assisted coding at scale, Smartsheet faced several key challenges around managing costs associated with large language model (LLM) API calls, reducing latency for developer interactions and handling repetitive elements in prompts efficiently. The development team recognized that a significant portion of the prompts sent to Amazon Bedrock were repeated elements: the system prompt and the continuously building message history, presenting an opportunity for optimization.

To address these challenges, Smartsheet implemented a comprehensive solution centered around Roo Code integration with Amazon Bedrock and prompt caching. Smartsheet deployed Roo Code, powered by Amazon Bedrock and Anthropic’s Claude 3.7, across the organization and integrated seamlessly into existing development workflows. This provided developers with immediate access to AI assistance for code writing, reviewing, and debugging tasks.

The team then contributed a sophisticated Amazon Bedrock prompt caching system to Roo Code that identifies and stores common prompt elements, significantly reducing redundant data transmission to Bedrock. This caching layer proved crucial in optimizing both cost and performance metrics. The caching decision flow incorporates two key checks: first, it verifies if the selected model supports prompt caching, then it makes sure the system prompt meets a minimum token threshold to make caching worthwhile.

The solution integrates Roo Code directly into developers’ integrated development environments (IDE), using the Amazon Bedrock prompt caching capabilities. The architecture separates content into static elements (like system instructions and code context) and dynamic elements (like user queries). When a developer makes a request, static content is automatically marked with cache checkpoints and stored for 5 minutes, with each access refreshing this window.

For subsequent queries, the system checks for matching cached content before processing. When matches are found, it bypasses recomputation of these sections, significantly reducing both response time and costs.

The following diagram shows the integration points between Roo Code and Amazon Bedrock.

Technical workflow diagram showing data flow between user query, cache decision logic, and Amazon Bedrock with response handling

The following screenshot highlights the model provider settings and connection parameters required for implementation.

Amazon Bedrock configuration interface showing AWS authentication, region settings, and Claude model with caching enabled

Optimizing performance with prompt caching

When analyzing code files, much of the context—such as file contents and environment details—remains consistent across multiple queries. By caching this content, subsequent requests can reuse it from the cache, significantly reducing cost and latency.

In our example, we first asked Roo Code to analyze a Python file. We then sent multiple follow-up queries about different aspects of the code. The first query explained the overall structure, followed by questions about specific classes, unused modules, and potential improvements.

The usage section of the responses showed significant improvements through caching. During the first query, the model processed the input and wrote the cached content. For subsequent queries, 90% of the input was served from the cache. Because cache reads are 90% cheaper than processing new input tokens, this resulted in an 83% cost reduction for follow-up queries.

Here’s how the caching worked across queries:

First query – The model processed the entire file content and environment details, writing them to cache
Subsequent queries – The model loaded the cached content, only processing the new question text

To illustrate this, let’s look at the usage statistics from a series of queries:

Query 1: "Explain my @/src/models/product.py code"
{
    "inputTokens": 1051,
    "outputTokens": 902,
    "totalTokens": 19374,
    "cacheReadInputTokenCount": 0,
    "cacheWriteInputTokenCount": 12421
}

Query 2: "Describe what is ProductUpdate class used for"
{
    "inputTokens": 1078,
    "outputTokens": 694,
    "totalTokens": 21664,
    "cacheReadInputTokenCount": 19892,
    "cacheWriteInputTokenCount": 0
}

Query 3: "Which module is not in use?"
{
    "inputTokens": 4,
    "outputTokens": 774,
    "totalTokens": 22753,
    "cacheReadInputTokenCount": 19892,
    "cacheWriteInputTokenCount": 2083
}

Query 4: "Any improvement to implement to my code making it more efficient?"
{
    "inputTokens": 1097,
    "outputTokens": 1270,
    "totalTokens": 24342,
    "cacheReadInputTokenCount": 21975,
    "cacheWriteInputTokenCount": 0
}

We can see from the results that the first query didn’t return cacheReadInputTokenCount but wrote to the cache the whole input. For subsequent queries, most of the input was read from the cache, except for the specific question.

The results clearly demonstrate the benefits of prompt caching:

70% of total input was served from cache across queries
90% of input for follow-up queries came from cache
Overall input costs reduced by 60%
Follow-up query costs reduced by 83%

This approach is particularly effective for development workflows where developers frequently query the same code base with different questions. The cached content provides necessary context for each query while significantly reducing processing overhead.

The following screenshot displays Roo Code’s interface, showing detailed metrics including token usage, cache hit rates, and associated API costs. The interface presents the cost savings achieved through caching and provides comprehensive usage analytics.

IDE showing Roo Code analysis with metrics, API costs, and detailed Python code explanation

Results and future innovation

The following line graph demonstrates the cost reduction trend over multiple queries runs, having the first query uncached, and the subsequent input using significant query responses from the cache. The y-axis shows costs in dollars, and the x-axis shows the cost per query. The graph shows a clear downward trend, with costs decreasing by 60% from the baseline.

Dual-axis graph displaying per-query and cumulative costs, demonstrating cost savings through caching over 4 queries

The implementation delivered remarkable improvements across key metrics. The prompt caching system achieved a 60% reduction in operational costs while simultaneously improving response latency by 20%.

The choice of Amazon Bedrock prompt caching proved particularly effective for Smartsheet’s coding assistance workflows. The 5-minute cache Time to Live (TTL) aligns perfectly with the natural rhythm of developer interactions with Roo Code, where multiple related queries typically occur within short timeframes. During active coding sessions, developers engage in rapid back-and-forth exchanges with the AI assistant—reviewing code, asking follow-up questions, and requesting modifications—while maintaining context through the cache. This agentic workflow, where each interaction builds upon previous context, takes full advantage of the prompt caching mechanism—each query refreshes the cache timer while preserving valuable context about the code being discussed.

Best practices and lessons learned

Through their implementation journey, Smartsheet developed several key best practices for organizations looking to implement similar solutions. Early implementation of prompt caching proved crucial, allowing the team to analyze prompt patterns and design efficient caching strategies from the start. The team found that focusing on developer experience through response time optimization and seamless tool integration was essential for successful adoption.

Continuous monitoring and analysis of usage patterns became a cornerstone of their optimization strategy. By tracking key metrics like response times and costs, the team could identify opportunities for further optimization and regularly adjust their caching strategies to maintain optimal performance.

Looking ahead

Smartsheet continues to innovate with Amazon Bedrock and Roo Code, exploring new ways to enhance developer productivity. Their engineering team is investigating advanced caching strategies and evaluating new FMs as they become available on Amazon Bedrock. The success of this implementation has established a strong foundation for future AI-assisted development initiatives.

Conclusion

Smartsheet’s implementation of Roo Code with Amazon Bedrock and prompt caching demonstrates how organizations can successfully deploy AI-assisted coding solutions while maintaining cost-efficiency. Their approach provides a blueprint for others looking to enhance developer productivity through AI while optimizing operational costs. The combination of strategic implementation, innovative caching solutions, and continuous optimization has enabled Smartsheet to achieve significant improvements in performance and cost metrics.

To learn more about implementing AI solutions at scale, refer to the Amazon Bedrock documentation and explore the AWS Machine Learning Blog. The frameworks and strategies outlined in this post can help organizations of varying size implement efficient, cost-effective AI-assisted development workflows.

About the Authors

How LaunchDarkly migrated to Amazon MWAA to achieve efficiency and scale

2025-05-16 Asena Uyar, Dean Verhey

Post Syndicated from Asena Uyar, Dean Verhey original https://aws.amazon.com/blogs/big-data/how-launchdarkly-migrated-to-amazon-mwaa-to-achieve-efficiency-and-scale/

This is a guest post coauthored with LaunchDarkly.

The LaunchDarkly feature management platform equips software teams to proactively reduce the risk of shipping bad software and AI applications while accelerating their release velocity. In this post, we explore how LaunchDarkly scaled the internal analytics platform up to 14,000 tasks per day, with minimal increase in costs, after migrating from another vendor-managed Apache Airflow solution to AWS, using Amazon Managed Workflows for Apache Airflow (Amazon MWAA) and Amazon Elastic Container Service (Amazon ECS). We walk you through the issues we ran into during the migration, the technical solution we implemented, the trade-offs we made, and lessons we learned along the way.

The challenge

LaunchDarkly has a mission to enable high-velocity teams to release, monitor, and optimize software in production. The centralized data team is responsible for tracking how LaunchDarkly is progressing toward that mission. Additionally, this team is responsible for the majority of the company’s internal data needs, which include ingesting, warehousing, and reporting on the company’s data. Some of the large datasets we manage include product usage, customer engagement, revenue, and marketing data.

As the company grew, our data volume increased, and the complexity and use cases of our workloads expanded exponentially. While using other vendor-managed Airflow-based solutions, our data analytics team faced new challenges on time to integrate and onboard new AWS services, data locality, and a non-centralized orchestration and monitoring solution across different engineering teams within the organization.

Solution overview

LaunchDarkly has a long history of using AWS services to solve business use cases, such as scaling our ingestion from 1 TB to 100 TB per day with Amazon Kinesis Data Streams. Similarly, migrating to Amazon MWAA helped us scale and optimize our internal extract, transform, and load (ETL) pipelines. We used existing monitoring and infrastructure as code (IaC) implementations and eventually extended Amazon MWAA to other teams, establishing it as a centralized batch processing solution orchestrating multiple AWS services.

The solution for our transformation jobs include the following components:

A central Amazon MWAA environment responsible for orchestration
An ECS cluster and AWS Fargate task definition for running tasks
Amazon CloudWatch and Datadog for monitoring
Amazon Elastic Container Registry (Amazon ECR) as the container registry
Amazon Simple Storage Service (Amazon S3) for artifact storage
AWS Secrets Manager as a secret store for Amazon MWAA and Amazon ECS

Our original plan for the Amazon MWAA migration was:

Create a new Amazon MWAA instance using Terraform following LaunchDarkly service standards.
Lift and shift (or rehost) our code base from Airflow 1.12 to Airflow 2.5.1 on the original cloud provider to the same version on Amazon MWAA.
Cut over all Directed Acyclic Graph (DAG) runs to AWS.
Upgrade to Airflow 2.
With the flexibility and ease of integration within AWS ecosystem, iteratively make enhancements around containerization, logging, and continuous deployment.

Steps 1 and 2 were executed quickly—we used the Terraform AWS provider and the existing LaunchDarkly Terraform infrastructure to build a reusable Amazon MWAA module initially at Airflow version 1.12. We had an Amazon MWAA instance and the supporting pieces (CloudWatch and artifacts S3 bucket) running on AWS within a week.

When we started cutting over DAGs to Amazon MWAA in Step 3, we ran into some issues. At the time of migration, our Airflow code base was centered around a custom operator implementation that created a Python virtual environment for our workload requirements on the Airflow worker disk assigned to the task. By trial and error in our migration attempt, we learned that this custom operator was essentially dependent on the behavior and isolation of Airflow’s Kubernetes executors used in the original cloud provider platform. When we began to run our DAGs concurrently on Amazon MWAA (which uses Celery Executor workers that behave differently), we ran into a few transient issues where the behavior of that custom operator could affect other running DAGs.

At this time, we took a step back and evaluated solutions for promoting isolation between our running tasks, eventually landing on Fargate for ECS tasks that could be started from Amazon MWAA. We had initially planned to move our tasks to their own isolated system rather than having them run directly in Airflow’s Python runtime environment. Due to the circumstances, we decided to advance this requirement, transforming our rehosting project into a refactoring migration.

We chose Amazon ECS on Fargate for its ease of use, existing Airflow integrations (ECSRunTaskOperator), low cost, and lower management overhead compared to a Kubernetes-based solution such as Amazon Elastic Kubernetes Service (Amazon EKS). Although a solution using Amazon EKS would improve the task provisioning time even further, the Amazon ECS solution met the latency requirements of the data analytics team’s batch pipelines. This was acceptable because these queries run for several minutes on a periodic basis, so a couple more minutes for spinning up each ECS task didn’t significantly impact overall performance.

Our first Amazon ECS implementation involved a single container that downloads our project from an artifacts repository on Amazon S3, and runs the command passed to the ECS task. We trigger those tasks using the ECSRunTaskOperator in a DAG in Amazon MWAA, and created a wrapper around the built-in Amazon ECS operator, so analysts and engineers on the data analytics team could create new DAGs just by specifying the commands they were already familiar with.

The following diagram illustrates the DAG and task deployment flows.

When our initial Amazon ECS implementation was complete, we were able to cut all of our existing DAGs over to Amazon MWAA without the prior concurrency issues, because each task ran in its own isolated Amazon ECS task on Fargate.

Within a few months, we proceeded to Step 4 to upgrade our Amazon MWAA instance to Airflow 2. This was a major version upgrade (from 1.12 to 2.5.1), which we implemented by following the Amazon MWAA Migration Guide and subsequently tearing down our legacy resources.

The cost increase of adding Amazon ECS to our pipelines was minimal. This was because our pipelines run on batch schedules, and therefore aren’t active at all times, and Amazon ECS on Fargate only charges for vCPU and memory resources requested to complete the tasks.

As a part of Step 5 for continuous assessment and improvements, we enhanced our Amazon ECS implementation to push logs and metrics to Datadog and CloudWatch. We could monitor for errors and model performance, and catch data test failures alongside existing LaunchDarkly monitoring.

Scaling the solution beyond internal analytics

During the initial implementation for the data analytics team, we created an Amazon MWAA Terraform module, which enabled us to quickly spin up more Amazon MWAA environments and share our work with other engineering teams. This allowed the use of Airflow and Amazon MWAA to power batch pipelines within the LaunchDarkly product itself in a couple of months shortly after the data analytics team completed the initial migration.

The numerous AWS service integrations supported by Airflow, the built-in Amazon provider package, and Amazon MWAA allowed us to expand our usage across teams to use Amazon MWAA as a generic orchestrator for distributed pipelines across services like Amazon Athena, Amazon Relational Database Service (Amazon RDS), and AWS Glue. Since adopting the service, onboarding a new AWS service to Amazon MWAA has been straightforward, typically involving the identification of the existing Airflow Operator or Hook to use, and then connecting the two services with AWS Identity and Access Management (IAM).

Lessons and results

Through our journey of orchestrating data pipelines at scale with Amazon MWAA and Amazon ECS, we’ve gained valuable insights and lessons that have shaped the success of our implementation. One of the key lessons learned was the importance of isolation. During the initial migration to Amazon MWAA, we encountered issues with our custom Airflow operator that relied on the specific behavior of the Kubernetes executors used in the original cloud provider platform. This highlighted the need for isolated task execution to maintain the reliability and scalability of our pipelines.

As we scaled our implementation, we also recognized the importance of monitoring and observability. We enhanced our monitoring and observability by integrating with tools like Datadog and CloudWatch, so we could better monitor errors and model performance and catch data test failures, improving the overall reliability and transparency of our data pipelines.

With the previous Airflow implementation, we were running approximately 100 Airflow tasks per day across one team and two services (Amazon ECS and Snowflake). As of the time of writing this post, we’ve scaled our implementation to three teams, four services, and execution of over 14,000 Airflow tasks per day. Amazon MWAA has become a critical component of our batch processing pipelines, increasing the speed of onboarding new teams, services, and pipelines to our data platform from weeks to days.

Looking ahead, we plan to continue iterating on this solution to expand our use of Amazon MWAA to additional AWS services such as AWS Lambda and Amazon Simple Queue Service (Amazon SQS), and further automate our data workflows to support even greater scalability as our company grows.

Conclusion

Effective data orchestration is essential for organizations to gather and unify data from diverse sources into a centralized, usable format for analysis. By automating this process across teams and services, businesses can transform fragmented data into valuable insights to drive better decision-making. LaunchDarkly has achieved this by using managed services like Amazon MWAA and adopting best practices such as task isolation and observability, enabling the company to accelerate innovation, mitigate risks, and shorten the time-to-value of its product offerings.

If your organization is planning to modernize its data pipelines orchestration, start assessing your current workflow management setup, exploring the capabilities of Amazon MWAA, and considering how containerization could benefit your workflows. With the right tools and approach, you can transform your data operations, drive innovation, and stay ahead of growing data processing demands.

About the Authors

Asena Uyar is a Software Engineer at LaunchDarkly, focusing on building impactful experimentation products that empower teams to make better decisions. With a background in mathematics, industrial engineering, and data science, Asena has been working in the tech industry for over a decade. Her experience spans various sectors, including SaaS and logistics, and she has spent a significant portion of her career as a Data Platform Engineer, designing and managing large-scale data systems. Asena is passionate about using technology to simplify and optimize workflows, making a real difference in the way teams operate.

Dean Verhey is a Data Platform Engineer at LaunchDarkly based in Seattle. He’s worked all across data at LaunchDarkly, ranging from internal batch reporting stacks to streaming pipelines powering product features like experimentation and flag usage charts. Prior to LaunchDarkly, he worked in data engineering for a variety of companies, including procurement SaaS, travel startups, and fire/EMS records management. When he’s not working, you can often find him in the mountains skiing.

Daniel Lopes is a Solutions Architect for ISVs at AWS. His focus is on enabling ISVs to design and build their products in alignment with their business goals with all advantages AWS services can provide them. His areas of interest are event-driven architectures, serverless computing, and generative AI. Outside work, Daniel mentors his kids in video games and pop culture.

How Flutter UKI optimizes data pipelines with AWS Managed Workflows for Apache Airflow

2025-04-29 Monica Cujerean, Ionut Hedesiu

Post Syndicated from Monica Cujerean, Ionut Hedesiu original https://aws.amazon.com/blogs/big-data/how-flutter-uki-optimizes-data-pipelines-with-aws-managed-workflows-for-apache-airflow/

This post is co-written with Monica Cujerean and Ionut Hedesiu from Flutter UKI.

In this post, we share how Flutter UKI transitioned from a monolithic Amazon Elastic Compute Cloud (Amazon EC2)-based Airflow setup to a scalable and optimized Amazon Managed Workflows for Apache Airflow (Amazon MWAA) architecture using features like Kubernetes Pod Operator, continuous integration and delivery (CI/CD) integration, and performance optimization techniques.

About Flutter UKI

As a division of Flutter Entertainment, Flutter UKI stands at the forefront of the sports betting and gaming industry. Flutter UKI offers a diverse portfolio of entertainment options, encompassing sports wagering, casino games, bingo, and poker experiences. Flutter UKI’s digital presence is robust, operating through an array of renowned online brands. These include the iconic Paddy Power, Sky Betting and Gaming, and Tombola. While Flutter UKI has established a strong online foothold, it maintains a significant physical presence with a network of 576 Paddy Power betting shops strategically located across the United Kingdom and Ireland.

The Data team at Flutter UKI is integral to the company’s mission of using data to drive business success and innovation. Specializing in data, their teams are dedicated to ensuring the seamless integration, management, and accessibility of data across multiple facets of the organization. By developing robust data pipelines and maintaining high data quality standards, Flutter UKI empowers stakeholders with reliable insights, optimizes operational efficiencies, and enhances the user experience. Its commitment to data excellence underpins its efforts to remain at the forefront of the online gaming and entertainment industry, delivering value and strategic advantage to the business.

The journey from self managing Airflow on Amazon EC2 to operating Airflow workloads at scale using Amazon MWAA

Flutter UKI’s data orchestration story began in 2017 with a modest Apache Airflow deployment on EC2 instances. As the company’s digital footprint expanded, so did their data pipeline requirements, leading to an increasingly complex monolithic cluster that demanded constant attention and resource scaling. The operational overhead of managing these EC2 instances became a significant challenge for their engineering teams. In 2022, Flutter UKI reached a crossroads. They needed to choose between re-architecting their service on Amazon Elastic Kubernetes Service (Amazon EKS) or embracing Amazon Managed Workflows for Apache Airflow (MWAA).

Flutter UKI was looking to transform their data orchestration service from a resource-intensive, self-managed system to a more efficient, managed service that would allow them to focus on their core business objectives rather than infrastructure management. Through extensive proof-of-concept (POC) testing and close collaboration with AWS Enterprise Support, Flutter UKI gained confidence in the ability of Amazon MWAA to handle their sophisticated workloads at scale. Their choice of MWAA over a self-managed solution on Amazon EKS reflected Flutter UKI’s strategic focus on using managed services to reduce operational complexity and accelerate innovation.

The migration to Amazon MWAA followed a methodical approach. There was extensive testing of multiple POCs. During the POCs, the engineering team found MWAA to have a good ease of use, which helped them reduce the learning curve resulting in faster. Learning from each POC, they iterated on the final architecture by making data-driven decisions. Starting with a small subset of directed acyclic graphs (DAG), the Flutter UKI team expanded their deployment over time, gradually moving hundreds and eventually thousands of workflows to the managed service. This careful, phased transition allowed them to validate the performance and reliability of MWAA while minimizing operational risk.

High-level architecture design

During the service re-architecture, the data team strategically managed over 3,500 dynamically generated DAGs by implementing a sophisticated distribution approach across multiple Amazon MWAA environments to create a workload isolated environment. Another reason for having multiple environments was to make sure that no one MWAA environment doesn’t get overloaded by multiple DAGs. By placing DAG files across diverse Amazon Simple Storage Service (Amazon S3) locations and configuring unique DAG_FOLDER paths for each environment, the data team created an intelligent load balancing mechanism that allocates workflows based on complex criteria including environment type, task volume, and environment-specific DAG affinity. A round-robin distribution strategy was designed to minimize single environment load, ensuring scalable infrastructure with zero performance degradation. This approach allowed the team to optimize workflow orchestration, maintaining high performance while efficiently managing an extensive collection of dynamically generated DAGs across multiple MWAA environments. To provide more compute to individual tasks and to keep the MWAA efficient, Flutter UKI delegated the DAG execution to an external compute environment using Amazon Elastic Kubernetes Service (Amazon EKS). The resulting high-level architecture is shown in the following figure.

Kubernetes Pod Operator (KPO) for tasks: Flutter UKI transitioned from using custom operators and many native Airflow operators to exclusively utilizing the Kubernetes Pod Operator (KPO). This decision simplified their architecture by eliminating unnecessary complexity, reducing maintenance overhead, and mitigating potential bugs. Additionally, this approach enabled them to allocate compute resources on a per-task basis, optimizing overall service performance. It also enabled the use of different container images for different tasks, thereby avoiding library dependency conflicts.
Kubernetes Pod Operator wrapper (KPOw): Instead of using KPO directly, they developed a wrapper (KPOw) around it. This wrapper abstracts the underlying complexity and minimizes the impact of signature changes in Airflow, Amazon MWAA, Amazon EKS, or operator versions. By centralizing these changes, they only need to update the wrapper rather than thousands of individual DAGs. The wrapper also simplifies DAGs by hiding repetitive parameters, such as node affinity, pod resources, and EKS cluster configurations. Furthermore, it enforces company-specific naming conventions and allows for parameter validation at task execution time rather than during DagBag refresh. They also introduced profiles and image files, where profile files contain necessary KPO parameters, and the corresponding image files link to the repository for the task’s container image. This setup ensures consistency across tasks using the same profile and facilitates simultaneous updates across tasks.
Monthly image updates in Kubernetes: Enforcing a policy of monthly image updates made sure that their code remained current, preventing security vulnerabilities and avoiding extensive code changes due to deprecated libraries.
Continuous Airflow updates: Flutter UKI maintains a cutting-edge infrastructure by implementing new Airflow versions shortly after release, while following a carefully orchestrated deployment strategy. Their approach uses standard Amazon MWAA configurations and employs a systematic testing protocol. New versions are first deployed to development and test environments for thorough validation before reaching production systems. This methodical progression significantly reduces the risk of disruptions to business-critical workflows.

To achieve operational excellence, Flutter UKI has implemented a comprehensive monitoring framework centered on Amazon CloudWatch metrics. Their monitoring solution includes strategically configured alarms that provide early warning signals for potential issues. This proactive monitoring approach enables their teams to quickly identify and investigate anomalies in production workload executions, ensuring high availability and performance of their data pipelines. The combination of careful version management and robust monitoring exemplifies Flutter UKI’s commitment to operational excellence in their cloud infrastructure.

CI/CD integration: By managing their code in GitLab, with mandatory code reviews and using Argo Events and Argo Workflows for image updates in AWS ECR, they streamlined their development processes.
Performance Optimization: A significant portion of the DAGs are dynamically generated based on database metadata. This generation process runs outside Amazon MWAA, with its own CI/CD pipeline, and the resulting DAG files are stored in the S3 DAG. Placing code outside of tasks was avoided, including parameter evaluation. Parameters and secrets are stored in AWS Secrets Manager and retrieved at task runtime. Engineers aim to minimize or eliminate inter-service dependencies within MWAA.

DAGs are scheduled to distribute execution times as evenly as possible. Task code and common modules are hosted on Amazon S3 and retrieved at runtime. For larger codebases, Amazon Elastic File System (Amazon EFS) volumes are mounted to task pods are used.

Results

Today, Flutter UKI’s infrastructure comprises four Amazon MWAA clusters, each executing tasks on dedicated Amazon EKS node groups. They manage approximately 5,500 DAGs encompassing over 30,000 tasks, handling more than 60,000 DAG runs daily with a concurrency exceeding 450 tasks running simultaneously across clusters. They anticipate a 10% monthly increase in this workload in the short to medium term. During major events like Cheltenham and Grand National, where data load increases by 30%, their MWAA service has demonstrated stability and scalability, achieving a 100% success rate for critical processes in 2025, a significant improvement over previous years.

Conclusion

Flutter UKI’s journey with AWS Managed Workflows for Apache Airflow (Amazon MWAA) has resulted in a stable, scalable, and resilient production environment. The careful re-architecting of Flutter UKI’s service, combined with strategic decisions around task execution and infrastructure management, has not only simplified their operations, but also enhanced performance and reliability. Security and compliance benefits were also noticed, because MWAA provides managed security updates, built-in encryption, and integration with AWS security services. Perhaps most importantly, the shift to MWAA has allowed Flutter UKI’s engineering teams to redirect their efforts from infrastructure maintenance to business-critical tasks, focusing on DAG development and improving data pipeline efficiency, ultimately accelerating innovation in their core business operations.

If you’re looking to reduce operational overhead and migrate to a fully managed Airflow solution on AWS, consider using Amazon MWAA. Get in touch with your Technical Account Manager or your Solutions Architect to discuss a solution specific to your use-case. You can also reach out to AWS Support by creating a case if you’re facing an issues setting up the service.

Ready to see what Amazon MWAA is like? Visit the AWS Management Console for Amazon MWAA. For more information, see What Is Amazon Managed Workflows for Apache Airflow. Additionally, Using Amazon MWAA with Amazon EKS shows you how to integrate Amazon MWAA with Amazon EKS.

About the authors

Monica Cujerean is a Principal Data Engineer at Flutter UKI, focusing on service related initiatives that cover performance optimization, cost effectiveness, and new feature adoption on most AWS service in our stack: Amazon MWAA, Amazon Redshift, Amazon Aurora, and Amazon SageMaker.

Ionut Hedesiu is a Senior Data Architect at Flutter UKI, responsible for designing strategic solutions to cover complex and varied business needs. His main expertise is on Amazon MWAA, Kubernetes, Amazon Sagemaker, and ETL solutions.

Nidhi Agrawal is a Technical Account Manager at AWS and works with large enterprise customers to provide the technical guidance, best practices, and strategic support to customers, helping them optimize their environments in the AWS Cloud.

John Kellett is a Senior Customer Solutions Manager with 25 years of experience across private and public sectors. John helps drive end-to-end customer engagement through program management excellence. By understanding and representing customers’ strategic visions, John aligns to develop the people, organizational readiness, and technology competencies to meet the desired outcomes.

Sidhanth Muralidhar is a Principal Technical Account Manager at AWS. He works with large enterprise customers who run their workloads on AWS. He is passionate about working with customers and helping them architect workloads for cost, reliability, performance, and operational excellence at scale in their cloud journey. He has a keen interest in data analytics as well.

How BMW Group built a serverless terabyte-scale data transformation architecture with dbt and Amazon Athena

2025-04-29 Philipp Karg

Post Syndicated from Philipp Karg original https://aws.amazon.com/blogs/big-data/how-bmw-group-built-a-serverless-terabyte-scale-data-transformation-architecture-with-dbt-and-amazon-athena/

Businesses increasingly require scalable, cost-efficient architectures to process and transform massive datasets. At the BMW Group, our Cloud Efficiency Analytics (CLEA) team has developed a FinOps solution to optimize costs across over 10,000 cloud accounts. While enabling organization-wide efficiency, the team also applied these principles to the data architecture, making sure that CLEA itself operates frugally. After evaluating various tools, we built a serverless data transformation pipeline using Amazon Athena and dbt.

This post explores our journey, from the initial challenges to our current architecture, and details the steps we took to achieve a highly efficient, serverless data transformation setup.

Challenges: Starting from a rigid and costly setup

In our early stages, we encountered several inefficiencies that made scaling difficult. We were managing complex schemas with wide tables that required significant effort in maintainability. Initially, we used Terraform to create tables and views in Athena, allowing us to manage our data infrastructure as code (IaC) and automate deployments through continuous integration and delivery (CI/CD) pipelines. However, this method slowed us down when changing data models or dealing with schema changes, therefore requiring high development efforts.

As our solution grew, we faced challenges with query performance and costs. Each query scanned large amounts of raw data, resulting in increased processing time and higher Athena costs. We used views to provide a clean abstraction layer, but this masked underlying complexity because seemingly simple queries against these views scanned large volumes of raw data, and our partitioning strategy wasn’t optimized for these access patterns. As our datasets grew, the lack of modularity in our data design increased complexity, making scalability and maintenance increasingly difficult. We needed a solution for pre-aggregating, computing, and storing query results of computationally intensive transformations. The absence of robust testing and lineage solutions made it challenging to identify the root causes of data inconsistencies when they occurred.

As part of our business intelligence (BI) solution, we used Amazon QuickSight to build our dashboards, providing visual insights into our cloud cost data. However, our initial data architecture led to challenges. We were building dashboards on top of large, wide datasets, with some hitting the QuickSight per-dataset SPICE limit of 1 TB. Additionally, during SPICE ingest, our largest datasets required 4–5 hours of processing time due to performing full scans each time, often scanning over a terabyte of data. This architecture wasn’t helping us be more agile and quick while scaling up. The long processing times and storage limitations hindered our ability to provide timely insights and expand our analytics capabilities.

To address these issues, we enhanced the data architecture with AWS Lambda, AWS Step Functions, AWS Glue, and dbt. This tool stack significantly enhanced our development agility, empowering us to quickly modify and introduce new data models. At the same time, we improved our overall data processing efficiency with incremental loads and better schema management.

Solution overview

Our current architecture consists of a serverless and modular pipeline coordinated by GitHub Actions workflows. We chose Athena as our primary query engine for several strategic reasons: it aligns perfectly with our team’s SQL expertise, excels at querying Parquet data directly in our data lake, and alleviates the need for dedicated compute resources. This makes Athena an ideal fit for CLEA’s architecture, where we process around 300 GB daily from a data lake of 15 TB, with our largest dataset containing 50 billion rows across up to 400 columns. The capability of Athena to efficiently query large-scale Parquet data, combined with its serverless nature, enables us to focus on writing efficient transformations rather than managing infrastructure.

The following diagram illustrates the solution architecture.

Using this architecture, we’ve streamlined our data transformation process using dbt. In dbt, a data model represents a single SQL transformation that creates either a table or a view—essentially a building block of our data transformation pipeline. Our implementation includes around 400 such models, 50 data sources, and around 100 data tests. This setup enables seamless updates—whether creating new models, updating schemas, or modifying views—triggered simply by creating a pull request in our source code repository, with the rest handled automatically.

Our workflow automation includes the following features:

Pull request – When we create a pull request, it’s deployed to our testing environment first. After passing validation and being approved or merged, it’s deployed to production using GitHub workflows. This setup enables seamless model creation, schema updates, or view changes—triggered just by creating a pull request, with the rest handled automatically.
Cron scheduler – For nightly runs or multiple daily runs to reduce data latency, we use scheduled GitHub workflows. This setup allows us to configure specific models with different update strategies based on data needs. We can set models to update incrementally (processing only new or changed data), as views (querying without materializing data), or as full loads (completely refreshing the data). This flexibility optimizes processing time and resource usage. We can target only specific folders—like source, prepared, or semantic layers—and run the dbt test afterward to validate model quality.
On demand – When adding new columns or changing business logic, we need to update historical data to maintain consistency. For this, we use a backfill process, which is a custom GitHub workflow created by our team. The workflow allows us to select specific models, include their upstream dependencies, and set parameters like start and end dates. This makes sure that changes are applied accurately across the entire historical dataset, maintaining data consistency and integrity.

Our pipeline is organized into three primary stages—Source, Prepared, and Semantic—each serving a specific purpose in our data transformation journey. The Source stage maintains raw data in its original form. The Prepared stage cleanses and standardizes this data, handling tasks like deduplication and data type conversions. The Semantic stage transforms this prepared data into business-ready models aligned with our analytical needs. An additional QuickSight step handles visualization requirements. To achieve low cost and high performance, we use dbt models and SQL code to manage all transformations and schema changes. By implementing incremental processing strategies, our models process only new or changed data rather than reprocessing the entire dataset with each run.

The Semantic stage (not to be confused with dbt’s semantic layer feature) introduces business logic, transforming data into aggregated datasets that are directly consumable by BMW’s Cloud Data Hub, internal CLEA dashboards, data APIs, or In-Console Cloud Assistant (ICCA) chatbot. The QuickSight step further optimizes data by selecting only necessary columns by using a column-level lineage solution and setting a dynamic date filter with a sliding window to ingest only relevant hot data into SPICE, avoiding unused data in dashboards or reports.

This approach aligns with BMW Group’s broader data strategy, which includes streamlining data access using AWS Lake Formation for fine-grained access control.

Overall, as a high-level structure, we’ve fully automated schema changes, data updates, and testing through GitHub pull requests and dbt commands. This approach enables controlled deployment with robust version control and change management. Continuous testing and monitoring workflows uphold data accuracy, reliability, and quality across transformations, supporting efficient, collaborative model iteration.

Key benefits of the dbt-Athena architecture

To design and manage dbt models effectively, we use a multi-layered approach combined with cost and performance optimizations. In this section, we discuss how our approach has yielded significant benefits in five key areas.

SQL-based, developer-friendly environment

Our team already had strong SQL skills, so dbt’s SQL-centric approach was a natural fit. Instead of learning a new language or framework, developers could immediately start writing transformations using familiar SQL syntax with dbt. This familiarity aligns well with the SQL interface of Athena and, combined with dbt’s added functionality, has increased our team’s productivity.

Behind the scenes, dbt automatically handles synchronization between Amazon Simple Storage Service (Amazon S3), the AWS Glue Data Catalog, and our models. When we need to change a model’s materialization type—for example, from a view to a table—it’s as simple as updating a configuration parameter rather than rewriting code. This flexibility has reduced our development time dramatically, allowed us to focus on building better data models rather than managing infrastructure.

Agility in modeling and deployment

Documentation is crucial for any data platform’s success. We use dbt’s built-in documentation capabilities by publishing them to GitHub Pages, which creates an accessible, searchable repository of our data models. This documentation includes table schemas, relationships between models, and usage examples, enabling team members to understand how models interconnect and how to use them effectively.

We use dbt’s built-in testing capabilities to implement comprehensive data quality checks. These include schema tests that verify column uniqueness, referential integrity, and null constraints, as well as custom SQL tests that validate business logic and data consistency. The testing framework runs automatically on every pull request, validating data transformations at each step of our pipeline. Additionally, dbt’s dependency graph provides a visual representation of how our models interconnect, helping us understand the upstream and downstream impacts of any changes before we implement them. When stakeholders need to modify models, they can submit changes through pull requests, which, after they’re approved and merged, automatically trigger the necessary data transformations through our CI/CD pipeline. This streamlined process enabled us to create new data products within days compared to weeks and reduced ongoing maintenance work by catching issues early in the development cycle.

Athena workgroup separation

We use Athena workgroups to isolate different query patterns based on their execution triggers and purposes. Each workgroup has its own configuration and metric reporting, allowing us to monitor and optimize separately. The dbt workgroup handles our scheduled nightly transformations and on-demand updates triggered by pull requests through our Source, Prepared, and Semantic stages. The dbt-test workgroup executes automated data quality checks during pull request validation and nightly builds. The QuickSight workgroup manages SPICE data ingestion queries, and the Ad-hoc workgroup supports interactive data exploration by our team.

Each workgroup can be configured with specific data usage quotas, enabling teams to implement granular governance policies. This separation provides several benefits: it enables clear cost allocation, provides isolated monitoring of query patterns across different use cases, and helps enforce data governance through custom workgroup settings. Amazon CloudWatch monitoring per workgroup helps us track usage patterns, identify query performance issues, and adjust configurations based on actual needs.

Using QuickSight SPICE

QuickSight SPICE (Super-fast, Parallel, In-memory Calculation Engine) provides powerful in-memory processing capabilities that we’ve optimized for our specific use cases. Rather than loading entire tables into SPICE, we create specialized views on top of our materialized semantic models. These views are carefully crafted to include only the necessary columns, relevant metadata joins, and appropriate time filtering to have only recent data available in dashboards.

We’ve implemented a hybrid refresh strategy for these SPICE datasets: daily incremental updates keep the data fresh, and weekly full refreshes maintain data consistency. This approach strikes a balance between data freshness and processing efficiency. The result is responsive dashboards that maintain high performance while keeping processing costs under control.

Scalability and cost-efficiency

The serverless architecture of Athena eliminates manual infrastructure management, automatically scaling based on query demand. Because costs are based solely on the amount of data scanned by queries, optimizing queries to scan as little data as possible directly reduces our costs. We use the distributed query execution capabilities of Athena through our dbt model structure, enabling parallel processing across data partitions. By implementing effective partitioning strategies and using Parquet file format, we minimize the amount of data scanned while maximizing query performance.

Our architecture offers flexibility in how we materialize data through views, full tables, and incremental tables. With dbt’s incremental models and partitioning strategy, we process only new or modified data instead of entire datasets. This approach has proven highly effective—we’ve observed significant reductions in data processing volume as well as data scanning, particularly in our QuickSight workgroup.

The effectiveness of these optimizations implemented at the end of 2023 is visible in the following diagram, showing costs by Athena workgroups.

The workgroups are illustrated as follows:

Green (QuickSight): Shows reduced data scanning post-optimization.
Light blue (Ad-hoc): Varies based on analysis needs.
Dark blue (dbt): Maintains consistent processing patterns
Orange (dbt-test): Shows regular, efficient test execution.

The increased dbt workload costs directly correlate with decreased QuickSight costs, reflecting our architectural shift from using complex views in QuickSight workgroups (which previously masked query complexity but led to repeated computations) to using dbt for materializing these transformations. Although this increased the dbt workload, the overall cost-efficiency improved significantly because materialized tables reduced redundant computations in QuickSight. This demonstrates how our optimization strategies successfully manage growing data volumes while achieving net cost reduction through efficient data materialization patterns.

Conclusion

Our data architecture uses dbt and Athena to provide a scalable, cost-efficient, and flexible framework for building and managing data transformation pipelines. Athena’s ability to query data directly in Amazon S3 alleviates the need to move or copy data into a separate data warehouse, and its serverless model and dbt’s incremental processing minimize both operational overhead and processing costs. Given our team’s strong SQL expertise, expressing these transformations in SQL through dbt and Athena was a natural choice, enabling rapid model development and deployment. With dbt’s automatic documentation and lineage, troubleshooting and identifying data issues is simplified, and the system’s modularity allows for quick adjustments to meet evolving business needs.

Starting with this architecture is quick and straightforward: all that is needed is the dbt-core and dbt-athena libraries, and Athena itself requires no setup, because it’s a fully serverless service with seamless integration with Amazon S3. This architecture is ideal for teams looking to rapidly prototype, test, and deploy data models, optimizing resource usage, accelerating deployment, and providing high-quality, accurate data processing.

For those interested in a managed solution from dbt, see From data lakes to insights: dbt adapter for Amazon Athena now supported in dbt Cloud.

About the Authors

Philipp Karg is a Lead FinOps Engineer at BMW Group and has a strong background in data engineering, AI, and FinOps. He focuses on driving cloud efficiency initiatives and fostering a cost-aware culture within the company to leverage the cloud sustainably.

Selman Ay is a Data Architect specializing in end-to-end data solutions, architecture, and AI on AWS. Outside of work, he enjoys playing tennis and engaging outdoor activities.

Cizer Pereira is a Senior DevOps Architect at AWS Professional Services. He works closely with AWS customers to accelerate their journey to the cloud. He has a deep passion for cloud-based and DevOps solutions, and in his free time, he also enjoys contributing to open source projects.

How Smartsheet reduced latency and optimized costs in their serverless architecture

2025-04-18 Anton Aleksandrov

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/architecture/how-smartsheet-reduced-latency-and-optimized-costs-in-their-serverless-architecture/

Cloud software as a service (SaaS) companies are often looking for ways to enhance their architectures for performance and cost-efficiency. Serverless technologies offload infrastructure management, allowing development teams to focus on innovation and delivering business value. As application architectures grow and face more demanding requirements, continued optimization helps maximize both the technical and financial advantages of the serverless approach.

In this post, we discuss Smartsheet’s journey optimizing its serverless architecture. We explore the solution, the stringent requirements Smartsheet faced, and how they’ve achieved an over 80% latency reduction. This technical journey offers valuable insights for organizations looking to enhance their serverless architectures with proven enterprise-grade optimization techniques.

Solution overview

Smartsheet is a leading cloud-based enterprise work management platform, enabling millions of users worldwide to plan, manage, track, automate, and report on work at scale. At the core of the platform lies an event-driven architecture that processes real-time user activity across various document types. Given the collaborative nature of the platform, multiple users can work on these documents concurrently. Every document interaction triggers a series of events that must be processed with minimal latency to maintain data consistency and provide immediate feedback. Processing delays can impact user experience and productivity, making consistently low latency a fundamental business requirement.

Smartsheet’s traffic pattern is spiky during business hours and mostly dormant during nights and weekends. Within peak periods, traffic can fluctuate as users collaborate in real time. To efficiently manage dynamic workloads, which can surge from hundreds to tens of thousands of events per second within minutes, Smartsheet implements a serverless event processing architecture using services such as Amazon Simple Queue Service (Amazon SQS) and AWS Lambda. This architecture uses the elasticity of serverless services and the ability to automatically scale dynamically based on the traffic volume. It makes sure Smartsheet can efficiently handle sudden traffic surges while automatically scaling down during off-peak hours, optimizing for both performance and cost-efficiency.

The following diagram illustrates the high-level architecture of the Smartsheet event processing pipeline.

high-level architecture of the Smartsheet event processing pipeline

Optimization opportunity

Smartsheet uses Lambda functions to serve both batch jobs and API requests. The primary runtime used for building those functions is Java. Lambda automatically scales the number of execution environments allocated to your function on demand to accommodate traffic volume. When Lambda receives an incoming request, it attempts to serve it with an existing execution environment first. If no execution environments are available, the service initializes a new one. During initialization, the Smartsheet’s function code commonly sends several requests to external dependencies, such as databases and REST APIs, which might take time to reply.

The following diagram illustrates how Lambda functions reach out to external dependencies during initialization.

Lambda functions reach out to external dependencies during initialization

These tasks introduced execution environment initialization latency, commonly referred to as a cold start. Although cold starts typically affect less than 1% of requests, Smartsheet had stringent low latency requirements for their architecture to further prioritize the best possible end-user experience.

“To reduce customer request latency while keeping costs low, our engineering team utilized Lambda provisioned concurrency with auto scaling and Graviton, which resulted in an 83% reduction in P95 latency while providing a high quality of service as we continue to scale our platform and its limits,” says Abhishek Gurunathan, Sr Director of Engineering at Smartsheet.

Addressing the cold start with provisioned concurrency

To reduce cold start latency, the Smartsheet team adopted provisioned concurrency in their architecture, a capability that allows developers to specify the number of execution environments that Lambda should keep warm to instantly handle invocations. The following diagram illustrates the difference. Without provisioned concurrency, execution environments are created on demand, which means some invocations (typically less than 1%) need to wait for the execution environment to be created and initialization code to be run. With provisioned concurrency, Lambda creates execution environments and runs initialization code preemptively, making sure invocations are served by warm execution environments.

invocations are served by warm execution environments

Provisioned concurrency includes a dynamic spillover mechanism, making your serverless architecture highly resilient to traffic spikes. When incoming traffic exceeds the preconfigured provisioned concurrency, additional requests are automatically served by on-demand concurrency rather than being throttled. This provides seamless scalability and maintains service availability even during traffic surges, while still providing the performance benefits of pre-warmed execution environments for the majority of requests.

The Smartsheet team configured provisioned concurrency to match their historical P95 concurrency needs. This resulted in immediate improvements—the number of cold starts dropped dramatically and P95 invocation latency dropped by 83%. As the team monitored system performance, they quickly identified another architecture optimization opportunity—the Lambda functions were heavily used during work hours but had significantly fewer invocations at night and on weekends, as illustrated in the following graph.

Lambda functions were heavily used during work hours but had significantly fewer invocations at night and on weekends

Setting a static provisioned concurrency configuration worked great for busy periods, but was underutilized during off-times. The Smartsheet team wanted to further fine-tune their architecture and increase provisioned concurrency utilization rates to achieve higher cost-efficiency. This led them to look into provisioned concurrency auto scaling to match traffic patterns as well as adopting an AWS Graviton architecture.

Auto scaling provisioned concurrency and Graviton architecture

Two common approaches to enable provisioned concurrency are setting a static value and using auto scaling. With static configuration, you specify a fixed number of pre-initialized execution environments that remain continuously warm to serve invocations. This approach is highly effective for architectures that handle predictable traffic patterns. Unpredictable traffic patterns, however, can lead to under-provisioning during peak periods (with spillover to on-demand concurrency resulting in more cold starts) or underutilization during low-usage periods. To address that, provisioned concurrency with auto scaling dynamically adjusts the configuration based on utilization metrics, automatically scaling the number of execution environments up or down to match the actual demand. This dynamic approach optimizes for cost-efficiency and is particularly recommended for architectures with fluctuating traffic patterns.

The following figure compares static and dynamic provisioned concurrency.

static and dynamic provisioned concurrency

To further optimize the architecture for cost-efficiency, the Smartsheet team has implemented provisioned concurrency auto scaling based on utilization metrics. Smartsheet used an infrastructure as code (IaC) approach with Terraform to define auto scaling policies for maximum reusability across hundreds of functions. The policies track the LambdaProvisionedConcurrencyUtilization metric and define the scaling threshold according to the function purpose. For functions implementing interactive APIs, the auto scale threshold is 60% utilization to pre-provision execution environments early, keeping latency extra-low, and making functions more resilient towards traffic surges. For functions that implement asynchronous data processing, Smartsheet’s goal was to achieve the highest utilization rate and cost-efficiency, so they’ve defined the auto scale threshold at 90%.

The following diagram illustrates the architecture of auto scaling policies based on provisioned concurrency utilization rate and workload type.

Another optimization technique Smartsheet employed was switching the CPU architecture used by their Lambda functions from x86_64 to arm64 Graviton. To achieve this, Smartsheet adopted the ARM versions of Lambda layers they’ve used, such as Datadog and Lambda Insights extensions. This was required because binaries built using one architecture might be incompatible with a different one. Because Smartsheet functions were implemented with Java and packaged as JAR files, they didn’t have any compatibility issues when moving to Graviton. With Terraform used for codifying the infrastructure, this architecture switch was a simple property change in aws_lambda_function resources, as illustrated in the following code:

property change in aws_lambda_function resources

By switching to a Graviton architecture, Smartsheet saved 20% on function GB-second costs. See AWS Lambda pricing for details.

Best practices

Use the following techniques and best practices to optimize your serverless architectures, reduce cold starts, and increase cost-efficiency:

Fine-tune your Lambda functions to find the optimal balance between cost and performance. Increasing memory allocation also adds CPU capacity, which often means faster execution and can lead to reduced overall costs.
Use a Graviton2 architecture for compatible workloads to benefit from a better price-performance ratio. Depending on the workload type, switching to Graviton can yield up to 34% improvement.
Use provisioned concurrency and Lambda SnapStart to reduce cold starts in your serverless architectures. Start with static provisioned concurrency based on your historical concurrency requirements, monitor utilization, and introduce auto scaling into your architecture to achieve the optimal cost-performance profile.

Conclusion

Serverless architectures using services like Lambda and Amazon SQS offload the infrastructure management and scaling concerns to AWS, allowing teams to focus on innovation and delivering business value. As Smartsheet’s journey demonstrates, using provisioned concurrency and Graviton in your architectures can help significantly improve user experience by reducing latencies while also achieving better cost-efficiency, providing a practical blueprint for optimization across the organization. Whether you’re running large-scale enterprise applications or building new cloud solutions, these proven techniques can help you unlock similar performance gains and cost-efficiencies in your serverless architectures.

To learn more about serverless architectures, see Serverless Land.

About the authors

Simplifying Code Documentation with Amazon Q Developer

2025-04-11 Jehu Gray

Post Syndicated from Jehu Gray original https://aws.amazon.com/blogs/devops/simplifying-code-documentation-with-amazon-q-developer/

In the fast-paced world of software development, maintaining comprehensive documentation often falls to the bottom of priority lists in favor of delivering functionality. Amazon Q Developer’s /doc agent changes this equation by automating README generation and updates. With this tool, the variable of time spent producing documentation is reduced to the point that it’s no longer a burden to the detriment of functionality.

Understanding Amazon Q Developer’s Documentation Generation

The /doc agent leverages generative AI to analyze your codebase and generate comprehensive documentation. Additionally, the agent respects your .gitignore file and excludes files you don’t want to be included in documentation review.

Solution Overview:
As an example, imagine a cloud infrastructure team at a technology consultancy had been working for weeks on their AWS DataSync project. The solution they built provided an elegant CDK implementation that automated data transfer between Amazon S3 buckets using AWS DataSync. The lead engineer had just finished implementing the final IAM role configurations when the product manager requested comprehensive documentation for the next day’s client handoff meeting. The team realized this would typically take hours of focused work. Instead, they decided to try Amazon Q Developer /doc agent.

Getting Started with /doc:

To begin using the /doc agent, you’ll need to:

Set up Amazon Q: Follow these steps
Open your IDE with the Amazon Q extension installed
Click the Amazon Q icon to open the chat panel
Enter /doc to start the documentation process
Select your documentation task:
1. Create a new README
2. Update an existing README with recent code changes

This image shows the user entering /doc to start the documentation process

Figure 1 – Entering /doc to start the documentation process.

Example: Creating a New README:

For projects without documentation, simply select: Create a README. It will confirm the project you are creating the README for.

This image shows the user selecting the Create a README option

Figure 2 – Select the Create a README option.

Once you verify the folder and select yes, the agent begins creating the README document for the folder.
Here are the steps it works through: scanning the source files, summarizing the source files, and generating the documentation.

This image shows the user verifying the folder and select yes

Figure 3 – Verify the folder and select yes.

When the document is created, you can preview the README file. The agent then presents you with the ability to either accept the changes or request modifications before implementation.

This image shows the preview of the created README file.

Figure 4 – Preview the created README file.

If you choose to accept, the README file is added to your project.

This image shows the user accepting the changes so the README is added to your project folder

Figure 5 – Accept the changes so the README is added to your project folder.

Example: Updating Documentation with Code Changes

When your code evolves, you can keep the documentation synchronized by using /doc. The agent will review your recent code modifications and suggest appropriate documentation updates.

This image shows when the user selects update an existing README

Figure 6 – Select Update an existing README to make changes to a README file.

Then you can describe the changes you want the agent to make to your README file.

This image shows how you can describe changes you’d like the agent to make

Figure 7 – Describe changes to your README files.

For targeted documentation updates, you can provide specific instructions:

This image shows the user verifying the folder

Figure 8 – Verify the folder and select yes.

Once you’ve made the changes, the agent asks you to verify them by selecting yes.

This image shows the user verifying the changes made

Figure 9 – Verify the changes and select yes.

Advanced Documentation Management

Multi-step Documentation Refinement:

This image shows the steps in Documentation Refinement

Figure 10 – Multi-step Documentation Refinement.

Amazon Q Developer /doc agent allows for iterative improvement of your documentation through feedback loops. After generating initial documentation, you can:

Review the generated content for gaps or inaccuracies
Provide specific feedback to refine particular sections
Request additional sections on complex topics
Gradually build comprehensive documentation through multiple iterations

This iterative approach is particularly valuable for complex projects where documentation needs to evolve alongside the codebase.

Documentation for Specific Components

For modular projects, you can create targeted documentation at different levels:

Root-level README for project overview
Component-level READMEs for specific modules
Service-level documentation for microservices
API documentation for interfaces

By combining these documentation levels, you can maintain a hierarchical documentation structure that remains manageable and specific.

Handling Documentation Inheritance

This image shows how to handle Documentation Inheritance.

Figure 11 – Handling Documentation Inheritance.

When working with derived or extended codebases:

Generate base documentation for the parent project
Create specialized documentation for extensions
Cross-reference related documentation to maintain consistency
Use the /doc agent to update specific sections when inheritance patterns change

Documentation Syncing Strategy

Figure 12 – Documentation Syncing Strategy.

For teams working on rapidly evolving projects:

Establish a documentation update schedule aligned with sprints
Assign documentation reviews as part of code review processes
Use /doc to generate change summaries after significant updates
Implement a verification process to ensure generated documentation accurately reflects code changes

Best Practices for /doc Agent

To improve results from documentation generation with Amazon Q Developer, follow these best practices:

Optimize repository size: Amazon Q Developer supports documentation generation across your codebase, accommodating projects up to the specified size limits. While documentation for larger repositories may require additional processing time and could provide more generalized results, you have the option to request documentation for specific subsets of code or individual files to receive more detailed insights.
Maintain high-quality code: The quality of documentation Amazon Q Developer generates improves significantly when your code is well-commented and organized, has meaningful naming conventions for programming entities, and follows standard coding conventions.
Be specific with change requests: When requesting specific README changes in natural language, choose to update an existing README and select the option to make a specific change. After initial documentation generation, you can request additional modifications by describing exactly what updates you want.
Craft effective change descriptions: When describing desired updates, include:
1. Specific sections you want to modify
2. Exact content you want to add or remove
3. Particular issues that need correcting
4. How project functionality should be reflected in the README
5. References to content available in your codebase.
Understand system limitations: Amazon Q Developer doesn’t have access to private or internal platforms and might lack knowledge of third-party tools, specialized software, or custom tooling in your code. Content requiring this knowledge won’t be documented automatically. In these cases, you’ll need to manually edit the README to include information Amazon Q Developer cannot generate.

Documentation Quotas and Limitations

When working with Amazon Q Developer /doc agent, be aware of these important constraints:

Document generations per task: There’s a limit to the number of feedback iterations allowed per documentation session. This quota resets each time you start a new documentation task.
File filtering: Amazon Q Developer filters out files or folders defined in your .gitignore file. This helps streamline the documentation process by focusing only on relevant code files.

Conclusion

Amazon Q Developer /doc agent transforms the documentation process from a tedious chore to an automated, efficient workflow. By generating and maintaining READMEs based on your actual code, it ensures documentation remains accurate and up-to-date without consuming precious development time.

As part of Amazon Q Developer’s free tier, the /doc agent is readily available to integrate into your development process. Start using it today to improve your project documentation and enhance team collaboration.

About the Authors:

How UNiDAYS achieved AWS Region expansion in 3 weeks

2025-04-08 Sanhawat Taongern

Post Syndicated from Sanhawat Taongern original https://aws.amazon.com/blogs/architecture/how-unidays-achieved-aws-region-expansion-in-3-weeks/

UNiDAYS is a fast, free digital platform that provides exclusive student offers and benefits to over 29 million verified members worldwide. With a rapidly growing user base and an increasing number of global partnerships, UNiDAYS recognized the need to enhance its platform’s performance to deliver a seamless consumer experience in geographic regions far from its original base of operations.

In this post, we share how UNiDAYS achieved AWS Region expansion in just 3 weeks using AWS services.

Business challenges

In response to growth opportunities, UNiDAYS faced a pressing business requirement: deliver low-latency responses and provide high availability for users across diverse geographic regions. At the same time, the platform needed to guarantee global data consistency while adhering to tight deadlines—all within just a few weeks. However, the existing monolithic application, although built on Amazon Web Services (AWS), wasn’t optimized for active-active multi-Region deployments.

The challenge was further complicated by the need to extend functionality from this legacy system, which used the AWS global network for improved user experience but fell short of meeting new business requirements. Re-architecting the entire platform to support multi-Region deployments within the given timeframe wasn’t feasible.

Solution overview

UNiDAYS opted to create complementary services tailored to these new requirements, using AWS services for a multi-Region, active-active architecture. The key services used included:

Amazon CloudFront and Amazon Route 53 – For global delivery and lowest-latency responses
Amazon DynamoDB – To provide consistent, highly available data across Regions
Amazon Elastic Container Service (Amazon ECS) – For scalable, deployable containerized resources
Amazon EventBridge – To enable asynchronous integration with existing systems through event-driven patterns

This approach allowed UNiDAYS to meet its latency, availability, and consistency goals while seamlessly integrating with existing infrastructure. The following diagram is the architecture for the solution.

Global delivery and resiliency

To provide the lowest latency and multi-Region resiliency, CloudFront was used with latency-based routing configured in Route 53. This routing directs requests to the Regional Application Load Balancers with the lowest latency, automatically providing resiliency in the event of Regional issues. Security was a key consideration. AWS WAF integration with CloudFront provided application-layer protection at the edge. Additional security measures included:

Custom HTTP headers on origin requests, enforced using Application Load Balancer listener rule conditions
Prefix lists to restrict access to Application Load Balancers, making sure that traffic originated from the intended CloudFront distributions

Rapid Regional deployment

The core infrastructure is deployed through Terraform, and applications are deployed using custom tooling that wraps AWS CloudFormation. This hybrid approach enabled rapid delivery by using existing patterns without disrupting established workflows. Resources were organized into tiers: platform, global, and Regional. Platform and global resources were deployed one time, and Regional resources were rolled out to each activated Region, streamlining expansion efforts.

One technical challenge involved CloudFormation exports, which are Regional by design. To address this, we implemented a custom CloudFormation macro to enable cross-Region access to exported values, providing consistency across deployments.

Amazon ECS enabled progressive application deployments within each Region, allowing teams to focus on scaling applications rather than managing infrastructure. For cost-efficiency, we used Spot Instances. During testing, container start-up latency was observed due to cross-Region image downloads from Amazon Elastic Container Registry (Amazon ECR). This issue was resolved by enabling private image replication in Amazon ECR so that container images were available locally in each Region. This solution significantly reduced start-up times, improving application responsiveness during deployments and scaling events.

Data consistency and performance

DynamoDB global tables were instrumental in providing eventual data consistency and Regional replication. With DynamoDB handling these aspects, we could focus on application logic.

The result was a substantial reduction in latency at key locations. For example, client-experienced latency in one Region dropped from approximately 200 milliseconds to 50 milliseconds upon deployment, as shown in the following screenshot.

Key technical hurdles

We addressed the following technical obstacles while developing the solution:

Cross-Region CloudFormation exports – CloudFormation exports are Regional by design. We addressed this by creating a custom CloudFormation macro to read exports across Regions.
Container start-up latency – Latency caused by cross-Region image downloads was mitigated by implementing Amazon ECR private image replication. This meant that container images were readily available in each Region, reducing deployment times and improving overall performance.
Security assurance – By using CloudFront, AWS WAF, and Application Load Balancer security features, we made sure that traffic and data remained secure.

Why AWS?

UNiDAYS chose AWS due to its comprehensive global infrastructure and robust service offerings, which allowed the platform to:

Seamlessly expand compute operations to Regions closer to its user base
Take advantage of a full stack of services for reliable, secure, and low-latency content delivery
Meet tight delivery deadlines without compromising on performance or security
Maintain flexibility where required, with the ability to use more managed services, which allowed a focus on our applications

Conclusion

By adopting a multi-Region, active-active architecture on AWS, UNiDAYS successfully met its business goals within only 3 weeks, rapidly expanding to new Regions while promoting platform resiliency. The solution improved latency by 75% in new Regions (from 200 milliseconds to 50 milliseconds), provided Regional data availability through DynamoDB global tables, and maintained 100% service uptime during resiliency tests, even in cases of Regional connectivity loss. Additionally, deployment velocity increased by over 40%, allowing faster feature releases and improved agility. This architecture not only provides a scalable and resilient platform for current operations but also establishes a strong foundation for future global expansion.

Learn more

Is your organization looking to expand into new Regions while maintaining performance and reliability?

Contact AWS experts to explore tailored solutions for your multi-Region strategy.
Use AWS Global Infrastructure to optimize your expansion.
Share your challenges and successes in the comments—we’d love to hear your insights!

About the Authors

How generative AI is transforming developer workflows at Amazon

2025-04-02 Erin Kraemer

Post Syndicated from Erin Kraemer original https://aws.amazon.com/blogs/devops/how-generative-ai-is-transforming-developer-workflows-at-amazon/

Introduction

Software engineering stands at an inflection point. While previous technological shifts enhanced what developers could build, AI is fundamentally changing how we build. Amazon Q has driven a shift in how developers at Amazon approach software development. At re:Invent 2024, our breakout session Unleashing generative AI: Amazon’s journey with Amazon Q Developer (DOP214) shared insights from Amazon’s internal journey that reveal not just tactical benefits, but our strategic reimagining of the software development process itself. In the months since our talk, the capabilities of this technology have only accelerated in sophistication and reliability. Our hope is that these learnings can inform the approaches organizations are taking to adopt AI, through guidance on scaling implementation, measuring impact and considerations around meaningful data, and best practices every step of the way. For all of us, this is just the beginning of what is looking to be a very exciting journey as AI becomes not only an active assistant in our day-to-day innovation, but starts to become a peer and companion as AI agents take on full tasks.

Rethinking our mental models: The Productivity Paradox

For too long, we’ve approached software development optimization through the lens of industrial-age thinking – treating code like widgets on an assembly line. This mental model has led organizations to chase simplified metrics like lines of code or story points, missing the true nature of software development as knowledge work.

Our experience at Amazon has shown that the real opportunity isn’t just about making current processes faster – it’s about changing how developers interact with code, documentation, and knowledge. In the 2024 Stack Overflow Developer Survey, 53% of respondents agreed that waiting on answers disrupts their workflow, even if they know where to go find those answers. Similarly, 30% of respondents said knowledge silos impact their productivity ten times or more per week.

In 2024, to solve this problem for Amazon developers, we ingested our internal Amazon knowledge repository consisting of millions of documents into Amazon Q Business so our developers could get answers based on information spread across those repositories. By simply asking Amazon Q their questions in existing tooling instead of manually searching or needing to ask an expert, we reduced the time Amazon developers spent waiting for technical answers by over 450k hours and reduced the interruptions to “flow state” of existing team members.

Today’s developers face a striking paradox: while they’re equipped with more powerful tools than ever, we know that across the industry, developers can spend only 1-2 hours daily writing code. The rest is consumed by what we call “toil” – necessary but undifferentiated work that scales linearly with software complexity. This represents not just a productivity challenge, but a strategic liability for organizations trying to accelerate innovation.

Amazon’s scale has provided unique insights into the transformative potential of AI in software engineering. Using AI for software transformations tied tightly into our internal development tools, we didn’t just save 4,500 developer years of effort – we unlocked new possibilities for large-scale technical modernization that previously seemed impractical. This experience revealed something profound: AI isn’t just making existing processes more efficient; it’s making previously impossible tasks feasible. For instance, our ability to automatically handle complex dependency updates across thousands of services has fundamentally changed how we think about technical debt and system modernization. Over the coming years as these AI agents become increasingly capable and autonomous, we will get increasingly bold with the types of modernization work we ask them to perform, ensuring reliability by complementing them with other agentic capabilities such as correctness validation, testing, and even advanced anomaly detection and production system operations.

The evolution of developer cognition

Perhaps the most fascinating insight from our journey has been observing how AI is changing the way developers think about and solve problems. The traditional model of a developer working in isolation, relying solely on their own knowledge and documentation, is evolving into a more collaborative model where AI serves as an intelligent thinking partner. We’ve seen this manifest in unexpected ways. Developers report that having AI tools available changes not just how they write code, but how they approach problem-solving itself. The ability to rapidly explore multiple approaches or instantly access contextual knowledge has enabled more creative and experimental development practices.

In particular, we are seeing seasoned developers playing with new-to-them programming languages and coding techniques that previously would not have been worthwhile due to ramp time. One developer reported cutting their typical three-week ramp-up time for learning a new programming language down to just one week using Q Developer. This significant reduction in a developer’s initial time investment allows creativity with more suitable programming languages for nuanced project requirements, or jumping into work with unfamiliar and complex systems, without sacrificing code quality. For example, with our recently launched Amazon Q Developer Agentic CLI, another internal developer was able to work with an unfamiliar codebase to build and implement a non-trivial feature within 2 days using Rust, a programming language they didn’t know, stating, “If I’d done this ‘the old fashioned way,’ I would estimate it would have taken me 5-6 weeks due to language and codebase ramp up time. More realistically, I wouldn’t have done it at all, because I don’t have that kind of time to devote.”

As we look to the future, we see several emerging frontiers that will further transform software engineering. The rise of AI agents capable of handling increasingly complex development tasks is shifting the developer’s role from implementation to orchestration. We’re moving toward a model where developers spend more time defining what needs to be built and validating approaches, rather than handling every implementation detail. System architecture, traditionally considered too nuanced for automation, is emerging as a new frontier for AI assistance. Application security reviews or software testing, frequent bottlenecks to software release due to specialist bandwidth, will increasingly be accelerated by AI agents amplifying the capacity of those specialists. While we’re just beginning to explore this space, early experiments suggest AI could help architects evaluate trade-offs and identify potential issues in complex systems more effectively than ever before.

Strategic implications for organizations

We believe the most successful organizations will be those that view AI not just as a tool for automation, but as a catalyst for transforming how they approach software development entirely. The real strategic advantage will come from reimagining software development processes and culture to fully leverage AI’s capabilities. This includes rethinking traditional metrics, redefining developer productivity, and creating space and cultural change for teams to experiment with new ways of working. Amazon Q represents a new class of development tools that augment developer capabilities in fundamental ways, beyond just writing code more efficiently. Organizations that understand and embrace this transformation will be best positioned to lead in the next era of software innovation.

To learn more about Amazon Q Developer and explore innovative ways of accelerating software development refer to the Q Developer documentation. Individual users can get started with Q Developer in the AWS Console, CLI, or in their IDE on the perpetual Free Tier. And remember: give yourself and your team “Permission to Play!” We’re at the heart of a technological revolution; as technologists this a moment where we get to immerse ourselves in the unknown and learn and be curious!

Implement Amazon EMR HBase Graceful Scaling

2025-03-18 Yu-Ting Su

Post Syndicated from Yu-Ting Su original https://aws.amazon.com/blogs/big-data/implement-amazon-emr-hbase-graceful-scaling/

Apache HBase is a massively scalable, distributed big data store in the Apache Hadoop ecosystem. We can use Amazon EMR with HBase on top of Amazon Simple Storage Service (Amazon S3) for random, strictly consistent real-time access for tables with Apache Kylin. It ingests data through spark jobs and queries the HTables through Apache Kylin cubes. The HBase cluster uses HBase write-ahead logs (WAL) instead of Amazon EMR WAL.

A time goes by, companies may want to scale in long-running Amazon EMR HBase clusters because of issues such as Amazon Elastic Compute Cloud (Amazon EC2) scheduling events and budget concerns. Another issue is that companies may use Spot Instances and auto scaling for task nodes for short-term parallel computation power, like MapReduce tasks and spark executors. Amazon EMR also runs HBase region servers on task nodes for Amazon EMR on S3 clusters. Spot interruptions will lead to an unexpected shutdown on HBase region servers. For an Amazon EMR HBase cluster without enabling write-ahead logs (WAL) for Amazon EMR feature, an unexpected shutdown on HBase region servers will cause WAL splits with server recovery process, and it will bring extra load to the cluster and sometimes makes HTables inconsistent.

For these reasons, administrators look for a way to scale-in Amazon EMR HBase cluster gracefully and stop all HBase region servers on the task nodes.

This post demonstrates how to gracefully decommission target region servers programmatically. The scripts do the following tasks. The script also tests successfully in Amazon EMR 7.3.0, Amazon EMR 6.15.0, and 5.36.2.

Automatically move the HRegions through a script
Raise the decommission priority
Decommission HBase region servers gracefully
Prevent Amazon EMR provisioning region servers on task nodes by Amazon EMR software configurations
Prevent Amazon EMR provisioning region servers on task nodes by Amazon EMR steps

Overview of solution

For graceful scaling in, the script uses HBase built-in graceful_stop.sh to move regions to other region servers to avoid WAL splits when decommissioning nodes. The script uses HDFS CLI and web interface to make sure there are no missing and corrupted HDFS block during the scaling events. To prevent Amazon EMR provisions HBase region servers on task nodes, administrators need to specify software configurations per instance groups when launching a cluster. For existing clusters, administrators can either use a step to terminate HBase region servers on task nodes, or reconfigure the task instance group’s HBase storagerootdir.

Solution

For a running Amazon EMR cluster, administrators can use AWS Command Line Interface (AWS CLI) to issue a modify-instance-groups with EC2InstanceIdsToTerminate to terminate specified instances immediately. But terminating an instance in this way can cause a data loss and unpredictable cluster behavior when HDFS blocks have not enough copies or there are ongoing tasks on those decommissioned nodes. To avoid these risks, administrators can send a modify-instance-groups with a new instance request count without a specific instance ID that administrators want to terminate. This command triggers a graceful decommission process on the Amazon EMR side. However, Amazon EMR only supports graceful decommission for YARN and HDFS. Amazon EMR doesn’t support graceful decommission for HBase.

Hence, administrators can try method 1, as described later in this post, to raise the decommission priority of the decommission targets as the first step. In case tweaking the decommissions priority didn’t work, move forward to the second approach, method 2. Method 2 is to stop the resizing request, and move the HRegions manually before terminating the target core nodes. Note that Amazon EMR is a managed service. Amazon EMR service will terminate the EC2 instance after anyone stops it or detach its Amazon Elastic Block Store (Amazon EBS) volumes. Therefore, don’t try to detach EBS volumes on the decommission targets and attach them to new nodes.

Method 1: Decommission HBase region servers through resizing

To decommission Hadoop nodes, administrators can add decommission targets to HDFS’s and YARN’s exclude list, which were dfs.hosts.exclude and yarn.nodes.exclude.xml. However, Amazon EMR disallows manual update to these files. The reason is that the Amazon EMR service daemon, master instance controller, is the only valid process to update these two files on master nodes. Manual updates to these two files will be reset.

Thus, one of the most accessible ways to raise a core node’s decommission priority according to Amazon EMR is having less instance controller heartbeat.

As the first step, pass move_regions to the following script on Amazon S3, blog_HBase_graceful_decommission.sh, as an Amazon EMR step to move HRegions to other region servers and shutdown processes of region server and instance controller. Please also provide targetRS and S3Path to blog_HBase_graceful_decommission.sh. targetRS represents to the private DNS of the decommission target region server. S3Path represents the location of the region migration script.

This step needs to be run in off-peak hours. After all HRegions on the target region server are moved to other nodes, splitting WAL activities after stopping the HBase region server will generate a very low workload to the cluster because it serves 0 regions.

For more information , refer to blog_HBase_graceful_decommission.sh.

Taking a closer look at the move_regions option in blog_HBase_graceful_decommission.sh, this script disables the region balancer and moves the regions to other region servers. The script retrieves Secure Shell (SSH) credentials from AWS Secrets Manager to access worker nodes.

In addition, the script included some AWS CLI operations. Please make sure the instance profile, EMR_EC2_DefaultRole, can operate the following APIs and have SecretsManagaerReadWrite permission.

Amazon EMR APIs:

describe-cluster
list-instances
modify-instance-groups

Amazon S3 APIs:

cp

Secrets Manager APIs:

get-secret-value

In Amazon EMR 5.x, HBase on Amazon S3 will make the master node also work as a region server hosting hbase:meta regions. This script will get stuck when trying to move non-hbase:meta HRegions to the master. To automate the script, the parameter, maxthreads, is increased to move regions through multiple threads. By moving regions in a while loop, one of the threads got a runtime error because it tries to move non-hbase:meta HRegions to the master node. Other threads can keep on moving HRegions to other region servers. After the only stuck thread timed out after 300 seconds, it moves forward to the next run. After six retries, manual actions will be required, such as using a move action through the HBase shell for the remaining regions’ movement or resubmitting the step.

The following is the syntax to use the script to invoke the move_regions function through blog_HBase_graceful_decommission.sh as an Amazon EMR step:

Step type: Custom JAR
Name: Move HRegions
JAR location :s3://<region>.elasticmapreduce/libs/script-runner/script-runner.jar
Main class :None
Arguments :s3://yourbucket/your/step/location/blog_HBase_graceful_decommission.sh move_regions <your-secret-id> <targetRS: target_region_server_private_DNS> <S3Path: S3 location>
Action on failure:Continue

Here’s an Amazon EMR step example to move regions:

Step type: Custom JAR
Name: Move HRegions
JAR location :s3://us-west-2.elasticmapreduce/libs/script-runner/script-runner.jar
Main class :None
Arguments :s3://yourbucket/your/step/location/blog_HBase_graceful_decommission.sh move_regions your-secret-id ip-172-0-0-1.us-west-2.compute.internal s3://yourbucket/yourpath/
Action on failure:Continue

In the HBase web UI, the target region server will serve 0 regions after the evacuation, as shown in the following screenshot.

After that, the stop_RS_IC function in the script stopped the HBase region server and instance controller process on the decommission target after making sure that there is no running YARN container on that node.

Note that the script is for Amazon EMR 5.30.0 and later release versions. For Amazon EMR 4.x-5.29.0 release versions, stop_RS_IC in the script needs to be updated by referring to How do I restart a service in Amazon EMR? In the AWS Knowledge Center. Also, in Amazon EMR versions earlier than 5.30.0, Amazon EMR uses a service nanny to watch the status of other processes. If a service nanny automatically restarts the instance controller, please stop the service nanny using the stop_RS_IC function before stopping the instance controller on that node. Here’s an example:

if [ "\$runningContainers" -eq 0 ]; then
        echo "0 container is running on \${targetRS}" | tee -a /tmp/graceful_stop.log;
        echo "Shutdown IC" | tee -a /tmp/graceful_stop.log;
        sudo /etc/init.d/service-nanny stop | tee -a /tmp/graceful_stop.log;
        sudo /etc/init.d/instance-controller stop | tee -a /tmp/graceful_stop.log;
        sudo /etc/init.d/instance-controller status | tee -a /tmp/graceful_stop.log;
else
        echo "Still have \${runningContainers} containers running on \${targetRS}" | tee -a /tmp/graceful_stop.log;
     	echo "Not to shutdown IC" | tee -a /tmp/graceful_stop.log;
fi

After the step is successfully completed, scale in and define (current core node amount is −1) as the desired target node amount using the Amazon EMR console. Amazon EMR might pick up the target core node to decommission it because the instance controller isn’t running on that node. There can be a few minutes of delay for Amazon EMR to detect the heartbeat loss of that target node through polling the instance controller. Thus, make sure the workload is very low and there will be no container to the target node for a while.

Stopping the instance controller merely increases the decommissioning priority. But method 1 doesn’t guarantee that the target core node will be picked up as the decommissioning target by Amazon EMR. If Amazon EMR doesn’t pick up the decommission target as the decommissioning victim after using method 1, administrators can stop the resize activity using the AWS Management Console. Then, proceed to method 2.

Method 2: Manually decommission the target core nodes

Administrators can terminate the node using the EC2InstanceIdsToTerminate option in the modify-instance-groups API. But this action will directly terminate the EC2 instance and will risk losing HDFS blocks. To mitigate the risk of having a data loss, administrators can use the following steps in off-peak hours with zero or very few running jobs.

First, run the move_hregions function through blog_HBase_graceful_decommission.sh as an Amazon EMR step in method 1. The function moves HRegions to other region servers and stopped the HBase region server as well as the instance controller process.

Then, run the terminate_ec2 function in blog_HBase_graceful_decommission.sh as an Amazon EMR step. To run this function successfully, please provide the target instance group ID and target instance ID to the script. This function merely terminates one node at a time by specifying the EC2InstanceIdsToTerminate option in the modify-instance-groups API. This makes sure that the core nodes are not terminated back-to-back and lowered the risks of missing HDFS blocks. It inspects HDFS and makes sure all HDFS blocks had at least two copies. If an HDFS block have only one copy, the script will exit with an error message similar to, “Some HDFS blocks have only 1 copy. Please increase HDFS replication factor through the following command for existing HDFS blocks.”

$ hdfs dfs -setrep -R -w 2 <the-file-or-directory-you-want-to-modify>

To make sure all upcoming HDFS blocks have at least two copies, reconfigure the core instance group with the following software configuration:

[{
    "classification": "hdfs-site",
    "properties": {
        "dfs.replication": "2"
    },
    "configurations": []
}]

In addition, the terminateEC2 function compares the metadata of the replicating blocks before and after terminating the core node using hdfs dfsadmin -report. This makes sure no under-replicating, corrupted, or missing HDFS block increased.

The terminateEC2 function tracked decommission status. The script will complete after the decommission completes. It can take some time to recover HDFS blocks. The elapsed time depends on several factors such as the total number of blocks, I/O, bandwidth, HDFS handler amount, and name node resources. If there are many HDFS blocks to be recovered, it may take a few hours to complete. Before running the script, please make sure that the instance profile, EMR_EC2_DefaultRole, have permission of elasticmapreduce:ModifyInstanceGroups.

The following is the syntax to use the script to invoke the terminate_ec2 function through blog_HBase_graceful_decommission.sh as an Amazon EMR step:

Step type: Custom JAR
Name: Terminate EC2
JAR location :s3://<region>.elasticmapreduce/libs/script-runner/script-runner.jar
Main class :None
Arguments :s3://yourbucket/your/step/location/blog_HBase_graceful_decommission.sh terminate_ec2 <your-secret-id> <instance_groupID> <target_EC2_Instance_ID>
Action on failure:Continue

Here’s an Amazon EMR step example to move regions:

Step type: Custom JAR
Name: Terminate EC2
JAR location :s3://us-west-2.elasticmapreduce/libs/script-runner/script-runner.jar
Main class :None
Arguments :s3://yourbucket/your/step/location/blog_HBase_graceful_decommission.sh terminate_ec2 your-secret-id ig-ABCDEFGH12345 i-1234567890abcdef
Action on failure:Continue

While invoking terminate_ec2, the script checks HDFS Name Node Web UI for the decommission target to understand how many blocks need to be recovered on other nodes after submitting the decommission request. Here are the steps:

On the Amazon EMR console, version 6.x, find HDFS NameNode web UI. For example, enter http://<master-node-public-DNS>:9870
On the top menu bar, choose Datanodes
In the In operation section, check the on-service data nodes and the total number of data blocks on the nodes, as shown in the following screenshot.
To view the HDFS decommissioning progress, go to Overview, as shown in the following screenshot.

On the Datanodes page, the decommission target node will not have a green checkmark, and the node will be in the Decommissioning section, as shown in the following screenshot.

The step’s STDOUT also reveals the decommission status:

Hostname: ip-172-31-4-197.us-west-2.compute.internal
Decommission Status : Decommission in progress

The decommission target will transit from Decommissioning to Decommissioned in the HDFS NameNode web UI, as shown in the following screenshot.

The decommissioned target will appear in the Dead datanodes section in the step’s STDOUT after the process is completed:

Dead datanodes (1):
Name: 172.31.4.197:50010 (ip-172-31-4-197.us-west-2.compute.internal)
Hostname: ip-172-31-4-197.us-west-2.compute.internal
Decommission Status : Decommissioned
Configured Capacity: 62245027840 (57.97 GB)
DFS Used: 394412032 (376.14 MB)
Non DFS Used: 0 (0 B)
DFS Remaining: 61179640063 (56.98 GB)
DFS Used%: 0.63%
DFS Remaining%: 98.29%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 0
Last contact: Tue Jan 14 06:09:17 UTC 2025

After the target node is decommissioned, the hdfs dfsadmin report will be displayed in the last section in the step’s STDOUT . There should be no difference between rep_blocks_${beforeDate} and rep_blocks_${afterDate} as described in the script. It means no additional amount of under-replicated, missing, or corrupt blocks after the decommission. In HBase web UI, the decommissioned region server will be moved to dead region servers. The dead region server records will be reset after restarting HMaster during routine maintenance.

After the Amazon EMR step is completed without errors, please repeat the preceding steps to decommission the next target core node because administrators may have more than one core nodes to decommission.

After administrators complete all decommission tasks, administrators can manually enable the HBase balancer through the HBase shell again:

$ echo "balance_switch true" | sudo -u hbase hbase shell
## To make sure balance_switch is enabled, submit the same command again. The output should say it’s already in “true” status.
$ echo "balance_switch true" | sudo -u hbase hbase shell

Prevent Amazon EMR from provisioning HBase region servers on task nodes

For new clusters, configure HBase settings for master and core groups only and keep the HBase settings empty when launching an Amazon EMR HBase on an S3 cluster. This prevents provisioning HBase region servers on task nodes.

For example, define configurations for applications other than HBase settings in the software configuration textbox in the Software settings section on the Amazon EMR console, as shown in the following screenshot.

Then, configure HBase settings in Node configuration – optional for each instance group in the Cluster configuration – required section, as shown in the following screenshot.

For master and core instance groups, HBase configurations will be like the following screenshot.

Here’s a json formatted example:

[
    {
        "Classification": "hbase",
        "Properties": {
            "hbase.emr.storageMode": "s3"
         }
    },
    {
        "Classification": "hbase-site",
        "Properties": {
            "hbase.rootdir": "s3://my/HBase/on/S3/RootDir/"
        }
    }
]

For task instance groups, there will be no HBase configuration, as shown in the following screenshot.

Here’s a json formatted example:

[]

Here’s an example in AWS CLI:

$ aws emr create-cluster \
--applications Name=Hadoop Name=HBase Name=ZooKeeper \
... (skip) \
--instance-groups '[ {"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":2}]},"InstanceGroupType":"MASTER","InstanceType":"m5.xlarge","Configurations":[{"Classification":"hbase","Properties":{"hbase.emr.storageMode":"s3"}},{"Classification":"hbase-site","Properties":{"hbase.rootdir":"s3://my/HBase/on/S3/RootDir/"}}],"Name":"Master - 1"},\
{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":2}]},"InstanceGroupType":"TASK","InstanceType":"m5.xlarge","Name":"Task - 3"},\
{"InstanceCount":2,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":64,"VolumeType":"gp2"},"VolumesPerInstance":1}],"EbsOptimized":true},"InstanceGroupType":"CORE","InstanceType":"m5.2xlarge","Configurations":[{"Classification":"hbase","Properties":{"hbase.emr.storageMode":"s3"}},{"Classification":"hbase-site","Properties":{"hbase.rootdir":"s3://my/HBase/on/S3/RootDir/"}}],"Name":"Core - 2"}]' --configurations '[{"Classification":"hdfs-site","Properties":{"dfs.replication":"2"}}]' \
--auto-scaling-role Amazon EMR_AutoScaling_DefaultRole \
... (skip) \
--scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region us-west-2

Stop decommission the HBase region servers on task nodes

For an existing Amazon EMR HBase on an S3 cluster, pass stop_and_check_task_rs to blog_HBase_graceful_decommission.sh as an Amazon EMR step to stop HBase region servers on nodes in a task instance group. The script requirs a task instance group ID and an S3 location to place sharing scripts for task nodes.

The following is the syntax to pass stop_and_check_task_rs to blog_HBase_graceful_decommission.sh as an Amazon EMR step:

Step type: Custom JAR
Name: Stop Hbase Region servers on Task Nodes
JAR location: s3://<region>.elasticmapreduce/libs/script-runner/script-runner.jar
Arguments: s3://yourbucket/your/step/location/blog_HBase_graceful_decommission.sh stop_and_check_task_rs <your-secret-id> <instance_groupID> <S3Path: S3 location>
Action on failure:Continue

Here’s an Amazon EMR step example to stop HBase regions on nodes in a task group:

Step type: Custom JAR
Name: Stop Hbase Region servers on Task Nodes
JAR location :s3://us-west-2.elasticmapreduce/libs/script-runner/script-runner.jar
Main class :None
Arguments :s3://yourbucket/your/step/location/ blog_HBase_graceful_decommission.sh your-secret-id stop_and_check_task_rs ig-ABCDEFGH12345 s3://yourbucket/yourpath/
Action on failure:Continue

This step above not only stops HBase region servers on existing task nodes. To avoid provisioning HBase region servers on new task nodes, the script also reconfigures and scales in the task group. Here are the steps:

Using the move_regions function, in blog_HBase_graceful_decommission.sh, move HRegions on the task group to other nodes and stop region servers on those task nodes.

After making sure that the HBase region servers are stopped at these task nodes, the script reconfigures the task instance group. The reconfiguration details are to let HBase rootdir point to a non-existing location. These settings only apply to the task group. Here’s an example:

[
    {
        "Classification": "hbase-site",
        "Properties": {
            "hbase.rootdir": "hdfs://non/existing/location"
        }
    },
    {
        "Classification": "hbase",
        "Properties": {
            "hbase.emr.storageMode": "hdfs"
        }
    }
]

When the task group’s state returns to RUNNING, the script scales in these task nodes to 0. New task nodes in the upcoming scaling out events will not run HBase region servers.

Conclusion

These scaling steps demonstrate how to handle Amazon EMR HBase scaling gracefully. The functions in the script can help administrators to resolve problems when companies want to gracefully scale the Amazon EMR HBase on S3 clusters without Amazon EMR WAL.

If you have a similar request to scale in an Amazon EMR HBase on an S3 cluster gracefully because the cluster doesn’t enable Amazon EMR WAL, you can refer to this post. Please test the steps in the testing environment for verifications first. After you confirm the steps can meet your production requirements, you can proceed and apply the steps to production environment.

About the Authors

Yu-Ting Su is a Sr. Hadoop Systems Engineer at Amazon Web Services (AWS). Her expertise is in Amazon EMR and Amazon OpenSearch Service. She’s passionate about distributing computation and helping people to bring their ideas to life.

Hsing-Han Wang is a Cloud Support Engineer at Amazon Web Services (AWS). He focuses on Amazon EMR and AWS Lambda. Outside of work, he enjoys hiking and jogging, and he is also an Eorzean.

Cheng Wang is a Technical Account Manager at AWS who has over 10 years of industry experience, focusing on enterprise service support, data analysis, and business intelligence solutions.

Chris Li is an Enterprise Support manager at AWS. He leads a team of Technical Account Managers to solve complex customer problems and implement well-structured solutions.

Amazon Prime Video advances search for sports using Amazon OpenSearch Service

2025-02-27 Radhika Chandak

Post Syndicated from Radhika Chandak original https://aws.amazon.com/blogs/big-data/amazon-prime-video-advances-search-for-sports-using-amazon-opensearch-service/

Passionate sports viewers expect to easily discover and access sports events and their favorite teams, leagues, and players. Providing a robust and intuitive search experience is crucial for the success of Prime Video Sports. With a vast, rapidly growing catalog of live and on-demand sports offerings, a well-designed search architecture allows Prime Video Sports to cater to this engaged audience, streamlining navigation and reducing friction in the user experience. The Prime Video search experience is one of the most clicked on elements in the global navigation bar. Search enables highly relevant recommendations and drives increased viewership and engagement. By prioritizing a seamless search experience that caters to the needs of sports fans, Prime Video has enhanced the overall customer experience, fostering trust and loyalty that contributes to the platform’s long-term growth and success. In this post, we will walk you through how Prime Video used Amazon OpenSearch Service and its AI and machine learning (AI/ML) capabilities to build a more intuitive and enhanced sports search experience.

Challenges

The Prime Video search experience was originally designed to help customers discover trending movies and TV shows that carry durable stats including ratings, viewership, and so on. As Prime Video began to acquire sports rights, they needed to rethink the approach, which was focused primarily on TV shows and movies, to understand the customers’ intent and surface the right content. The approach for TV shows and movies didn’t work as well for live sports because of the more temporal and seasonal nature of sports content making every title a cold start. For example, a search for “soccer live” surfaced documentaries such as “This is football: Season 1” and “Ronaldo VS Messi – Face Off!” rather than live soccer matches. While those entertainment options are perfectly fine on their own, they didn’t fulfill the customers’ goal of finding and watching live or upcoming games for their favorite sports. This disconnect between search queries and relevant results created challenges for customers trying to access the sports content they wanted. By surfacing these relevant sports events in search results, Prime Video enhanced the customer experience, helping customers discover the full breadth of sports coverage available on Prime Video and finding their favorite sports events. To address these issues and better serve the needs of sports fans, in 2024, Prime Video enhanced its sports-specific search capabilities, incorporating deeper sports understanding and using state-of-the-art search techniques, creating an improved and intelligent search system.

Solution overview

In 2024, Prime Video Sports Search delivered the first version of an enhanced sports search functionality powering the experience through a two layer solution comprised of coarse retrieval using semantic search and binary search relevance classification. Semantic search is a technique of searching for information that goes beyond just matching keywords. It matches queries to data (sports events in this case) based on vector embeddings, which capture the meaning of words, phrases, and sentences. The vectors can have n dimensions; when mapped into an n-dimensional space, data that is close in semantic meaning (not a direct text match) will be close to each other in the space, as shown in the following diagram of a two-dimensional vector space of sports matches (in yellow) and search queries (in green).

The foundation of using vector search for sports is the creation of vector embeddings for each sport event present in the Prime Video Sports Catalog. As event data is ingested, textual information including title, sports, team names, leagues, and other event details are used to generate a unique vector representation for each sports event. This allows the system to capture the semantic meaning and relationships between different events—including abbreviations, nicknames, and so on—that are often used by customers to search. When a customer searches for something related to sports, their query is also converted into a vector. The system then performs a K-nearest neighbor (KNN) search, comparing the customer’s query vector to the vectors of all sports events in the catalog. The events with vectors that are closest to the query vector are identified as the most relevant matches, even if the searched words were not directly indexed. For example, Thursday Night Football events might be indexed without the abbreviation tnf, however these games will be returned by semantic search if a customer searches using “tnf” as their search query.

The following figure shows a high level indexing and query flow for a KNN vector search.

Finding the nearest vectors isn’t enough—the system also runs each of these potentially relevant events through a custom binary relevance classification machine learning (ML) model, trained in-house. This allows the system to filter out any events that might be only tangentially related to the original search, leaving behind a refined list of the most pertinent and relevant results for the customer.

Finally, these highly relevant events are ranked and surfaced to the customer with factors like the event’s current live status and upcoming schedule playing a key role in determining the optimal order to display the results. This combined use of vector semantic search and relevance classification enables Prime Video to provide customers with a sports search experience that accurately surfaces the content they’re looking for, significantly enhancing their ability to discover and access the live, upcoming, and recently ended games that they’re most interested in.

Procedure

The vector semantic search implementation we developed consists of two main components: a KNN search index and an endpoint to invoke the text embedding model. To host these components, we used AWS services—the custom text embedding model was deployed on Amazon SageMaker, while the KNN index was created using OpenSearch Service, and hosted on a managed cluster consisting of more than 50 data nodes.

Both of these components are designed to handle real-time customer traffic at a scale of thousands of requests per second. We simplified our system’s application layer by using ready-to-use solutions available in AWS. The Amazon OpenSearch Ingestion pipeline enabled a seamless, code-free integration, allowing us to write sports data from an Amazon DynamoDB table directly into the OpenSearch Service index, eliminating the need for traditional extract, transform, and load (ETL) processes. Furthermore, we used the Neural Search feature of OpenSearch Service instead of directly integrating our application layer with SageMaker for text-to-vector conversion. This approach enables internal text-to-vector transformation, facilitating vector search during both ingestion and search phases. The Neural Search plugin of OpenSearch Service directly communicates with a text embedding model deployed on SageMaker as a real-time inference endpoint using ML connectors.

This architecture—illustrated in the following figure—enabled us to build a scalable and efficient vector search solution, taking advantage of the strengths of various AWS services to simplify the implementation and improve performance.

OpenSearch Ingestion : No-ETL data transfer from DynamoDB to an OpenSearch Service index

Before indexing the sports data in OpenSearch Service, the data is first stored in a DynamoDB table. This layer of storage allows us to maintain a database of all sports events and their metadata required to enable search. This layer acts as a source of truth for sports data that isn’t impacted by the evolution of customer use cases and their respective implementation.

To seamlessly transfer this data from DynamoDB to the OpenSearch Service index, we used an OpenSearch Ingestion pipeline. This allowed us to set up real-time data transfer with a zero ETL integration, abstracting away the data indexing from the application layer. The OpenSearch Ingestion pipeline configuration enables us to specify a schema mapping between the DynamoDB table and the expected document schema in OpenSearch Service. This configuration also allows us to perform data formatting operations on specific fields and configure a dead-letter queue (DLQ) if needed. The steps to setup an OpenSearch Ingestion pipeline can be found in this blog post.

Embedding model setup on SageMaker

At the core of our vector search implementation is the text-embedding model, which plays a crucial role in capturing the semantic meaning of sports-related data. The Sports Search Science team developed this text-embedding model and deployed it on SageMaker as a real-time inference endpoint using AWS Cloud Development Kit (AWS CDK).

The process of creating the SageMaker endpoint requires two key artifacts:

A Docker image of the text-embedding model, which is uploaded to Amazon Elastic Container Registry (Amazon ECR).
A model zip file, which is stored in an Amazon Simple Storage Service (Amazon S3) bucket.

With these two components in place, we used the AWS CDK to programmatically provision the SageMaker endpoint, ensuring a seamless and consistent deployment of the text-embedding model. By using the capabilities of AWS services, such as SageMaker, Amazon ECR, and Amazon S3, we were able to build a scalable and efficient text-embedding model infrastructure to power the vector search solution.

ML connectors

To facilitate access to machine learning models hosted on platforms, such as SageMaker or Amazon Bedrock, OpenSearch Service provides ML connectors. These connectors enable direct integration between OpenSearch Service and external machine learning models.

In our case, the ML connector allows OpenSearch Service to directly invoke the SageMaker endpoint where our custom text-embedding model is deployed. This built-in integration between OpenSearch Service and the SageMaker hosted model simplifies the overall architecture and eliminates the need for the application layer to manage the communication between these two components.

By using the ML connectors provided by the OpenSearch Service ML plugin, we were able to seamlessly integrate our text-embedding model—which is hosted on SageMaker—into the OpenSearch-powered vector search solution. This integration streamlines the data ingestion and querying pipeline making the implementation simpler and more intuitive.

Neural search

To simplify the application layer of our vector search solution, we used the Neural Search capabilities provided by OpenSearch Service. This feature allows us to send only the text data to the index, without the need to explicitly manage the vector embedding generation and indexing. Using neural search helped simplify the application layer of the system by abstracting the generations and management of vectors required to perform a KNN search. During ingestion, neural search transforms document text into vector embeddings and indexes both the text and its vector embeddings in a vector index. When you use a neural query during search, neural search converts the query text into vector embeddings, uses vector search to compare the query and sports event embeddings, and returns the closest results. This abstracts away the need to integrate with SageMaker in the application layer to generate vector embeddings during ingestion and search.

The process of setting up a neural search index with a SageMaker-hosted inference endpoint involves the following detailed steps:

Create an ML connector and register your model in OpenSearch Service: This step generates a model ID that you’ll need in the subsequent neural index setup.
Create a neural ingest pipeline: An ingest pipeline is a sequence of processors that are applied to documents as they’re ingested into an index. To enable neural search, you can define the text_embedding processor in the pipeline. This processor converts the text in a document field to vector embeddings, and the field_map configuration determines the input and output fields for this process.
Create the neural search index: To use the text embedding processor defined in the ingest pipeline, you can create a KNN index and specify the pipeline created in the previous step as the default pipeline.
Run a neural query: To verify your neural search setup, run a neural query by providing a search text and evaluate the results.

By following these steps, you can set up a neural search index in OpenSearch Service and run a neural query. The neural query can perform KNN vector search internally, while only requiring the input of text data during both indexing and querying. This simplifies the application layer and uses the built-in vector embedding generation and indexing capabilities provided by the OpenSearch Service Neural Search feature.

Outcomes

The initial launch of this architecture for sports search had a measurably positive impact on customer experience. We observed a statistically significant increase in search-attributed conversions including streams, purchases, subscriptions, and so on. Offline analysis of the results delivered to customers indicated an improvement in the precision of search results and a reduction in the irrelevance rate of the content shown.

Additionally, we saw that customers engaged with the search feature more frequently, as it was now surfacing results that much more closely aligned with what they were looking for. This increased engagement led to greater discovery of relevant titles on the Prime Video service, including titles that had received little engagement prior to the changes.

Overall, the data clearly demonstrated that by tailoring the specific needs of sports fans into the search experience, we significantly improved their ability to find and access desired content. By developing a smarter search system that better understands sports intent, we have driven more meaningful customer activity and increased conversions directly from search interactions.

Conclusion

By using the innovative AI/ML capabilities of Amazon OpenSearch Service, Prime Video was able to create a cutting-edge search experience that effectively addressed the unique challenges presented by highly dynamic, high-volume sports content. In addition, by overcoming the hurdles that come with such large scale, Prime Video Sports Search was able to contribute valuable improvements and enhancements back to the OpenSearch open source community. These contributions help to pave the way for other developers to more readily use the advanced AI/ML features that OpenSearch Service offers.

This collaboration between Prime Video Sports Search and OpenSearch Service has resulted in a best-in-class search capability that can seamlessly accommodate the unique requirements of live sports content. It’s a partnership that has allowed the products to grow and innovate in tandem, to the benefit of customers seeking exceptional search and discovery experiences.

If you want to build a search experience that understands user intent beyond keyword matching, try the semantic search algorithm with OpenSearch Service and its AI/ML capabilities. If you have any questions, leave a comment below.

About the authors

Radhika Chandak is a Software Development Engineer at Amazon Prime Video, where she has been working for the past 3 years. Her focus is on creating high-velocity customer experiences, with a particular emphasis on building state-of-the-art search experiences for sports content. Radhika is passionate about developing solutions that solve customer problems and delight users. Her expertise lies in crafting innovative approaches to enhance the Prime Video Sports platform, ensuring seamless and engaging experiences for sports enthusiasts.

Anna Chalupowicz is a Software Development Manager at Amazon Prime Video Sports, with 6 years of diverse experience within Amazon. For the last 3.5 years, Anna has been working in Prime Video Sports, where she focuses on developing high-scale solutions and architectural approaches that directly benefit customers. With a passion for collaborative learning and knowledge sharing, Anna finds joy in tackling complex technical challenges and using data-driven insights to enhance the customer experience.

Yaliang Wu is a Software Engineering Manager at AWS, focusing on OpenSearch projects, machine learning, and generative AI applications.

WellRight modernizes to an event-driven architecture to manage bursty and unpredictable traffic

2025-02-24 John Lee

Post Syndicated from John Lee original https://aws.amazon.com/blogs/architecture/wellright-modernizes-to-an-event-driven-architecture-to-manage-bursty-and-unpredictable-traffic/

WellRight is a leading comprehensive corporate wellness platform provider that helps organizations and employees drive meaningful outcomes through personalized wellness programs. The platform increases engagement and benefit utilization by delivering engaging challenges across multiple dimensions of wellness, from physical activities like step tracking to mental health initiatives and team-building exercises.

In this post, we share how WellRight optimized the cost and performance of their application through a ground-up modernization to an event-driven architecture.

The challenge

WellRight’s infrastructure often experiences bursty and unpredictable traffic patterns. For instance, clients can upload bulk user data at any time, which can impact tens of thousands of users, which then cascade into millions of changes. WellRight’s legacy monolithic infrastructure had several challenges when faced with such traffic:

Multiple processes such as registration, progress calculation, and reward distribution relied on a single server, leading to a noisy neighbor problem.
Certain core services were isolated to avoid the noisy neighbor problem, but with high burst workloads, auto scaling didn’t react fast enough to meet the demand. This led to queues backing up with millions of requests. In addition, the database also had to be overprovisioned to avoid throttling, adding to the overall cost.
Parts of the application were not designed with auto scaling in mind, leading to overprovisioning of resources.

The following figure shows the Number of Messages Received metric from a sample Amazon Simple Queue Service (Amazon SQS) queue. WellRight would often receive burst of events at an unpredictable time.

A line graph showing the number of messages received in an SQS queue, with a sharp spike amid otherwise zero activity.

Solution overview

To address the challenges, WellRight made the strategic decision to transition to an event-driven architecture using fully managed AWS services. WellRight’s platform is driven by asynchronous state changes that propagate through multiple wellness programs, which is well suited for an event-driven architecture and can be broken down into microservices. Managed services such as AWS Lambda, Amazon SQS, and Amazon DynamoDB were appealing because they would eliminate the need to manage servers and allow WellRight to focus on core business logic and reduce the operational burden to their engineering team. It also has the added benefit of avoiding overprovisioning of infrastructure or continuously right-sizing resources. Each microservice would scale automatically as needed with no manual efforts, minimizing costs. The loosely coupled architecture would allow the WellRight team to be flexible, being able to add or make modifications to existing programs without affecting existing workflows.

Design

WellRight’s initial event-driven architecture was centered around using serverless and fully managed services. DynamoDB was used as a primary data store for user information. For instance, when a user makes progress on their step challenge, the update in the DynamoDB table would propagate through DynamoDB Streams to Amazon EventBridge. Then, the event would be routed to the appropriate SQS queue, which functions as a buffer and provides fault tolerance to the events. A Lambda function would then process individual user metrics and update the Programs table. The Programs table uses DynamoDB Streams to send out updates using Amazon Simple Notification Service (Amazon SNS), keeping users informed about their progress and rankings.

The following diagram illustrates the flow of an event after a user update.

The first iteration of the event-driven architecture fared better than the monolithic legacy application, but the bursty nature of the traffic was still an issue. Lambda functions triggered by SQS queues scaled rapidly, handling requests in under 15 minutes that previously required 30 servers and took hours to process. Lambda provided WellRight the scalability that they needed, but the rapid scaling introduced a new challenge. This resulted in the throttling of DynamoDB and reaching Lambda concurrency limits during times of extremely high load, which led to many unprocessed messages in the dead-letter queue (DLQ).

Maximum concurrency solution

In January 2023, AWS introduced the maximum concurrency feature for Lambda functions using Amazon SQS as an event source. This new feature allowed WellRight to control the concurrency of their Lambda functions for each SQS queue. Prior to this launch, Lambda functions would continue to scale as long as there were messages in the SQS queue. At times, Lambda functions would scale to its concurrency limits, resulting in it throttling itself. However, with this feature in place, the scaling Lambda functions would not exceed the set maximum concurrency value. This provided WellRight fine-grained control over the overall throughput of the system. WellRight would adjust the maximum concurrency value as needed to protect downstream processes from being overwhelmed, while responding to customer requests in a timely manner.

The following screenshot of the Lambda console shows the maximum concurrency for the function is set to 100 for an SQS trigger.

An AWS Lambda configuration screen showing a trigger from an SQS progress-calculation-queue with maximum concurrency set to 100, alongside a diagram illustrating the SQS to Lambda connection.

WellRight converted all Amazon SQS to Lambda integrations to use this feature. This provided WellRight with full control over the throughput of customer requests while preventing overloading the system. With the maximum concurrency feature, WellRight reduced failed processed messages by 99%, and eliminated DynamoDB throttling events. The feature was enabled for all Amazon SQS and Lambda integrations, including those without scaling issues, as a safeguard for potential future scaling demands.

Performance and cost savings

WellRight’s event-driven architecture significantly improved their ability to handle bursty and unpredictable traffic patterns. The managed serverless services can scale instantaneously to handle these traffic spikes, providing a seamless experience for their clients. With their previous legacy architecture, clients experienced lags in challenge progress, leaderboards, and reward processing.

Now, clients continue to upload updates with over 1 million entries at any time, and WellRight can maintain up-to-the-minute leaderboards and reward processing. The transition to the new architecture has also yielded significant cost savings for WellRight. Prior to the serverless architecture, their baseline architecture required several large Amazon Elastic Compute Cloud (Amazon EC2) instances to handle the initial burst of traffic. After implementing the event-driven architecture, WellRight reduced their costs by 70% on the progress calculation service.

Future plans

WellRight is currently in the process of rolling out the new event-driven architecture to the remaining clients. By the end of 2024, WellRight plans to retire the majority of their remaining servers, further reducing their infrastructure costs.

Conclusion

WellRight’s transition to an event-driven architecture on AWS has been a successful endeavor. By using fully managed services such as Lambda, Amazon SQS, and DynamoDB, they have been able to handle bursty and unpredictable traffic patterns efficiently, while providing a seamless experience for their clients. The introduction of maximum concurrency for Lambda functions has been a game changer, allowing WellRight to control the throughput of their Lambda functions and avoid overwhelming downstream resources.

Overall, the event-driven architecture has enabled WellRight to scale efficiently, improve performance, and reduce costs of their progress calculation service by over 70%. As they continue to optimize their serverless architecture and migrate remaining clients, WellRight is well-positioned to further enhance their platform and provide an exceptional experience to their customers.

To learn more about building event-driven architectures, including key concepts, best practices, AWS services, and getting started resources, visit Serverless Land.

About the authors

AWS CloudFormation: 2024 Year in Review

2025-02-18 Idriss Laouali Abdou

Post Syndicated from Idriss Laouali Abdou original https://aws.amazon.com/blogs/devops/aws-cloudformation-2024-year-in-review/

AWS CloudFormation: 2024 Year in Review

AWS CloudFormation enables you to model and provision your cloud application infrastructure as code-base templates. Whether you prefer writing templates directly in JSON or YAML, or using programming languages like Python, Java, and TypeScript with the AWS Cloud Development Kit (CDK), CloudFormation and CDK provide the flexibility you need. For organizations adopting multi-account strategies, CloudFormation StackSets offers a powerful capability to deploy resources across multiple regions and accounts in parallel.

Last year, we delivered broad set of enhancements that accelerated the development cycle, simplified troubleshooting, and introduced new deployment safety and configuration governance capabilities. Let’s dive into the key launches that shaped CloudFormation in 2024.

Development cycle improvements

Deploy stacks up to 40% faster with optimistic stabilization and configuration complete

In March, we introduced optimistic stabilization with the new CONFIGURATION_COMPLETE event, delivering up to 40% faster stack creation times. This new event signals that CloudFormation has created the resource and applied the configuration as defined in the stack template, allowing us to begin parallel creation of dependent resources. For example, if your stack contains resource B that depends on resource A, CloudFormation will now start provisioning resource B when resource A reaches the CONFIGURATION_COMPLETE state, rather than waiting for full stabilization. Read How we sped up AWS CloudFormation deployments with optimistic stabilization to learn more.

CloudFormation’s old and new deployment strategy

Figure 1: CloudFormation’s old and new deployment strategy

Catch template errors before deployment with early validation

In March, we launched early resource properties validation checks. This feature validates your stack operation upfront for invalid resource property errors, helping you fail fast and minimize the steps required for a successful deployment. Previously, you had to wait until CloudFormation attempted to provision a resource before discovering property-related errors. Now, we validate your template before deploying the first resource and provide clear error messages upfront.

CloudFormation’s early template properties validation feature

Figure 2: CloudFormation’s early template properties validation feature

Safely clean up failed stacks with enhanced deletion controls

In May, we enhanced the DeleteStack API with a new DeletionMode parameter, allowing you to safely delete stacks that are in DELETE_FAILED state. By passing the FORCE_DELETE_STACK value to this parameter, you can now resolve stuck stacks more efficiently during your development and testing cycles.

Accelerate feedback loops with CloudFormation custom resource timeout controls

In June, we introduced the ServiceTimeout property for custom resources. This new capability allows you to set custom timeout values for your custom resource logic execution. Previously, custom resources had a fixed one-hour timeout, which could lead to long wait times when debugging custom resource logic. Now, you can set appropriate timeout values to accelerate your development feedback loops. Refer to the custom resources documentation to learn more about the ServiceTimeout property.

CloudFormation’s ServiceTimeout property for Custom resource

Figure 3: CloudFormation’s ServiceTimeout property for Custom resource

Streamlined Troubleshooting Experience

Resolve deployment issues faster with one-click CloudTrail access

In May, we launched integration with AWS CloudTrail in the Events tab of the CloudFormation console. Troubleshooting some failed stack operations can be time-consuming, so we have streamlined this process by providing direct links from stack operation events to relevant CloudTrail events. When you click ‘Detect Root Cause’ in the CloudFormation Console, you’ll now see a pre-configured CloudTrail deep-link to the API events generated by your stack operation, eliminating multiple manual steps from the troubleshooting process.

Figure 4: CloudFormation troubleshooting with CloudTrail integration

Visualize your entire deployment process with timeline view

In November, we launched deployment timeline view. It gives you unprecedented visibility into your stack operations. This visual tool shows the sequence of actions CloudFormation takes during a deployment, helping you understand resource dependencies and provisioning duration. You can see which resources are being created in parallel, track their status through color-coding, and quickly identify bottlenecks in your deployments.

CloudFormation’s deployment timeline view

Figure 5: CloudFormation’s deployment timeline view

Get instant troubleshooting help with Amazon Q Developer

We integrated Amazon Q Developer to provide AI-powered assistance for troubleshooting. When you encounter a failed stack operation, you can now click “Diagnose with Q” to receive a clear, human-readable analysis of the error. Need more help? The “Help me resolve” button provides actionable steps tailored to your specific scenario.

Figure 6: CloudFormation troubleshooting with Q feature

Enhanced Deployment Safety

In April, we improved change sets to help you better understand the impact of your stack operations. Change sets now show detailed before-and-after values of resource properties and attributes, such as deletion policies. This enhancement helps you detect unintended resource property-level changes, such as a Lambda MemorySize or Runtime values change, during your change sets reviews.

We’ve also improved how change sets handle references. When referenced values are available before deployment, Change sets can now resolve them to their expected values, giving you a more accurate preview of your planned changes.

CloudFormation’s change sets feature

Figure 7: CloudFormation’s change sets feature

Easy onboarding to Infrastructure-as-Code (IaC)

Eliminate weeks of manual effort with IaC Generator

In February, we launched the CloudFormation IaC Generator, a capability addressing one of our customers’ biggest challenges: onboarding existing cloud resources to CloudFormation. This feature makes it easier to generate CloudFormation templates for existing AWS resources. You can now onboard workloads to IaC in minutes instead of spending weeks writing templates manually.

The IaC generator supports over 600 AWS resource types and provides recommendations for related resources. For instance, when you select an S3 bucket, it automatically suggests including associated bucket policies. You can use the generated templates to import resources into CloudFormation, download them for deployment.

CloudFormation’s IaC Generator

Figure 8: CloudFormation’s IaC Generator

In August, we enhanced the IaC Generator with two improvements. First, we added a graphical summary view that helps you quickly find resources after the account scan completes. Second, we integrated with AWS Infrastructure Composer to visualize your application architecture, making it easier to understand resource relationships and configurations.

Figure 9: IaC generator resource scan

Proactive Control Improvements

In November, we launched major enhancements to CloudFormation Hooks, giving you easier ways to author proactive configuration controls and more points to enforce them with your cloud infrastructure provisioning.

CloudFormation Hooks for stack and change set target invocation points

First, we introduced stack and change set target invocation points for CloudFormation Hooks. This extends Hooks beyond individual resource validation, allowing you to run validation checks against entire templates and examine resource relationships. For example, you can now create hooks that validate architectural patterns across multiple resources or enforce team-specific deployment standards. With the change set invocation point, you can automate your change set reviews and reduce the time needed to resolve compliance issues. Refer to the Hooks developer guide to learn more.

Figure 10: CloudFormation’s Hooks for stack and change set target invocation points

Managed hooks for the CloudFormation Guard domain specific language

We introduced the managed hooks to author configuration controls using CloudFormation Guard domain-specific language. This simplifies the hook creation process—you can now write hooks by providing your Guard rule set stored as an S3 object. This is particularly valuable if you’re already using Guard for static template validation, as you can extend these rules to dynamic checks before deployments. To learn more about the Guard hook, check out the AWS DevOps Blog or refer to the Guard Hook User Guide.

CloudFormation Hooks’ Guard language feature

Figure 11: CloudFormation Hooks’ Guard language feature

Managed hooks for AWS Lambda functions

For extended flexibility, we also introduced the managed hooks to implement configuration controls using Lambda function. You can now simply point to a Lambda function with the Lambda Amazon Resource Names (ARNs) for Hooks to invoke. To learn more about the Lambda hook, check out the AWS DevOps Blog or refer to the Lambda Hooks User Guide.

Figure 12: CloudFormation Hooks’ Lambda function feature

CloudFormation Hooks for AWS Cloud Control API target invocation points

Lastly, we extended Hooks to support AWS Cloud Control API (CCAPI) resource configurations. This means your existing resource Hooks can now evaluate configurations from CCAPI create and update operations, allowing you to standardize your proactive control evaluation regardless your IaC tool. If you’re already using pre-built Lambda or Guard hooks, you simply need to specify “Cloud_Control” as a target in your hooks’ configuration to extend their coverage. Learn the detail of this feature from this AWS DevOps Blog.

Figure 13: CloudFormation Hooks for AWS Cloud Control API target invocation point

Additional Platform Improvements

StackSets ListStackSetAutoDeploymentTargets API

In March, we enhanced StackSets with the ListStackSetAutoDeploymentTargets API. This new capability gives you better visibility into your auto-deployment configurations by allowing you to list existing target Organizational Units (OUs) and AWS Regions for a given stack set. Instead of logging into individual accounts to understand your deployment scope, you can now get this information in a single API call.

CloudFormation Git sync with request review support

In September, we improved CloudFormation Git sync with pull request workflow support. When you create or update a pull request in a linked repository, CloudFormation automatically posts change set information as PR comments. This integration provides a clear overview of proposed changes within your familiar Git workflow, allowing team members to review infrastructure changes alongside code changes. Visit our user guide and launch blog to learn more.

Figure 14: CloudFormation Git sync with request review support feature

Early 2025 improvements

Reshape your AWS CloudFormation stacks seamlessly with stack refactoring

In February 2025, CloudFormation introduced a new capability called stack refactoring that makes it easy to reorganize cloud resources across your CloudFormation stacks. Stack refactoring enables you to move resources from one stack to another, split monolithic stacks into smaller components, and rename the logical name of resources within a stack. This enables you to adapt your stacks to meet architectural patterns, operational needs, or business requirements. To explore an example scenario, read Introducing AWS CloudFormation Stack Refactoring.

Learn more

Here are some resources to help you get started learning and using CloudFormation to manage your cloud infrastructure:

Conclusion

As we are starting 2025, our focus remains on making infrastructure deployment faster, safer, and more manageable. These enhancements reflect our commitment to solving real customer challenges and improving the CloudFormation experience. We are excited about the roadmap ahead and look forward to bringing you more innovations in 2025.

We encourage you to try these new features and share your feedback. For more detailed information about any of these launches, visit our documentation or check out the AWS DevOps Blog.

About the author:

Use generative AI on AWS for efficient clinical document analysis

2025-02-05 Alex Boudreau

Post Syndicated from Alex Boudreau original https://aws.amazon.com/blogs/architecture/use-generative-ai-on-aws-for-efficient-clinical-document-analysis/

Clinical trials involve the ingestion and processing of vast amounts of highly regulated data, including complex protocol documents that describe how the trial will be conducted. Managing this volume of information can be overwhelming, but generative AI offers a solution by helping automate the process and enabling clinical researchers to quickly focus on the most relevant information. Currently, the drug approval process takes on average 10–12 years, with clinical trial study startup time accounting for 1 year of that timeframe. Much of the challenge with study startup lies in the complex and non-standard nature of protocol documents. These often require weeks or months of effort to review and assess. This review time adds to the already long cycle time to bring a new drug to market.

In this post, we show how Clario uses the AWS platform to accelerate clinical document analysis.

About Clario

Clario is a leading provider of endpoint data solutions to the clinical trials industry providing regulatory-grade clinical evidence for pharmaceutical, biotech, and medical device partners. Since Clario’s founding more than 50 years ago, their endpoint data solutions have supported clinical trials more than 26,000 times with over 700 regulatory approvals across more than 100 countries. One of the critical challenges Clario faces is the time-consuming process of generating documentation for clinical trials, which can take weeks or months.

The business challenge

Clinical trials are essential for the approval of new health innovations, including treatments, procedures, and medical devices. They require the collection of vast quantities of complex data from dispersed clinical trial sites to support assessments of medical benefits and risks, all while maintaining privacy and regulatory compliance. To make matters even more challenging, capturing data in clinical trial occurs not only in healthcare centers but also through remote capture through various aspects of trial participants’ daily activities.

Partners like Clario understand the challenges faced by life sciences companies when it comes to analyzing large volumes of complex clinical documents, such as study protocols. These documents often contain a mix of structured and unstructured data, including tables, images, and diagrams, making it difficult to accurately interpret and extract key information at scale. In this post, we explore how Clario has used the power of generative AI on AWS to efficiently analyze clinical documents and drive better outcomes for its clients.

Harnessing the power of large language models

The rapid progress in large language models (LLMs) has expanded the potential applications of natural language processing beyond simple conversational AI assistants. Clario has experimented with various techniques, such as zero-shot learning, few-shot learning, classification, entity extraction, and summarization, for the effective use of LLMs in specialized use cases. By employing prompt engineering, AI orchestration, and content retrieval, Clario can guide the models to accurately generate insights and extract relevant information from key clinical research documents, including complex clinical trial protocols.

Four pillars of effective document analysis on AWS

Through its research and development efforts, Clario has identified four core pillars that enable effective document analysis using generative AI on AWS:

Parsing – Clario uses AWS services such as Amazon Textract and Amazon Comprehend to extract text, images, and tables from clinical documents, maintaining both data privacy and security.
Retrieval – By using embedding models and vector databases like Amazon OpenSearch Service, Clario efficiently stores and retrieves relevant information from large document collections based on similarity search. The team has experimented with various chunking and retrieval strategies to optimize accuracy and performance.
Prompting – Using techniques like zero-shot and few-shot learning, Clario has enhanced the accuracy of LLMs for classifying and extracting information . AWS services such as and Amazon Bedrock simplify experimentation with different prompting strategies and the evaluation of model performance.
Generation – Clario carefully considers factors such as context size, reasoning capabilities, and latency when selecting the appropriate LLMs for generating structured outputs. AWS offers a range of pre-trained models and frameworks that seamlessly integrate into Clario’s pipeline.

Solution overview

To tackle the unique challenges associated with analyzing clinical documents, Clario has built a custom generative AI platform on AWS. This platform incorporates an orchestration engine that combines multiple LLMs and deep learning models, enabling it to extract key information accurately and at scale. By using AWS services such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Simple Storage Service (Amazon S3), SageMaker, and AWS Lambda, Clario can efficiently process thousands of documents in a matter of seconds.

The following diagram illustrates the solution architecture.

The workflow consists of the following steps:

Documents are collected on premises (1) and uploaded using AWS Direct Connect (2) with encryption in transit to Amazon S3 (3). All uploaded documents are then automatically and securely stored with server-side object-level encryption.
After the documents are uploaded and the user has reviewed them, the Clario AI Orchestration Engine (4) determines the best document parsing strategy based on file type, and extracts text using Amazon Textract (5). Once extracted, the text is vectorized and stored in the Amazon OpenSearch Service vector engine (6) for later semantic retrieval.
After vectorization, the Clario AI Orchestration Engine (4), which runs as a distributed service in Amazon EKS, launches a document classification async task using Amazon MQ. Amazon EC2 and Lambda are used for additional processing if needed. This triggers the Document Classification Agent, which uses Amazon Bedrock LLMs (8), for automatically determining the document type.
After the documents are classified, the Clario AI Orchestration Engine (4) launches the appropriate document analysis agent for further background processing. In the case of study protocols, the engine launches the Protocol Analysis agent, which uses a predefined analysis graph configuration stored in Amazon Relational Database Service (Amazon RDS) (7), as well as a combination of retrieval strategies and AI models, including custom deep learning models on SageMaker (9), and pre-trained LLMs on Amazon Bedrock (8). This orchestration powers advanced document analysis, transforming massive amounts of unstructured multi-modal data into structured data and insights.
Following the analysis, all structured data is then persisted to Amazon RDS (7) for later visualization, review, and querying.

Recommendations and best practices

Based on their experience developing and deploying generative AI solutions on AWS, Clario learned the following best practices:

Adopt an incremental and iterative development approach to gradually build and refine your models
Follow a standard machine learning approach for evaluating and validating model performance using representative test sets
Optimize the four pillars of document analysis before investing in fine-tuning and continuous pre-training of LLMs
Tailor your approaches to specific use cases, because not all problems require the same models or techniques

Conclusion

By using the power of generative AI on AWS, Clario has been able to efficiently analyze complex clinical trial documents and extract valuable insights for its clients in the life sciences industry. Through a combination of careful model selection, iterative development, and adherence to best practices, Clario has built a scalable and accurate document analysis pipeline using AWS. Unlock the full potential of your clinical trial data by applying these best practices with an AWS generative AI solution today.