Tag Archives: Best practices

Enhancing data durability in Amazon EMR HBase on Amazon S3 with the Amazon EMR WAL feature

Post Syndicated from Suthan Phillips original https://aws.amazon.com/blogs/big-data/enhancing-data-durability-in-amazon-emr-hbase-on-amazon-s3-with-the-amazon-emr-wal-feature/

Apache HBase, an open source NoSQL database, enables quick access to massive datasets. Amazon EMR, from version 5.2.0, lets you use HBase on Amazon Simple Storage Service (Amazon S3). This combines HBase’s speed with the durability advantages of Amazon S3. Also, it helps achieve the data lake architecture benefits such as the ability to scale storage and compute requirements separately. We see our customers choosing Amazon S3 over Hadoop Distributed File Systems (HDFS) when they want to achieve greater durability, availability, and simplified storage management. Amazon EMR continually improves HBase on Amazon S3, focusing on performance, availability, and reliability.

Despite these durability benefits of HBase on Amazon S3 architecture, a critical concern remains regarding data recovery when the Write-Ahead Log (WAL) is lost. Within the EMR framework, HBase data attains durability when it’s flushed, or written, to Amazon S3. This flushing process is triggered by reaching specific size and time thresholds or through manual initiation. Until data is successfully flushed to S3, it persists within the WAL, which is stored in HDFS. In this post, we dive deep into the new Amazon EMR WAL feature to help you understand how it works, how it enhances durability, and why it’s needed. We explore several scenarios that are well-suited for this feature.

HBase WAL overview

Each RegionServer in HBase is responsible for managing data from multiple tables. These tables are horizontally partitioned into regions, where each region represents a contiguous range of row keys. A RegionServer can host multiple such regions, potentially from different tables. At the RegionServer level, there is a single, shared WAL that records all write operations across all regions and tables in a sequential, append-only manner. This shared WAL makes sure durability is maintained by persisting each mutation before applying it to in-memory structures, enabling recovery in case of unexpected failures. Within each region, the memory structure of the MemStore is further divided by column families, which are the fundamental units of physical storage and I/O in HBase. Each column family maintains:

  • Its own MemStore, which holds recently written data in memory for fast access and buffering before it flushes to disk.
  • A set of HFiles, which are immutable data files stored on HDFS (or Amazon S3 in HBase on S3 mode) that hold the persistent, flushed data.

Although all column families within a region are served by the same RegionServer process, they operate independently in terms of memory buffering, flushing, and compaction. However, they still share the same WAL and RegionServer-level resources, which introduces a degree of coordination, hence they operate semi-independently within the broader region context. This architecture is shown in the following diagram.

Architecture diagram of HBase Region Server showing WAL, two regions with in-memory Memstores and persistent HFiles

Understanding the HBase write process: WAL, MemStore, and HFiles

The HBase write path initiates when a client issues a write request, typically through an RPC call directed to the appropriate RegionServer that hosts the target region. Upon receiving the request, the RegionServer identifies the correct HBase region based on the row key and forwards the KeyValue pair accordingly. The write operation follows a two-step process. First, the data is appended to the WAL, which promotes durability by recording every change before it’s committed to memory. The WAL resides on HDFS by default and exists independently on each RegionServer. Its primary purpose is to provide a recovery mechanism in the event of a failure, particularly for edits that have not yet been flushed to disk. When the WAL append is successful, the data is written to the MemStore, an in-memory store for each column family within the region. The MemStore accumulates updates until it reaches a predefined size threshold, controlled by the hbase.hregion.memstore.flush.size parameter (default is 128 MB). When this threshold is exceeded, a flush is triggered.Flushing is handled asynchronously by a background thread in the RegionServer. The thread writes the contents of the MemStore to a new HFile, which is then persisted to long-term storage. In Amazon EMR, the location of this HFile depends on the deployment mode: for HBase on Amazon S3, HFiles are stored in Amazon S3, but for HBase on HDFS, they’re stored in HDFS.This workflow is shown in the following diagram.

HBase write process workflow showing data path through WAL, Memstore, and HFile with AWS services

A region server serves multiple regions, and they all share a common WAL. The WAL records all data changes, storing them in local HDFS. Puts and deletes are initially logged to the WAL by the region server before being recorded in the MemStore for the affected store. Scan and get operations in HBase don’t require the use of the WAL. In the event of a region server crash or unavailability before MemStore flushing, the WAL is crucial for replaying data changes, which promotes data integrity. Because this log by default resides on a replicated filesystem, it enables an alternate server to access and replay the log, requiring nothing from the physically failed server for a complete recovery. When a RegionServer fails abruptly, HBase initiates an automated recovery process orchestrated by the HMaster. First, the ZooKeeper session timeout detects the RegionServer failure, notifying the HMaster. The HMaster then identifies all regions previously hosted on the failed RegionServer and marks them as unassigned. The WAL files from the failed RegionServer are split by region, and these split WAL files are distributed to the new RegionServers that will host the reassigned regions. Each new RegionServer replays its assigned WAL segments to recover the MemStore state that existed before the failure, preventing data loss. When WAL replay is complete, the regions become operational on their new RegionServers, and the recovery process concludes.

HBase recovery workflow from RegionServer failure through WAL splitting to regions online

The effectiveness of the HDFS WAL model relies on the successful completion of the write request in the WAL and the subsequent data replication in HDFS. In cases where some nodes are terminated, HDFS can still recover from the WAL files, allowing HBase to autonomously heal by replaying data from the WALs and rebalancing the regions. However, if all CORE nodes are simultaneously terminated, achieving complete cluster recovery is a challenge because the data to replay from the WAL is lost. The issue arises when WALs are lost due to CORE node shutdown (for example, all three replicas of a file block). In this scenario, HBase enters a loop attempting to replay data from the WALs. Unfortunately, the absence of available blocks in this case causes the HBase server crash procedure to fail and retry indefinitely.

Amazon EMR WAL

To address the mentioned challenge of HDFS WAL and to provide data durability in HBase, Amazon EMR introduces a new EMR WAL feature starting from versions emr-7.0 and emr-6.15. This feature facilitates the recovery of data that hasn’t been flushed to Amazon S3 (HFile). Using this feature provides thorough backup for your HBase clusters. Behind the scenes, the RegionServer writes WAL data to EMR WAL, which is a service outside the EMR cluster. With this feature enabled, concerns about loss of WAL data in HDFS are alleviated. Also, in the event of cluster or Availability Zone failure issues, you can create a new cluster, directing it to the same Amazon S3 root directory and EMR WAL workspace. This enables the automatic recovery of data in the WAL in the order of minutes. Recovery of unflushed data is supported for a duration of 30 days, after which remaining unflushed data is deleted. This workflow is shown in the following diagram.

Detailed sequence diagram of HBase write operations with EMR WAL service integration and S3 storage

Key benefits

Upon enabling EMR WAL, the WALs are located external to the EMR cluster. The key benefits are:

  • High availability – You can remain confident about data integrity even in the face of Availability Zone failures. Their HFiles are stored in Amazon S3, and the WALs are externally stored in EMR WAL. This setup enables cluster recovery and WAL replay in the same or a different Availability Zone within the region. However, for true high availability with zero downtime, relying solely on EMR WAL is not sufficient because recovery still involves brief interruptions. To provide seamless failover and uninterrupted service, HBase replication across multiple Availability Zones is essential along with EMR WAL, providing robust zero-downtime high availability.
  • Data durability improvement – Customers no longer need to concern themselves with potential data loss in scenarios involving WAL data corruption in HDFS or the removal of all replicas in HDFS due to instance terminations.

The following flow diagram compares the sequence of events with and without EMR WAL enabled.

HBase write process and failure recovery workflow with EMR WAL and S3 integration

Key EMR WAL features

In this section, we explore the key enhancements introduced in the EMR WAL service across recent Amazon EMR versions. From grouping multiple HBase regions into a single EMR WAL to advanced configuration options, these new capabilities address specific usage scenarios.

Grouping multiple HBase regions into a single Amazon EMR WAL

In Amazon EMR versions up to 7.2, a separate EMR WAL is created for each region, which can become expensive due to the EMR-WAL-WALHours pricing model, especially when the HBase cluster contains many regions. To address this, starting from Amazon EMR 7.3, we introduced the EMR WAL grouping feature, which enables consolidating multiple HBase regions per EMR WAL, offering significant cost savings (over 99% cost savings in our sample evaluation) and improved operational efficiency. By default, each HBase RegionServer has two Amazon EMR WALs. If you have many regions per RegionServer and want to increase throughput, you can customize the number of WALs per RegionServer by configuring the hbase.wal.regiongrouping.numgroups property. For instance, to set 10 EMR WALs per HBase RegionServer, you can use the following configuration:

[
  {
    "Classification": "hbase-site",
    "Properties": {
      "hbase.wal.regiongrouping.numgroups": "10"
    }
  }
]

The two HBase system tables hbase:meta and hbase:master (masterstore) don’t participate in the WAL grouping mechanisms.

In a performance test using m5.8xlarge instances with 1,000 regions per RegionServer, we observed a significant increase in throughput as the number of WALs grew from 1 to 20 per RegionServer (from 1,570 to 3,384 operations per sec). This led to a 54% improvement in average latency (from 40.5 ms to 18.8 ms) and a 72% reduction in 95th percentile latency (from 231 ms to 64 ms). However, beyond 20 WALs, we noted diminishing returns, with only slight performance improvements between 20 and 50 WALs, and average latency stabilized around 18.7ms. Based on these results, we recommend maintaining a lower region density (around 10 regions per WAL) for optimal performance. Nonetheless, it’s crucial to fine-tune this configuration according to your specific workload characteristics and performance requirements and conduct tests in your lower environment to validate the best setup.

Configurable maximum record size in EMR WAL

Until Amazon EMR version 7.4, the EMR WAL had a record size limit of 4 MB, which was insufficient for some customers. Starting from EMR 7.5, the maximum record size in EMR WAL is configurable through the emr.wal.max.payload.size property. The default value is set to 1 GB. The following is an example of how to set the maximum record size to 2 GB:

[
  {
    "Classification": "hbase-site",
    "Properties": {
      "emr.wal.max.payload.size": "2147483648"
    }
  }
]

AWS PrivateLink support

EMR WAL supports AWS PrivateLink, if you want to keep your connection within the AWS network. To set it up, create a virtual private cloud (VPC) endpoint using the AWS Management Console or AWS Command Line Interface (AWS CLI) and select the service labeled com.amazonaws.region.emrwal.prod. Make sure your VPC endpoint uses the same security groups as the EMR cluster. You have two DNS configuration options: enabling private DNS, which uses the standard endpoint format and automatically routes traffic privately, or using the provided VPC endpoint-specific DNS name for more explicit control. Regardless of the DNS option chosen, both methods mean that traffic remains within the AWS network, enhancing security. To implement this in the EMR cluster, update your cluster configuration to use the PrivateLink endpoint, as shown in the following code sample (for private DNS):

[
    {
        "Classification": "hbase-site",
        "Properties": {
            "emr.wal.client.endpoint": "https://prod.emrwal.region.amazonaws.com"
        }
    }
]

For more details, refer to Access Amazon EMR WAL through AWS PrivateLink in the Amazon EMR documentation.

Encryption options for WAL in Amazon EMR

Amazon EMR automatically encrypts data in transit in the EMR WAL service. You can enable server-side encryption (SSE) for WAL (data at rest) with two key management options:

  • SSE-EMR-WAL: Amazon EMR manages the encryption keys
  • SSE-KMS-WAL: You use an AWS Key Management Service (AWS KMS) key for encryption policies

EMR WAL cross-cluster replication

From EMR 7.5, EMR WAL supports cross-cluster replay, allowing clusters in an active-passive HBase replication setup to use EMR WAL.

For more details on the setup, refer to EMR WAL cross-cluster replication in the Amazon EMR documentation.

EMR WAL enhancement: Minimizing CPU load from HBase sync threads

Starting from EMR 7.9, we’ve implemented code optimizations in EMR WAL to address the high CPU utilization caused by sync threads used by HBase processes to write WAL edits, leading to improved CPU efficiency.

Sample use cases benefitting from this feature

Based on our customer interactions and feedback, this feature can help in the following scenarios.

Continuity during service disruptions

If your business demands disaster recovery with no data loss for an HBase on an S3 cluster due to unexpected service disruptions, such as an Availability Zone failure, the newly introduced feature means you don’t have to rely on a persistent event store solution using Amazon Managed Streaming for Apache Kafka (Amazon MSK) or Amazon Kinesis. Without EMR WAL, you had to set up a complex event-streaming pipeline to retain the most recently ingested data and enable replay from the point of failure. This new feature eliminates that dependency by storing Hbase WALs in the EMR WAL service.

Note: During an Availability Zone (AZ) failure or service-level issue, make sure to fully terminate the original Hbase cluster before launching a new one that points to the same S3 root directory. Running two active Hbase clusters that access the same S3 root can lead to data corruption.

Upgrading to the latest EMR releases or cluster rotations

Without EMR WAL, moving to the latest EMR version or managing cluster rotations with HBase on Amazon S3 necessitated manual interruptions for data flushing to S3. With the new feature, the requirement for data flushing is eliminated. However, during cluster termination and the subsequent launch of a new HBase cluster, there is an inevitable service downtime, during which data producers or ingestion pipelines must handle write disruptions or buffer incoming data until the system is fully restored. Also, the downstream services should account for temporary unavailability, which can be mitigated using a read replica cluster.

Overcoming HDFS challenges during HBase auto scaling

Without EMR WAL feature, having HDFS for your WAL files was a requirement. When implementing custom auto scaling for your HBase clusters, it sometimes resulted in WAL data corruption due to issues linked to HDFS. This is because, to prevent data loss, data blocks had to be moved to different HDFS nodes when one HDFS node was being decommissioned. When nodes continued to be terminated swiftly during scale-down process without allowing sufficient time for graceful decommissioning, it could result in WAL data corruption issues, primarily attributed to missing blocks.

Addressing HDFS disk space issues due to old WALs

When a WAL file is no longer required for recovery, indicating that HBase has made sure all data within the WAL file has been flushed, it’s transferred to the oldWALs folder for archival purposes. The log remains in this location until all other references to the WAL file are completed. In HBase use cases with high write activity, some customers have expressed concerns about the oldWALs directory (/usr/hbase/oldWALs) expanding and occupying excessive disk space and eventually causing disk space issues. With the complete relocation of these WALs to an external EMR WAL service, you will no longer encounter this issue.

Assessing HBase in Amazon EMR clusters with and without EMR WAL for fault tolerance

We conducted a data durability test employing two scripts. The first was for installing YCSB, creating a pre-split table, and loading 8 million records on the master node. The second was for terminating a core node every 90 seconds after a 3-minute wait, totaling five terminations. Two EMR clusters with eight core nodes each were created, one configured with EMR WAL enabled and the other as a standard EMR HBase cluster with the WAL stored in HDFS. After completion of EMR steps, a count was run on the HBase table. In the EMR cluster with EMR WAL enabled, all records were successfully inserted without corruption. In the cluster not using EMR WALs, regions in HBase remained “OPENING” if the node hosting the meta was terminated. For other core node terminations, inserts failed, resulting in a lower record count during validation.

Understanding when EMR WAL read charges apply in HBase

In HBase, standard table read operations such as Get and Scan don’t access WALs. Therefore, EMR WAL read (GiB) charges are only incurred during operations that involve reading from WALs, such as:

  • Restoring data from EMR WALs in a newly launched cluster
  • Replaying WALs to recover data on a crashed RegionServer
  • Performing HBase replication, which involves reading WALs to replicate data across clusters

In a normal scenario, you’re billed only for the following two components related to EMR WAL usage:

  • EMR-WAL-WALHours – Represents the hourly cost of WAL storage, calculated based on the number of WALs maintained. You can use the EMRWALCount metric in Amazon CloudWatch to monitor the number of WALs and track associated usage over time.
  • EMR-WAL-WriteRequestGiB – This reflects the volume of data written to the WAL service, charged by the amount of data written in GiB.

For further details on pricing, refer to Amazon EMR pricing and Amazon EMR Release Guide.

To monitor and analyze EMR WAL related costs in the AWS Cost and Usage Reports (CUR), look under product_servicecode = ‘ElasticMapReduce’, where you’ll find the following product_usagetype entries associated with WAL usage:

  • USE1-EMR-WAL-ReadRequestGiB
  • USE1-EMR-WAL-WALHours
  • USE1-EMR-WAL-WriteRequestGiB

The prefix USE1 indicates the Region (in this case, us-east-1) and will vary depending on where your EMR cluster is deployed.

Summary

This new EMR WAL feature allows you to improve durability of your Amazon EMR HBase on S3 clusters, addressing critical workload scenarios by eliminating the need for streaming solutions for Availability Zone level service disruptions, streamlining processes for upgrading or rotating clusters, preventing data corruption during HBase auto scaling or node termination events, and resolving disk space issues associated with old WALs. Because many of the EMR WAL features are added on the latest releases of Amazon EMR, we recommend that customers use Amazon EMR version 7.9 or later to fully benefit from these improvements.


About the authors

Suthan Phillips is a Senior Analytics Architect at AWS, where he helps customers design and optimize scalable, high-performance data solutions that drive business insights. He combines architectural guidance on system design and scalability with best practices to ensure efficient, secure implementation across data processing and experience layers. Outside of work, Suthan enjoys swimming, hiking and exploring the Pacific Northwest.

How Launchpad from Pega enables secure SaaS extensibility with AWS Lambda

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/architecture/how-launchpad-from-pega-enables-secure-saas-extensibility-with-aws-lambda/

Large organizations increasingly adopt software as a service (SaaS) solutions to focus on business priorities, reduce infrastructure management overhead, and optimize costs. These organizations expect SaaS vendors to provide customizability facilities for tailoring the solution behavior according to their needs. Although traditional approaches like feature flags and webhooks offer some flexibility, they often fall short of providing a high degree of customizability. A new emerging pattern in this space is tenant-supplied custom code execution, which allows tenants to inject their own code into specific workflow points, enabling deep customization while preserving the core SaaS solutions’ integrity and security.

In this post, we share how Pegasystems (Pega) built Launchpad, its new SaaS development platform, to solve a core challenge in multi-tenant environments: enabling secure customer customization. By running tenant code in isolated environments with AWS Lambda, Launchpad offers its customers a secure, scalable foundation, eliminating the need for bespoke code customizations.

Solution overview

Launchpad, which is built on AWS, is an end-to-end platform on which software providers can build, launch, and operate workflow-centric B2B SaaS applications and AI solutions. It provides a managed, secure, scalable cloud environment for hosting multi-tenant applications and data. It accelerates the build experience with generative AI-powered low code tools, prebuilt capabilities, and subscriber-level configuration. Being a multi-tenant platform at its core, Launchpad had to maintain stringent isolation across tenants in its architecture.

One of the requirements Launchpad had was to allow their tenants to augment the workflows natively by providing custom code. Some common scenarios included communicating with external systems with proprietary non-industry-standard protocols, reuse of existing business logic, and SDK-based custom code development.

The solution necessitated the ability for tenants to provide custom code that would implement the required business logic, which Launchpad would be executing. This required architecting a secure runtime environment for custom code execution that maintains the highest degree of cross-tenant isolation within the multi-tenant architecture, at the same time allowing sufficient access to platform APIs and services. It was essential to build an architecture that would decouple the environment running tenant code from the core SaaS platform, as illustrated in the following diagram.

Architecting the solution topology

To achieve the required high level of compute isolation for running code provided by different tenants, Launchpad has adopted Lambda functions in its architecture as the secure ephemeral compute environment. Each untrusted code snippet provided by tenants is bootstrapped as a stand-alone Lambda function, with strong Firecracker-based isolation across different functions and execution environments addressing Launchpad’s requirements. This isolation provides dedicated resources, customizable access permissions, independent monitoring and operations, and automatic scaling for each function, while maintaining complete separation from other functions and their execution environments, as illustrated in the following diagram.

With Lambda being a serverless compute service, adopting it for the Launchpad architecture yielded several significant benefits. The major business benefit was that tenants could implement thousands of custom workflow augmentations on their own simply by providing code snippets, instead of the Launchpad engineering team being responsible for implementing them in the core platform code. Other benefits included:

  • Managed runtimes – AWS handles patching and updating the underlying infrastructure, operating system, and runtimes for customers, reducing the potential attack surface.
  • Fine-grained permissions – Each function can have its own set of access policies to tightly control what resources and actions it can access.
  • No need to pre-provision and pay for overprovisioned capacity – Lambda functions scale up and down automatically based on traffic patterns.
  • Built-in monitoring – Lambda functions emit detailed metrics, logs, and traces through Amazon CloudWatch and AWS X-Ray out of the box, making it straightforward to monitor tenant code execution.

To further reduce risks, Launchpad runs these Lambda functions with untrusted code in a dedicated AWS account. This account is separated from the core SaaS platform account. When end-users create a new function in the Launchpad authoring portal, they upload their code and specify the code handler to be executed during the invocation. Users can also map function input and output to Launchpad fields for further processing to enable an even higher degree of customizability and integration. The multi-tenant authoring service is a Control Plane component that runs as a microservice on the Amazon Elastic Kubernetes Service (Amazon EKS) cluster and uses the Lambda API for function lifecycle management, as illustrated in the following diagram. After a function resource is created, it can be used for further invocations.

Runtime architecture

At runtime, when Launchpad needs to invoke a function, it calls the Lambda Invoke API. Before the function is invoked, the multi-tenant runtime service performs a tenancy check to make sure the request is coming from an authorized tenant by doing the token validation. After a successful validation, the service invokes the required Lambda function. To invoke functions hosted in a different AWS account, the multi-tenant runtime service uses an AWS Identity and Access Management (IAM) role to assume the required permissions and invokes the Lambda service using the AWS SDK. The sequence of interactions is shown in the following architecture diagram.

The workflow consists of the following steps:

  1. An incoming user request reaches the application gateway service.
  2. The application gateway authenticates the request using the tenancy security service.
  3. After it’s authenticated, the request is forwarded to the multi-tenant runtime service
  4. The multi-tenant runtime service validates the supplied token and performs a tenancy check. This makes sure tenants can only invoke own functions they have permissions for (for example, functions they own).
  5. The multi-tenant runtime service pod assumes the IAM role required for invoking the tenant-specific Lambda function in a different AWS account.
  6. The multi-tenant runtime service pod invokes the required Lambda function.

Invoking the platform API from custom code is as straightforward as connecting to any external API. The custom code can authenticate with the platform using OAuth2. To facilitate the authentication, the developer can pass along the credentials as input parameters to the function from the core platform. Then the developer can create a corresponding record (isolated by tenant) in the platform that stores the credentials per function, and pass credentials as input parameters during invocation.

Distributed architecture observability

Operating a distributed architecture that runs untrusted code across multiple AWS accounts requires a comprehensive observability strategy. Launchpad’s approach combines centralized logging and monitoring with cross-account aggregation to provide a unified operational view of the platform.

The monitoring architecture uses CloudWatch Metrics to observe the Lambda functions, aggregating them through a centralized observability layer. This setup empowers platform operators to correlate Lambda function metrics with the core platform services running on Amazon EKS. Launchpad also collects per-function telemetry like function invocations, error rates, and execution time, which allows them to observe per-tenant metrics. These telemetry dimensions enable both a platform-wide and tenant-specific monitoring perspective.

For logging and troubleshooting, Launchpad implements a unified logging pipeline that aggregates Lambda function logs with application gateway and runtime service logs. Each request flowing through the system carries a correlation ID, so operators can trace execution paths across the core SaaS services and into the tenant functions running in the AWS account running tenant Lambda functions.

With this multi-layer observability architecture, Launchpad can maintain operational excellence while running tenant code securely at scale. Regular operational reviews drive continuous improvements in monitoring coverage and incident response procedures. Having per-tenant Lambda functions make it possible for Launchpad to use tenant-specific cost allocation tags, further empowering them to understand the cost footprint of running tenant custom code.

Best practices

When building a SaaS solution, maintaining a unified core code base is essential for scalability and manageability. Implementing per-tenant variations within the core platform code can lead to maintenance complexity and technical debt. Instead, architect your SaaS solution to have extension points, which allow your tenants to inject their custom code at specific points in the workflow, enabling customization without compromising the platform’s maintainability. This pattern makes sure the core SaaS platform remains clean and standardized while offering the flexibility that customers demand.

Additional best practices include:

  • Use separate accounts for running Lambda functions with untrusted tenant-provided code to make sure it’s isolated from your core SaaS platform code.
  • Grant absolute minimum required access permissions to the execution role assigned to the function. The custom code running within the execution environment gets permissions defined in the execution role when making requests to AWS API endpoints. If the function doesn’t need to reach out to AWS API endpoints, remove all permissions from the execution role and add an explicit AWSDenyAll policy.
  • Use separate Lambda functions for each code snippet and each tenant. This will provide the highest degree of cross-tenant isolation. Resources are not reused across different functions and execution environments.
  • Use Lambda layers in case you need to add a layer of vendor-provided code in order to keep it separated from the untrusted tenant-provided code.
  • Implement additional security controls, such as using Amazon Virtual Private Cloud (Amazon VPC) constructs to restrict network access and VPC Flow Logs for network activity monitoring.

Conclusion

The implementation of a secure untrusted code execution environment within SaaS platforms addresses a critical need for tenant customization while maintaining architectural integrity. Lambda offers a built-in isolation model, fine-grained security controls, and serverless scalability, so SaaS providers such as Launchpad can address the requirements of executing tenant-provided code in a multi-tenant environment and offer robust customization capabilities while maintaining strict security boundaries and operational efficiency. This architectural pattern enables providers to focus on core platform development while confidently supporting tenant-specific workflows through the secure and scalable Lambda execution environment.

To learn more, refer to the Security Overview of AWS Lambda white paper. For additional serverless architectural patterns, see Serverlessland.com.


About the authors

How Smartsheet boosts developer productivity with Amazon Bedrock and Roo Code

Post Syndicated from Rony Blum original https://aws.amazon.com/blogs/architecture/how-smartsheet-boosts-developer-productivity-with-amazon-bedrock-and-roo-code/

This post was co-written with JB Brown from Smartsheet.

Organizations are often seeking ways to enhance developer productivity while maintaining cost-efficiency. AI-powered coding assistants are effective solutions to accelerate development cycles, but implementing these solutions at scale while managing costs presents unique challenges. Roo Code, an AI-powered autonomous coding agent located directly in developers’ editors, represents the latest evolution in AI-assisted development. It helps developers code faster and smarter by offering capabilities ranging from code generation and refactoring to documentation writing and code base analysis through natural language interaction. This post explores how Smartsheet successfully deployed Roo Code with Amazon Bedrock and Anthropic’s Claude, achieving significant improvements in developer efficiency while optimizing costs through innovative caching strategies.

The Amazon Bedrock prompt caching capability, which became generally available in April 2025, represents a significant advancement in optimizing AI model interactions. With this feature, organizations can cache frequently used prompts across multiple API calls, reducing costs by up to 90% and lowing latency by up to 85%. The technology is particularly valuable for applications that repeatedly use similar context, such as coding assistants that need to maintain context about code files. Prompt caching supports multiple foundation models (FMs), including Anthropic’s Claude 3.5 Haiku and Anthropic’s Claude 3.7 Sonnet, as well as the Amazon Nova family of models. The cached context remains available for up to 5 minutes after each access, with each cache hit resetting this countdown. For organizations implementing AI-assisted development workflows, they can use this capability to balance performance with cost-efficiency.

Benefits of Amazon Bedrock prompt caching

Smartsheet is a leading cloud-based enterprise work management service, enabling millions of users worldwide to plan, manage, track, automate, and report on work at scale. Their engineering team identified an opportunity to enhance developer productivity by implementing AI-assisted coding capabilities across their organization. The solution needed to scale efficiently while maintaining reasonable costs, leading them to choose Roo Code with Amazon Bedrock with Anthropic’s Claude as their FM provider.

“Our engineering team has achieved a 60% reduction in operational costs and 20% improvement in response latency when using Roo Code with Amazon Bedrock prompt caching. These improvements directly translate to enhanced developer productivity across our organization,” says JB Brown, VP of Engineering at Smartsheet.

This fact is underscored by the team’s ability to rapidly generate comprehensive code documentation and architecture diagrams in a mere four hours, a task that previously consumed 2 weeks of manual effort. This newfound efficiency extends to critical areas like AWS cost management, where a vital analysis tool was built in just 30 minutes, and is projected to save Smartsheet $450,000 annually. Furthermore, the speed and accessibility of information have been significantly amplified, as evidenced by the 6–8 hours of senior engineer time saved by quickly explaining complex Terraform/Terragrunt infrastructure. These examples highlight how Amazon Bedrock prompt caching is not just about incremental gains, but about unlocking substantial time and resource savings, empowering our developers to focus on innovation rather than time-consuming manual processes.

Solution overview

When implementing AI-assisted coding at scale, Smartsheet faced several key challenges around managing costs associated with large language model (LLM) API calls, reducing latency for developer interactions and handling repetitive elements in prompts efficiently. The development team recognized that a significant portion of the prompts sent to Amazon Bedrock were repeated elements: the system prompt and the continuously building message history, presenting an opportunity for optimization.

To address these challenges, Smartsheet implemented a comprehensive solution centered around Roo Code integration with Amazon Bedrock and prompt caching. Smartsheet deployed Roo Code, powered by Amazon Bedrock and Anthropic’s Claude 3.7, across the organization and integrated seamlessly into existing development workflows. This provided developers with immediate access to AI assistance for code writing, reviewing, and debugging tasks.

The team then contributed a sophisticated Amazon Bedrock prompt caching system to Roo Code that identifies and stores common prompt elements, significantly reducing redundant data transmission to Bedrock. This caching layer proved crucial in optimizing both cost and performance metrics. The caching decision flow incorporates two key checks: first, it verifies if the selected model supports prompt caching, then it makes sure the system prompt meets a minimum token threshold to make caching worthwhile.

The solution integrates Roo Code directly into developers’ integrated development environments (IDE), using the Amazon Bedrock prompt caching capabilities. The architecture separates content into static elements (like system instructions and code context) and dynamic elements (like user queries). When a developer makes a request, static content is automatically marked with cache checkpoints and stored for 5 minutes, with each access refreshing this window.

For subsequent queries, the system checks for matching cached content before processing. When matches are found, it bypasses recomputation of these sections, significantly reducing both response time and costs.

The following diagram shows the integration points between Roo Code and Amazon Bedrock.

Technical workflow diagram showing data flow between user query, cache decision logic, and Amazon Bedrock with response handling

The following screenshot highlights the model provider settings and connection parameters required for implementation.

Amazon Bedrock configuration interface showing AWS authentication, region settings, and Claude model with caching enabled

Optimizing performance with prompt caching

When analyzing code files, much of the context—such as file contents and environment details—remains consistent across multiple queries. By caching this content, subsequent requests can reuse it from the cache, significantly reducing cost and latency.

In our example, we first asked Roo Code to analyze a Python file. We then sent multiple follow-up queries about different aspects of the code. The first query explained the overall structure, followed by questions about specific classes, unused modules, and potential improvements.

The usage section of the responses showed significant improvements through caching. During the first query, the model processed the input and wrote the cached content. For subsequent queries, 90% of the input was served from the cache. Because cache reads are 90% cheaper than processing new input tokens, this resulted in an 83% cost reduction for follow-up queries.

Here’s how the caching worked across queries:

  • First query – The model processed the entire file content and environment details, writing them to cache
  • Subsequent queries – The model loaded the cached content, only processing the new question text

To illustrate this, let’s look at the usage statistics from a series of queries:

Query 1: "Explain my @/src/models/product.py code"
{
    "inputTokens": 1051,
    "outputTokens": 902,
    "totalTokens": 19374,
    "cacheReadInputTokenCount": 0,
    "cacheWriteInputTokenCount": 12421
}

Query 2: "Describe what is ProductUpdate class used for"
{
    "inputTokens": 1078,
    "outputTokens": 694,
    "totalTokens": 21664,
    "cacheReadInputTokenCount": 19892,
    "cacheWriteInputTokenCount": 0
}

Query 3: "Which module is not in use?"
{
    "inputTokens": 4,
    "outputTokens": 774,
    "totalTokens": 22753,
    "cacheReadInputTokenCount": 19892,
    "cacheWriteInputTokenCount": 2083
}

Query 4: "Any improvement to implement to my code making it more efficient?"
{
    "inputTokens": 1097,
    "outputTokens": 1270,
    "totalTokens": 24342,
    "cacheReadInputTokenCount": 21975,
    "cacheWriteInputTokenCount": 0
} 

We can see from the results that the first query didn’t return cacheReadInputTokenCount but wrote to the cache the whole input. For subsequent queries, most of the input was read from the cache, except for the specific question.

The results clearly demonstrate the benefits of prompt caching:

  • 70% of total input was served from cache across queries
  • 90% of input for follow-up queries came from cache
  • Overall input costs reduced by 60%
  • Follow-up query costs reduced by 83%

This approach is particularly effective for development workflows where developers frequently query the same code base with different questions. The cached content provides necessary context for each query while significantly reducing processing overhead.

The following screenshot displays Roo Code’s interface, showing detailed metrics including token usage, cache hit rates, and associated API costs. The interface presents the cost savings achieved through caching and provides comprehensive usage analytics.

IDE showing Roo Code analysis with metrics, API costs, and detailed Python code explanation

Results and future innovation

The following line graph demonstrates the cost reduction trend over multiple queries runs, having the first query uncached, and the subsequent input using significant query responses from the cache. The y-axis shows costs in dollars, and the x-axis shows the cost per query. The graph shows a clear downward trend, with costs decreasing by 60% from the baseline.

Dual-axis graph displaying per-query and cumulative costs, demonstrating cost savings through caching over 4 queries

The implementation delivered remarkable improvements across key metrics. The prompt caching system achieved a 60% reduction in operational costs while simultaneously improving response latency by 20%.

The choice of Amazon Bedrock prompt caching proved particularly effective for Smartsheet’s coding assistance workflows. The 5-minute cache Time to Live (TTL) aligns perfectly with the natural rhythm of developer interactions with Roo Code, where multiple related queries typically occur within short timeframes. During active coding sessions, developers engage in rapid back-and-forth exchanges with the AI assistant—reviewing code, asking follow-up questions, and requesting modifications—while maintaining context through the cache. This agentic workflow, where each interaction builds upon previous context, takes full advantage of the prompt caching mechanism—each query refreshes the cache timer while preserving valuable context about the code being discussed.

Best practices and lessons learned

Through their implementation journey, Smartsheet developed several key best practices for organizations looking to implement similar solutions. Early implementation of prompt caching proved crucial, allowing the team to analyze prompt patterns and design efficient caching strategies from the start. The team found that focusing on developer experience through response time optimization and seamless tool integration was essential for successful adoption.

Continuous monitoring and analysis of usage patterns became a cornerstone of their optimization strategy. By tracking key metrics like response times and costs, the team could identify opportunities for further optimization and regularly adjust their caching strategies to maintain optimal performance.

Looking ahead

Smartsheet continues to innovate with Amazon Bedrock and Roo Code, exploring new ways to enhance developer productivity. Their engineering team is investigating advanced caching strategies and evaluating new FMs as they become available on Amazon Bedrock. The success of this implementation has established a strong foundation for future AI-assisted development initiatives.

Conclusion

Smartsheet’s implementation of Roo Code with Amazon Bedrock and prompt caching demonstrates how organizations can successfully deploy AI-assisted coding solutions while maintaining cost-efficiency. Their approach provides a blueprint for others looking to enhance developer productivity through AI while optimizing operational costs. The combination of strategic implementation, innovative caching solutions, and continuous optimization has enabled Smartsheet to achieve significant improvements in performance and cost metrics.

To learn more about implementing AI solutions at scale, refer to the Amazon Bedrock documentation and explore the AWS Machine Learning Blog. The frameworks and strategies outlined in this post can help organizations of varying size implement efficient, cost-effective AI-assisted development workflows.


About the Authors

Mastering Amazon Q Developer Part 1: Crafting Effective Prompts

Post Syndicated from Will Matos original https://aws.amazon.com/blogs/devops/mastering-amazon-q-developer-part-1-crafting-effective-prompts/

As organizations increasingly adopt AI-powered tools to enhance developer productivity, your ability to effectively communicate with these assistants becomes a valuable skill. This guide explores how you can craft prompts that deliver accurate, useful results when working with Amazon Q Developer.

Your success with Amazon Q Developer depends directly on how well you communicate with it. Through my work as a Principal Specialist Solutions Architect on the Next Generation Developer Experience team at AWS, I’ve observed that developers experience varying degrees of success based primarily on their approach to prompt construction. The difference between a vague request and a well-structured prompt can be the difference between wasted time and a productivity breakthrough.

Recent McKinsey research reveals that developers can complete tasks up to twice as fast with generative AI when using proper prompting techniques [1]. Even more impressive, developers tackling complex tasks are 25-30% more likely to complete them within given time-frames when using these tools effectively. These productivity gains aren’t automatic—they depend on mastering the art and science of prompt engineering.

Based on patterns observed across numerous customer interactions, this guide provides practical techniques to help you maximize the value of your AI-assisted development experience. You’ll learn how to transform your interactions to consistently produce helpful, relevant assistance that can dramatically improve your development workflow.

Key Takeaways

  • Structure your prompts with clear context, specific requirements, and desired output format
  • Include relevant technical details about your environment and constraints
  • Avoid vague requests and provide specific examples when possible
  • Use the provided prompt template to ensure consistent results

Getting Started with Amazon Q Developer

Already using Amazon Q Developer? Great! This guide will help you get more value from your interactions. If you haven’t set up Amazon Q Developer yet, check out the getting started guide.

Understanding the Impact of Good Prompts

The rapid adoption of AI technologies makes prompt engineering skills essential for today’s developers. McKinsey’s latest global survey reveals that 65% of organizations regularly use generative AI, nearly double from their previous survey. When developers master prompt engineering, they’re 25-30% more likely to complete complex tasks within given timeframes.

What Makes an Effective Prompt?

  • Specific Request: State exactly what you need
  • Clear Background: Describe your project, requirements, and constraints
  • Additional Context: Provide code, configuration, or other additional context
  • Expected Output: Specify how you want the information presented

Here’s how this works in practice:

Poor prompt:

How do I deploy a container on AWS?

Effective prompt:


I need to deploy a containerized Node.js e-commerce application that handles 
50,000 daily users with peak loads during promotional events.
Requirements:
- High availability across multiple regions
- MongoDB for persistence
- Auto-scaling capabilities

Please provide:
1. AWS architecture diagram
2. List of required services with configurations
3. Security best practices
4. Operational monitoring recommendations

Common Patterns to Avoid

Short or Vague Requests:

  • Add Docs
  • Make this better
  • Check this
'Add docs' simple prompt with generic response.

Not much to go on here. Amazon Q Developer will likely provide generic documentation.

'Check this' simple prompt with generic response.

Another vague prompt with a generic response.

Overly Broad Questions:

  • How do I use AWS?
  • What's the best practice?
  • Help with Lambda
Image showing the Amazon Q Developer IDE Chat panel where the user entered the vague prompt: 'Help with Lambda'. Amazon Q Developer responds by asking clarifying questions.

The prompt is so vague that Amazon Q Developer responds by asking clarifying questions.

Image showing the Amazon Q Developer chat pane where the user entered the prompt: "Create a Lambda function that processes S3 events."

The more specific prompt allows Amazon Q Developer to provide a more precise response.

Remember: The quality of information you receive directly correlates with the quality of the information you provide.

Proven Techniques for Better Results

To help you apply these principles consistently, I’ve developed a template structure that incorporates all the key elements of an effective prompt. This framework can be adapted for various scenarios and serves as a starting point for your interactions with Amazon Q Developer. While Amazon Q Developer will fill in some parts of this context (see the next post in this series), you just need to make sure this information is available.

These are the principles demonstrated in the template:

  • Technical Context Requirements
    1. Specify your technology stack and versions
    2. Include environment details
    3. Mention compliance requirements
    4. Define scale expectations
  • Example Specifications
    1. Include relevant code snippets
    2. Paste error messages
    3. Reference configuration files
    4. Show current architecture
  • Output Format Guidelines
    1. Request specific documentation formats
    2. Ask for diagrams when needed
    3. Specify code language preferences
    4. Indicate level of detail needed
Image showing the Amazon Q Developer chat panel with the user submitted prompt: "Document the requirements for an application that will process images. Format as a technical requirements document using markdown markup. Output as a single markdown code-block." The response is much more detailed, and aligns with the user's request.

The specification of the output format ensure the response is what you expect.

Quick Reference Prompt Template

Use this template to structure your prompts:


[Business Context] 
- Project description: 
- Performance requirements: 
- Compliance needs: 
- Scale expectations: 

[Technical Details] 
- Current technology stack: 
- Versions/dependencies: 
- Technical constraints: 
- Environment details: 

[Specific Request] 
- Task description: 
- Expected outcome: 
- Special considerations: 

[Output Format] 
- Desired format: 
- Level of detail: 
-  Examples needed: 
- Additional requirements:

Best Practices for Daily Use

Successfully working with Amazon Q Developer requires consistent application of proven practices. These guidelines, developed through extensive customer interactions, will help you maximize the value of your AI-assisted development experience.

  • Start with clear business objectives
  • Include relevant technical constraints
  • Specify performance requirements
  • Request specific output formats
  • Provide examples when possible

Through extensive customer interactions, we’ve found that following these practices consistently produces better results and reduces the need for follow-up clarification.

Take Action Now

Additional Resources

What’s Next?

In the next part of this series, we’ll explore advanced context management in Amazon Q Developer and dive into the new prompt catalog features. You’ll learn how to:

  • Build and maintain context across multiple interactions
  • Use the prompt catalog effectively
  • Handle complex, multi-step development tasks
  • Optimize responses for your specific use cases

Stay tuned, and start applying these techniques today to transform how you build on AWS!

About the author:

Will Matos

Will Matos is a Principal Specialist Solutions Architect at AWS, revolutionizing developer productivity through Generative AI, AI-powered chat interfaces, and code generation. With 25 years of tech experience, he collaborates with product teams to create intelligent solutions that streamline workflows and accelerate software development cycles. A thought leader engaging early adopters, Will bridges innovation and real-world needs.

Navigate Bulk Sender Requirements with Amazon SES

Post Syndicated from Vinay Ujjini original https://aws.amazon.com/blogs/messaging-and-targeting/navigate-bulk-sender-requirements-with-amazon-ses/

Introduction

Email communication remains a critical component of business operations and customer engagement. As the digital landscape evolves, major mailbox providers continually update their policies to enhance security and user experience. This blog will explore the changes implemented by Microsoft for bulk senders trying to reach Outlook.com (supporting Hotmail.com, live.com consumer domain addresses). This follows the Google & Yahoo! bulk sender requirements changes in February of 2024. Microsoft is implementing the enforcement of sender requirements for bulk email senders, particularly those sending over 5,000 messages daily, starting May 5, 2025. These requirements focus on improving email authentication and trust. This will ensure Outlook and Hotmail recipients are receiving messages that are authenticated and from who they claim to be from. These measures will help reduce spoofing, phishing, and spam, and safeguarding individuals and businesses relying on email.

This blog will discuss what these changes mean for you, and how Amazon Simple Email Service (Amazon SES) can help you maintain compliance and optimize your email sending practices.

Background

In February 2024, Google and Yahoo implemented new requirements for bulk email senders, building upon industry efforts to combat spam and improve email deliverability. These changes aligned with Google’s 2024 bulk sender requirements initiative, signaling a unified approach among major mailbox providers to enhance the privacy and compliance in email.

What does this mean for customers and email senders?

What’s Changing?

Microsoft’s New Requirements

  1. DMARC enforcement with at least a p=none policy
  2. Sender domain authentication (SPF, DKIM)
  3. Functional unsubscribe links required in the email
  4. Requirement for From and Reply-to addresses to be deliverable

Why These Changes Matter?

These new requirements serve several crucial purposes:

  1. Enhances trust in your sending domain: Validates that the sender is who they are claiming to be. Enhances trust by delivering messages that are authenticated and aligned with the bulk sender requirements.
  2. Improved Deliverability: Ensuring legitimate emails reach the recipients who have subscribed to sender’s messages.
  3. User-Centric: Providing recipients with control over their inboxes.
  4. Industry Standardization: Aligning sender requirements across major email providers

Best Practices for Compliance

To adhere to these new requirements and optimize your email sending practices, consider the following best practices:

1. Implement Strong Authentication

  • Configure SPF: SPF (Sender Policy Framework) is an email authentication standard that’s designed to prevent email spoofing. Domain owners use SPF to tell email providers which servers are allowed to send email from their domains. Follow setup instructions to authenticate your email with SPF. Must pass SPF for sending domain.
    • Configure “custom MAIL FROM“, which is how senders can ensure that the SPF-authenticated domain is aligned with the From header domain’s DMARC policy.
  • Enable DKIM signing: DomainKeys Identified Mail (DKIM) is an email security standard. It is designed to ensure that an email that claims to have come from a specific domain, was indeed authorized by the owner of that domain. It uses public-key cryptography to sign an email with a private key. Recipient servers use a public key, published to a domain’s DNS to verify that parts of the email have not been modified during the transit. Follow these set up instructions to authenticate email with DKIM in SES. Must pass to validate email integrity and authenticity.
    • Verify your domain with Easy DKIM. If currently using email identities, you have to move to domain
    • If you utilize email identities only, you will default all authentication to amazonses.com. That will not align with your friendly from address which will not satisfy the bulk sender requirements. This means that when you send email to mailbox providers, your messages will be rejected because you do not have proper authentication on your emails. To satisfy the bulk sender requirements, you must use domain verified identities which ensure that you have ownership of or permission to use the sending domain. That will allow SES to sign the outgoing emails with a DKIM signature that aligns with the friendly from domain.
  • Set up DMARC with an appropriate policy: Domain-based Message Authentication, Reporting and Conformance (DMARC) is an email authentication protocol that uses SPF and DKIM to detect email spoofing and phishing. To comply with DMARC, messages must be authenticated through either SPF or DKIM. Ideally, when both are used with DMARC, you’ll be ensuring the highest level of protection possible for your email sending.
Name Type Value
_dmarc.example.com TXT “v=DMARC1;p=none;rua=mailto:[email protected]

In the preceding records:

    • example.com is your domain
    • Value of the TXT record contains the DMARC policy that applies to your domain.
    • In this example, the policy tells email providers to do the following:
      • At least p=none should be implemented.

2. Optimize Email Content

  • Clearly identify yourself as the sender: Use a recognizable “From” name and email address that accurately represents your brand or organization. For example, use “[email protected]” instead of a generic or misleading address.
  • Implement user friendly unsubscribe mechanisms: Include a visible, easy-to-use unsubscribe link in every email, typically in the header. Ensure the unsubscribe process is simple and honors requests promptly, ideally within 24-48 hours. Visit this guide on how Amazon SES helps you do that.
  • Subject line aligns with content: Avoid deceptive subject lines that don’t match the email content.
  • Clearly identify commercial content: If your email is promotional, make it obvious. Use clear language in the subject line and body that indicates the nature of the email, such as “Special Offer” or “Newsletter.”
  • Include a valid physical address: Add your company’s physical mailing address in the email footer. This is not only a legal requirement in many jurisdictions but builds trust with recipients.
  • Verify URLS in the emails: Verify that links in the emails you send work and are not misleading to the reader/subscriber. Be transparent with URLs/links in the email content.

3. Monitor and Maintain

  • Monitor bounces: A bounce typically indicates why a message was not delivered. The SMTP response in the bounce message will have details on why the message was bounced. For example: if it is missing authentication records (fix: include authentication records for the domain – quick fix) versus an IP or domain reputation bounce reason (this maybe a longer term fix).
    • Track both hard bounces (permanent delivery failures) and soft bounces (temporary issues). High bounce rates can indicate list quality problems or delivery issues. Visit this blog to set up notifications for bounces & complaints. Virtual Deliverability Manager (VDM) is an Amazon SES feature that helps you enhance email deliverability. It helps increasing inbox deliverability and email conversions, by providing insights into your sending and delivery data. VDM advices on how to fix the issues that are negatively affecting your delivery success rate and reputation.
  • Track complaint rates: Regularly monitor the number of spam complaints your emails receive with a goal of keeping the complaint rate under 0.2%. Not all mailbox providers have complaint feedback loop data, so use aggregate data from the mailbox providers that do, such as Hotmail and Yahoo. Email providers that don’t provide complaint feedback loops, such as Gmail may have alternative dashboards or tools available like Google Postmaster tools.
  • Perform regular authentication checks: Periodically verify that your SPF, DKIM, and DMARC records are correctly set up and functioning. Alternative to manual DNS checks, Amazon SES has a feature in Virtual Deliverability Manager that performs authentication checks for your sending identities.
  • Maintain list hygiene: Regularly clean your email list by removing inactive subscribers, correcting typos in email addresses, and honoring unsubscribe requests. This helps improve deliverability and engagement rates.

How Amazon SES Helps

Amazon SES provides a robust set of features to help you meet these new requirements and optimize your email sending practices:

Authentication Support

  • Easy DKIM configuration
  • SPF record management
  • DMARC implementation guidance

Comprehensive Monitoring

  • Virtual Deliverability Manager
  • Complaint tracking
  • Bounce rate monitoring
  • Event publishing to Amazon CloudWatch, SNS , Kinesis Firehose and Event Bridge
  • Detailed sending statistics

Compliance Tools

  • List management capabilities (included with SES)
  • Suppression list handling (included with SES)
  • Feedback loop processing (included with SES)
  • Authentication status tracking: This is done through Amazon SES feature Virtual Deliverability Manager (VDM).

Implementation Strategy

To successfully implement these changes, consider the following strategy:

  1. Assessment: Audit your current email practices, review authentication status, and evaluate compliance gaps.
  2. Technical Implementation: Configure authentication protocols, update DNS records, and implement required unsubscribe mechanisms.
  3. Monitoring and Optimization: Track deliverability metrics, monitor complaint rates, and adjust sending practices as needed.

Measuring Success

To ensure ongoing compliance and optimize your email practices, track these key metrics:

  1. Delivery rates
  2. Complaint rates
  3. Authentication pass rates
  4. Engagement metrics (open rates, click-through rates)

Conclusion

The new bulk sender requirements from Microsoft and Yahoo represent an important step towards a more secure and reliable email ecosystem. By leveraging Amazon SES’s powerful features and following industry best practices, you can maintain compliance, improve deliverability, and enhance the overall effectiveness of your email communications.

Amazon SES is committed to helping you navigate these changes and optimize your email sending practices. For the most up-to-date guidance and support, please consult SES’s documentation or contact Amazon SES support.

Additional Resources

The email landscape is constantly evolving. Stay informed and adaptable to ensure your email practices remain effective and compliant.

About the authors:

Monitoring and optimizing the cost of the unused access analyzer in IAM Access Analyzer

Post Syndicated from Oscar Diaz original https://aws.amazon.com/blogs/security/monitoring-and-optimizing-the-cost-of-the-unused-access-analyzer-in-iam-access-analyzer/

AWS Identity and Access Management (IAM) Access Analyzer is a feature that you can use to identify resources in your AWS organization and accounts that are shared with external entities and to identify unused access. In this post, we explore how the unused access analyzer in IAM Access Analyzer works, dive into the cost implications, and share practical approaches to manage and optimize how you use it with a primary focus on cost optimization.

Note: While security best practices for managing AWS Identity and Access Management (IAM) resources are critical, this post emphasizes cost-saving strategies rather than detailed security guidance. We don’t cover step-by-step implementation details for the recommendations here; instead, we provide links to resources that you can use as guides for the process.

Understanding the unused access analyzer in IAM Access Analyzer

IAM Access Analyzer has two capabilities to generate findings:

  • External access analysis (no additional charge): Identifies resources shared with external entities. It requires one analyzer per AWS Region where you have resources.
  • Unused access analysis (paid): Detects unused roles, access keys, and permissions. It requires only one analyzer per AWS account and analyzes IAM roles and users across Regions from a single analyzer.

Both external access analysis and unused access analysis support AWS Organizations and you can create a single analyzer per organization (in the case of external access analysis, per organization per Region).

IAM Access Analyzer unused access analysis costs $0.20 per IAM role or user analyzed each month. The charges for existing roles and users happen at the beginning of the month. As new roles and users are added throughout the month, they are analyzed and charged at a rate of $0.20 per role or user. To help avoid duplicate charges, create only one unused access analyzer per account if using an account-level analyzer, or one unused access analyzer for the entire organization if using an organizational-level analyzer. You should avoid deleting and recreating an analyzer. If you recreate an analyzer, you will be charged again for the analysis.

Reviewing and optimizing your usage

Before taking any actions to reduce costs, it’s crucial to understand your current usage. You can use the AWS Cost and Usage Report (AWS CUR) to identify how many unused access analyzers you have in your environment. To learn more, see Querying Cost and Usage Reports using Amazon Athena.

Use the following Athena query on your CUR data to identify the unused access analyzers within your organization. Replace <CUR_TABLE> with the name of your CUR table.

SELECT
line_item_usage_type,
product_region,
line_item_resource_id,
bill_payer_account_id,
line_item_usage_account_id,
SUM(line_item_unblended_cost)
FROM <CUR_TABLE>
WHERE line_item_product_code = 'AWSIAMAccessAnalyzer'
AND line_item_line_item_type = 'Usage'
GROUP BY
line_item_usage_type,
product_region,
line_item_resource_id,
bill_payer_account_id,
line_item_usage_account_id

This query will give you a comprehensive view of your IAM Access Analyzer usage across your organization, including the cost per analyzer.

Now, let’s walk through four things that you can do today to optimize your IAM Access Analyzer unused access analysis costs.

Consolidate unused analyzers

Review your AWS CUR analysis results to identify opportunities for consolidation. If you’re using an organizational unused access analyzer, you should use a single analyzer. If you’re using an unused access analyzer per account, make sure a single account doesn’t have more than one analyzer.

Use tags to exclude some roles or users

Consider using tags to exclude certain roles or users from analysis. This approach can help scope your analysis and reduce costs by avoiding roles and users that you don’t want to analyze. To do this, you’ll need to implement a tagging strategy for your IAM roles and users, identifying principals that might not require regular access analysis. Then, when creating or modifying an analyzer, use exclusion to skip analysis of tagged IAM roles and users. Regularly review your exclusion strategy to validate that it aligns with your organization’s security policies and compliance requirements.

For a deeper dive into this process, including step-by-step guidance and practical examples, see Customize the scope of IAM Access Analyzer unused access analysis.

Regular clean-up of IAM roles and users

Periodically review and remove unnecessary IAM roles and users. Because IAM Access Analyzer unused access analysis charges are based on the number of roles and users analyzed, removing unused roles and users will help reduce unused access findings cost. This is also a security best practice for IAM.

Monitor and adjust

Set up AWS Budgets or AWS Cost Anomaly Detection to track your IAM Access Analyzer unused access analysis costs. Create alerts for when costs exceed expected thresholds. By using the proactive approach, you can quickly identify and address unexpected cost increases.

Conclusion

IAM Access Analyzer is a valuable tool for improving your organization’s security posture by detecting unused IAM roles, unused access keys for IAM users, unused passwords for IAM users, and unused services and actions for active IAM roles and users. You can then act based on those findings and support your effort to achieve least privilege access. By understanding the billing model and implementing these cost optimization strategies, you can maximize benefits while keeping costs under control. Remember, cost optimization is an ongoing process. Regularly review your usage and adjust your strategy as your needs evolve.

To learn more about IAM Access Analyzer and its pricing, see Getting started with AWS Identity and Access Management Access Analyzer. We’re here to help you optimize your AWS environment, so reach out to AWS Support and your AWS account team if you need further assistance.

If you have feedback about this post, submit comments in the Comments section below.

Oscar Diaz

Oscar Diaz Cordovez

Oscar is a Senior Technical Account Manager specializing in cloud operations and security. His passion for technology and innovation drives his expertise in cloud-native architectures, DevOps practices, and automation.

Avi Harari

Avi Harari

Avi is a Senior Technical Account Manager at AWS supporting Enterprise customers with the adoption and use of AWS services. He is part of the AWS Cloud Operations technical community, focusing on Cloud governance and compliance on AWS.

Petabyte-scale data migration made simple: AppsFlyer’s best practice journey with Amazon EMR Serverless

Post Syndicated from Roy Ninio original https://aws.amazon.com/blogs/big-data/petabyte-scale-data-migration-made-simple-appsflyers-best-practice-journey-with-amazon-emr-serverless/

This post is co-written with Roy Ninio from Appsflyer.

Organizations worldwide aim to harness the power of data to drive smarter, more informed decision-making by embedding data at the core of their processes. Using data-driven insights enables you to respond more effectively to unexpected challenges, foster innovation, and deliver enhanced experiences to your customers. In fact, data has transformed how organizations drive decision-making, but historically, managing the infrastructure to support it posed significant challenges and required specific skill sets and dedicated personnel. The complexity of setting up, scaling, and maintaining large-scale data systems impacted agility and pace of innovation. This reliance on experts and intricate setups often diverted resources from innovation, slowed time-to-market, and hindered the ability to respond to changes in industry demands.

AppsFlyer is a leading analytics and attribution company designed to help businesses measure and optimize their marketing efforts across mobile, web, and connected devices. With a focus on privacy-first innovation, AppsFlyer empowers organizations to make data-driven decisions while respecting user privacy and compliance regulations. AppsFlyer provides tools for tracking user acquisition, engagement, and retention, delivering actionable insights to enhance ROI and streamline marketing strategies.

In this post, we share how AppsFlyer successfully migrated their massive data infrastructure from self-managed Hadoop clusters to Amazon EMR Serverless, detailing their best practices, challenges to overcome, and lessons learned that can help guide other organizations in similar transformations.

Why AppsFlyer embraced a serverless approach for big data

AppsFlyer manages one of the largest-scale data infrastructures in the industry, processing 100 PB of data daily, handling millions of events per second, and running thousands of jobs across nearly 100 self-managed Hadoop clusters. The AppsFlyer architecture is comprised of many data engineering open source technologies, including but not limited to Apache Spark, Apache Kafka, Apache Iceberg, and Apache Airflow. Although this setup has powered operations for years, the growing complexity of scaling resources to meet fluctuating demands, coupled with the operational overhead of maintaining clusters, prompted AppsFlyer to rethink their big data processing strategy.

EMR Serverless is a modern, scalable solution that alleviates the need for manual cluster management while dynamically adjusting resources to match real-time workload requirements. With EMR Serverless, scaling up or down happens within seconds, minimizing idle time and interruptions like spot terminations.

This shift has freed engineering teams to focus on innovation, improved resilience and high availability, and future-proofed the architecture to support their ever-increasing demands. By only paying for compute and memory resources used during runtime, AppsFlyer also optimized costs and minimized charges for idle resources, marking a significant step forward in efficiency and scalability.

Solution overview

AppsFlyer’s previous architecture was built around self-managed Hadoop clusters running on Amazon Elastic Compute Cloud (Amazon EC2) and handled the scale and complexity of the data workflows. Although this setup supported operational needs, it required substantial manual effort to maintain, scale, and optimize.

AppsFlyer orchestrated over 100,000 daily workflows with Airflow, managing both streaming and batch operations. Streaming pipelines used Spark Streaming to ingest real-time data from Kafka, writing raw datasets to an Amazon Simple Storage Service (Amazon S3) data lake while simultaneously loading them into BigQuery and Google Cloud Storage to build logical data layers. Batch jobs then processed this raw data, transforming it into actionable datasets for internal teams, dashboards, and analytics workflows. Additionally, some processed outputs were ingested into external data sources, enabling seamless delivery of AppsFlyer insights to customers across the web.

For analytics and fast queries, real-time data streams were ingested into ClickHouse and Druid to power dashboards. Additionally, Iceberg tables were created from Delta Lake raw data and made accessible through Amazon Athena for further data exploration and analytics.

With the migration to EMR Serverless, AppsFlyer replaced its self-managed Hadoop clusters, bringing significant improvements to scalability, cost-efficiency, and operational simplicity.

Spark-based workflows, including streaming and batch jobs, were migrated to run on EMR Serverless and take advantage of the elasticity of EMR Serverless, dynamically scaling to meet workload demands.

This transition has significantly reduced operational overhead, alleviating the need for manual cluster management, so teams can focus more on data processing and less on infrastructure.

The following diagram illustrates the solution architecture.

This post reviews the main challenges and lessons learned by the team at AppsFlyer from this migration.

Challenges and lessons learned

Migrating a large-scale organization like AppsFlyer, with dozens of teams, from Hadoop to EMR Serverless was a significant challenge—especially because many R&D teams had limited or no prior experience managing infrastructure. To provide a smooth transition, AppsFlyer’s Data Infrastructure (DataInfra) team developed a comprehensive migration strategy that empowered the R&D teams to seamlessly migrate their pipelines.

In this section, we discuss how AppsFlyer approached the challenge and achieved success for the entire organization.

Centralized preparation by the DataInfra team

To provide a seamless transition to EMR Serverless, the DataInfra team took the lead in centralizing preparation efforts:

  • Clear ownership – Taking full responsibility for the migration, the team planned, guided, and supported R&D teams throughout the process.
  • Structured migration guide – A detailed, step-by-step guide was created to streamline the transition from Hadoop, breaking down the complexities and making it accessible to teams with limited infrastructure experience.

Building a strong support network

To make sure the R&D teams had the resources they needed, AppsFlyer established a robust support environment:

  • Data community – The primary resource for answering technical questions. It encouraged knowledge sharing across teams and was spearheaded by the DataInfra team.
  • Slack support channel – A dedicated channel where the DataInfra team actively responded to questions and guided teams through the migration process. This real-time support significantly reduced bottlenecks and helped teams resolve issues quickly.

Infrastructure templates with best practices

Recognizing the complexity of the team’s migration, the DataInfra team had standardized templates to help teams start quickly and efficiently:

  • Infrastructure as code (IaC) templates – They developed Terraform templates with best practices for building applications on EMR Serverless. These templates included code examples and real production workflows already migrated to EMR Serverless. Teams could quickly bootstrap their projects by using these ready-made templates.
  • Cross-account access solutions – Operating across multiple AWS accounts required managing secure access between EMR Serverless accounts (where jobs run) and data storage accounts (where datasets reside). To streamline this, a step-by-step module was developed for setting up cross-account access using Assume Role permissions. Additionally, a dedicated repository was created, so teams can define and automate role and policy creation, providing seamless and scalable access management.

Airflow integration

As AppsFlyer’s primary workflow scheduler, Airflow plays a critical role, making it essential to provide a seamless transition for its users.

AppsFlyer developed a dedicated Airflow operator for executing Spark jobs on EMR Serverless, carefully designed to replicate the functionality of the existing Hadoop-based Spark operator. In addition, a Python package was made available across all Airflow clusters with the relevant operators. This approach minimized code changes, allowing teams to transition seamlessly with minimal modifications.

Solving common permission challenges

To streamline permissions management, AppsFlyer developed targeted solutions for frequent use cases:

  • Comprehensive documentation – Provided detailed instructions for handling permissions for services like Athena, BigQuery, Vault, GIT, Kafka, and many more.
  • Standardized Spark defaults configuration for teams to apply to their applications – Included built-in solutions for collecting lineage from Spark jobs running on EMR Serverless, providing accountability and traceability.

Continuous engagement with R&D teams

To promote progress and maintain alignment across teams, AppsFlyer introduced the following measures:

  • Weekly meetings – Weekly status meetings to review the status of each team’s migration efforts. Teams shared updates, challenges, and commitments, fostering transparency and collaboration.
  • Assistance – Proactive assistance was provided for issues raised during meetings to minimize delays. This made sure that the teams were on track and had the support they needed to meet their commitments.

By implementing these strategies, AppsFlyer transformed the migration process from a daunting challenge into a structured and well-supported journey. Key outcomes included:

  • Empowered teams – R&D teams with minimal infrastructure experience were able to confidently migrate their pipelines.
  • Standardized practices – Infrastructure templates and predefined solutions provided consistency and best practices across the organization.
  • Reduced downtime – The custom Airflow operator and detailed documentation minimized disruptions to existing workflows.
  • Cross-account compatibility – With seamless cross-account access, teams could run jobs and access data efficiently.
  • Improved collaboration – The data community and Slack support channel fostered a sense of collaboration and shared responsibility across teams.

Migrating an entire organization’s data workflows to EMR Serverless is a complex task, but by investing in preparation, templates, and support, AppsFlyer successfully streamlined the process for all R&D teams in the company.

This approach can serve as a model for organizations undertaking similar migrations.

Spark application code management and deployment

For AppsFlyer data engineers, developing and deploying Spark applications is a core daily responsibility. The Data Platform team focuses on identifying and implementing the right set of tools and safeguards that would not only simplify the migration to EMR Serverless, but also streamline ongoing operations.

There are two different approaches available for running Spark code on EMR Serverless: custom container images and JARs or Python files. At the beginning of the exploration, custom images looked promising because it allows greater customization than JARs, which should allow the DataInfra team smoother migration for existing workloads. After deeper research, it was realized that custom images have great power, but come with a cost that in large scale would need to be evaluated. Custom images presented the following challenges:

  • Custom images are supported as of version 6.9.0, but some of AppsFlyer’s workloads used earlier versions.
  • EMR Serverless resources run from the moment EMR Serverless begins downloading the image until workers are stopped. This means a payment is done for aggregate vCPU, memory, and storage resources during the image download phase.
  • They required a different continuous integration and delivery (CI/CD) approach than compiling a JAR or Python file, leading to operational work that should be minimized as much as possible.

AppsFlyer decided to go all in with JARs and allow only in unique cases, where the customization required the use of custom images. Eventually, it was realized that using non-custom images was suitable for AppsFlyer use cases.

CI/CD perspective

From a CI/CD perspective, AppsFlyer’s DataInfra team decided to align with AppsFlyer’s GitOps vision, making sure that both infrastructure and application code are version-controlled, built, and deployed using Git operations.

The following diagram illustrates the GitOps approach AppsFlyer adopted.

JARs continuous integration

For CI, the process in charge of building the application artifacts, several options have been explored. The following key considerations drove the exploration process:

  • Use Amazon S3 as the native JAR source for EMR Serverless
  • Support different versions for the same job
  • Support staging and production environments
  • Allow hotfixes, patches, and rollbacks

Using AppsFlyer’s current external package repository led to challenges, because it required them to build a custom delivery into Amazon S3 or a complex runtime ability to fetch the code externally.

Using Amazon S3 directly also had several alternative approaches:

  • Buckets – Use single vs. separated buckets for staging and production
  • Versions – Use Amazon S3 native object versioning vs. uploading a new file
  • Hotfix – Override the same job’s JAR file vs. uploading a new one

Finally, the decision was to go with immutable builds for consistent deployment across the environments.

Each Spark job git repository pushes to the main branch, triggers a CI process to validate the semantic versioning (semver) assignment, compiles the JAR artifact, and uploads it to Amazon S3. Each artifact is uploaded to three different paths according to the version of the JAR, and also include a version tag for the S3 object:

  • <BucketName>/<SparkJobName>/<major>"."<minor>"."<patch>/app.jar
  • <BucketName>/<SparkJobName>/<major>"."<minor>"/app.jar
  • <BucketName>/<SparkJobName>/<major>/app.jar

AppsFlyer can now have deep granularity and assign each EMR Serverless job to a pinpointed version. Some jobs can run with the latest major version, and other stability and SLA sensitive jobs require a lock to a specific patch version.

EMR Serverless continuous deployment

Uploading the files to Amazon S3 was the final step in the CI process, which then leads to a different CD process.

CD is done by changing the infrastructure code, which is Terraform based, to point to the new JAR that was uploaded to Amazon S3. Then the staging or production application can start using the newly uploaded code and the process can be considered deployed.

Spark application rollbacks

If they need an application rollback, AppsFlyer points the EMR Serverless job IaC configuration from the current impaired JAR version to the previous stable JAR version in the relevant Amazon S3 path.

AppsFlyer believes that every automation impacting production, like CD, requires a breaking glass mechanism for an emergency situation. In such cases, AppsFlyer can manually override the needed S3 object (JAR file) while still using Amazon S3 versions in order to have better visibility and manual version control.

Single-job vs. multi-job applications

When using EMR Serverless, one important architectural decision is whether to create a separate application for each Spark job or use an automatic scaling application shared across multiple Spark jobs. The following table summarizes these considerations.

Aspect Single-Job Application Multi-Job Application
Logical Nature Dedicated application for each job. Shared application for multiple jobs.
Shared Configurations Limited shared configurations; each application is independently configured. Allows shared configurations through spark-defaults, including executors, memory settings, and JARs.
Isolation Maximum isolation; each job runs independently. Maintains job-level isolation through distinct IAM roles despite sharing the application.
Flexibility Flexible for unique configurations or resource requirements. Reduces overhead by reusing configurations and using automatic scaling.
Overhead Higher setup and management overhead due to multiple applications. Lower administrative overhead but requires careful resource contention management.
Use Cases Suitable for jobs with unique requirements or strict isolation needs. Ideal for related workloads that benefit from shared settings and dynamic scaling.

By balancing these considerations, AppsFlyer tailored its EMR Serverless usage to efficiently meet the demands of diverse Spark workloads across their teams.

Airflow operator: Simplifying the transition to EMR Serverless

Before the migration to EMR Serverless, AppsFlyer’s teams relied on a custom Airflow Spark operator created by the DataInfra team.

This operator, packaged as a Python library, was integrated into the Airflow environment and became a key component of the data workflows.

It provided essential capabilities, including:

  • Retries and alerts – Built-in retry logic and PagerDuty alert integration
  • AWS role-based access – Automatic fetching of AWS permissions based on role names
  • Custom defaults – Setting Spark configurations and package defaults tailored for each job
  • State management – Job state tracking

This operator streamlined running Spark jobs on Hadoop and was highly tailored to AppsFlyer’s requirements.

When moving to EMR Serverless, the team chose to build a custom Airflow operator to align with their existing Spark-based workflows. They already had dozens of Directed Acyclic Graphs (DAGs) in production, so with this approach, they could maintain their familiar interface, including custom handling for retries, alerting, and configurations—all without requiring broad changes across the board.

This abstraction provided a smoother migration by preserving the same development patterns and minimizing the migration efforts of adapting to the native operator semantics.

The DataInfra team developed a dedicated, custom, EMR Serverless operator to support the following goals:

  • Seamless migration – The operator was designed to closely mimic the interface of the existing Spark operator on Hadoop. This made sure that teams could migrate with minimal code changes.
  • Feature parity – They added the features missing from the native operator:
    • Built-in retry logic.
    • PagerDuty integration for alerts.
    • Automatic role-based permission fetching.
    • Default Spark configurations and package support for each job.
  • Simplified integration – It’s packaged as a Python library available in Airflow clusters. Teams could use the operator just like they did with the previous Spark operator.

The custom operator abstracts some of the underlying configurations required to submit jobs to EMR Serverless, aligning with AppsFlyer’s internal best practices and adding essential features.

The following is from an example DAG using the operator:

return SparkBatchJobEmrServerlessOperator(
    task_id=task_id,  # Unique task identifier in the DAG

    jar_file=jar_file,  # Path to the Spark job JAR file on S3
    main_class="<main class path>",

    spark_conf=spark_conf,

    app_id=default_args["<emr_serverless_application_id>"],  # EMR Serverless app ID
    execution_role=default_args["<job_execution_role_arn>"],  # IAM role for job execution

    polling_interval_sec=120,  # How often to poll for job status
    execution_timeout=timedelta(hours=1),  # Max allowed runtime

    retries=5,  # Retry attempts for failed jobs
    app_args=[],  # Arguments to pass to the Spark job

    depends_on_past=True,  # Ensure sequential task execution

    tags={'owner': '<team_tag>'},  # Metadata for ownership
    aws_assume_role="<my_aws_role>",  # Role for cross-account access

    alerting_policy=ALERT_POLICY_CRITICAL.with_slack_channel(sc),  # Alerting integration
    owner="<team_owner>",

    dag=dag  # DAG this task belongs to
)

Cross-account permissions on AWS: Simplifying EMRs workflows

AppsFlyer operates across multiple AWS accounts, creating a need for secure and efficient cross-account access. EMR Serverless jobs are executed in the production account, and the data they process resides in a separate data account. To enable seamless operation, Assume Role permissions are used to verify that EMR Serverless jobs running in the production account can access the data and services in the data account. The following diagram illustrates this architecture.

Below is a diagram demonstrating the cross-account permissions AppsFlyer adopted:

Role management strategy

To manage cross-account access efficiently, three distinct roles were created and maintained:

  • EMR role – Used for executing and managing EMR Serverless applications in the production account. Integrated directly into Airflow workers to make it available for the DAGs on the dedicated team Airflow cluster.
  • Execution role – Assigned to the Spark job running on EMR Serverless. Passed by the EMR role in the DAG code to provide seamless integration.
  • Data role – Resides in the data account and is assumed by the execution role to access data stored in Amazon S3 and other AWS services.

To enforce access boundaries, each role and policy is tagged with team-specific identifiers.
This makes sure that teams can only access their own data and roles, minimizing unauthorized access to other teams’ resources.

Simplifying Airflow migration

A streamlined process to make cross-account permissions transparent for teams migrating their workloads to EMR Serverless was developed:

  1. The EMR role is embedded into Airflow workers, making it available for DAGs in the dedicated Airflow cluster for each team:
{
   "Version":"2012-10-17",
   "Statement":[
      "..."{
         "Effect":"Allow",
         "Action":"iam:PassRole",
         "Resource":"arn:aws:iam::account-id:role/execution-role",
         "Condition":{
            "StringEquals":{
               "iam:ResourceTag/Team":"team-tag"
            }
         }
      }
   ]
}
  1. The EMR role automatically passes the execution role to the job within the DAG code:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sts:AssumeRole",
      "Resource": "arn:aws:iam::data-account-id:role/data-role",
      "Condition": {
        "StringEquals": {
          "iam:ResourceTag/Team": "team-tag"
        }
      }
    }
  ]
}
  1. The execution role assumes the data role dynamically during job execution to access the required data and services in the data account:

Allows the Execution Role in the Production account to assume the Data Role.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::production-account-id:role/execution-role"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
  1. Policies, trust relationships, and role definitions are managed in a dedicated GitLab repository. GitLab CI/CD pipelines automate the creation and integration of roles and policies, providing consistency and reducing manual overhead.

Benefits of AppsFlyer’s approach

This approach offered the following benefits:

  • Seamless access – Teams no longer need to handle cross-account permissions manually because these are automated through preconfigured roles and policies, providing seamless and secure access to resources across accounts.
  • Scalable and secure – Role-based and tag-based permissions provide security and scalability across multiple teams and accounts. By using roles and tags, it alleviates the need to create separate hardcoded policies for each team or account. Instead, they can define generalized policies that scale automatically as new resources, accounts, or teams are added.
  • Automated management – GitLab CI/CD streamlines the deployment and integration of policies and roles, reducing manual effort while enhancing consistency. It also minimizes human errors, improves change transparency, and simplifies version management.
  • Flexibility for teams – Teams have the flexibility to use their own or native EMR Serverless operators while maintaining secure access to data.

By implementing a robust, automated cross-account permissions system, AppsFlyer has enabled secure and efficient access to data and services across multiple AWS accounts. This makes sure that teams can focus on their workloads without worrying about infrastructure complexities, accelerating their migration to EMR Serverless.

Integrating lineage into EMR Serverless

AppsFlyer developed a robust solution for column-level lineage collection to provide comprehensive visibility into data transformations across pipelines. Lineage data is stored in Amazon S3 and subsequently ingested into DataHub, AppsFlyer’s lineage and metadata management environment.

Currently, AppsFlyer collects column-level lineage from a variety of sources, including Amazon Athena, BigQuery, Spark, and more.

This section focuses on how AppsFlyer collects Spark column-level lineage specifically within the EMR Serverless infrastructure.

Collecting Spark lineage with Spline

To capture lineage from Spark jobs, AppsFlyer uses Spline, an open source tool designed for automated tracking of data lineage and pipeline structures.

AppsFlyer modified Spline’s default behavior to output a customized Spline object that aligns with AppsFlyer’s specific requirements. AppsFlyer adapted the Spline integration into both legacy and modern environments. In the pre-migration phase, they injected the Spline agent into Spark jobs through their customized Airflow Spark operator. In the post-migration phase, they integrated Spline directly into EMR Serverless applications.

The lineage workflow consists of the following steps:

  1. As Spark jobs execute, Spline captures detailed metadata about the queries and transformations performed.
  2. The captured metadata is exported as Spline object files to a dedicated S3 bucket.
  3. These Spline objects are processed into column-level lineage objects customized to fit AppsFlyer’s data architecture and requirements.
  4. The processed lineage data is ingested into DataHub, providing a centralized and interactive view of data dependencies.

The following figure is an example of a lineage diagram from DataHub.

Challenges and how AppsFlyer addressed them

AppsFlyer encountered the following challenges:

  • Supporting different EMR Serverless applications – Each EMR Serverless application has its own Spark and Scala version requirements.
  • Diverse operator usage – Teams often use custom or native EMR Serverless operators, making uniform Spline integration challenging.
  • Confirming universal adoption – They need to make sure Spark jobs across multiple accounts use the Spline agent for lineage tracking.

AppsFlyer addressed these challenges with the following solutions:

  • Version-specific Spline agents – AppsFlyer created a dedicated Spline agent for each EMR Serverless application version to match its Spark and Scala versions. For example, EMR Serverless application version 7.0.1 and Spline.7.0.1.
  • Spark defaults integration – They integrated the Spline agent into EMR Serverless application Spark defaults to verify lineage collection for jobs executed on the application—no job-specific modifications needed.
  • Automation for compliance – This process consists of the following steps:
    • Detect a newly created EMR Serverless application across accounts.
    • Verify that Spline is properly defined in the application’s Spark defaults.
    • Send a PagerDuty alert to the dedicated team if misconfigurations are detected.

Example integration with Terraform

To automate Spline integration, AppsFlyer used Terraform and local-exec to define Spark defaults for EMR Serverless applications. With Amazon EMR, you can set unified Spark configuration properties through spark-defaults, which are then applied to Spark jobs.

This configuration makes sure the Spline agent is automatically applied to every Spark job without requiring modifications to the Airflow operator or the job itself.

This robust lineage integration provides the following benefits:

  • Full visibility – Automatic lineage tracking provides detailed insights into data transformations
  • Seamless scalability – Version-specific Spline agents provide compatibility with EMR Serverless applications
  • Proactive monitoring – Automated compliance checks verify that lineage tracking is consistently enabled across accounts
  • Enhanced governance – Ingesting lineage data into DataHub provides traceability, supports audits, and fosters a deeper understanding of data dependencies

By integrating Spline with EMR Serverless applications, AppsFlyer has provided comprehensive and automated lineage tracking, so teams can understand their data pipelines better while meeting compliance requirements. This scalable approach aligns with AppsFlyer’s commitment to maintaining transparency and reliability throughout their data landscape.

Monitoring and observability

When embarking on a large migration, and as a day-to-day best-practice process, monitoring and observability are key parts of being able to run workloads successfully for stability, debugging, and cost.

AppsFlyer’s DataInfra team set several KPIs for monitoring and observability in EMR Serverless:

  • Monitor infrastructure-level metrics and logs:
    • EMR Serverless resource usage, including cost
    • EMR Serverless API usage
  • Monitor Spark application-level metrics and logs:
    • stdout and stderr logs
    • Spark engine metrics
  • Centralized observability over the existing environments, Datadog

Metrics

Using EMR Serverless native metrics, AppsFlyer’s DataInfra team set up several dashboards to support tracking both the migration and the day-to-day usage of EMR Serverless across the company. The following are the main metrics that were monitored:

  • Service quota usage metrics:
    • vCPU usage tracking (ResourceCount with vCPU dimension)
    • API usage tracking (API actual usage vs. API limits)
  • Application status metrics:
    • RunningJobs, SuccessJobs, FailedJobs, PendingJobs, CancelledJobs
  • Resource limits tracking:
    • MaxCPUAllowed vs. CPUAllocated
    • MaxMemoryAllowed vs. MemoryAllocated
    • MaxStorageAllowed vs. StorageAllocated
  • Worker-level metrics:
    • WorkerCpuAllocated vs. WorkerCpuUsed
    • WorkerMemoryAllocated vs. WorkerMemoryUsed
    • WorkerEphemeralStorageAllocated vs. WorkerEphemeralStorageUsed
  • Capacity allocation tracking:
    • Metrics filtered by CapacityAllocationType (PreInitCapacity vs. OnDemandCapacity)
    • ResourceCount
  • Worker type distribution:
    • Metrics filtered by WorkerType (SPARK_DRIVER vs. SPARK_EXECUTORS)
  • Job success rates over time:
    • SuccessJobs vs. FailedJobs ratio
    • SubmitedJobs vs. PendingJobs

The following screenshot shows an example of the tracked metrics.

Logs

For logs management, AppsFlyer’s DataInfra team explored several options:

Streamlining EMR Serverless log shipping to Datadog

Because AppsFlyer decided to keep their logs in an external logging environment, the DataInfra team aimed to reduce the number of components involved in the shipping process and minimize maintenance overhead. Instead of managing a Lambda based log shipper, they developed a custom Spark plugin that seamlessly exports logs from EMR Serverless to Datadog.

Companies already storing logs in Amazon S3 or CloudWatch Logs can take advantage of EMR Serverless native support for those environments. However, for teams needing a direct, real-time integration with Datadog, this approach alleviates the need for extra infrastructure, providing a more efficient and maintainable logging solution.

The custom Spark plugin offers the following capabilities:

  • Automated log export – Streams logs from EMR Serverless to Datadog
  • Fewer extra components – Alleviates the need for Lambda based log shippers
  • Secure API key management – Uses Vault instead of hardcoding credentials
  • Customizable logging – Supports custom Log4j settings and log levels
  • Full integration with Spark – Works on both driver and executor nodes

How the plugin works

In this section, we walk through the components of how the plugin works and provide a pseudocode overview:

  • Driver pluginLoggerDriverPlugin runs on the Spark driver to configure logging. The plugin fetches EMR job metadata, calls Vault to retrieve the Datadog API key, and configures logging settings.
initialize() {
  if (user provided log4j.xml) {
     Use custom log configuration
  } else {
     Fetch EMR job metadata (application name, job ID, tags)
     Retrieve Datadog API key from Vault
     Apply default logging settings
  }
}
  • Executor plugin – LoggerExecutorPlugin provides consistent logging across executor nodes. It inherits the driver’s log configuration and makes sure the executors use consistent logging
initialize() {
   fetch logging config from Driver
   apply log settings (log4j, log levels)
}
  • Main plugin – LoggerSparkPlugin registers the driver and executor plugins in Spark. It serves as the entry point for Spark and applies custom logging settings dynamically.
function registerPlugin() {
  return (driverPlugin, executorPlugin);
}
loginToVault(role, vaultAddress) {
    create AWS signed request
    authenticate with Vault
    return vault token
}

getDatadogApiKey(vaultToken, secretPath) {
    fetch API key from Vault
    return key
}

Set up the plugin

To set up the plugin, complete the following steps:

  1. Add the following dependencies to your project:
<dependency>
  <groupId>com.AppsFlyer.datacom</groupId>
  <artifactId>emr-serverless-logger-plugin</artifactId>
  <version><!-- insert version here --></version>
</dependency>
  1. Configure the Spark plugin. The following code enables the custom Spark plugin and assigns the Vault role to access the Datadog API key:

--conf "spark.plugins=com.AppsFlyer.datacom.emr.plugin.LoggerSparkPlugin"

--conf "spark.datacom.emr.plugin.vaultAuthRole=your_vault_role"

  1. Use a custom or default Log4j configuration:

--conf "spark.datacom.emr.plugin.location=classpath:my_custom_log4j.xml"

  1. Set the environment variables for different log levels. This adjusts the logging for specific packages.

--conf "spark.emr-serverless.driverEnv.ROOT_LOG_LEVEL=WARN"

--conf "spark.executorEnv.ROOT_LOG_LEVEL=WARN"

--conf "spark.emr-serverless.driverEnv.LOG_LEVEL=DEBUG"

--conf "spark.executorEnv.LOG_LEVEL=DEBUG"

  1. Configure the Vault and Datadog API key and verify secure Datadog API key retrieval.

By adopting this plugin, AppsFlyer was able to significantly simplify log shipping, reducing the number of moving parts while maintaining real-time log visibility in Datadog. This approach provides reliability, security, and ease of maintenance, making it an ideal solution for teams using EMR Serverless with Datadog.

Summary

Through their migration to EMR Serverless, AppsFlyer achieved a significant transformation in team autonomy and operational efficiency. Individual teams now have greater freedom to choose and build their own resources without depending on a central infrastructure team, and can work more independently and innovatively. The minimization of spot interruptions, which were common in their previous self-managed Hadoop clusters, has substantially improved stability and agility in their operations. Thanks to this autonomy and reliability, combined with the automatic scaling capabilities of EMR Serverless, the AppsFlyer teams can focus more on data processing and innovation rather than infrastructure management. The result is a more efficient, flexible, and self-sufficient development environment where teams can better respond to their specific needs while maintaining high performance standards.

Ruli Weisbach, AppsFlyer EVP of R&D, says,

“EMR-Serverless is a game changer for AppsFlyer; we are able to save significantly our cost with remarkably lower management overhead and maximal elasticity.”

If the AppsFlyer approach sparked your interest and you are thinking about implementing a similar solution in your organization, refer to the following resources:

Migrating to EMR Serverless can transform your organization’s data processing capabilities, offering a fully managed, cloud-based experience that automatically scales resources and eases the operational complexity of traditional cluster management, while enabling advanced analytics and machine learning workloads with greater cost-efficiency.


About the authors

Roy Ninio is an AI Platform Lead with deep expertise in scalable data platform and cloud-native architectures. At AppsFlyer, Roy led the design of a high-performance Data Lake handling PB of daily events, driven the adoption of EMR Serverless for dynamic big data processing, and architected lineage and governance systems across platforms.

Avichay Marciano is a Sr. Analytics Solutions Architect at Amazon Web Services. He has over a decade of experience in building large-scale data platforms using Apache Spark, modern data lake architectures, and OpenSearch. He is passionate about data-intensive systems, analytics at scale, and it’s intersection with machine learning.

Eitav Arditti is AWS Senior Solutions Architect with 15 years in AdTech industry, specializing in Serverless, Containers, Platform engineering, and Edge technologies. Designs cost-efficient, large-scale AWS architectures that leverage the cloud-native and edge computing to deliver scalable, reliable solutions for business growth.

Yonatan Dolan is a Principal Analytics Specialist at Amazon Web Services. Yonatan is an Apache Iceberg evangelist, helping customers design scalable, open data lakehouse architectures and adopt modern analytics solutions across industries.

Introducing Amazon Q Developer in Amazon OpenSearch Service

Post Syndicated from Muthu Pitchaimani original https://aws.amazon.com/blogs/big-data/introducing-amazon-q-developer-in-amazon-opensearch-service/

Customers use Amazon OpenSearch Service to store their operational and telemetry signal data. They use this data to monitor the health of their applications and infrastructure, so that when a production issue happens, they can identify the cause quickly. The sheer volume and variety in data often makes this process complex and time-consuming, leading to high mean time to repair (MTTR).

To expedite this process and transform how developers interact with their operational data, today we introduced Amazon Q Developer support in OpenSearch Service. With this AI-assisted analysis, both new and experienced users can navigate complex operational data without training, analyze issues, and gain insights in a fraction of the time. Amazon Q Developer in OpenSearch Service reduces MTTR by integrating generative AI capabilities directly into OpenSearch workflows so you can improve your operational capabilities without scaling your specialist teams. You can now investigate issues, analyze patterns, and create visualizations using in-context assistance and natural language interactions.

In this post, we share how to get started using Amazon Q Developer in OpenSearch Service and explore some of its key capabilities.

Solution overview

Setting up observability signal data for analysis involves many steps, including instrumenting application code, creating complex queries, creating visualizations and dashboards, configuring appropriate alerts, and often machine learning-based anomaly detectors. This requires significant upfront investment in time, resources, and expertise. Amazon Q Developer in OpenSearch Service introduces natural language exploration and generative AI-based tooling throughout OpenSearch, simplifying both initial setup and ongoing operations. Customers already use natural language based query generation to aid constructing OpenSearch queries; Amazon Q in OpenSearch Service brings in the following additional capabilities:

  • Natural language-based visualizations
  • Result summarization for queries generated with natural language queries
  • Anomaly detector suggestions
  • Alert summarization and insights
  • Best practices guidance

Let’s explore each of these capabilities in detail to understand how they help transform traditional observability workflows and streamline the process of data analysis in the centralized OpenSearch UI.

Natural language-based visualization

Natural language-based visualizations with Amazon Q for OpenSearch Service fundamentally transform how users create and interact with data visualizations. You don’t need to know specialized query languages currently used in OpenSearch Service dashboards to create complex visualizations. For example, you can input requests like “show me a chart of error rates over the last 24 hours broken down by region” or “create a chart showing the distribution of HTTP response codes,” and Amazon Q will automatically generate the appropriate visualization.

To get started with this feature, choose Visualizations in the navigation pane and choose Create New Visualization. The OpenSearch UI has many built-in visualization types. To use the new natural language-based visualization, choose Natural language previewer.

This will bring will bring a new visualization page with a text field where you can enter a query in natural language.

Choose an index pattern on the dropdown menu (openSearch_dashabords_sample_data_logs in this case). Amazon Q interprets your intent, identifies relevant fields, automatically selects the most appropriate visualization type, and applies proper formatting and styling. Amazon Q can also understand multiple dimensions in the data, various aggregation methods, and different time ranges.

Now you’re ready to build your visualization in natural language. For example, for the query “Show me number of distinct IP addresses per day in logs,” we see the following visualization.

Amazon Q generates the visualization as per the instruction. The UI also gives the option to update any component of data, transformations, marks and encoding for the visualization. This window also shows the generated query for the data in PPL. For this example Amazon Q generated this query

source=opensearch_dashboards_sample_data_logs*| stats DISTINCT_COUNT(`ip`) as unique_ips by span(`timestamp`, 1d)

Using this interactive UI, you can customize different aspects of the visualization if needed. For example, if you prefer to use a bar type instead of what Amazon Q generated, you can change the mark type to bar and choose Update, or choose Edit visual and specify new set of instructions for this visualization (for example, “change to bar chart”).

After you have adjusted the visualization to your satisfaction, you can save it to retrieve later. What makes this feature particularly powerful is its ability to understand context and suggest refinements by updating your prompts—if the initial visualization doesn’t quite meet your needs, you can describe the desired changes using the Edit visual option.

Result summarization

Amazon Q acts as an interpretation layer that processes query results into a condensed, structured summary. It can also identify patterns and other significant trends in the data by observing both the qualitative and quantitative characteristics of the results. The system’s effectiveness largely depends on the quality of the underlying data, the specificity of the initial query, and the characteristics of query generation, among other things. Amazon Q also samples the result set for generating this result summarization. These summaries are a good starting point for analysis. For example, for the same query we used last time (“Show me number of distinct IP addresses per day in logs”), Amazon Q will analyze the result set in the Amazon Q Summary section.

Anomaly detector suggestions

As it responds to your query, Amazon Q can make suggestions for creating an anomaly detector based upon your data source selected. It does that by recommending relevant fields of your operational data patterns with a one-click confirmation to create the detector.

Features are aggregation of fields or scripts that determines what constitutes an anomaly. Identifying features and creating a detector to use those features typically requires deep technical understanding of spikes, dips, thresholds and inter-relationship between multiple features. Amazon Q helps reduce this traditional complexity when creating a detector by automatically identifying these features as shown below. You can also make changes to the suggested detector to fine-tune to your needs.

Alerts summarization and insights

Choosing the Amazon Q icon next to alerts generates a concise summary that includes alert definitions, the specific conditions that led to its activation, and an overview of the current state of the monitored system or service.

The insights component provides a higher-level insight into the alerts by highlighting the significance of these alerts, typical conditions that results in these alerts, along with recommendations to help mitigate the conditions of these alerts. To get an insight for an alert, you need to provide additional information about your environment with a knowledge base. For instructions on generating insights, see View alert summaries and insights.

By choosing View in Discover, you can dive deeper into the data behind the alert with a single click, facilitating a seamless transition from alert notification to detailed investigation in Discover. The insights and summarization feature helps accelerate your investigations; care must be taken to identify the root cause of the problem because it will likely require human intervention.

Best practices guidance

Amazon Q Developer in OpenSearch Service not only simplifies operations, but also serves as an intelligent assistant for implementing OpenSearch Service best practices. Amazon Q for OpenSearch Service has been trained on the developer and product documentation, so that it can suggest best practices for operating OpenSearch Service domains, Amazon OpenSearch Serverless collections, and configurations based on your needs for capacity and compliance. To get started, choose the Amazon Q icon on the top right. The assistant maintains the history of the conversations. For the guidance it provides, the assistant cites its sources, providing a helpful link to the documentation. It also provides suggestions to continue the conversation. You can ask questions regarding data access policies, index state managements, sizing leader nodes, or other best practices or operational questions about OpenSearch.

Cost considerations

OpenSearch UI is available for use without other associated costs. Amazon Q Developer for OpenSearch Service is available within OpenSearch UI in the following AWS Regions: US East (N. Virginia), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), Europe (London), Europe (Paris), and South America (São Paulo). Because it’s included at the Free Tier, there is no associated cost.

Conclusion

Amazon Q Developer support in OpenSearch Service brings in AI-powered capabilities to help alleviate the traditional barriers that teams face when setting up, monitoring, and troubleshooting their applications. This allows teams of all experience levels to harness the full power of OpenSearch.

We’re excited to see how you will use these new capabilities to transform your observability workflows and drive better operational outcomes. To get started with Amazon Q Developer in OpenSearch Service, refer to Amazon Q Developer is now generally available in Amazon OpenSearch Service


About the Authors

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.

Dagney Braun is a Senior Manager of Product on the Amazon Web Services OpenSearch team. She is passionate about improving the ease of use of OpenSearch and expanding the tools available to better support all customer use cases.

Optimizing cold start performance of AWS Lambda using advanced priming strategies with SnapStart

Post Syndicated from Shan Kandaswamy original https://aws.amazon.com/blogs/compute/optimizing-cold-start-performance-of-aws-lambda-using-advanced-priming-strategies-with-snapstart/

Introduced at re:Invent 2022, SnapStart is a performance optimization that makes it easier to build highly responsive and scalable applications using AWS Lambda. The largest contributor to startup latency (often referred to as cold-start time) is the time spent initializing a function. This includes loading the function’s code and initializing dependencies. For latency-sensitive workloads such as APIs and real-time data processing applications, high startup latency can result in a suboptimal end user experience. Lambda SnapStart can reduce startup duration from several seconds to as low as sub-second, with minimal or no code changes. This post discusses ‘Priming’, a technique to further optimize startup times for AWS Lambda functions built using Java and Spring Boot.

Spring Boot applications typically experience high cold start latency during JVM and framework initialization, where significant time is spent loading classes and performing Just-In-Time (JIT) compilation of Java bytecode. This blog post uses a Spring Boot application as an example that retrieves 10 records from a ‘UnicornEmployee’ table in an Amazon RDS for PostgreSQL database, where each employee record includes employee name, location, and hire date.

The sample application uses Amazon API Gateway which triggers an AWS Lambda function that connects to the database through RDS Proxy to return the employee data. While this sample application uses dummy employee data for demonstration, the patterns and optimization techniques discussed in this post are applicable to real-world scenarios with similar data access patterns. Sample code for this implementation can be found in our GitHub repository at lambda-priming-crac-java-cdk.

Background: How SnapStart works

The post assumes familiarity with SnapStart and provides a short background. For additional details, refer to the SnapStart documentation.

To quickly recap, the INIT phase for a Lambda function involves downloading the function’s code, starting the runtime and any external dependencies, and running the function’s initialization code. For functions that don’t use SnapStart, this phase occurs each time your application scales up to create a new execution environment. When SnapStart is activated, the INIT phase happens when you publish a function version.

The following image shows a comparison of a Lambda request lifecycle with and without SnapStart.

Figure 1 – comparison of a Lambda request lifecycle with and without SnapStart

At the end of the INIT phase, Lambda executes your before-checkpoint runtime hooks. Lambda then snapshots the memory and disk state of the initialized execution environment, persists the encrypted snapshot, and caches it for low-latency access. When the function is subsequently invoked, new execution environments are resumed from the cached snapshot (during the RESTORE phase), speeding up function startup.

Figure 2 – new execution environments are resumed from the cached snapshot.

You can validate this speedup by comparing the RESTORE duration with the INIT duration recorded before SnapStart in your Lambda function’s Amazon CloudWatch Logs. As demonstrated in the following table, enabling SnapStart reduces the startup latency of our sample Spring Boot application by 4.3x from 6.1s to 1.4s. The 6.1s cold start latency for ON_DEMAND is primarily due to the combination of (1) initializing the JVM and Spring Boot framework, (2) JIT compilation of lazy loaded application code during initial invocation and (3) the time needed to establish a database connection with RDS through Amazon RDS Proxy. By enabling SnapStart, Lambda initializes the JVM and Spring Boot prior to the function invocation – resulting in the significantly reduced latency of 1.4s.

Method Cold Start Invocations p50 P90 P99 p99.9
PrimingLogGroup-1_ON_DEMAND 128 5047.94 ms 5386.78 ms 6158.80 ms 6195.84 ms
PrimingLogGroup-2_SnapStart_NO_PRIMING 111 1177.87 ms 1288.73 ms 1419.94 ms 1425.63 ms

You can reduce cold starts even further for your latency-sensitive Spring Boot applications by using priming techniques on Lambda functions. Let’s explore how to implement priming techniques.

Priming explained

Priming is the process of preloading dependencies and initializing resources during the INIT phase, rather than during the INVOKE phase to further optimize startup performance with SnapStart. This is required because Java frameworks that use dependency injection load classes into memory when these classes are explicitly invoked, which typically happens during Lambda’s INVOKE phase. You can proactively load classes using Java runtime hooks, that are part of the open source CRaC (Coordinated Restore at Checkpoint) project. This post demonstrates how to use this hook, called beforeCheckpoint(), to prime SnapStart-enabled Java functions, in two ways:

  1. Invoke Priming: This approach involves directly invoking application endpoints or methods in your pre-snapshotting hook so that they are JIT compiled during the INIT phase and included in the snapshot. This can include operations such as invoking API Gateway endpoints or fetching data from an S3 bucket or RDS database to proactively execute the code paths, ensuring that the underlying classes are included in the snapshot.
  2. Class Priming: This approach involves proactive initialization of classes during the INIT phase, ensuring that they are included in the function’s snapshot without risking unwanted changes to application state or data. This can be achieved by leveraging Java’s forName() method, which loads, links, and initializes the specified class. Initialization refers to the JVM process of loading the class definition into memory, verifying the bytecode, preparing static fields with default values, and executing static initializers. This is different from instantiation, which creates objects of the class using constructors. To generate a list of the classes required for pre-loading, you can use the following VM option, writing the list to a file called classes-loaded.txt:
    -Xlog:class+load=info:classes-loaded.txt

While invoke priming can offer better performance, it requires additional effort to ensure that the actions performed are idempotent and do not have unintended side effects, for instance processing financial transactions in a banking application. For this reason, invoke priming should only be used when code executed during priming is either idempotent or does not modify state. For scenarios where this is not possible, class priming provides a safer alternative by only initializing classes without executing their methods. Note that this assumes your application does not execute state-modifying code during class initialization.

With this context, let’s look at how to implement Invoke and Class priming for a Spring Boot sample application.

Example priming Implementation using CRaC runtime hooks before taking a Lambda snapshot

This post demonstrates both Invoke priming and Class priming using the sample Spring Boot application. The choice between the two approaches depends on the specific requirements and complexities of your application.

Step 1: Set up your Spring Boot Application using the aws-serverless-springboot3-archetype as explained in our Quick Start Spring Boot3 guide, adding database connectivity code, or simply clone the sample project from GitHub repository.

  1. Create a Spring Boot Application.
    // src/main/java/software/amazon/awscdk/examples/unicorn/UnicornApplication.java
    package software.amazon.awscdk.examples.unicorn;
    …
    @Import({ UnicornConfig.class })
    @SpringBootApplication
    public class UnicornApplication {
    
        private static final Logger log = LoggerFactory.getLogger(UnicornApplication.class);
    
        public static void main(String... arguments) {
            SpringApplication.run(UnicornApplication.class, arguments);
        }
    
    }

  2. Add all the necessary Maven dependencies for Spring Boot, AWS Lambda, and Database Connection in your pom.xml file. The following, highlighted, dependency contains the classes required to use the CRaC runtime hooks.
    ...
            <dependency>
                <groupId>org.crac</groupId>
                <artifactId>crac</artifactId>
            </dependency>
    ...

  3. Configure Database Connection – Set up the database connection details in application.properties.
    spring.datasource.password=${SPRING_DATASOURCE_PASSWORD} 
    spring.datasource.url=${SPRING_DATASOURCE_URL} 
    spring.datasource.username=postgres 
    spring.datasource.hikari.maximumPoolSize=1 

Step 2: Implement Lambda Function Handler with CRaC runtime hooks and Invoke Priming Approach:

Create Lambda Function Handler and integrate CRaC runtime hooks to execute beforeCheckpoint() and afterRestore() methods in your application for before taking and after restoring the snapshot.

  1. Implement the RequestHandler<UnicornRequest, UnicornResponse> interface in the Lambda function handler class.
  2. Implement the CRaC resource interface with two methods: beforeCheckpoint() and afterRestore(), which defines actions performed before Lambda creates the snapshot and after the snapshot is restored.
  3. Add invoke priming by creating a UnicornRequest object with a GET request to a specific endpoint (such as, /unicorn) and call the handleRequest(unicornRequest, null) method.

This ensures that the code paths associated with the specified endpoint are JIT compiled and optimized for faster execution during the first invocation after the snapshot is restored.

/src/main/java/software/amazon/awscdk/examples/unicorn/handler/InvokePriming.java
package software.amazon.awscdk.examples.unicorn.handler;

import org.crac.Core;
import org.crac.Resource;
...
public class InvokePriming implements RequestHandler<APIGatewayV2HTTPEvent, APIGatewayV2HTTPResponse>, Resource {
	...

@Override
public APIGatewayV2HTTPResponse handleRequest(APIGatewayV2HTTPEvent event, Context context) {
    var awsLambdaInitializationType = System.getenv("AWS_LAMBDA_INITIALIZATION_TYPE");
    var unicorns = getUnicorns();
    var body = gson.toJson(unicorns);
    return APIGatewayV2HTTPResponse.builder().withStatusCode(200).withBody(body).build();
}

@Override
public void beforeCheckpoint(org.crac.Context<? extends Resource> context)
        throws Exception {
    var event = APIGatewayV2HTTPEvent.builder().build();
    handleRequest(event, null);
}
...
}

Step 3: Implement Class priming Approach:

The class priming approach focuses on pre-loading required classes to achieve optimal performance. To implement class priming, generate the list of classes that are loaded during the application startup and function execution by running the application locally using the following JVM argument: -Xlog:class+load=info:classes-loaded.txt

  1. Ensure that your application classes included in the generated classes-loaded.txt file are not mutating state during static initialization.
    Note: the generated classes-loaded.txt contains class entries in the following format:

    [0.068s][info][class,load] software.amazon.awscdk.examples.unicorn.handler.ClassPriming source: file:/var/task/

  2. Extract only the fully qualified class names from each line and remove the additional logging information. For Example:
    software.amazon.awscdk.examples.unicorn.handler.ClassPriming

  3. Use the ClassLoaderUtil.loadClassesFromFile() utility method to extract the generated class entries.
    	     //src/main/java/software/amazon/awscdk/examples/unicorn/service/ClassLoaderUtil.java
    package software.amazon.awscdk.examples.unicorn;
    	...
    public class ClassLoaderUtil {
    	...
        public static void loadClassesFromFile() {
            log.info("loadClassesFromFile->started");
            Path path = Paths.get("classes-loaded.txt");
    
            try (BufferedReader bufferedReader = Files.newBufferedReader(path)) {
                Stream<String> lines = bufferedReader.lines();
                lines.forEach(line -> {
                    var index1 = line.indexOf("[class,load] ");
                    var index2 = line.indexOf(" source: ");
    
                    if (index1 < 0 || index2 < 0) {
                        return;
                    }
    
                    var className = line.substring(index1 + 13, index2);
                    try {
                        Class.forName(className, true,
                                ClassPriming.class.getClassLoader());
                    } catch (Throwable ignored) {
                    }
                });
    
                log.info("loadClassesFromFile->finished");
            } catch (IOException exception) {
                log.error("Error on newBufferedReader", exception);
            }
        }
    ...
    }

  4. Read a file (such as, /classes-loaded.txt) that contains a list of classes that have been loaded during the application’s execution in the beforeCheckpoint() method.
  5. Use the Class.forName() method to load and initialize the class, ensuring that it is ready during the snapshot.
    Note: by systematically pre-loading these classes, the Class priming approach simplifies the optimization process and reduces the complexities associated with Invoke priming.

    //src/main/java/software/amazon/awscdk/examples/unicorn/handler/ClassPriming.java
    package software.amazon.awscdk.examples.unicorn.handler;
    
    ...
    import org.crac.Core;
    import org.crac.Resource;
    
    public class ClassPriming implements RequestHandler<APIGatewayV2HTTPEvent, APIGatewayV2HTTPResponse>, Resource {
    
    ...
            ConfigurableApplicationContext configurableApplicationContext =
    				SpringApplication.run(UnicornApplication.class);
    
            this.unicornService = configurableApplicationContext.getBean(UnicornService.class);
            this.gson = configurableApplicationContext.getBean(Gson.class);
    
            Core.getGlobalContext().register(this);
        }
    
        @Override
        public APIGatewayV2HTTPResponse handleRequest(APIGatewayV2HTTPEvent event, Context context) {
            var unicorns = getUnicorns();
            var body = gson.toJson(unicorns);
    
            return APIGatewayV2HTTPResponse.builder().withStatusCode(200).withBody(body).build();
        }
    
        @Override
        public void beforeCheckpoint(org.crac.Context<? extends Resource> context)
                throws Exception {
    
            ClassLoaderUtil.loadClassesFromFile();
    
        }
    ...
    }

Step 4: AWS CDK Infrastructure Setup

Before proceeding, review the prerequisites in the project README file.

The CDK stack deploys a serverless application and required infrastructure for testing different Lambda optimization strategies. It creates a VPC with private subnets, an RDS for PostgreSQL instance with a database proxy, and five Lambda functions implementing different optimization approaches (ON_DEMAND without SnapStart, SnapStart without priming, SnapStart with invoke priming, and SnapStart with class priming). Each Lambda function is integrated with API Gateway for HTTP access, configured with Java 21 runtime on ARM64 architecture, and includes CloudWatch log groups for monitoring.

Follow these steps to deploy the infrastructure:

  1. Clone the sample repository:
    git clone https://github.com/aws-samples/lambda-priming-crac-java-cdk.git

  2. Deploy the CDK stack:
    cd lambda-priming-crac-java-cdk/infrastructure
    cdk synth
    cdk deploy --require-approval never --all 2>&1 | tee cdk_output.txt

  3. Save the API Gateway URLs:
    The deployment will output five URLs in this format:

    # ON_DEMAND endpoint (without SnapStart)
    LambdaPrimingCracJavaCdkStack.PrimingJavaRestApi1ONDEMANDEndpoint = https://[id].execute-api.us-east-1.amazonaws.com/prod/
    
    # SnapStart without priming endpoint
    LambdaPrimingCracJavaCdkStack.PrimingJavaRestApi2SnapStartNOPRIMINGEndpoint = https://[id].execute-api.us-east-1.amazonaws.com/prod/
    
    # SnapStart with invoke priming endpoint
    LambdaPrimingCracJavaCdkStack.PrimingJavaRestApi3SnapStartINVOKEPRIMINGEndpoint = https://[id].execute-api.us-east-1.amazonaws.com/prod/
    
    # SnapStart with class priming endpoint
    LambdaPrimingCracJavaCdkStack.PrimingJavaRestApi4SnapStartCLASSPRIMINGEndpoint = https://[id].execute-api.us-east-1.amazonaws.com/prod/
    
    # Database setup endpoint
    LambdaPrimingCracJavaCdkStack.PrimingJavaRestApi5DBLOADEREndpoint = https://[id].execute-api.us-east-1.amazonaws.com/prod/

  4.  Extract the URLs into variables for testing:
    ONDEMAND_URL=$(grep -oE 'https://[a-zA-Z0-9.-]+\.execute-api\.[a-zA-Z0-9-]+\.amazonaws\.com/prod/' "cdk_output.txt" | head -n 1) \
    
    NOPRIMING_URL=$(grep -oE 'https://[a-zA-Z0-9.-]+\.execute-api\.[a-zA-Z0-9-]+\.amazonaws\.com/prod/' "cdk_output.txt" | head -n 2 | tail -n 1) \
    
    INVOKEPRIMING_URL=$(grep -oE 'https://[a-zA-Z0-9.-]+\.execute-api\.[a-zA-Z0-9-]+\.amazonaws\.com/prod/' "cdk_output.txt" | head -n 3 | tail -n 1) \
    
    CLASSPRIMING_URL=$(grep -oE 'https://[a-zA-Z0-9.-]+\.execute-api\.[a-zA-Z0-9-]+\.amazonaws\.com/prod/' "cdk_output.txt" | head -n 4 | tail -n 1) \
    
    SETUP_URL=$(grep -oE 'https://[a-zA-Z0-9.-]+\.execute-api\.[a-zA-Z0-9-]+\.amazonaws\.com/prod/' "cdk_output.txt" | head -n 5 | tail -n 1)

Step 5: Load database and run performance testing using artillery:

  1. Initialize the database with sample data.
    curl -X GET "$SETUP_URL"
    
    #Expected output: {"message":"Database schema initialized and data loaded"}

  2. Run performance tests for all endpoints
    artillery run -t "$ONDEMAND_URL" -v '{ "url": "/unicorn" }' ./loadtest.yaml && \
    artillery run -t "$NOPRIMING_URL" -v '{ "url": "/unicorn" }' ./loadtest.yaml && \
    artillery run -t "$INVOKEPRIMING_URL" -v '{ "url": "/unicorn" }' ./loadtest.yaml && \
    artillery run -t "$CLASSPRIMING_URL" -v '{ "url": "/unicorn" }' ./loadtest.yaml

Step 6: Compare the load test results for On-demand (non-SnapStart), SnapStart, Invoke priming, and Class priming

The performance test results in the table below are sorted from slowest to fastest startup latency. The function without SnapStart performs the slowest due to JVM initialization, class loading and JIT compilation that occurs when the function is invoked. Notice a 4.3x improvement with SnapStart, which resumes invocations from a pre-initialized snapshot thereby avoiding JVM initialization and initial JIT compilation. SnapStart with class priming achieves a 1.4x speed-up over SnapStart, by proactively loading/initializing classes during INIT so that they are included in your function’s snapshot. Finally, SnapStart with invoke priming achieves the fastest performance – with a 781.68ms p99.9 cold-start latency that is 1.8x faster than SnapStart. This is because in addition to initializing classes, it also executes methods on the instances of those classes, resulting in even more components being included in the function’s snapshot.

Note that with invoke priming, any application code you execute must either be idempotent or modify stub data only. For instance, consider application code that triggers a financial transaction. If this code is executed during invoke priming with real user data, it may drive unintended effects with potentially serious consequences. Class priming avoids this, since application classes are initialized rather than being instantiated and their methods executed. This assumes that application code does not execute state modifying logic during class initialization. We recommend that you keep these considerations in mind when using invoke and/or class priming, and choose the appropriate approach for your use case.

Method Cold Start Invocations p50 P90 P99 p99.9
PrimingLogGroup-1_ON_DEMAND 128 5047.94 ms 5386.78 ms 6158.80 ms 6195.84 ms
PrimingLogGroup-2_SnapStart_NO_PRIMING 111 1177.87 ms 1288.73 ms 1419.94 ms 1425.63 ms
PrimingLogGroup-4_SnapStart_CLASS_PRIMING 82 857.81 ms 997.49 ms 1085.94 ms 1085.94 ms
PrimingLogGroup-3_SnapStart_INVOKE_PRIMING 66 608.42 ms 688.88 ms 781.68 ms 781.68 ms

 Conclusion

This post showed how AWS Lambda SnapStart, enhanced by CRaC runtime hooks, unlocks granular control over cold-start optimization for Java applications through two distinct priming strategies:

  • Invoke Priming: improves performance by executing critical endpoints during snapshot creation, ideal for idempotent workflows.
  • Class Priming: preloads classes without triggering business logic, mitigating side-effect risks.

To implement these optimization techniques in your applications evaluate your use case and opt for the optimal priming approach. Track latency reductions and resource utilization of your application via Amazon CloudWatch metrics to quantify performance improvements. By integrating these strategies, developers can achieve sub-second cold starts while maintaining the scalability and cost-efficiency of serverless architecture using Java.

To dive deeper, check out the GitHub repository with the full example code, including setup instructions and reusable patterns you can adapt to your own projects. For more examples of Java applications running on AWS Lambda, visit serverlessland.com and explore a wide range of resources, tutorials, and real-world use cases.

Combining Snyk’s Insight with Amazon Q Developer’s Assistance to Streamline Secure Development

Post Syndicated from Omar Faruk original https://aws.amazon.com/blogs/devops/combining-snyks-insight-with-amazon-q-developers-assistance-to-streamline-secure-development/

Developers today face a constant balancing act – building new features and functionality while also ensuring the security and reliability of their codebase. Two powerful tools, Snyk and Amazon Q Developer, can work in tandem to help developers navigate this challenge with greater efficiency and efficacy.

Snyk is a leading developer security platform that empowers developers to seamlessly secure their code, open-source dependencies, container images, and cloud infrastructure all from a single, unified platform. Amazon Q Developer is a generative AI-powered assistant designed to accelerate a variety of tasks across the software development lifecycle. By combining the security insights from Snyk with the assistive capabilities of Amazon Q Developer, developers can streamline their workflows and focus on delivery.

Getting started with Amazon Q Developer and Snyk IDE Plugins

To get started with Amazon Q Developer, you need to have an AWS Builder ID or be part of an organization with an AWS IAM Identity Center instance that allows you to use Amazon Q. To use Amazon Q Developer agents for software development in Visual Studio Code, start by installing the Amazon Q extension. Find the latest version of the extension on the Amazon Q Developer page. The extension is also available for JetBrains, Eclipse (Preview), and Visual Studio IDEs. For a detailed list of supported IDEs and the features available in each, refer to the Amazon Q Developer documentation.

To get started with Snyk, sign up for a free Snyk account or log in with your existing account. To use Snyk in your IDE to automatically find security issues, review the IDE documentation and install Snyk using your IDE extension marketplace. After Snyk is installed, navigate to the Snyk panel in your IDE and follow the on-screen instructions to authenticate with your Snyk account.

After authenticating, Snyk will automatically scan your entire codebase for security issues. Snyk will continue scanning periodically as you write code or generate code with Amazon Q Developer.

Walkthrough

Let’s explore how Snyk and Amazon Q Developer can be used together through a few examples. Imagine that you maintain an open-source project. As a new Snyk user, you would like to find and fix the security issues in the project. In this first and simple scenario, Snyk has identified many cases of security vulnerabilities in specific lines of code. Among the vulnerabilities, we’ll focus on the Information Exposure vulnerability.

Snyk's IDE plugin shares a list of vulnerabilities and an overview, such as the line of code with the vulnerability and detail, of the vulnerability when it is selected.

Figure 1 – Snyk IDE Plugin displaying vulnerability analysis of an Information Exposure issue, showing severity, affected code, and prevention tips.

Rather than manually researching and implementing the fix, you can simply highlight the flagged line, invoke Amazon Q Developer’s inline chat by pressing ⌘+I (Mac) or Ctrl+I (Windows), and request assistance. Amazon Q Developer will analyze the issue, propose the necessary code changes, and provide you with an inline diff to review and accept. This allows for rapid remediation of security flaws saving time while improving the code.

Activating inline Q Developer and making a prompt for the agent to resolve the information exposure vulnerability identified by Snyk.

Figure 2 – Activating Amazon Q Developer inline code generation to fix the detected information exposure vulnerability.

We are happy with the change Amazon Q Developer proposed, so we’ll simply hit enter to accept the suggestions. Of course, we could always hit escape to reject the suggestion if needed.

Q Developer makes an inline code suggestion to resolve the information exposure vulnerability.

Figure 3 – Amazon Q Developer displaying an inline code generation to fix the detected information exposure vulnerability.

In addition to the inline chat, you can pass the vulnerability details directly from the Snyk plugin’s Problems view into the Amazon Q Developer /dev agentic capability.

In the chat interface of Q Developer, the /dev agentic capability allows longer conversation, broader workspace context, and handle changes within multiple files and topics. When this workflow is invoked, the Amazon Q Developer Agent will generate code based on the description and existing code in the workspace, provide a list of suggestions to review and add to the workspace, and if needed, iterate on the code based on feedback.

Copying the information of the information exposure vulnerability from Snyk plugin and requesting a fix using /dev agent capability.

Figure 4 – Using Amazon Q’s /dev agent to implement project-wide fixes for Snyk-detected vulnerabilities across multiple files.

Not all issues are trivial as the prior example. In a more complex case, Snyk may surface a vulnerability that requires a deeper understanding of the code and the potential risk. Let’s look at another issue that Snyk identified in the project we have been discussing.

Snyk identified cross-site scripting vulnerability and explains the line of code, details, and prevention tips of the vulnerability.

Figure 5 – Snyk Plugin highlighting a cross-site scripting (XSS) vulnerability, showing the affected code line and prevention recommendations.

Here, you can switch to Amazon Q Developer’s chat interface, provide the details of the issue, and ask for a more thorough explanation. Amazon Q Developer can then dive into the codebase, explain the problem in detail, and walk you through the appropriate fixes. This collaborative approach empowers developers to make informed decisions and gain broader knowledge, rather than simply implementing a suggestion.

Chat interface that takes a prompt from user to explain why Snyk flagged an cross-site scripting vulnerability and its impact.

Figure 6 – Amazon Q Developer’s chat interface explaining an XSS vulnerability and its security implications through natural language dialogue.

Note that Amazon Q Developer provides links to documentation and other sources for further reading. In addition, you can continue discussing the issue to learn more. For example, imagine that you want to understand real world breaches that have occurred as a result of the issues that Synk has identified. Q provides a few examples for me to learn more.

A natural language query in the chat interface if there has been any major breaches caused by the issue of cross-site scripting. Q responds with popular and impactful incidents.

Figure 7 – Amazon Q Developer discussing notable real-world XSS breach examples and their security impacts.

Beyond fixing issues, Amazon Q Developer can also assist with other development tasks identified by Snyk, such as updating dependencies, refactoring code, or optimizing cloud infrastructure. By integrating these two tools, developers can streamline security scanning, issue investigation, and remediation, dramatically increasing their overall productivity.

Conclusion

In this blog, we took a look at how Snyk and Amazon Q Developer are a powerful duo in the modern developer’s toolkit. Integrating Snyk’s leading security insights with the generative AI capabilities of Amazon Q Developer empowers developers to more efficiently identify, comprehend, and address security vulnerabilities. This combination enables developers to upskill and enhance their own abilities as they work to resolve security issues. Get started with installing the Amazon Q Developer in the IDE and Snyk plugin.



Connect with AWS Partner Snyk.

Snyk – AWS Partner Spotlight

Snyk empowers the world’s developers to build secure applications and equip security teams to meet the demands of the digital world. Used by 1,200 customers worldwide, Snyk’s Developer Security Platform automatically integrates with a developer’s workflow and is purpose-built for security teams to collaborate with their development teams.

Snyk on AWS Marketplace

About the authors:

Omar Faruk

Omar Faruk is a DevOps Partner Solutions Architect at Amazon Web Services. He helps DevSecOps partners to design, build and operate their and shared customers’ workloads in AWS. He is passionate about CI/CD, Infrastructure as Code, and next-generation developer experience. Outside work, he enjoys family time and travel.

David Schott

David is a seasoned DevSecOps Solutions Engineer with 15+ years of experience helping Fortune 100 companies optimize their software delivery security and efficiency. After driving DevOps adoption and CI development at CloudBees, he now focuses on DevSecOps at Snyk, where he collaborates with strategic partners to enable secure innovation at scale.

Melting the ice — How Natural Intelligence simplified a data lake migration to Apache Iceberg

Post Syndicated from Yonatan Dolan original https://aws.amazon.com/blogs/big-data/melting-the-ice-how-natural-intelligence-simplified-a-data-lake-migration-to-apache-iceberg/

This post is co-written with Haya Axelrod Stern, Zion Rubin and Michal Urbanowicz from Natural Intelligence.

Many organizations turn to data lakes for the flexibility and scale needed to manage large volumes of structured and unstructured data. However, migrating an existing data lake to a new table format such as Apache Iceberg can bring significant technical and organizational challenges

Natural Intelligence (NI) is a world leader in multi-category marketplaces. NI’s leading brands, Top10.com and BestMoney.com, help millions of people worldwide to make informed decisions every day. Recently, NI embarked on a journey to transition their legacy data lake from Apache Hive to Apache Iceberg.

In this blog post, NI shares their journey, the innovative solutions developed, and the key takeaways that can guide other organizations considering a similar path.

This article details NI’s practical approach to this complex migration, focusing less on Apache Iceberg’s technical specifications, but rather on the real-world challenges and solutions encountered during the transition to Apache Iceberg, a challenge that many organizations are grappling with.

Why Apache Iceberg?

The architecture at NI followed the commonly used medallion architecture, comprised of a bronze-silver-gold layered framework, shown in the figure that follows:

  • Bronze layer: Unprocessed data from various sources, stored in its raw format in Amazon Simple Storage Service (Amazon S3), ingested through Apache Kafka brokers.
  • Silver layer: Contains cleaned and enriched data, processed using Apache Flink.
  • Gold layer: Holds analytics-ready datasets designed for business intelligence (BI) and reporting, produced using Apache Spark pipelines, and consumed by services such as Snowflake, Amazon Athena, Tableau, and Apache Druid. The data is stored in Apache Parquet format with AWS Glue Catalog providing metadata management.

BDB4681-Arch1

While this architecture supported NI analytical needs, it lacked the flexibility required for a truly open and adaptable data platform. The gold layer was coupled only with query engines that supported Hive and AWS Glue Data Catalog. It was possible to use Amazon Athena however Snowflake required maintaining another catalog in order to query those external tables. This issue made it difficult to evaluate or adopt alternative tools and engines without costly data duplication, query rewrite data catalog synchronization. As business scaled, NI needed a data platform that could seamlessly support multiple query engines simultaneously with a single data catalog and avoiding any vendor lock-in.

The power of Apache Iceberg

Apache Iceberg emerged as the perfect solution—a flexible, open table format that aligns with NI’s approach of Data Lake First. Iceberg offers several critical advantages such as ACID transactions, schema evolution, time travel, performance improvements and more. But the key strategic benefits lay in the ability to support multiple query engines simultaneously. It also has the following advantages:

  • Decoupling of storage and compute: The open table format enables you to separate the storage layer from the query engine, allowing an easy swap and support for multiple engines concurrently without data duplication.
  • Vendor independence: As an open table format, Apache Iceberg prevents vendor lock-in, giving you the flexibility to adapt to changing analytics needs.
  • Vendor adoption: Apache Iceberg is widely supported by major platforms and tools, providing seamless integration and long-term ecosystem compatibility.

By transitioning to Iceberg, NI was able to embrace a truly open data platform, providing long-term flexibility, scalability, and interoperability while maintaining a unified source of truth for all analytics and reporting needs.

Challenges faced

Migrating a live production data lake to Iceberg was challenging because of operational complexities and legacy constraints. The data service at NI runs hundreds of Spark and machine learning pipelines, manages thousands of tables, and supports over 400 dashboards—all operating 24/7. Any migration would need to be done without production interruptions; and coordinating such a migration while operations continue seamlessly was daunting.

NI needed to accommodate diverse users with varying requirements and timelines from data engineers to data analysts all the way to data scientists and BI teams.

Adding to the challenge were legacy constraints. Some of the existing tools didn’t fully support Iceberg, so there was a need to maintain Hive-backed tables for compatibility. As NI realized that not all consumers could adopt Iceberg immediately. A plan was required to allow for incremental transitions without downtime or disruption to ongoing operations.

Key pillars for migration

To help ensure a smooth and successful transition, six critical pillars were defined:

  • Support ongoing operations: Maintain uninterrupted compatibility with existing systems and workflows during the migration process.
  • User transparency: Minimize disruption for users by preserving existing table names and access patterns.
  • Gradual consumer migration: Allow consumers to adopt Iceberg at their own pace, avoiding a forced, simultaneous switchover.
  • ETL flexibility: Migrate ETL pipelines to Iceberg without imposing constraints on development or deployment.
  • Cost effectiveness: Minimize storage and compute duplication and overhead during the migration period.
  • Minimize maintenance: Reduce the operational burden of managing dual table formats (Hive and Iceberg) during the transition.

Evaluating traditional migration approaches

Apache Iceberg supports two main approaches for migration: In-place and rewrite-based migration.

In-place migration

How it works: Converts an existing dataset into an Iceberg table without duplicating data by creating Iceberg metadata on top of the existing files while preserving their layout and format.

Advantages:

  • Cost-effective in terms of storage (no data duplication)
  • Simplified implementation
  • Maintains existing table names and locations
  • No data movement and minimal compute requirements, translating into lower cost

Disadvantages:

  • Downtime required: All write operations must be paused during conversion, which was unacceptable in NI cases because data and analytics are considered mission critical and run 24/7
  • No gradual adoption: All consumers must switch to Iceberg simultaneously, increasing the risk of disruption
  • Limited validation: No opportunity to validate data before cutover; rollback requires restoring from backups
  • Technical constraints: Schema evolution during migration can be challenging; data type incompatibilities can halt the entire process

Rewrite-based migration

How it works: Rewrite-based migration in Apache Iceberg involves creating a new Iceberg table by rewriting and reorganizing existing dataset files into Iceberg’s optimized format and structure for improved performance and data management.

Advantages:

  • Zero downtime during migration
  • Supports gradual consumer migration
  • Enables thorough validation
  • Simple rollback mechanism

Disadvantages:

  • Resource overhead: Double storage and compute costs during migration
  • Maintenance complexity: Managing two parallel data pipelines increases operational burden
  • Consistency challenges: Maintaining perfect consistency between the two systems is challenging
  • Performance impact: Increased latency because of dual writes; potential pipeline slowdowns

Why neither option alone was good enough

NI decided that neither option could meet all critical requirements:

  • In-place migration fell short because of unacceptable downtime and lack of support for gradual migration.
  • Rewrite-based migration fell short because of prohibitive cost overhead and complex operational management.

This analysis led NI to develop a hybrid approach that combines the advantages of both methods while mitigating and minimizing limitations.

The hybrid solution

The hybrid migration strategy was designed around five foundational elements, using AWS analytical services for orchestration, processing, and state management.

  1. Hive-to-Iceberg CDC: Automatically synchronize Hive tables with Iceberg using a custom change data capture (CDC) process to support existing consumers. Unlike traditional CDC focusing on row-level changes, the process was done at the partition-level to preserve Hive’s behavior of updating tables by overwriting partitions. This helps ensure that data consistency is maintained between Hive and Iceberg without logic changes at the migration phase, making sure that the same data exists on both tables.
  2. Continuous schema synchronization: Schema evolution during the migration introduced maintenance challenges. Automated schema sync processes compared Hive and Iceberg schemas, reconciling differences while maintaining type compatibility.
  3. Iceberg-to-Hive reverse CDC: To enable the data team to transition extract, transform, and load (ETL) jobs to write directly to Iceberg while maintaining compatibility with existing Hive-based processes not yet migrated, a reverse CDC from Iceberg to Hive was implemented. This allowed ETLs to write to Iceberg while maintaining Hive tables for downstream processes that had not yet migrated and still relied on them during the migration period.
  4. Alias management in Snowflake: Snowflake aliases made sure that Iceberg tables retained their original names, making the transition transparent to users. This approach minimized reconfiguration efforts across dependent teams and workflows.
  5. Table replacement: Swap production tables while retaining original names, completing the migration.

Technical deep dive

The migration to from Hive to Iceberg was constructed of several steps:

1. Hive-to-Iceberg CDC pipeline

Objective: Keep Hive and Iceberg tables synchronized without duplicating effort.

The preceding figure demonstrates how every partition written to the Hive table is automatically and transparently copied to the Iceberg table using a CDC process. This process makes sure that both tables are synchronized, enabling a seamless and incremental migration without disrupting downstream systems. NI chose partition-level synchronization because the legacy Hive ETL jobs already wrote updates by overwriting entire partitions and updating the partition location. Adopting that same approach in the CDC pipeline helped ensure that it remained consistent with how data was originally managed, making the migration smoother and avoiding the need to rework row-level logic.

Implementation:

  • To keep Hive and Iceberg tables synchronized without duplicating effort, a streamlined pipeline was implemented. Whenever partitions in Hive tables are updated, the AWS Glue Catalog emits events such as UpdatePartition. Amazon EventBridge captured these events, filtered them for the relevant databases and tables according to the event bridge rule, and triggered an AWS Lambda This function parsed the event metadata and sent the partition updates to an Apache Kafka topic.
  • A Spark job running on Amazon EMR consumed the messages from Kafka, which contained the updated partition details from the Data Catalog events. Using that event metadata, the Spark job queried the relevant Hive table, and wrote it to Iceberg table in Amazon S3 using the Spark Iceberg overwritePartitions API, as shown in the following example:
{
   "id":"10397e54-c049-fc7b-76c8-59e148c7cbfc",
   "detail-type":"Glue Data Catalog Table State Change",
   "source":"aws.glue",
   "time":"2024-10-27T17:16:21Z",
   "region":"us-east-1",
   "detail":{
      "databaseName":"dlk_visitor_funnel_dwh_production",
      "changedPartitions":[
         "2024-10-27"
      ],
      "typeOfChange":"UpdatePartition",
      "tableName":"fact_events"
   }
}
  • By targeting only modified partitions, the pipeline (shown in the following figure) significantly reduced the need for costly full-table rewrites. Iceberg’s robust metadata layers, including snapshots and manifest files, were seamlessly updated to capture these changes, providing efficient and accurate synchronization between Hive and Iceberg tables.

2. Iceberg-to-Hive reverse CDC pipeline

Objective: Support Hive consumers while allowing ETL pipelines to transition to Iceberg.

BDB4681-arch4

The preceding figure shows the reverse process, where every partition written to the Iceberg table is automatically and transparently copied to the Hive table using a CDC mechanism. This process helps ensure synchronization between the two systems, enabling seamless data updates for legacy systems that still rely on Hive while transitioning to Iceberg.

Implementation:

Synchronizing data from Iceberg tables back to Hive tables presented a different challenge. Unlike Hive tables, Data Catalog doesn’t track partition updates for Iceberg tables because partitions in Iceberg are managed internally and not within the catalog. This meant NI couldn’t rely on Glue Catalog events to detect partition changes.

To address this, NI implemented a solution similar to the previous flow but adapted to Iceberg’s architecture. Apache Spark was used to query Iceberg’s metadata tables—specifically the snapshots and entries tables—to identify the partitions modified since the last synchronization. The query used was:

SELECT e.data_file.partition, MAX(s.committed_at) AS last_modified_time 
FROM $target_table.snapshots JOIN $target_table.entries e ON s.snapshot_id = e.snapshot_id 
WHERE s.committed_at &amp;gt; '$last_sync_time' 
GROUP BY e.data_file.partition;

This query returned only the partitions that had been updated since the last synchronization, enabling it to focus exclusively on the changed data. Using this information, similar to the earlier process, a Spark job retrieved the updated partitions from Iceberg and wrote them back to the corresponding Hive table, providing seamless synchronization between both tables.

3. Continuous schema synchronization

Objective: Automate schema updates to maintain consistency across Hive and Iceberg.

BDB4681-arch5

The preceding figure shows how the automatic schema sync process helps ensure consistency between Hive and Iceberg tables schemas by automatically synchronizing schema changes. In this example adding the Channel column, minimizing manual work and double maintenance during the extended migration period.

 Implementation:

To handle schema changes between Hive and Iceberg, a process was implemented to detect and reconcile differences automatically. When a schema change happens in a Hive table, Data Catalog emits an UpdateTable event. This event triggers a Lambda function (routed through EventBridge), which retrieves the updated schema from Data Catalog for the Hive table and compares it to the Iceberg schema. It’s important to call out that in NI’s setup, schema changes originate from Hive because the Iceberg table is hidden behind aliases across the system. Because Iceberg is primarily used for Snowflake, a one-way sync from Hive to Iceberg is sufficient. As a result, there is no mechanism to detect or handle schema changes made directly in Iceberg, because they aren’t needed in the current workflow.

During the schema reconciliation (shown in the following figure), data types are normalized to help ensure compatibility—for example, converting Hive’s VARCHAR to Iceberg’s STRING. Any new fields or type changes are validated and applied to the Iceberg schema using a Spark job running on Amazon EMR. Amazon DynamoDB stores schema synchronization checkpoints which allow tracking changes over time and maintain consistency between the Hive and Iceberg schemas.

BDB4681-arch6

By automating this schema synchronization, maintenance overhead was significantly reduced and freed developers from manually keeping schemas in sync, making the long migration period significantly more manageable.

The preceding figure depicts an automated workflow to maintain schema consistency between Hive and Iceberg tables. AWS Glue captures table state change events from Hive, which trigger an EventBridge event. The event invokes a Lambda function that fetches metadata from DynamoDB and compares schemas fetched from AWS Glue for both Hive and Iceberg tables. If a mismatch is detected, the schema in Iceberg is updated to help ensure alignment, minimizing manual intervention and supporting smooth operation during the migration.

4. Alias management in Snowflake

Objective: Enable Snowflake consumers to adopt Iceberg without changing query references.

The preceding figure shows how Snowflake aliases enable seamless migration by mapping queries like SELECT platform, COUNT(clickouts) FROM funnel.clickouts to Iceberg tables in the Glue Catalog. Even with suffixes added during the Iceberg migration, existing queries and workflows remain unchanged, minimizing disruption for BI tools and analysts.

Implementation:

To help ensure a seamless experience for BI tools and analysts during the migration, Snowflake aliases were used to map external tables to the Iceberg metadata stored in Data Catalog. By assigning aliases that matched the original Hive table names, existing queries and reports were preserved without interruption. For example, an external table was created in Snowflake and aliased it to the original table name, as shown in the following query:

CREATE OR REPLACE ICEBERG TABLE dlk_visitor_funnel_dwh_production.aggregated_cost 
EXTERNAL_VOLUME = 's3_dlk_visitor_funnel_dwh_production_iceberg_migration' 
CATALOG = 'glue_dlk_visitor_funnel_dwh_production_iceberg_migration' 
CATALOG_TABLE_NAME = 'aggregated_cost'; 
ALTER ICEBERG TABLE dlk_visitor_funnel_dwh_production.aggregated_cost REFRESH;

When migration was completed, a simple change back to the alias was done to point to the new location or schema, making the transition seamless and minimizing any disruption to user workflows.

5. Table replacement

Objective: When all ETLs and related data workflows were successfully transitioned to use Apache Iceberg’s capabilities, and everything was functioning correctly with the synchronization flow, it was time to move on to the final phase of the migration. The primary objective was to maintain the original table names, avoiding the use of any prefixes like those employed in the earlier, intermediate migration steps. This helped ensure that the configuration remained tidy and free from unnecessary naming complications.

The preceding figure shows the table replacement to complete the migration, where Hive on Amazon EMR was used to register Parquet files as Iceberg tables while preserving original table names and avoiding data duplication, helping to ensure a seamless and tidy migration.

Implementation:

One of the challenges was that renaming tables isn’t possible within AWS Glue, which prevents the use of a straightforward renaming approach for the existing synchronization flow tables. In addition, AWS Glue doesn’t support the Migrate procedure, which creates Iceberg metadata on top of the existing data file while preserving the original table name. The strategy to overcome this limitation was to use a Hive metastore on an Amazon EMR cluster. By using Hive on Amazon EMR, NI was able to create the final tables with their original names because it operates in a separate metastore environment, giving the flexibility to define any required schema and table names without interference.

The add_files procedure was used to methodically register all the existing Parquet files, thus constructing all necessary metadata within Hive. This was a crucial step, because it helped ensure that all data files were appropriately cataloged and linked within the metastore.

The preceding figure shows the transition of a production table to Iceberg by using the add_files procedure to register existing Parquet files and create Iceberg metadata. This helped ensure a smooth migration while preserving the original data and avoiding duplication.

This setup allowed the use of existing Parquet files without duplicating data, thus saving resources. Although the sync flow used separate buckets for the final architecture, NI chose to maintain the original buckets and cleaned the intermediate files. This resulted in a different folder structure on Amazon S3. The historical data had subfolders for each partition under the root table directory, while the new Iceberg data organizes subfolders within a data folder. This difference was acceptable to avoid data duplication and preserve the original Amazon S3 buckets.

Technical recap

The AWS Glue Data Catalog served as the primary source of truth for schema and table updates, with Amazon EventBridge capturing Data Catalog events to trigger synchronization workflows. AWS Lambda parsed event metadata and managed schema synchronization, while Apache Kafka buffered events for real-time processing. Apache Spark on Amazon EMR handled data transformations and incremental updates, and Amazon DynamoDB maintained state, including synchronization checkpoints and table mappings. Finally, Snowflake seamlessly consumed Iceberg tables via aliases without disrupting existing workflows.

Migration outcome

The migration was completed with zero downtime; continuous operations were maintained throughout the migration, supporting hundreds of pipelines and dashboards without interruption. The migration was done with a cost optimized mindset with incremental updates and partition-level synchronization that minimized the usage of compute and storage resources. Lastly, NI Established a modern, vendor-neutral platform that enables scaling their evolving analytics and machine learning needs. It enables seamless integration with multiple compute and query engines, supporting flexibility and further innovation.

Conclusion

Natural intelligence migration to Apache Iceberg was a pivotal step in modernizing the company’s data infrastructure. By adopting a hybrid strategy and using the power of event-driven architectures, NI helped ensure a seamless transition that balanced innovation with operational stability. The journey underscored the importance of careful planning, understanding the data ecosystem, and focusing on an organization-first approach.

Above all, business was kept in focus and continuity prioritized the user experience. By doing so, NI unlocked the flexibility and scalability of their data lake while minimizing disruption, allowing teams to use cutting-edge analytics capabilities, positioning the company at the forefront of modern data management and readiness for the future.

If you’re considering an Apache Iceberg migration or facing similar data infrastructure challenges, we encourage you to explore the possibilities. Embrace open formats, use automation, and design with your organization’s unique needs in mind. The journey might be complex, but the rewards in scalability, flexibility, and innovation are well worth the effort. You can use the AWS prescriptive guide to help learn more about how to best use Apache Iceberg for your organization


About the Authors

Yonatan DolanYonatan Dolan is a Principal Analytics Specialist at Amazon Web Services. Yonatan is an Apache Iceberg evangelist.

Haya Stern is a Senior Director of Data at Natural Intelligence. She leads the development of NI’s large-scale data platform, with a focus on enabling analytics, streamlining data workflows, and improving dev efficiency. In the past year, she led the successful migration from the previous data architecture to a modern lake house based on Apache Iceberg and Snowflake.

Zion Rubin is a Data Architect at Natural Intelligence with ten years of experience architecting large‑scale big‑data platforms, now focused on developing intelligent agent systems that turn complex data into real‑time business insight.

Michał Urbanowicz is a Cloud Data Engineer at Natural Intelligence with expertise in migrating data warehouses and implementing robust retention, cleanup, and monitoring processes to ensure scalability and reliability. He also develops automations that streamline and support campaign management operations in cloud-based environments.

Maintaining spare capacity during host failures on AWS Outposts with dynamic monitoring

Post Syndicated from Adam Duffield original https://aws.amazon.com/blogs/compute/maintaining-spare-capacity-during-host-failures-on-aws-outposts-with-dynamic-monitoring/

AWS Outposts Rack is a fully managed service that extends AWS infrastructure, services, and APIs to user managed locations. Although you may be used to the seemingly infinite capacity that AWS offers in region, those using Outposts rack for their workloads are limited to the capacity that they order. You will need to closely manage and monitor usage of the available resources as part of capacity management. It is also important to make sure that there is sufficient available capacity in the event of an impactful hardware failure. Although spare capacity is often planned for in the initial Outposts rack configuration order, scaling events and deployments of new workloads can often lead to capacity shortages that only become visible during a failure event.

In this post, we review best practices for capacity management and fault tolerance with Outposts rack followed by an example of how the Outposts API can be used to build an automated monitoring and alerting system to highlight potential resiliency issues.

Planning for failures

The AWS Outposts High Availability Design and Architecture whitepaper discusses the principals of capacity planning within Outposts rack, such as how instance families are mapped to hosts through capacity planning.

When looking to determine resiliency levels, we refer to having N+M capacity, where N represents the number of deployed hosts of a particular instance family (such as C5 or M5), and M represents the number of hosts that can fail while still meeting workload capacity requirements.

The capacity configuration that is applied to each host will impact the necessary recovery process in the event of a failure, depending on the number of configured or running instances. With this in mind, there are three potential recovery scenarios that can apply in the event of a host hardware failure:

  1. Sufficient capacity exists within all instance pools to tolerate the failure of M hosts. This is the most ideal operational position to be in because, in the event of a failure, instances can be recovered to free capacity quickly either through automated features, such as EC2 Auto Scaling groups and instance recovery, or through manual stop/start of the instances.
  2. The required instance type is not available within the available instance pools, however, there is sufficient vCPU available to execute capacity tasks to create the required instance capacity to fulfill the shortfall. As this requires changes to existing capacity, this results in a longer recovery time overall
  3. Insufficient capacity within the Outpost at both the instance pool and vCPU level means that either workloads need to be stopped to fit within the available capacity, or more Outpost hardware needs to be added. This further extends the recovery time for workloads.

Consider the following example of an Outpost configured with four M5 hosts that have been designed with an N+1 resiliency model.

Figure 1: Example configuration with sufficient instance pool capacity

In this example, there are five configured instance pools with the following usages:

Instance size Total instance pool capacity Total free instance pool capacity Max configured instances per host
M5.large 16 6 4
M5.xlarge 8 3 2
M5.2xlarge 8 3 2
M5.4xlarge 8 3 2
M5.8xlarge 4 2 1

For all instance pools, the number of available instances is greater than the maximum number of instances configured on a single host. Therefore, in the event of a failure of any host, instances can be moved to the existing available capacity without any reconfiguration.

We can consider another scenario of running instances on the same set of hosts:

Figure 2: Example configuration with sufficient vCPU capacity

With the usage as shown, four of the configured instance pools have sufficient available capacity. However, the m5.4xlarge instance pool only has one available instance placement, resulting in no tolerance to a single host failure. A single m5 host has a total of 96 vCPU, and in this example the overall capacity of the available slots is 156 vCPU. This means that, with the execution of a capacity task to rebalance the available slots, instances could be restarted after a host failure.

Automating a capacity observability solution

With the release of the capacity task functionality for Outposts, details of instance placement and slot configuration per host are now available within both the AWS Management Console and through the API. With the addition of capacity tasks for Outposts, an automated solution can be created to query this data and provide notifications when the N+M resiliency requirements for your workloads are at risk.

The following diagram shows an example solution to achieve this, with the sample code provided in the AWS Samples GitHub repository. The solution is deployed using an AWS Serverless Application Model (AWS SAM) template.

Figure 3: Sample code architectural diagram

  1. Amazon EventBridge scheduler initiates an AWS Lambda function on a user defined time basis.
  2. The Lambda function evaluates the Outpost rack capacity, creates and updates Amazon CloudWatch alarms, and initiates regular reporting.
  3. An Amazon Simple Notification Service (Amazon SNS) Topic sends the report to user defined endpoints such as email or Slack.
  4. CloudWatch alarms continually monitor for changes to Outpost capacity.
  5. In the event of alarm thresholds being breached, a Lambda function is invoked to send notifications via SNS to the user defined endpoints.

At the core of the solution are two Lambda functions:

Monitoring stack manager: This Lambda function sets up the dynamic monitoring of the desired N+M resiliency level. It achieves this by creating and updating CloudWatch alarms based on the current capacity configuration of the Outposts being monitored, and the capacity usage for each instance family and type. The function generates detailed reports for each Outpost, identifying any potential resiliency issues for each instance family based on the M value that is specified at the time of deployment.

The detailed report, which is issued via the configured SNS topic, starts with an overall summary that clearly details the status of each instance family and the resiliency status.

Figure 4: Resiliency report summary section

Following the overall summary section, a more detailed analysis is provided for each instance family, looking at resiliency from both instance type and vCPU capacity perspectives. As part of this detailed analysis, the level of risk for each capacity pool is provided alongside a review of available instance capacity and suggested mitigation options.

Figure 5: Resiliency report instance pool analysis section

Figure 6: Resiliency report vCPU analysis section

This summary report is generated on every execution of the Monitoring Stack Manager function, with the default configuration that is triggered by the EventBridge Scheduler set to daily.

Process alarm: When the alarm that is configured by the Monitoring Stack Manager Lambda triggers, the Process Alarm Lambda analyzes Outpost capacity, checking for available free vCPUs within the hosts running the affected instance family. Then, a report is sent via SNS to immediately draw attention to the capacity risk, providing guidance if the resiliency risk can be mitigated through the application of an alternate capacity configuration.

Figure 7: Resiliency alarm notification report

Similar to the report generated by the Monitoring Stack Manager function, a more detailed breakdown of the capacity issue is provided that allows for easy identification of any necessary follow up actions. These actions are recommendations for manual resolution of the issue and require you to take action to implement.

When the available capacity returns to a level that matches the N+M resiliency requirements you defined, a further notification report is sent to confirm this, and the alarm is reset.

You may also prefer to integrate notifications into platforms such as Slack or Microsoft Teams. One option for this is to use a Lambda function to rewrite the Amazon SNS notification to publish the message through a Webhook. For more information on this, go to How do I use webhooks to publish Amazon SNS messages to Amazon Chime, Slack, or Microsoft Teams?. Alternatively, for sending messages to Slack, users can use Slack’s email-to-channel integration, which allows Slack to accept email messages and forward them to a Slack channel. For more information, go to Configure Amazon SNS to send messages for alerts to other destinations.

Considerations for deploying this solution

The sample solution provided has been designed to work for users who are operating Outposts at any scale. However, there are some considerations for deploying:

  1. The solution is deployed within the AWS account that owns the Outpost, rather than workload/consumer accounts that might be using Outposts resources through AWS Resource Access Manager (AWS RAM)
  2. The deployment is AWS Region-specific. Therefore, it would need to be deployed in each AWS Region you’re using Outposts in.
  3. Each stack deployment supports dedicated N+M configuration monitoring, allowing you to create separate deployments to match the desired resilience requirements across multiple Outposts.

Cleaning up

Because this solution is implemented through AWS SAM, the only clean up required is to execute the AWS SAM deployment using the cleanup parameter as documented in the code repository readme file.

Conclusion

In this post, we reviewed how to calculate N+M resilience for Outposts rack deployments, and provided a sample solution that can dynamically monitor and report on capacity constraints. Making sure that there is sufficient available capacity within an Outpost rack to tolerate failures is critical to running resilient applications and minimizing any potential downtime. Combining good capacity management practices with service functionality, such as EC2 Auto Scaling, automatic instance recovery, and placement groups, gives you several options to make sure workloads can continue to run even during failure events. If you need any assistance calculating your Outposts rack resiliency, or further information on deploying and running fault tolerant workloads, reach out to your AWS Account team.

Empower your teams with modern architecture governance

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/empower-your-teams-with-modern-architecture-governance/

Agile product teams thrive on autonomy and rapid iteration, especially in the cloud where they can quickly deploy and test systems. However, traditional architecture governance often stands in their way, because many enterprises still impose centralized, one-off architecture signoffs early in the process. Historically, these signoffs verified design compliance with corporate standards in a slower, on-premises world. In cloud environments, such signoffs quickly become obsolete—along with their associated architectural documents—and discourage teams from considering new insights.

Modern cloud architectures demand a new governance approach. In this post, we show how collaborative architecture oversight can transform team performance through automation, self-service platforms, and distributed decision-making. We explore how key stakeholders (developers, architects, security specialists, and shared services teams) can participate in architectural decisions through asynchronous approval workflows, while making sure non-negotiable controls such as encryption at rest and in transit are consistently enforced through automation and policy as code. This approach empowers teams to experiment and adapt quickly while maintaining robust enterprise standards.

The promise of traditional architecture signoffs

Traditional architecture governance centers around formal reviews where teams submit detailed design documents to a central architecture board. These artifacts often include comprehensive diagrams, technology selections, security plans, and integration specifications. Architects and a variety of stakeholders such as security specialists, compliance officers, quality assurance, and operations teams review these documents in scheduled meetings before issuing a signoff. These approvals represent point-in-time validations against enterprise standards, assuming minimal deviation during implementation.

Why traditional signoffs fall short

Signoffs can create challenges in modern cloud architectures:

  • Substituting for continuous compliance when automated verification is missing, creating false assurance through one-off reviews
  • Creating a restrictive “check-the-box” mentality where meeting minimum documented requirements becomes the goal instead of exploring the best solutions
  • Removing decision authority from implementation teams who often have the most contextual knowledge
  • Delaying implementation feedback loops and reducing organizational agility

Consider an agile team responsible for a strategic cloud application that’s moved beyond minimal viable product and is now scaling to support growing business demands. The system architecture must evolve to handle increasing data volumes, performance requirements, and unanticipated integrations. However, corporate stakeholders insist on rigid adherence to the originally approved architecture documents. Although governance is essential for production systems, this inflexible approach with an early architecture signoff prevents the team from implementing architectural improvements as they go. What appears as meaningful control to stakeholders becomes a stifling constraint for builders, ultimately compromising the system stability the governance process aims to protect.

Modern cloud architecture support

A modern architecture function operates around evolving capabilities across three core areas: preapproved blueprints, distributed governance, and automated insights—with traditional signoffs reserved as an exception path for unique use cases.

A modern architecture function operates around evolving capabilities across three core areas: preapproved blueprints, distributed governance, and automated insights—with traditional signoffs reserved as an exception path for unique use cases.

Preapproved blueprints

Preapproved blueprints like reference architectures and code templates enable your teams to move faster while maintaining corporate standards. This approach supports a use-case-focused assessment model, where architects can concentrate on evaluating specific workflows or threats relevant to a unique use case—rather than having to understand the entire system or threat model from scratch. In this way, the architecture function shifts to managing by exception and refocus reviews on deviations from the standard. Blueprints should have gravity, guiding teams towards standardized patterns while preventing fragmentation through too many tools, databases, middleware options, or SDKs. Consider the following:

  • Pattern-based reference architectures – These set clear principles for security and resilience without micromanaging. These standards align teams while allowing innovation within a reliable framework. The cloud-driven enterprise transformation at BMW Group exemplifies how moving from signoffs to enablement through pattern-based architectures can be successful.
  • Self-service platforms – These provide standardized resources that empower teams to build independently. A self-service platform with preapproved templates for deployment toolchains and infrastructure code enables confident and rapid development. Most companies host these on internal developer platforms like Backstage or AWS Service Catalog. This also allows controlled changes to the blueprints and to track their adoption.
  • Blueprint lifecycle – Blueprints require their own approval process. Although this creates significant efficiency by reducing individual system reviews, it introduces the challenge of managing existing deployments when patterns are updated. Include versioning and migration strategies when introducing new blueprints.

Distributed governance

Distributed governance treats architectural decision-making as a continuous, collaborative process with clear accountability, delegating decision-making and empowering your builders within established blueprints. Consider the following:

  • Architecture decision records (ADRs) – These replace formal, one-off signoffs by documenting decisions for each build iteration. This approach promotes transparency and maintains team agility without compromising accountability for decisions and approvals with key stakeholders. It also allows teams to defer decisions until they are most relevant. For practical implementation guidance, see Using architectural decision records to streamline decision-making for a software development project. To learn about how to write concise ADRs and avoid duplication, consult the ADR templates GitHub repository and When Should I Write an Architecture Decision Record.
  • Community-driven consultations – Architecture departments can foster self-organization by creating architecture communities of practice for peer knowledge-sharing. These communities enable collaboration on best practices, challenges, and standards, cultivating a culture of distributed decision-making, without eliminating the lines of responsibility and accountability for final decisions. This approach works because deep architectural expertise often resides with builders who have hands-on experience with specific use cases and technologies. The role of the architecture function shifts towards providing the necessary infrastructure and identifying thought leaders in the organization.

Automated insights

Automated insights enable compliance with corporate standards through real-time monitoring and adaptation:

Comparison

The modern view improves the overall value-add and support model of your architecture function. The following table compares the traditional and modern views.

Aspect Traditional View Modern View
Purpose Centralized signoff to enforce control and reduce risk Empower teams with preapproved standards to prevent sprawl and manage their distribution such as through Backstage
Architecture approach Fixed, one-time design Evolving, treated as a parametrized, reusable code product refined through feedback
Team empowerment Limited, decisions approved by centralized authority High, teams make decisions within clear standards
Team speed and agility Slower, due to dependency on signoff Faster, continuous iteration without waiting for approvals
Risk management Early signoff to lock in decisions and reduce uncertainty Risk managed through continuous control validation with automated evidence collection, providing stronger assurance for second and third lines of defense than point-in-time assessments
Compliance Manual checks by experts Automated through policy as code and AI tools
Transparency Limited, focused on approval documentation High, lightweight decision records for technical stakeholders and visualizations or dashboards for non-technical oversight functions
Collaboration Centralized control, limited cross-team collaboration Peer-led communities (collective governance) such as Security Guardians
Innovation Restricted, focus on following signed-off designs Encouraged, teams explore within a standards-based framework

Despite the benefits, many organizations struggle to let go of signoffs for a number of reasons, including:

  • Cultural resistance – In risk-averse cultures where failing fast is not accepted, leaders hesitate to let go of centralized control mechanisms.
  • Compliance concerns – In regulated industries, centralized approvals serve as control gates. The modern view replaces point-in-time trust with continuous compliance mechanisms—automated guardrails, real-time monitoring, and evidence collection—enabling even highly regulated environments to achieve compliance with small, autonomous teams operating within clear boundaries (“two-pizza team”).
  • Lack of infrastructure – Some organizations lack self-service platforms, automated compliance, or observability, so they fall back on signoff to manage risk.
  • Governance concerns – Traditional teams often view distributed decision-making as no governance rather than transformed governance.

The modern view offers significant benefits, though with governance considerations:

  • Speed and flexibility – Teams move faster without waiting for approvals, deploying AWS resources iteratively.
  • Empowerment and ownership – Builders using standards and ADRs feel accountable and actively shape architecture.
  • Innovation and experimentation – Self-service tools and AI guidance foster experimentation without delays.

Conclusion

You can empower your builders by rethinking your architecture signoff. In the modern view discussed in this post, architecture governance aligns with the pace and flexibility of the cloud, allowing teams to innovate within a shared framework. This approach values standards and autonomy over control, and transforms your architecture function into a strategic partner in a fast-evolving landscape.

To learn how to establish and maintain cloud-centered principles and patterns, refer to the platform architecture chapter of the AWS Cloud Adoption Framework and the AWS Culture of Security resources.

Related resources


About the author

How Smartsheet reduced latency and optimized costs in their serverless architecture

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/architecture/how-smartsheet-reduced-latency-and-optimized-costs-in-their-serverless-architecture/

Cloud software as a service (SaaS) companies are often looking for ways to enhance their architectures for performance and cost-efficiency. Serverless technologies offload infrastructure management, allowing development teams to focus on innovation and delivering business value. As application architectures grow and face more demanding requirements, continued optimization helps maximize both the technical and financial advantages of the serverless approach.

In this post, we discuss Smartsheet’s journey optimizing its serverless architecture. We explore the solution, the stringent requirements Smartsheet faced, and how they’ve achieved an over 80% latency reduction. This technical journey offers valuable insights for organizations looking to enhance their serverless architectures with proven enterprise-grade optimization techniques.

Solution overview

Smartsheet is a leading cloud-based enterprise work management platform, enabling millions of users worldwide to plan, manage, track, automate, and report on work at scale. At the core of the platform lies an event-driven architecture that processes real-time user activity across various document types. Given the collaborative nature of the platform, multiple users can work on these documents concurrently. Every document interaction triggers a series of events that must be processed with minimal latency to maintain data consistency and provide immediate feedback. Processing delays can impact user experience and productivity, making consistently low latency a fundamental business requirement.

Smartsheet’s traffic pattern is spiky during business hours and mostly dormant during nights and weekends. Within peak periods, traffic can fluctuate as users collaborate in real time. To efficiently manage dynamic workloads, which can surge from hundreds to tens of thousands of events per second within minutes, Smartsheet implements a serverless event processing architecture using services such as Amazon Simple Queue Service (Amazon SQS) and AWS Lambda. This architecture uses the elasticity of serverless services and the ability to automatically scale dynamically based on the traffic volume. It makes sure Smartsheet can efficiently handle sudden traffic surges while automatically scaling down during off-peak hours, optimizing for both performance and cost-efficiency.

The following diagram illustrates the high-level architecture of the Smartsheet event processing pipeline.

high-level architecture of the Smartsheet event processing pipeline

Optimization opportunity

Smartsheet uses Lambda functions to serve both batch jobs and API requests. The primary runtime used for building those functions is Java. Lambda automatically scales the number of execution environments allocated to your function on demand to accommodate traffic volume. When Lambda receives an incoming request, it attempts to serve it with an existing execution environment first. If no execution environments are available, the service initializes a new one. During initialization, the Smartsheet’s function code commonly sends several requests to external dependencies, such as databases and REST APIs, which might take time to reply.

The following diagram illustrates how Lambda functions reach out to external dependencies during initialization.

Lambda functions reach out to external dependencies during initialization

These tasks introduced execution environment initialization latency, commonly referred to as a cold start. Although cold starts typically affect less than 1% of requests, Smartsheet had stringent low latency requirements for their architecture to further prioritize the best possible end-user experience.

“To reduce customer request latency while keeping costs low, our engineering team utilized Lambda provisioned concurrency with auto scaling and Graviton, which resulted in an 83% reduction in P95 latency while providing a high quality of service as we continue to scale our platform and its limits,” says Abhishek Gurunathan, Sr Director of Engineering at Smartsheet.

Addressing the cold start with provisioned concurrency

To reduce cold start latency, the Smartsheet team adopted provisioned concurrency in their architecture, a capability that allows developers to specify the number of execution environments that Lambda should keep warm to instantly handle invocations. The following diagram illustrates the difference. Without provisioned concurrency, execution environments are created on demand, which means some invocations (typically less than 1%) need to wait for the execution environment to be created and initialization code to be run. With provisioned concurrency, Lambda creates execution environments and runs initialization code preemptively, making sure invocations are served by warm execution environments.

invocations are served by warm execution environments

Provisioned concurrency includes a dynamic spillover mechanism, making your serverless architecture highly resilient to traffic spikes. When incoming traffic exceeds the preconfigured provisioned concurrency, additional requests are automatically served by on-demand concurrency rather than being throttled. This provides seamless scalability and maintains service availability even during traffic surges, while still providing the performance benefits of pre-warmed execution environments for the majority of requests.

The Smartsheet team configured provisioned concurrency to match their historical P95 concurrency needs. This resulted in immediate improvements—the number of cold starts dropped dramatically and P95 invocation latency dropped by 83%. As the team monitored system performance, they quickly identified another architecture optimization opportunity—the Lambda functions were heavily used during work hours but had significantly fewer invocations at night and on weekends, as illustrated in the following graph.

Lambda functions were heavily used during work hours but had significantly fewer invocations at night and on weekends

Setting a static provisioned concurrency configuration worked great for busy periods, but was underutilized during off-times. The Smartsheet team wanted to further fine-tune their architecture and increase provisioned concurrency utilization rates to achieve higher cost-efficiency. This led them to look into provisioned concurrency auto scaling to match traffic patterns as well as adopting an AWS Graviton architecture.

Auto scaling provisioned concurrency and Graviton architecture

Two common approaches to enable provisioned concurrency are setting a static value and using auto scaling. With static configuration, you specify a fixed number of pre-initialized execution environments that remain continuously warm to serve invocations. This approach is highly effective for architectures that handle predictable traffic patterns. Unpredictable traffic patterns, however, can lead to under-provisioning during peak periods (with spillover to on-demand concurrency resulting in more cold starts) or underutilization during low-usage periods. To address that, provisioned concurrency with auto scaling dynamically adjusts the configuration based on utilization metrics, automatically scaling the number of execution environments up or down to match the actual demand. This dynamic approach optimizes for cost-efficiency and is particularly recommended for architectures with fluctuating traffic patterns.

The following figure compares static and dynamic provisioned concurrency.

static and dynamic provisioned concurrency

To further optimize the architecture for cost-efficiency, the Smartsheet team has implemented provisioned concurrency auto scaling based on utilization metrics. Smartsheet used an infrastructure as code (IaC) approach with Terraform to define auto scaling policies for maximum reusability across hundreds of functions. The policies track the LambdaProvisionedConcurrencyUtilization metric and define the scaling threshold according to the function purpose. For functions implementing interactive APIs, the auto scale threshold is 60% utilization to pre-provision execution environments early, keeping latency extra-low, and making functions more resilient towards traffic surges. For functions that implement asynchronous data processing, Smartsheet’s goal was to achieve the highest utilization rate and cost-efficiency, so they’ve defined the auto scale threshold at 90%.

The following diagram illustrates the architecture of auto scaling policies based on provisioned concurrency utilization rate and workload type.auto scaling policies based on provisioned concurrency utilization rate and workload type

Another optimization technique Smartsheet employed was switching the CPU architecture used by their Lambda functions from x86_64 to arm64 Graviton. To achieve this, Smartsheet adopted the ARM versions of Lambda layers they’ve used, such as Datadog and Lambda Insights extensions. This was required because binaries built using one architecture might be incompatible with a different one. Because Smartsheet functions were implemented with Java and packaged as JAR files, they didn’t have any compatibility issues when moving to Graviton. With Terraform used for codifying the infrastructure, this architecture switch was a simple property change in aws_lambda_function resources, as illustrated in the following code:

property change in aws_lambda_function resources

By switching to a Graviton architecture, Smartsheet saved 20% on function GB-second costs. See AWS Lambda pricing for details.

Best practices

Use the following techniques and best practices to optimize your serverless architectures, reduce cold starts, and increase cost-efficiency:

  • Fine-tune your Lambda functions to find the optimal balance between cost and performance. Increasing memory allocation also adds CPU capacity, which often means faster execution and can lead to reduced overall costs.
  • Use a Graviton2 architecture for compatible workloads to benefit from a better price-performance ratio. Depending on the workload type, switching to Graviton can yield up to 34% improvement.
  • Use provisioned concurrency and Lambda SnapStart to reduce cold starts in your serverless architectures. Start with static provisioned concurrency based on your historical concurrency requirements, monitor utilization, and introduce auto scaling into your architecture to achieve the optimal cost-performance profile.

Conclusion

Serverless architectures using services like Lambda and Amazon SQS offload the infrastructure management and scaling concerns to AWS, allowing teams to focus on innovation and delivering business value. As Smartsheet’s journey demonstrates, using provisioned concurrency and Graviton in your architectures can help significantly improve user experience by reducing latencies while also achieving better cost-efficiency, providing a practical blueprint for optimization across the organization. Whether you’re running large-scale enterprise applications or building new cloud solutions, these proven techniques can help you unlock similar performance gains and cost-efficiencies in your serverless architectures.

To learn more about serverless architectures, see Serverless Land.


About the authors

 

Build and operate an effective architecture review board

Post Syndicated from Darrin Weber original https://aws.amazon.com/blogs/architecture/build-and-operate-an-effective-architecture-review-board/

The rapid change of pace in computing landscapes because of cloud, artificial intelligence, and technology innovation has challenged organizations to keep up while making sure that their initiatives and projects remain compliant with enterprise guidelines and policies. An effective architecture review board (ARB) can help an organization maintain compliance with enterprise guardrails while accelerating implementation of initiatives in their project pipeline.

In this post, we identify the components of an efficient architecture review process, define what an ARB is, and describe how to build and operate an effective enterprise ARB.

What is an architecture review board?

An ARB is a multi-disciplinary team responsible for reviewing solution architectures to help ensure compliance with enterprise guidelines, best practices, and supportability. Team members include stakeholders from different disciplines throughout your organization, which typically include Security, Development, Enterprise Architecture, Infrastructure, and Operations. Including a broad set of stakeholders reduces the amount of project recycle that happens when stakeholder representation is overlooked.

An ARB isn’t a standalone group, it operates within the context of your project implementation process, reviewing solution architectures, custom development, and purchased solutions to maintain enterprise compliance and alignment with goals. As shown in the following diagram, architecture review typically occurs after the design phase—before a build or purchase decision—and again before deployment to validate that the reviewed architecture matches the solution that was built.

Project implementation process with architecture review checkpoints

Most organizations recognize the benefits and value of establishing an ARB. However, they often struggle to define and operate one in a manner that maximizes the benefits, integrates with overall project execution processes, and satisfies the needs of all the stakeholders. An efficient architecture review process imparts organizational benefits such as reduced costs, minimized security events, and diminished technical debt.

Life without a formal architecture review process

One of the most pronounced issues with implementing and maintaining software architecture is the difficulty in achieving human consensus. In any organization, you’ll find a diverse range of team members—each with their own priorities, perspectives, and pain points. Without a formalized review process, these differences can lead to prolonged debates and stalled projects. We often find that many members tend to fall into one of these personas:

The Not Invented Here The Not Invented Here – This individual doesn’t trust any software unless it was built and operated by members of their company. They’re generally wary of any cloud solution and will expend development time to avoid capital expenditure.
The Wait a Minute The Wait a Minute – This individual has good feedback and their input is welcome, but they tend to wait until the last minute before providing any feedback, making it difficult to have productive conversations and act on any constructive criticism.
The Bottleneck The Bottleneck – This individual craves control and insists that all reviews, decisions, and conversations go through them. This makes scaling the architecture review process very challenging and decisions will often come down to the whim of this one person.
The Creative The Creative – This individual has passion for software and for creating things, but will often choose complexity over simplicity and turn their architectures into art projects.
The Perfectionist The Perfectionist – This individual tends to let the perfect be the enemy of the good. While their intentions are pure, this approach can result in delayed decision making and debates on topics that might not be worth the time of the board.
The Historian The Historian – This individual has been at the company for a long time and remembers every success and failure along the way. While the context this individual brings to the table is invaluable, teams must guard against only looking to the past as they try to shape the future.

Benefits of an architecture review board

Establishing an ARB within your organization can yield substantial benefits, enhancing both the quality and efficiency of your architecture. Some key advantages are:

Improved compliance

By systematically reviewing architectural decisions, the ARB helps ensure that designs adhere to company best practices, open standards, and regulatory requirements as set forth by your enterprise architecture.

Reduced technical debt

Technical debt—taking shortcuts in the development process that lead to future complications—is a common issue in software development. The ARB helps identify and mitigate technical debt early in the design phase. By enforcing architectural standards and promoting best practices, the board helps ensure that decisions are made with long-term sustainability in mind. This approach results in more robust, maintainable codebases and reduces the likelihood of future rework.

Efficiency with lowered costs

While a formal architecture review sounds like it might have the potential for increased red tape and lowered efficiency, the ARB instead contributes to operational efficiency by standardizing architectural practices across the organization. This uniformity allows for better resource allocation, faster deployment cycles, and more predictable project timelines. By catching potential issues early in the design phase, the ARB helps avoid costly rewrites and rework, which can lead to significant cost savings over time.

Supportability

Designing for supportability is crucial for the long-term success of any application. The ARB makes sure that architectures are built with maintainability in mind, making it easier for operations teams to manage and troubleshoot systems. This focus on supportability leads to fewer downtime incidents, faster resolution times, and overall higher system reliability. By making sure the composition of the ARB crosses all parts of the organization, supportability concerns can be surfaced earlier and help ensure that changes are properly socialized.

Security

Above all, security is the most critical output of an effective ARB. The ARB plays a pivotal role in embedding security into the architectural fabric from the outset. By conducting thorough security reviews and incorporating security best practices into every design, the ARB makes sure that applications are resilient against unintended disclosure, inadvertent access, and threat actors. This proactive approach not only protects sensitive data, but also builds trust with your customers and stakeholders.

Steps for effective architecture review boards

Whether looking to establish a new architecture review process or improve the effectiveness of a current ARB, we’ve identified eight key steps to make sure that an ARB operates in a way which realizes the benefits of a robust architecture review process while maintaining enterprise compliance. With the exception of leadership support, the steps aren’t presented in a particular order and can be implemented in parallel or in whatever order fits your organization and resource availability.

Leadership support

Identifying a sponsor on the executive leadership team is crucial to the success of the ARB. An executive sponsor fosters participation from stakeholders, representing key organizations such as Security, Development, and Operations, along with gaining their commitment to the review processes. The executive sponsor helps embed the ARB function within the enterprise’s project implementation process. Supported by the executive sponsor, the ARB’s reviews serve as a formal gate within the project process, reducing attempts to bypass the review processes.

Single source for guidance, policies, and best practices

Establish a single, well-known repository or index so that the entire enterprise has a single source of truth that establishes the basis for designing and reviewing architecture. A common repository doesn’t need to be complex. It can be a central document location, wiki, or file share that’s quickly discoverable. Commonly, an enterprise’s collection of guidelines and policies are dispersed and managed by each organization using different mechanisms and repositories. Best practices are often treated as folklore passed between team members. Project teams and ARB stakeholders need to share a common understanding of the enterprise’s collective intelligence consisting of guidelines, policies, and best practices.

As the project community’s collective understanding of the enterprise guidelines and policies grows, initial solution designs are better aligned, and reviews through the ARB accelerate. After a common repository is established, consider using generative AI to create a natural language chatbot, a design chatbot, to simplify access to the collective guidelines, policies, and best practices. See Amazon Bedrock or Amazon Q – Generative AI Assistant.

Defined stakeholders

Make sure that your disciplines have defined stakeholders on the ARB. A good starting point is to identify stakeholders from the Security, Enterprise Architecture, Development, Infrastructure, and Operations teams. Broad representation on the board minimizes recycles and delays later in the project, which can occur when stakeholders aren’t engaged in the review process from the beginning. A stakeholder’s responsibility is to focus on their area of subject matter expertise and commit a portion of their time to the ARB. Consider rotating stakeholders periodically to distribute knowledge and workload through the organization.

Gated process with documented decisions

As previously described, architecture reviews typically occur after design and before solution implementation or purchase. Optionally, another architecture review takes place before deployment to validate that the solution matches what was reviewed and approved. It’s important to complete the review before implementation or the purchase decision and to get stakeholder sign off. Otherwise, projects risk rework and delay later in the process, often impacting cost or schedule to a greater degree. Document each ARB action, including approvals, reasons for recycles, exceptions required, follow-ups needed, and so on. Documented decisions should be added to the project’s overall lifecycle documentation to benefit future inspection of project or similar solution architectures.

Establish an exception process

There will always be exceptions to your enterprise guidelines or policies. Plan for exceptions with a defined process for reviewing, escalating, and gaining approval. Include leadership from both IT and business areas in the assessment and sign-off on an exception. Most importantly, set expiration dates on the exceptions–they should not be granted indefinitely. Exceptions are typically granted to accommodate a temporary nonconformance to provide time to plan for and implement a better, long-term solution.

Architecture central repository

Establish a well known, central repository for solution architecture documents. Solution documentation should be treated as living artifacts that are maintained for the lifecycle of the use case. A central architecture repository benefits teams responsible for operating and maintaining solutions, along with design teams chartered with new solution design. After a repository is established, consider including your architecture documentation in the generative AI design chatbot mentioned previously.

Automate review process

Employ automated architecture review processes wherever possible. Automated review processes allow stakeholders to focus their time on their subject matter expertise instead of administrative tasks. Consider separate review processes based on an initiative’s complexity, cost, and impact. Schedule live meetings with the ARB for the most complex and impactful solutions, and use offline mechanisms, such as email, for other efforts. Define a universal architecture template to capture areas of interest for review and automate the Q&A and sign-off processes. Consider using generative AI to do initial automated design reviews against enterprise core best practices and policies to further streamline stakeholder review processes.

Architecture review process shepherd

Identify a shepherd to help ensure that solution architectures are reviewed and the ARB review processes are broadly understood. The shepherd functions as a liaison with executive sponsors for exceptions. While the shepherd can also be a stakeholder on the board, the shepherd is not the single overall decision maker. The shepherd champions the continuous improvement of the architecture review process and mechanisms.

Conclusion

In this post, we explored the benefits of establishing an architecture review board within an organization, emphasizing its role in maintaining compliance, reducing technical debt, and enhancing operational efficiency. We discussed the challenges organizations face in setting up an effective ARB and provided guidance on the essential components and steps required to build and operate a successful ARB. By following the outlined steps, organizations can maximize the benefits of an ARB, making sure that architectural decisions align with enterprise goals and standards while fostering a culture of continuous improvement and stakeholder collaboration.

For additional guidance on garnering the leadership support necessary for an effective ARB, see Well-Architected Framework: Provide executive sponsorship. For more details on the review process, see Well-Architected Framework: The review process and AWS Well-Architected Tool, an AWS Management Console-based service that provides a consistent process for measuring your architecture using AWS best practices. If you’re interested in establishing a natural language chatbot interface for your enterprise architecture information, see Amazon Bedrock, Amazon Q Business, or Build a contextual chatbot application using Amazon Bedrock Knowledge Bases.


About the authors

Simplifying Code Documentation with Amazon Q Developer

Post Syndicated from Jehu Gray original https://aws.amazon.com/blogs/devops/simplifying-code-documentation-with-amazon-q-developer/

In the fast-paced world of software development, maintaining comprehensive documentation often falls to the bottom of priority lists in favor of delivering functionality. Amazon Q Developer’s /doc agent changes this equation by automating README generation and updates. With this tool, the variable of time spent producing documentation is reduced to the point that it’s no longer a burden to the detriment of functionality.

Understanding Amazon Q Developer’s Documentation Generation

The /doc agent leverages generative AI to analyze your codebase and generate comprehensive documentation. Additionally, the agent respects your .gitignore file and excludes files you don’t want to be included in documentation review.

Solution Overview:
As an example, imagine a cloud infrastructure team at a technology consultancy had been working for weeks on their AWS DataSync project. The solution they built provided an elegant CDK implementation that automated data transfer between Amazon S3 buckets using AWS DataSync. The lead engineer had just finished implementing the final IAM role configurations when the product manager requested comprehensive documentation for the next day’s client handoff meeting. The team realized this would typically take hours of focused work. Instead, they decided to try Amazon Q Developer /doc agent.

Getting Started with /doc:

To begin using the /doc agent, you’ll need to:

  1. Set up Amazon Q: Follow these steps
  2. Open your IDE with the Amazon Q extension installed
  3. Click the Amazon Q icon to open the chat panel
  4. Enter /doc to start the documentation process
  5. Select your documentation task:
    1. Create a new README
    2. Update an existing README with recent code changes
This image shows the user entering /doc to start the documentation process

Figure 1 – Entering /doc to start the documentation process.

Example: Creating a New README:

For projects without documentation, simply select: Create a README. It will confirm the project you are creating the README for.

This image shows the user selecting the Create a README option

Figure 2 – Select the Create a README option.

Once you verify the folder and select yes, the agent begins creating the README document for the folder.
Here are the steps it works through: scanning the source files, summarizing the source files, and generating the documentation.

This image shows the user verifying the folder and select yes

Figure 3 – Verify the folder and select yes.

When the document is created, you can preview the README file. The agent then presents you with the ability to either accept the changes or request modifications before implementation.

This image shows the preview of the created README file.

Figure 4 – Preview the created README file.

If you choose to accept, the README file is added to your project.

This image shows the user accepting the changes so the README is added to your project folder

Figure 5 – Accept the changes so the README is added to your project folder.

 

Example: Updating Documentation with Code Changes

When your code evolves, you can keep the documentation synchronized by using /doc. The agent will review your recent code modifications and suggest appropriate documentation updates.

This image shows when the user selects update an existing README

Figure 6 – Select Update an existing README to make changes to a README file.

Then you can describe the changes you want the agent to make to your README file.

This image shows how you can describe changes you’d like the agent to make

Figure 7 – Describe changes to your README files.

For targeted documentation updates, you can provide specific instructions:

This image shows the user verifying the folder

Figure 8 – Verify the folder and select yes.

Once you’ve made the changes, the agent asks you to verify them by selecting yes.

This image shows the user verifying the changes made

Figure 9 – Verify the changes and select yes.

Advanced Documentation Management

Multi-step Documentation Refinement:

This image shows the steps in Documentation Refinement

Figure 10 – Multi-step Documentation Refinement.

 

Amazon Q Developer /doc agent allows for iterative improvement of your documentation through feedback loops. After generating initial documentation, you can:

  1. Review the generated content for gaps or inaccuracies
  2. Provide specific feedback to refine particular sections
  3. Request additional sections on complex topics
  4. Gradually build comprehensive documentation through multiple iterations

This iterative approach is particularly valuable for complex projects where documentation needs to evolve alongside the codebase.

Documentation for Specific Components

For modular projects, you can create targeted documentation at different levels:

  • Root-level README for project overview
  • Component-level READMEs for specific modules
  • Service-level documentation for microservices
  • API documentation for interfaces

By combining these documentation levels, you can maintain a hierarchical documentation structure that remains manageable and specific.

Handling Documentation Inheritance

This image shows how to handle Documentation Inheritance.

Figure 11 – Handling Documentation Inheritance.

When working with derived or extended codebases:

  1. Generate base documentation for the parent project
  2. Create specialized documentation for extensions
  3. Cross-reference related documentation to maintain consistency
  4. Use the /doc agent to update specific sections when inheritance patterns change

Documentation Syncing Strategy

This image shows Documentation Syncing Strategy

Figure 12 – Documentation Syncing Strategy.

For teams working on rapidly evolving projects:

  1. Establish a documentation update schedule aligned with sprints
  2. Assign documentation reviews as part of code review processes
  3. Use /doc to generate change summaries after significant updates
  4. Implement a verification process to ensure generated documentation accurately reflects code changes

Best Practices for /doc Agent

To improve results from documentation generation with Amazon Q Developer, follow these best practices:

  1. Optimize repository size: Amazon Q Developer supports documentation generation across your codebase, accommodating projects up to the specified size limits. While documentation for larger repositories may require additional processing time and could provide more generalized results, you have the option to request documentation for specific subsets of code or individual files to receive more detailed insights.
  2. Maintain high-quality code: The quality of documentation Amazon Q Developer generates improves significantly when your code is well-commented and organized, has meaningful naming conventions for programming entities, and follows standard coding conventions.
  3. Be specific with change requests: When requesting specific README changes in natural language, choose to update an existing README and select the option to make a specific change. After initial documentation generation, you can request additional modifications by describing exactly what updates you want.
  4. Craft effective change descriptions: When describing desired updates, include:
    1. Specific sections you want to modify
    2. Exact content you want to add or remove
    3. Particular issues that need correcting
    4. How project functionality should be reflected in the README
    5. References to content available in your codebase.
  5. Understand system limitations: Amazon Q Developer doesn’t have access to private or internal platforms and might lack knowledge of third-party tools, specialized software, or custom tooling in your code. Content requiring this knowledge won’t be documented automatically. In these cases, you’ll need to manually edit the README to include information Amazon Q Developer cannot generate.

Documentation Quotas and Limitations

When working with Amazon Q Developer /doc agent, be aware of these important constraints:

  • Document generations per task: There’s a limit to the number of feedback iterations allowed per documentation session. This quota resets each time you start a new documentation task.
  • File filtering: Amazon Q Developer filters out files or folders defined in your .gitignore file. This helps streamline the documentation process by focusing only on relevant code files.

Conclusion

Amazon Q Developer /doc agent transforms the documentation process from a tedious chore to an automated, efficient workflow. By generating and maintaining READMEs based on your actual code, it ensures documentation remains accurate and up-to-date without consuming precious development time.

As part of Amazon Q Developer’s free tier, the /doc agent is readily available to integrate into your development process. Start using it today to improve your project documentation and enhance team collaboration.

About the Authors:

Jehu Gray

Jehu Gray is a Prototyping Architect at Amazon Web Services where he helps customers design solutions that fits their needs. He enjoys exploring what’s possible with IaC.

Abiola Olanrewaju

Abiola Olanrewaju is a Solutions Architect at AWS, specializing in helping Financial services customers implement scalable solutions that drive business outcomes. He has a keen interest in Data Analytics, Security and Generative AI.

Adeogo Olajide

Adeogo is a Solutions Architect at AWS, where he supports GovTech customers and other public sector customers in their cloud transformation journey. He specializes in designing secure, scalable, and compliant architectures that help public sector organizations modernize their digital services. Outside of work, he enjoys playing and watching soccer.

Joyce Muya

Joyce Muya is a Solutions Architect at AWS where she supports Enterprise Engaged customers in the media and entertainment sector. She specializes in Analytics and AI/ML workloads.

Damola Oluyemo

Damola Oluyemo is a Solutions Architect at Amazon Web Services focused on Enterprise customers. He helps customers design cloud solutions while exploring the potential of Infrastructure as Code and generative AI in software development.

Enhance Agentforce data security with Private Connect for Salesforce Data Cloud and Amazon Redshift – Part 3

Post Syndicated from Yogesh Dhimate original https://aws.amazon.com/blogs/big-data/enhance-agentforce-data-security-with-private-connect-for-salesforce-data-cloud-and-amazon-redshift-part-3/

Data protection is a high priority, particularly as organizations face increasing cybersecurity threats. Maintaining the security of customer data is top priority for AWS and Salesforce. With AWS PrivateLink, Salesforce Private Connect eliminates common security risks associated with public endpoints. Salesforce Private Connect now works with Salesforce Data Cloud to keep your customer data secure when using with key services like Agentforce.

In Part 2 of this series, we discussed the architecture and implementation details of cross-Region data sharing between Salesforce Data Cloud and AWS accounts. In this post, we discuss how to create AWS endpoint services to improve data security with Private Connect for Salesforce Data Cloud.

Solution overview

In this example, we configure PrivateLink for an Amazon Redshift instance to enable direct, private connectivity from Salesforce Data Cloud. AWS recommends that organizations use an Amazon Redshift managed VPC endpoint (powered by PrivateLink) to privately access a Redshift cluster or serverless workgroup. For details about best practices, refer to Enable private access to Amazon Redshift from your client applications in another VPC.

However, some organizations might prefer to use PrivateLink managed by themselves—for example, a Redshift managed VPC endpoint is not yet available in Salesforce Data Cloud, and you need to manage your PrivateLink connection. This post focuses on the solution to configure self-managed PrivateLink between Salesforce Data Cloud and Amazon Redshift in your AWS account to establish private connectivity.

The following architecture diagram shows the steps for setting up private connectivity between Salesforce Data Cloud and Amazon Redshift in your AWS account.

To set up private connectivity between Salesforce Data Cloud and Amazon Redshift, we use the following resources:

Prerequisites

To complete the steps in this post, you must already have Amazon Redshift running in a private subnet and have the permissions to manage it.

Create a security group for the Network Load Balancer

The security group acts as a virtual firewall. The only traffic that reaches the instance is the traffic allowed by the security group rules. To enhance the security posture, you only want to allow traffic to Redshift instances. Complete the following steps to create a security group for your Network Load Balancer (NLB):

  1. On the Amazon VPC console, choose Security groups in the navigation pane.
  2. Choose Create security group.
  3. Enter a name and description for the security group.
  4. For VPC, use the same virtual private cloud (VPC) as your Redshift cluster.
  5. For Inbound rules, add a rule to allow traffic to ingress the listening port 5439 on the load balancer.

  1. For Outbound rules, add a rule to allow traffic to your Redshift instance.

  1. Choose Create security group.

Create a target group

Complete the following steps to create a target group:

  1. On the Amazon EC2 console, under Load balancing in the navigation pane, choose Target groups.
  2. Choose Create target group.
  3. For Choose a target type, select IP addresses.

  1. For Protocol: Port, choose TCP and port 5436 (if your Redshift cluster runs on a different port, change the port accordingly).
  2. For IP address type, select IPv4.
  3. For VPC, choose the same VPC as your Redshift cluster.
  4. Choose Next.

  1. For Enter an IPv4 address from a VPC subnet, enter your Amazon Redshift IP address.

To locate this address, navigate to your cluster details on the Amazon Redshift console, choose the Properties tab, and under Network and security settings, expand VPC endpoint connection details and copy the private address of the network interface. If you’re using Amazon Redshift Serverless, navigate to the workgroup home page. The Amazon Redshift IPv4 addresses can be located in the Network and security section under Data access when you choose VPC endpoint ID.

  1. After you add the IP address, choose Include as pending below, then choose Create target group.

Create a load balancer

Complete the following steps to create a load balancer:

  1. On the Amazon EC2 console, choose Load balancers in the navigation pane.
  2. Choose Create load balancer.
  3. Choose Network.
  4. For Load balancer name, enter a name.
  5. For Scheme, select Internal.
  6. For Load balancer address type, select IPv4.
  7. For VPC, use the VPC that your target group is in.

  1. For Availably Zones, select the Availability Zone where the Redshift cluster is running.
  2. For Security groups, choose the security group you created in the previous step.
  3. For Listener details, add a listener that points to the target group created in the last step:
    1. For Protocol, choose TCP.
    2. For Port, use 5439.
    3. For Default action, choose Redshift-TargetGroup.
  4. Choose Create load balancer.

Make sure that the registered targets in the target group are healthy before proceeding. Also make sure that the target group has a target for all Availability Zones in your AWS Region or the NLB has the Cross-zone load balancing attribute enabled.

In the load balancer’s security setting, make sure that Enforce inbound rules on PrivateLink traffic is off.

Create an endpoint service

Complete the following steps to create an endpoint service:

  1. On the Amazon VPC console, choose Endpoint services in the navigation pane.
  2. Choose Create endpoint service.
  3. For Load balancer type, choose Network.
  4. For Available load balancers, select the load balancer you created in the last step
  5. From Supported Regions, select an additional region if Data Cloud isn’t hosted in the same AWS region as the Redshift instance.  For additional settings leave Acceptance required.

If this is selected, later, when the Salesforce Data Cloud endpoint is created to connect to the endpoint service, you will need to come back to this page to accept the connection. If not selected, the connection will be built directly.

  1. For Supported IP address type, select IPv4.
  2. Choose Create.

Next, you need to allow Salesforce principals.

  1. After you create the endpoint service, choose Allow principals.
  2. In another browser, navigate to Salesforce Data Cloud Setup.
  3. Under External Integrations, access the new Private Connect menu item.
  4. Create a new private network route to Amazon Redshift.

  1. Copy the principal ID.

  1. Return to the endpoint service creation page.
  2. For Principals to add, enter the principal ID.
  3. Copy the endpoint service name.
  4. Choose Allow principals.

  1. Return to the Salesforce Data Cloud private network configuration page.
  2. For Route Name, enter the endpoint service name.
  3. Choose Save.

The route status should show as Allocating.

If you opted to accept connections in the previous step, you will now need to accept the connection from Salesforce Data Cloud.

  1. On the Amazon VPC console, navigate to the endpoint service.
  2. On the Endpoint connections tab, locate your pending connection request.

  1. Accept the endpoint connection request from Salesforce Data Cloud.

Navigate to the Salesforce Data Cloud setup and wait 30 seconds, then refresh the private connect route so the status shows as Ready.

You can now use this route when creating a connection with Amazon Redshift. For additional details, refer to Part 1 of this series.

Amazon Redshift federation PrivateLink failover

Now that we have discussed how to configure PrivateLink to use with Private Connect for Salesforce Data Cloud, let’s discuss Amazon Redshift federation PrivateLink failover scenarios.

You can choose to deploy your Redshift clusters in three different deployment modes:

  • Amazon Redshift provisioned in a Single-AZ RA3 cluster
  • Amazon Redshift provisioned in a Multi-AZ RA3 cluster
  • Amazon Redshift Serverless

PrivateLink relies on a customer managed NLB connected to service endpoints using IP address target groups. The target group has the IP addresses of your Redshift instance. If there is a change in IP address targets, the NLB target group must be updated to the new IP addresses associated with the service. Failover behavior for Amazon Redshift will differ based on the deployment mode you employ.

This section describes PrivateLink failover scenarios for these three deployment modes.

Amazon Redshift provisioned in a Single-AZ RA3 cluster

RA3 nodes support provisioned cluster VPC endpoints, which decouple the backend infrastructure from the cluster endpoint used for access. When you create or restore an RA3 cluster, Amazon Redshift uses a port within the ranges of 5431–5455 or 8191–8215. When the cluster is set to a port in one of these ranges, Amazon Redshift automatically creates a VPC endpoint in your AWS account for the cluster and attaches network interfaces with a private IP for each Availability Zone in the cluster. For the PrivateLink configuration, you use the IP associated with the VPC endpoint as the target for the frontend NLB. You can identify the IP address of the VPC endpoint on the Amazon Redshift console or by doing a describe-clusters query on the Redshift cluster.

Amazon Redshift will not remove a network interface associated with a VPC endpoint unless you add an additional subnet to an existing Availability Zone or remove a subnet using Amazon Redshift APIs. We recommend that you don’t add multiple subnets to an Availability Zone to avoid disruption. There might be failover scenarios where additional network interfaces are added to a VPC endpoint.

In RA3 clusters, the nodes are automatically recovered and replaced as needed by Amazon Redshift. The cluster’s VPC endpoint will not change even if the leader node is replaced.

Cluster relocation is an optional feature that allows Amazon Redshift to move a cluster to another Availability Zone without any loss of data or changes to your applications. When cluster relocation is turned on, Amazon Redshift might choose to relocate clusters in some situations. In particular, this happens where issues in the current Availability Zone prevent optimal cluster operation or to improve service availability. You can also invoke the relocation function in cases where resource constraints in a given Availability Zone are disrupting cluster operations. When a Redshift cluster is relocated to a new Availability Zone, the new cluster has the same VPC endpoint but a new network interface is added in the new Availability Zone. The new private address should be added to the NLB’s target group to optimize availability and performance.

In the case that a cluster has failed and can’t be recovered automatically, you have to initiate a restore of the cluster from a previous snapshot. This action generates a new cluster with a new DNS name, connection string, and VPC endpoint and IP address for the cluster. You have to update the NLB with the new IP for the VPC endpoint of the new cluster.

Amazon Redshift provisioned in a Multi-AZ RA3 cluster

Amazon Redshift supports Multi-AZ deployments for provisioned RA3 clusters. By using Multi-AZ deployments, your Redshift data warehouse can continue operating in failure scenarios when an unexpected event happens in an Availability Zone. A Multi-AZ deployment deploys compute resources in two Availability Zones, and these compute resources can be accessed through a single endpoint. In the case of a failure of the primary nodes, Multi-AZ clusters will make secondary nodes primary and deploy a new secondary stack in another Availability Zone. The following diagram illustrates this architecture.

Multi-AZ clusters deploy VPC endpoints that point to network interfaces in two Availability Zones, which should be configured as a part of the NLB target group. To configure the VPC endpoints in the NLB target group, you can identify the IP addresses of the VPC endpoint using the Amazon Redshift console or by doing a describe-clusters query on the Redshift cluster. In a failover scenario, VPC endpoint IPs will not change and the NLB doesn’t require an update.

Amazon Redshift will not remove a network interface associated with a VPC endpoint unless you add an additional subnet in to an existing Availability Zone or remove a subnet using Amazon Redshift APIs. We recommend that you don’t add multiple subnets to an Availability Zone to avoid disruption.

Amazon Redshift Serverless

Redshift Serverless provides managed infrastructure. You can perform the get-workgroup query to get the workgroup’s VpcEndpoint IPs. IPs should be configured in the target group of the PrivateLink NLB. Because this is a managed service, the failover is managed by AWS. During the event of an underlying Availability Zone failure, the workgroup might get a new set of IPs. You can frequently query the workgroup configuration or DNS record for the Redshift cluster to check if IP addresses have changed and update the NLB accordingly.

Automating IP address management

In scenarios where Amazon Redshift operations might change the IP address of the endpoint needed for Amazon Redshift connectivity, you can automate the update of NLB network targets by monitoring the results for cluster DNS resolution, using describe-cluster or get-workgroup queries, and using an AWS Lambda function to update the NLB target group configuration.

You can periodically (on a schedule) query the DNS of the Redshift cluster for IP address resolution. Use a Lambda function to compare and update the IP target groups for the NLB. For an example of this solution, see Hostname-as-Target for Network Load Balancers.

For legacy DS2 clusters where the IP address of the leader node must be explicitly monitored, you can configure Amazon CloudWatch metrics to monitor the HealthStatus of the leader node. You can configure the metric to trigger an alarm, which alerts an Amazon Simple Notification Service (Amazon SNS) topic and invokes a Lambda function to reconcile the NLB target group.

For backup and restore patterns, you can create a rule in Amazon EventBridge triggered on the RestoreFromClusterSnapshot API action, which invokes a Lambda function to update the NLB with the new IP addresses of the cluster.

For a cluster relocation pattern, you can trigger an event based on the Amazon Redshift ModifyCluster availability-zone-relocation API action.

Conclusion

In this post, we discussed how to use AWS endpoint services to improve data security with Private Connect for Salesforce Data Cloud. If you are currently using the Salesforce Data Cloud zero-copy integration with Amazon Redshift, we recommend you follow the steps provided in this post to make the network connection between Salesforce and AWS secure. Reach out to your Salesforce and AWS support teams if you need additional support to implement this solution.


About the authors

Yogesh Dhimate is a Sr. Partner Solutions Architect at AWS, leading technology partnership with Salesforce. Prior to joining AWS, Yogesh worked with leading companies including Salesforce driving their industry solution initiatives. With over 20 years of experience in product management and solutions architecture Yogesh brings unique perspective in cloud computing and artificial intelligence.

Avijit Goswami is a Principal Solutions Architect at AWS specialized in data and analytics. He supports AWS strategic customers in building high-performing, secure, and scalable data lake solutions on AWS using AWS managed services and open source solutions. Outside of his work, Avijit likes to travel, hike, watch sports, and listen to music.

Ife Stewart is a Principal Solutions Architect in the Strategic ISV segment at AWS. She has been engaged with Salesforce Data Cloud over the last 2 years to help build integrated customer experiences across Salesforce and AWS. Ife has over 10 years of experience in technology. She is an advocate for diversity and inclusion in the technology field.

Mike Patterson is a Senior Customer Solutions Manager in the Strategic ISV segment at AWS. He has partnered with Salesforce Data Cloud to align business objectives with innovative AWS solutions to achieve impactful customer experiences. In his spare time, he enjoys spending time with his family, sports, and outdoor activities.

Drew Loika is a Director of Product Management at Salesforce and has spent over 15 years delivering customer value via data platforms and services. When not diving deep with customers on what would help them be more successful, he enjoys the acts of making, growing, and exploring the great outdoors.

Adopting Amazon Q Developer in Enterprise Environments

Post Syndicated from Rene-Martin Tudyka original https://aws.amazon.com/blogs/devops/adopting-amazon-q-developer-in-enterprise-environments/

Increasing developer productivity has been a persistent challenge for senior leaders over the past decades. With the rise of generative artificial intelligence (AI), a new wave of innovation is transforming how software teams work. Generative AI tools like Amazon Q Developer are emerging as game-changers, supporting developers across the entire software development lifecycle. But how are large-scale organizations successfully adopting this AI-assisted approach? This document shares good practices I have discovered through working with enterprise customers navigating this technological transformation.

A common misperception still is that a developer is more productive when they write code faster. But a median developer spends less than one hour per day writing code, as a study conducted by software.com in 2022 shows. This makes clear that there are other aspects to consider when it comes to building an application and running it in production. Generative AI tools for developers, such as Amazon Q Developer, started as coding companions that provided inline completions. As the technology evolved, Amazon Q Developer is now able to support developers across the entire software development life cycle.

The Change Challenge

Generative AI offers new and interesting ways for developers to solve challenges and to support them in their daily work. But taking advantage of those opportunities takes some time. It introduces a noticeable change to their familiar ways of working and therefore it is much about forming a new habit. This, according to science, usually takes a minimum of two months. Teams need the space and permission to play with this new approach and to find out what works best for them. Expecting them to adopt a new way of working while expecting the same (or higher) level of output at the same time will only lead to teams falling back to what they know works. For customers that successfully have adopted Amazon Q Developer I have seen them reducing the delivery expectations to give space to learn, and having required teams to share learnings in return.

Additionally, as in any other large change project there is a significant cultural aspect to consider. If people feel intrinsically convinced by the value and benefits of an AI-supported approach to software engineering they will use it. “Simply ordering from above” will not help making an adoption successful. But building community, experimenting, learning, and sharing successes will. “Culture eats strategy for breakfast”, as the management visionary Peter Drucker said.

Keeping that in mind it is clear that there is no one-size-fits-all prescriptive way of successfully introducing generative AI tools for developers in your organization. It very much depends on your individual culture, goals, challenges, people, skills, and technology stack for example. However, there are a number of principles and good practices that work well with several of our large-scale customers who have successfully adopted Amazon Q Developer.

Best Practices for Successful Adoption

The following illustration describes the change management cycle and recommended activities for adopting AI-assisted software development.

The change management cycle – align, plan, roll out, reflect and repeat

Figure 1. The change management cycle

1. Secure Top-Down Commitment

Secure executive buy-in for adopting AI-assisted software development because this is a powerful sign for the organization. It helps to remove roadblocks and to promote successes across the organization. An executive sponsor links the adoption of AI-assisted software development to the strategic goals of the organization, will help resolving prioritization and capacity conflicts. Invite the sponsor to a kick-off meeting. Include them for the discussion of goals and success criteria, and keep them updated on progress, success stories, and challenges.

2. Become Clear on Goals and Success Criteria

Simply rolling out Generative AI tools to a large developer base won’t be enough. You need to be clear what you expect from such an adoption for your organization, for your developers, and for your business. It is important for you to understand your organization’s pain points from a developer experience perspective and how they affect your development productivity. These likely differ between different personas, like Software Developers and DevOps Engineers. For example, are your applications lacking test coverage or is your code lacking documentation, which creates a maintenance burden, and makes it difficult to onboard new developers? Is handling legacy code risky and time-consuming so that you avoid necessary upgrades because of the complexity and effort? Is troubleshooting applications locally or in production eating up your developers’ time? Or are you facing challenges caused by hard-to-find security issues in the code? What is it that you want to achieve by introducing an AI-assisted development approach, and why?

3. Establish Ownership

Adopting a new approach like the use of generative AI in software development at scale leads to many change management and coordination tasks. Those are related to getting access to the tool, enablement for new users, budget planning, measuring success and creating transparency on problems, creating momentum and making sure the adoption sustains, amongst others. Therefore, introducing a Customer Champion as a leader who coordinates the business and technology related aspects of the adoption is a common approach. They will connect the strategic goals of the organization with the tactical activities necessary for the implementation. If your adoption is spanning multiple different business units with larger developer bases consider establishing this role for each of them, forming a team that collaborates across your whole organization. Key responsibilities of the Customer Champion are to bring the right people together to successfully work on the adoption, to create transparency on status, and to address impediments early on.

4. Introduce Metrics

Once you have gained clarity on the specific pain points and goals for your organization, the next step is to determine the appropriate metrics to measure the success of your efforts. For example, you can measure the impact of using Amazon Q Developer on onboarding new developers by tracking the time to first commit against a previously established baseline average. Comparing the time it takes to complete certain tasks – like writing and integrating code, fixing bugs, setting up or upgrading environments, the time it takes to build a new feature, or by comparing sprint velocity before and after the introduction of Amazon Q Developer will give you further indications. But keep in mind that there is ambiguity in these metrics because they are impacted by different factors.

Focusing on developer experience and productivity, monitor the development of your established measurement framework, like DORA, SPACE, or GSM. SPACE in particular, with its “Satisfaction & Well-Being” dimension, pays close attention to how software developers perceive their work and how satisfied they are with their own productivity. Tools like Amazon Q Developer have shown a positive effect here, as for example a McKinsey study shows. They help to free developers from tasks that are perceived as toil, like repetitive and boring grunt work that doesn’t necessarily bring business value. To measure this impact on perceived productivity, customers sometimes design simple surveys, asking developers questions like “in percentage, how would you estimate the impact of using Amazon Q Developer on your productivity” or “on a scale from 1-5, how is Amazon Q Developer impacting the satisfaction with your day-to-day work?”

As a last dimension, understanding the general usage of Amazon Q Developer across your organization is important. Monitor the number active subscriptions, accepted code recommendations, or the number of executed security scans from the Amazon Q Developer dashboard. That way you will get an understanding for the acceptance of the tool in your developer base. Correlate this information with the success metrics you defined.

5. Start Small

When “rolling out” an AI-assisted software development approach, keep the technology adoption lifecycle and Everett Roger’s bell curve in mind. It describes that adoption of a new technology usually starts with a few “innovators”, followed by a small fraction of “early adopters”. Only if those early groups demonstrate convincing success, the majority of users will follow the adoption. To settle on this model will help you making the adoption of AI-assisted development successful in your organization.

Start small. Identify your tech innovators, or champions, who are enthusiastic to support the introduction of Amazon Q Developer and to advocate the approach in your organization. With them, build a team or a Center of Excellence (COE) that will help with identifying early adopters across your development teams, building technical and enablement foundations for onboarding users, creating roll-out plans, and evangelizing and sustaining the adoption. The champions will act as a bridge between your adoption program team and your users. They can provide insights and feedback on the adoption and come up with recommendations.

6. Create Momentum

Now that you have your foundations in place it is time to create momentum and onboard your early adopters to Amazon Q Developer. Start with a communication plan to raise awareness, to share updates and success stories, and to collect feedback from the users. You will need to create training material covering organization-specific resources (how to get access, for example, or which communication channels and contact points exist), user guides, tutorials, and pointers to relevant documentation and online resources. Consider creating your own prompt library, documenting prompts for Amazon Q Developer that your users find particularly helpful. There are community projects like promptz.dev that can deliver inspiration. Especially for new users this will have a lot of value for getting up to speed.

Consider how to integrate learning pathways with your organization’s learning platform moving forward. These should include the company-specific experience and knowledge you collect. In addition, it might be helpful to integrate external offers, like classroom trainings or workshop formats offered by AWS.

Hands-on technical enablement workshops will help your early adopters jumpstarting their experience. AWS offers multiple formats that can be tailored to your individual context. These include service deep-dives with Amazon Q Developer specialists, Immersion Days and hackathon support for teams to hands-on experience the tool in a guided, interactive format, or joined proof-of-value engagements. Your AWS account team will provide you more information.

7. Sustain and Scale Adoption

At that point you will have established Amazon Q Developer among an initial segment of your developer base. However, keeping Roger’s adoption curve in mind the largest part of the adoption still lies ahead of you. To facilitate the it across your whole organization, focus on delivering enablement workshops for the teams to be onboarded. These will be led by your champions and incorporate your organization-specific practices and knowledge.

To sustain the adoption, make sure the existing user base stays active. Continue offering exchange and support, collecting feedback, updating your enablement material, sharing updates on the latest developments for Amazon Q Developer, and reviewing your metrics. Create visibility across your organization for the accomplishments and positive experiences. Let user talk about the impact Amazon Q Developer has on their work. This will help building momentum. Nurture community building around your users of AI-assisted development.

Establish regular office hours with your champions for providing support and enablement to your users of Amazon Q Developer. This will facilitate the continuous gathering of feedback to improve documentation and enablement materials, collect and promote success stories, and validate the adoption approach. Additionally, establish a consistent communications and reporting cadence to keep all relevant stakeholders informed of key metrics, updates, feedback, and success stories. This ensures the alignment of the adoption of AI-assisted development with your strategic goals and expectations.

8. Inspect and Adapt

In addition, keep reflecting on the goals and metrics you came up with from a strategic perspective. Are they developing in the right direction? Are your pain points still the same? Should you move your focus to different aspects of the software development life cycle? And how might new capabilities of Amazon Q Developer support your goals?

Conclusion

By following the outlined approach, large organizations can successfully navigate the challenges of adopting Amazon Q Developer. The approach addresses technical, cultural, and organizational aspects of the adoption, increasing the likelihood of realizing the full potential of AI-assisted development across the enterprise.

If you are looking for advice or support during your adoption journey, your AWS Account Team will connect you with experts on Amazon Q Developer, provide you information on training and enablement, and support you with setting up a successful rollout program for your organization. To learn more about how Amazon Q Developer’s capabilities support the whole software development lifecycle, visit the product page or have a look at the documentation.

About the Author

Rene-Martin Tudyka is a Senior Customer Solutions Manager at AWS and provides guidance and support for enterprise customers on their cloud maturity journey. He has a long background in developing highly performant IT organizations and in successful large-scale cloud adoption.

Enhancing cloud security in AI/ML: The little pickle story

Post Syndicated from Nur Gucu original https://aws.amazon.com/blogs/security/enhancing-cloud-security-in-ai-ml-the-little-pickle-story/

As AI and machine learning (AI/ML) become increasingly accessible through cloud service providers (CSPs) such as Amazon Web Services (AWS), new security issues can arise that customers need to address. AWS provides a variety of services for AI/ML use cases, and developers often interact with these services through different programming languages. In this blog post, we focus on Python and its pickle module, which supports a process called pickling to serialize and deserialize object structures. This functionality simplifies data management and the sharing of complex data across distributed systems. However, because of potential security issues, it’s important to use pickling with care (see the warning note in pickle — Python object serialization). In this post, we’re going to show you ways to build secure AI/ML workloads that use this powerful Python module, ways to detect that it’s in use that you might not know about, and when it might be getting abused, and finally highlight alternative approaches that can help you avoid these issues.

Quick tips

Understanding insecure pickle serialization and deserialization in Python

Effective data management is crucial in Python programming, and many developers turn to the pickle module for serialization. However, issues can arise when deserializing data from untrusted sources. The Python bytestream that pickling uses, is proprietary to Python. Until it’s unpickled, the data in the bytestream can’t be thoroughly evaluated. This is where security controls and validation become critical. Without proper validation, there’s a risk that an unauthorized user could inject unexpected code, potentially leading to arbitrary code execution, data tampering, or even unintended access to a system. In the context of AI model loading, secure deserialization is particularly important—it helps prevent outside parties from modifying model behavior, injecting backdoors, or causing inadvertent disclosure of sensitive data.

Throughout this post, we will refer to pickle serialization and deserialization collectively as pickling. Similar issues can be present in other languages (for example, Java and PHP) when untrusted data is used to recreate objects or data structures, resulting in potential security issues such as arbitrary code execution, data corruption, and unauthorized access.

Static code analysis compared to dynamic testing for detecting pickling

Security code reviews, including static code analysis, offer valuable early detection and thorough coverage of pickling-related issues. By examining source code (including third-party libraries and custom code) before deployment, teams can minimize security risks in a cost-effective way. Tools that provide static analysis can automatically flag unsafe pickling patterns, giving developers actionable insights to address issues promptly. Regular code reviews also help developers improve secure coding skills over time.

While static code analysis provides a comprehensive white-box approach, dynamic testing can uncover context-specific issues that only appear during runtime. Both methods are important. In this post, we focus primarily on the role of static code analysis in identifying unsafe pickling.

Tools like Amazon CodeGuru and Semgrep are effective at detecting security issues early. For open source projects, Semgrep is a great option to maintain consistent security checks.

The risks of insecure pickling in AI/ML

Pickling issues in AI/ML contexts can be especially concerning.

  • Invalidated object loading: AI/ML models are often serialized for future use. Loading these models from untrusted sources without validation can result in arbitrary code execution. Libraries such as pickle, joblib, and some yaml configurations allow serialization but must be handled securely.
    • For example: If a web application stores user input using pickle and unpickles it later with no validation, an unauthorized user could craft a harmful payload that executes arbitrary code on the server.
  • Data integrity: The integrity of pickled data is critical. Unexpectedly crafted data could corrupt models, resulting in incorrect predictions or behaviors, which is especially concerning in sensitive domains such as finance, healthcare, and autonomous systems.
    • For example: A team updates its AI model architecture or preprocessing steps but forgets to retrain and save the updated model. Loading the old pickled model under new code might trigger errors or unpredictable outcomes.
  • Exposure of sensitive information: Pickling often includes all attributes of an object, potentially exposing sensitive data such as credentials or secrets.
    • For example: An ML model might contain database credentials within its serialized state. If shared or stored without precautions, an unauthorized user who unpickles the file might gain unintended access to these credentials.
  • Insufficient data protection: When sent across networks or stored without encryption, pickled data can be intercepted, leading to inadvertent disclosure of sensitive information.
    • For example: In a healthcare environment, a pickled AI model containing patient data could be transmitted over an unsecured network, enabling an outside party to intercept and read sensitive information.
  • Performance overhead: Pickling can be slower than other serialization formats (such as, JSON or Protocol Buffers), which can affect ML and large language model (LLM) applications when inference speed is critical.
    • For example: In a real-time natural language processing (NLP) application using an LLM, heavy pickling or unpickling operations might reduce responsiveness and degrade the user experience.

Detecting unsafe unpickling with static code analysis tools

Static code analysis (SCA) is a valuable practice for applications dealing with pickled data, because it helps detect insecure pickling before deployment. By integrating SCA tools into the development workflow, teams can spot questionable deserialization patterns as soon as code is committed. This proactive approach reduces the risk of events involving unexpected code execution or unintended access due to unsafe object loading.

For instance, in a financial services application where objects are routinely pickled, a SCA tool can scan new commits to detect unvalidated unpickling. If identified, the development team can quickly address the issue, protecting both the integrity of the application and sensitive financial data.

Patterns in the source code

There are various ways to load a pickle object in Python. In this context, methods for detection can be tailored for secure coding habits and needed package dependencies. Many Python libraries include a function to load pickle objects. An effective approach can be to catalog all Python libraries used in the project, then create custom rules in your static code analysis tool to detect unsafe pickling or unpickling within those libraries.

CodeGuru and other static analysis tools continue to evolve their capability to detect unsafe pickling patterns. Organizations can use these tools and create custom rules to identify potential security issues in AI/ML pipelines.

Let’s define the steps for creating a safe process for addressing pickling issues:

  1. Generate a list of all the Python libraries that are used in your repository or environment.
  2. Check the static code analysis tool in your pipeline for current rules and the ability to add custom rules. If the tool is capable of discovering all the libraries used in your project, you can rely on it. However, if it’s not able to discover all the libraries used in your project, you should consider adding user-provided custom rules in your static code analysis tool.
  3. Most of the issues can be identified with well-designed, context-driven patterns in the static code analysis tool. For addressing the pickling issues, you need to identify pickling and unpickling functions.
  4. Implement and test the custom rules to verify full coverage of pickling and unpickling risks. Let’s identify patterns for a few libraries:
    • NumPy can efficiently pickle and unpickle arrays; useful for scientific computing workflows requiring serialized arrays. To catch potential unsafe pickle usage in NumPy, custom rules could target patterns like:
      import numpy as np
      data = np.load('data.npy', allow_pickle=True)

    • npyfile is a utility for loading NumPy arrays from pickled files. You can add the following patterns to your custom rules to discover potentially unsafe pickle object usage.
      import npyfile
      data = npyfile.load('example.pkl')

    • pandas can pickle and unpickle DataFrames using pickle, allowing for efficient storage and retrieval of tabular data. You can add the following patterns to your custom rules to discover potentially unsafe pickle object usage.
      import pandas as pd
      df = pd.read_pickle('dataframe.pkl')

    • joblib is often used for pickling and unpickling Python objects that involve large data, especially NumPy arrays, more efficiently than standard pickle. You can add the following patterns to your custom rules to discover potentially unsafe pickle object usage.
      from joblib import load
      data = load('large_data.pkl')

    • Scikit-learn provides joblib for pickling and unpickling objects and is particularly useful for models. You can add the following patterns to your custom rules to discover potentially unsafe pickle object usage.
      from sklearn.externals import joblib
      data = joblib.load('example.pkl')

    • PyTorch provides utilities for loading pickled objects that are especially useful for ML models and tensors. You can add the following patterns to your custom rule format to discover potentially unsafe pickle object usage.
      import torch
      data = torch.load('example.pkl')

By searching for these functions and parameters in code, you can set up targeted rules that highlight potential issues with pickling.

Effective mitigation

Addressing pickling issues requires not only detection, but also clear guidance on remediation. Consider recommending more secure formats or validations where possible as follows:

  • PyTorch
    • Use Safetensors to store tensors. If pickling remains necessary, add integrity checks (for example, hashing) for serialized data.
  • pandas
    • Verify data sources and integrity when using pd.read_pickle. Encourage safer alternatives (for example, CSV, HDF5, or Parquet) to help avoid pickling risks.
  • scikit-learn (via joblib)
    • Consider Skops for safer persistence. If switching formats isn’t feasible, implement strict validation checks before loading.
  • General advice
    • Identify safer libraries or methods whenever possible.
    • Switch to formats such as CSV or JSON for data, unless object-specific serialization is absolutely required.
    • Perform source and integrity checks before loading pickle files—even those considered trusted.

Example

The following is an example implementation that shows safe pickle implementation as a representation of the preceding information.

import io
import base64
import pickle
import boto3
import numpy as np
from cryptography.fernet import Fernet

###############################################################################
# 1) RESTRICTED UNPICKLER
###############################################################################
#
# By default, pickle can execute arbitrary code when loading. Here we implement
# a custom Unpickler that only allows certain safe modules/classes. Adjust this
# to your application's requirements.
#

class RestrictedUnpickler(pickle.Unpickler):
    """
    Restricts unpickling to only the modules/classes we explicitly allow.
    """
    allowed_modules = {
        "numpy": set(["ndarray", "dtype"]),
        "builtins": set(["tuple", "list", "dict", "set", "frozenset", "int", "float", "bool", "str"])
    }

    def find_class(self, module, name):
        if module in self.allowed_modules:
            if name in self.allowed_modules[module]:
                return super().find_class(module, name)
        # If not allowed, raise an error to prevent arbitrary code execution.
        raise pickle.UnpicklingError(f"Global '{module}.{name}' is forbidden")

def restricted_loads(data: bytes):
    """Helper function to load pickle data using the RestrictedUnpickler."""
    return RestrictedUnpickler(io.BytesIO(data)).load()

###############################################################################
# 2) AWS KMS & ENCRYPTION HELPERS
###############################################################################

def generate_data_key(kms_key_id: str, region: str = "us-east-1"):
    """
    Generates a fresh data key using AWS KMS. 
    Returns (plaintext_key, encrypted_data_key).
    """
    kms_client = boto3.client("kms", region_name=region)
    response = kms_client.generate_data_key(KeyId=kms_key_id, KeySpec='AES_256')
    
    # Plaintext data key (use to encrypt the pickle data locally)
    plaintext_key = response["Plaintext"]
    # Encrypted data key (store along with your ciphertext)
    encrypted_data_key = response["CiphertextBlob"]
    return plaintext_key, encrypted_data_key

def decrypt_data_key(encrypted_data_key: bytes, region: str = "us-east-1"):
    """
    Decrypts the encrypted data key via AWS KMS, returning the plaintext key.
    """
    kms_client = boto3.client("kms", region_name=region)
    response = kms_client.decrypt(CiphertextBlob=encrypted_data_key)
    return response["Plaintext"]

def build_fernet_key(plaintext_key: bytes) -> Fernet:
    """
    Construct a Fernet instance from a 32-byte data key.
    Fernet requires a 32-byte key *encoded* in URL-safe base64.
    """
    if len(plaintext_key) < 32:
        raise ValueError("Data key is smaller than 32 bytes; cannot build a Fernet key.")
    fernet_key = base64.urlsafe_b64encode(plaintext_key[:32])
    return Fernet(fernet_key)

###############################################################################
# 3) MAIN LOGIC
###############################################################################

def upload_pickled_data_s3(
    numpy_obj: np.ndarray,
    bucket_name: str,
    s3_key: str,
    kms_key_id: str,
    region: str = "us-east-1"
):
    """
    Pickle a numpy object, encrypt it locally, and upload the ciphertext + 
    encrypted data key to S3.
    """
    # 1. Generate data key from KMS
    plaintext_key, encrypted_data_key = generate_data_key(kms_key_id, region)
    
    # 2. Build Fernet from plaintext data key
    fernet = build_fernet_key(plaintext_key)
    
    # 3. Serialize the numpy object with pickle
    pickled_data = pickle.dumps(numpy_obj, protocol=pickle.HIGHEST_PROTOCOL)
    
    # 4. Encrypt the pickled data
    encrypted_data = fernet.encrypt(pickled_data)
    
    # 5. Upload to S3 along with the encrypted data key (in metadata)
    s3_client = boto3.client("s3", region_name=region)
    s3_client.put_object(
        Bucket=bucket_name,
        Key=s3_key,
        Body=encrypted_data,
        Metadata={
            "encrypted_data_key": base64.b64encode(encrypted_data_key).decode("utf-8")
        }
    )
    print(f"Encrypted pickle uploaded to s3://{bucket_name}/{s3_key}")

def download_and_unpickle_data_s3(
    bucket_name: str,
    s3_key: str,
    region: str = "us-east-1"
) -> np.ndarray:
    """
    Download the ciphertext and the encrypted data key from S3. Decrypt the data 
    key with KMS, use it to decrypt the pickled data, then load with a restricted 
    unpickler for safety.
    """
    s3_client = boto3.client("s3", region_name=region)
    
    # 1. Get object from S3
    response = s3_client.get_object(Bucket=bucket_name, Key=s3_key)
    
    # 2. Extract the encrypted data key from metadata
    metadata = response["Metadata"]
    encrypted_data_key_b64 = metadata.get("encrypted_data_key")
    if not encrypted_data_key_b64:
        raise ValueError("Missing encrypted_data_key in S3 object metadata.")
    
    encrypted_data_key = base64.b64decode(encrypted_data_key_b64)
    
    # 3. Decrypt data key via KMS
    plaintext_key = decrypt_data_key(encrypted_data_key, region)
    fernet = build_fernet_key(plaintext_key)
    
    # 4. Decrypt the pickled data
    encrypted_data = response["Body"].read()
    decrypted_pickled_data = fernet.decrypt(encrypted_data)
    
    # 5. Use restricted unpickler to load the numpy object
    numpy_obj = restricted_loads(decrypted_pickled_data)
    
    return numpy_obj

###############################################################################
# DEMO USAGE
###############################################################################

if __name__ == "__main__":
    # --- Replace with your actual values ---
    KMS_KEY_ID = "arn:aws:kms:us-east-1:123456789012:key/your-kms-key-id"
    BUCKET_NAME = "your-secure-bucket"
    S3_OBJECT_KEY = "encrypted_npy_demo.bin"
    AWS_REGION = "us-east-1"  # or region of your choice
    
    # Example numpy array
    original_array = np.random.rand(2, 3)
    print("Original Array:")
    print(original_array)
    
    # Upload (pickle + encrypt) to S3
    upload_pickled_data_s3(
        numpy_obj=original_array,
        bucket_name=BUCKET_NAME,
        s3_key=S3_OBJECT_KEY,
        kms_key_id=KMS_KEY_ID,
        region=AWS_REGION
    )
    
    # Download (decrypt + unpickle) from S3
    retrieved_array = download_and_unpickle_data_s3(
        bucket_name=BUCKET_NAME,
        s3_key=S3_OBJECT_KEY,
        region=AWS_REGION
    )
    
    print("\nRetrieved Array:")
    print(retrieved_array)
    
    # Verify integrity
    assert np.allclose(original_array, retrieved_array), "Arrays do not match!"
    print("\nSuccess! The retrieved array matches the original array.")

Conclusion

With the rapid expansion of cloud technologies, integrating static code analysis into your AI/ML development process is increasingly important. While pickling offers a powerful way to serialize objects for AI/ML and LLM applications, you can mitigate potential risks by applying manual secure code reviews, setting up automated SCA with custom rules, and following best practices such as using alternative serialization methods or verifying data integrity.

When working with ML models on AWS, see the AWS Well-Architected Framework’s Machine Learning Lens for guidance on secure architecture and recommended practices. By combining these approaches, you can maintain a strong security posture and streamline the AI/ML development lifecycle.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
 

Nur Gucu
Nur Gucu

Nur is a Security Engineer at Amazon with 10 years of offensive security expertise, specializing in generative AI security, security architecture, and offensive testing. Nur has developed security products at startups and banks, and has created frameworks for emerging technologies. She brings practical experience to solve complex AI security challenges, and yet believes as many as six impossible things before breakfast.
Matt Schwartz
Matt Schwartz

Matt is an Amazon Principal Security Engineer specializing in generative AI security, risk management, and cloud computing expertise. His two-decade expertise enables organizations to implement advanced AI while maintaining strict security protocols. Matthew develops strategic frameworks that safeguard critical assets and ensure compliance in the evolving digital landscape, securing complex systems during transformations.

Building a voice interface for generative AI assistants

Post Syndicated from Reynaldo Hidalgo original https://aws.amazon.com/blogs/messaging-and-targeting/building-voice-interface-for-genai-assistant/

Generative AI is revolutionizing how businesses interact with their customers through natural conversational interfaces. While organizations can implement AI assistants across various channels, phone calls remain a preferred method for many customers seeking support or information.

We’ll demonstrate how to create a voice interface for your existing Amazon Bedrock generative AI assistant, enabling customers to engage in phone-based conversations with your AI implementation.

Solution overview

Using Workflow Studio for Amazon Web Services (AWS) Step Functions, we built a voice communication interface that connects with the Amazon Nova Micro model in Amazon Bedrock (Figure 1). The demo application uses the base model to enable open-ended questions. Organizations can implement either Amazon Bedrock Agents or Flows to address specific business requirements.

A Step Functions workflow diagram illustrating a voice communication system integrated with Amazon Bedrock. The workflow shows a sequential process starting with call handling, followed by parallel branches: one for managing hold music and another for processing voice input through Amazon Transcribe and Amazon Nova Micro model. The diagram demonstrates the complete call flow from initial welcome message through question-answer cycles to call completion.

Figure 1 – Step Functions workflow that enables voice communication to a generative AI assistant

How it works:

  1. Inbound call arrives
  2. System plays welcome message
  3. System asks caller for questions
  4. Voice recording starts, stopping when silence is detected
  5. Parallel flows begin:
    • First flow
      1. Plays some music while the caller is on-hold
    • Second flow
      1. Transcribes the recording using Amazon Transcribe
      2. Sends transcribed question to the Amazon Nova Micro model in Amazon Bedrock
      3. Upon receiving the response, stops the on-hold music
  6. Text-to-speech plays the model’s answer
  7. System asks for additional questions and loops to Step 4 or ends the call

 Expanded capabilities and optimizations

These are potential improvements, additional functionalities, and advanced features that can enhance the demo application:

  • The transcription component is interchangeable with any speech-to-text generative AI model (including Whisper Large V3 Turbo on Amazon Bedrock Marketplace)
  • The PSTN audio service RecordAudio Action can be tuned to adjust silence duration and background noise levels
  • Enabling the PSTN audio service VoiceFocus feature to improve call clarity by reducing background noise and enhancing voice quality
  • PSTN audio service Session Initiation Protocol (SIP) media applications can also handle calls through SIP trunking by using Amazon Chime SDK Voice Connector, streamlining integration with existing phone systems
  • The UpdateSipMediaApplicationCall API is a PSTN audio service feature that lets you regain call control and apply new actions during active calls
  • Parallel workflow states allow user-friendly handling of API service calls by playing music during processing
  • PSTN audio service provides pay-per-minute rates with serverless, scalable telephony infrastructure

Deploying the solution

 The following steps allow you to deploy the voice communication interface workflow (Figure 1) together with the supporting serverless architecture for Step Functions and PSTN audio service integration. In a previous blog, we demonstrated how combining Step Functions and Amazon Chime SDK PSTN audio service streamlines the development of reliable telephony applications through a visual workflow design.

 Prerequisites:

  1. AWS Management Console access
  2. Node.js and npm installed
  3. AWS Command Line Interface (AWS CLI) installed and configured
  4. Enable access to the Amazon Nova Micro model through the Amazon Bedrock console

 Walkthrough:

The AWS Cloud Development Kit (AWS CDK) project on the AWS GitHub repository will deploy the following resources:

  • phoneNumberBedrock – Provisioned phone number for the demo application
  • sipMediaApp – SIP media application that routes calls to lambdaProcessPSTNAudioServiceCalls
  • sipRule – SIP rule that directs calls from phoneNumberBedrock to sipMediaApp
  • lambdaProcessPSTNAudioServiceCallsAWS Lambda function for call processing
  • roleLambdaProcessPSTNAudioServiceCalls – AWS Identity and Access Management (IAM) Role for lambdaProcessPSTNAudioServiceCalls
  • stepfunctionBedrockWorkflow – Step Functions workflow for the telephony application
  • roleStepfuntionBedrockWorkflow – IAM Role for stepfunctionBedrockWorkflow
  • s3BucketApp – Amazon Simple Storage Service (Amazon S3) bucket for storing customer questions recordings
  • s3BucketPolicy IAM Policy granting PSTN audio service access to s3BucketApp
  • lambdaAudioTranscription – Lambda function for audio transcription
  • lambdaLayerForTranscription – Lambda layer required for lambdaAudioTranscription
  • roleLambdaAudioTranscription – IAM Role for lambdaAudioTranscription

Follow these steps to deploy the CDK stack:

  1. Clone the repository.
git clone https://github.com/aws-samples/sample-chime-sdk-bedrock-voice-interface
cd sample-chime-sdk-bedrock-voice-interface
npm install
  1. Bootstrap the stack.
#default AWS CLI credentials are used, otherwise use the –-profile parameter
#provide the <account-id> and <region> to deploy this stack
cdk bootstrap aws://<account-id>/<region>
  1. Deploy the stack.
#default AWS CLI credentials are used, otherwise use the –-profile parameter
#phoneAreaCode: the United States area code used to provision the phone number
cdk deploy –-context phoneAreaCode=NPA
  1. Call the provisioned phone number to test the sample application.

Cleaning up:

To clean up this demo, execute:

cdk destroy

Conclusion

We demonstrated how organizations can add voice capabilities to their existing generative AI implementations using Amazon Bedrock. The solution enables customers to interact with AI assistants through traditional phone calls, expanding accessibility and user engagement. The demo application showcases an architecture combining AWS Step Functions and Amazon Chime SDK PSTN audio service, delivering natural voice conversations with AI models through quick deployment using visual workflows.

Organizations benefit from cost optimization with pay-per-minute pricing, enterprise-ready telephony integration through PSTN or SIP trunking, and automatic scaling to match customer demand. This foundation enables businesses to build practical AI applications ranging from all day customer service agents, to multi-language support services, and knowledge base assistants. By following this solution, you can quickly extend your generative AI investments to voice channels, providing more value to your customers while maintaining operational efficiency.

Contact an AWS Representative to know how we can help accelerate your business.