Tag Archives: Architecture

Secure Amazon Elastic VMware Service (Amazon EVS) with AWS Network Firewall

Post Syndicated from Sheng Chen original https://aws.amazon.com/blogs/architecture/secure-amazon-elastic-vmware-service-amazon-evs-with-aws-network-firewall/

Amazon Elastic VMware Service (Amazon EVS) helps organizations migrate, run, and scale VMware workloads natively on AWS. It delivers a VMware Cloud Foundation (VCF) environment that operates directly within your Amazon Virtual Private Cloud (Amazon VPC) on Amazon EC2 bare-metal instances. The solution helps customers accelerate cloud migrations and data center exits without needing to refactor existing applications.

For customers considering a hybrid cloud architecture, a unified network security solution is required to protect application traffic across Amazon EVS environments, Amazon VPCs, on-premises data centers and the internet. It also needs to provide a single point of control for firewall policy management, centralized logging, and monitoring to streamline network security operations.

AWS Network Firewall is a managed firewall and intrusion detection and prevention service (IDS/IPS) that can help address these requirements. Built on AWS managed infrastructure, it automatically scales with traffic demands while maintaining high availability and consistent performance. The service provides centralized policy management and traffic inspection across multiple VPCs and AWS accounts. Additionally, it provides comprehensive visibility and reporting through firewall log collections to Amazon Simple Storage Service (Amazon S3), Amazon CloudWatch Logs, or Amazon Data Firehose.

In this post, we demonstrate how to utilize AWS Network Firewall to secure an Amazon EVS environment, using a centralized inspection architecture across an EVS cluster, VPCs, on-premises data centers and the internet. We walk through the implementation steps to deploy this architecture using AWS Network Firewall and AWS Transit Gateway.

Architecture overview

AWS Network Firewall operates as a “bump-in-the-wire” solution, which transparently inspects and filters network traffic across Amazon VPCs. It is inserted directly into the traffic path by updating VPC or Transit Gateway route tables, allowing it to examine all packets without requiring any changes to the existing application flow patterns.

The following diagram depicts the architecture overview of our centralized inspection model using AWS Network Firewall.

Figure 1: Secure Amazon EVS with AWS Network Firewall using centralized inspection architecture

Figure 1: Secure Amazon EVS with AWS Network Firewall using centralized inspection architecture

The Amazon EVS environment is deployed directly within a customer VPC (i.e. EVS VPC), which consists of EVS VLAN subnets that form the underlay networks for VCF deployment. This infrastructure provides connectivity for NSX overlay networks, host management, vMotion, and vSANAmazon VPC Route Server enables dynamic routing between the underlay networks and overlay networks. For more information, see Concepts and components of Amazon EVS.

The architecture also includes a standard workload VPC (i.e. VPC01), and a Direct Connect Gateway connects to the on-premises data center via an AWS Direct Connect connection. We use a dedicated egress VPC with NAT gateways for centralized internet egress, and a separate ingress VPC with Application Load Balancers to terminate ingress web traffic and steer flows back to the target services.

With this architecture, the following traffic flow patterns can be inspected:

East-West Traffic:

  • Between EVS VPCs and Workload VPCs
  • Between Workload VPCs

North-South Traffic:

  • Between EVS/Workload VPCs and on-premises
  • Between EVS/Workload VPCs and internet
  • Between on-premises and internet

The centralized inspection architecture provides several benefits:

  • Single point of control for network security inspection across multiple VPCs
  • Enhanced rule enforcement across AWS infrastructure, on-premises resources, and the internet
  • Centralized logging and monitoring

For this demo we use the AWS Network Firewall native integration with AWS Transit Gateway capability to streamline firewall deployment and management. With a native firewall attachment, AWS automatically provisions and manages all the necessary VPC resources, reducing the operational overhead of managing subnets, route tables, and firewall endpoints within the inspection VPC.

Prerequisites

This post assumes familiarity with: AWS Command Line Interface (AWS CLI), Amazon VPC, Amazon EC2, NAT gateway, Application Load Balancer, Internet gateway, AWS Direct Connect, AWS Transit Gateway and the VMware VCF platform.

The following prerequisites are necessary to complete this solution.

  • An EVS VPC includes:
    • An Amazon EVS cluster (minimum 4x i4i nodes)
    • VPC CIDR: 10.0.0.0/16
    • NSX Segments CIDR: 192.168.0.0/19 (summarized)
    • A VPC Route Server deployed in the EVS VPC to receive NSX segment routes via BGP dynamic routing. Refer to the EVS User Guide for more details.
  • A Workload VPC (VPC01):
    • CIDR: 172.21.0.0/16
  • An Egress VPC:
    • CIDR: 172.23.0.0/16
    • 1x Internet Gateway
    • 1x NAT Gateway
  • An Ingress VPC:
    • CIDR: 172.24.0.0/16
    • 1x Internet Gateway
    • 1x Application Load Balancer
  • Optional: a Direct Connect Gateway:
    •  connecting to the on-premises environment (10.0.0.0/8)

Note: The CIDR blocks used in this example are for demo purposes only; change the address spaces to match your own networking environment. The design can also be scaled to include additional EVS environments and/or other VPCs based on workload needs.

Walkthrough

In this section, we walk through the implementation steps to deploy the centralized inspection architecture with AWS Network Firewall and AWS Transit Gateway. We focus on the overall network integration of the architecture without diving into the detailed configurations of AWS Network Firewall or Transit Gateway.

1. Create an AWS Transit Gateway

In the VPC console, create a Transit Gateway. Make sure to deselect the following options:

  • Default route table association
  • Default route table propagation

Create two empty transit gateway route tables and associate them with the Transit Gateway.

  • Pre-inspection route table: steers traffic into the AWS Network Firewall for centralized inspection
  • Post-inspection route table: returns traffic back to its original destination after inspection and is permitted by the AWS Network Firewall

2. Attach VPCs to the Transit Gateway

Attach all four VPCs (EVS, VPC01, Ingress, Egress) to the same Transit Gateway. The Direct Connect Gateway can also be attached to the Transit Gateway if AWS Network Firewall is needed to inspect traffic between the on-premises environment and AWS or the internet.

Figure 2: Attach VPCs to the Transit Gateway

Figure 2: Attach VPCs to the Transit Gateway

Associate all attachments to the pre-inspection Transit Gateway route table.

Figure 3: Associate VPC attachments to the pre-inspection route table

Figure 3: Associate VPC attachments to the pre-inspection route table

3. Create an AWS Network Firewall with Transit Gateway native integration

In the Network Firewall section of the VPC console, choose Create firewall.

At the Attachment type section, select Transit Gateway to enable native integration with the existing Transit Gateway.

Figure 4: Enable AWS Network Firewall native integration with Transit Gateway

Figure 4: Enable AWS Network Firewall native integration with Transit Gateway

At the Logging configuration, enable the following log types with CloudWatch log group as the log destination. Create a log group for each log type in the CloudWatch Console.

  • Alert: /anfw-centralized/anfw01/alert
  • Flow: /anfw-centralized/anfw01/flow

Create and associate an empty firewall policy to deploy the AWS Network Firewall instance. The firewall policy contains a list of rule groups that define how the firewall inspects and manages traffic. This empty firewall policy can be configured later.

With the Transit Gateway native integration enabled, a Transit Gateway attachment is automatically created for the AWS Network Firewall, with the resource type shown as Network Function. In addition, the Appliance Mode is automatically enabled for the firewall attachment to make sure the Transit Gateway continues to use the same Availability Zone (AZ) for the attachment over the lifetime of a flow.

Associate the firewall attachment to the post-inspection Transit Gateway route table.

Figure 5: AWS Network Firewall native attachment

Figure 5: AWS Network Firewall native attachment

4. Update Transit Gateway route tables

Update the pre-inspection Transit Gateway route table with a default route that points to the AWS Network Firewall attachment. This makes sure traffic that arrives to the Transit Gateway from all VPC attachments and the Direct Connect Gateway attachment is sent to the firewall for centralized inspection.

Figure 6: Transit Gateway pre-inspection route table

Figure 6: Transit Gateway pre-inspection route table

Add the following static routes to the post-inspection route table to direct return traffic back to each VPC and the Direct Connect Gateway accordingly.

Figure 7: Transit Gateway post-inspection route table

Figure 7: Transit Gateway post-inspection route table

5. Update VPC route tables

Finally, update route tables at each VPC as per the following table.

Make sure to add the following routes at the relevant VPC route tables:

  • EVS VPC and VPC01 have a default route (marked in blue) to steer all egress flows into AWS Network Firewall for centralized inspection.
  • Ingress VPC and Egress VPC have RFC-1918 routes (marked in green) to direct return traffic to the Transit Gateway.

Within the EVS VPC, notice the NSX segment routes are automatically propagated to the NSX uplink subnet route table and the private subnet route table via the VPC Route Server.

Figure 8: NSX uplink subnet route table within EVS VPC

Figure 8: NSX uplink subnet route table within EVS VPC

A centralized security inspection architecture has now been deployed for the EVS environment, using AWS Network Firewall with Transit Gateway native integration.

6. Testing

Egress inspection (FQDN filtering)

To test egress inspection from EVS VPC or VPC01 to the internet, create a stateful rule group for the firewall instance using FQDN filtering:

  • Rule group format: Domain list
  • Domain names: .google.com
  • Source IPs: 192.168.0.0/19, 172.21.0.0/16
  • Protocols: HTTP & HTTPS
  • Action: Allow

As expected, testing web access from a virtual machine (192.168.12.10) within the EVS environment to the allowed domain (i.e. google.com) is permitted by the AWS Network Firewall. However, access to unauthorized domain (i.e. facebook.com) is blocked at the firewall with an alert trigged, which can be verified at the CloudWatch log group at /aws/network-firewall/alert/.

Figure 9: Egress inspection from EVS to internet with FQDN filtering

Figure 9: Egress inspection from EVS to internet with FQDN filtering

Ingress inspection

Create another stateful rule group to allow Application Load Balancers deployed within the Ingress VPC to access a web server running in the EVS environment via HTTP protocol:

  • Rule group format: Standard stateful rule
  • Geographic IP Filtering: Disable Geographic IP filtering
  • Protocol: HTTP
  • Source: 172.24.0.0/16
  • Source Port: ANY
  • Destination: 192.168.12.10/32
  • Destination Port ANY
  • Traffic direction: Forward
  • Action: Alert

The CloudWatch firewall logs show an Application Load Balancer (172.24.6.45) from the Ingress VPC can establish HTTP connection to the EVS web server (192.168.12.10). Additionally, the Application Load Balancer has successfully registered the EVS web server as a remote IP target.

Figure 10: Ingress inspection from Ingress VPC to EVS

Figure 10: Ingress inspection from Ingress VPC to EVS

East-West inspection

For East-West inspection testing, update the previous stateful rule group to add a new rule to block ICMP traffic from VPC01 to the EVS VPC.

  • Rule group format: Standard stateful rule
  • Geographic IP Filtering: Disable Geographic IP filtering
  • Protocol: ICMP
  • Source: 172.21.0.0/16
  • Source Port: ANY
  • Destination: 192.168.0.0/19
  • Destination Port: ANY
  • Action: Drop

As a result, pings from an EC2 instance (172.21.128.4) from VPC01 to the EVS web server (192.168.12.10) are being dropped.

Figure 11: East-West Inspection from VPC01 to EVS

Figure 11: East-West Inspection from VPC01 to EVS

Conclusion

In this post, we demonstrated how to utilize AWS Network Firewall to secure Amazon EVS workloads and to provide centralized traffic inspection between Amazon EVS environments, Amazon VPCs, on-premises data centers, and the internet. We walked through the implementation steps for deploying the centralized inspection architecture using AWS Network Firewall and AWS Transit Gateway.

To learn more, review these resources:


About the authors

Enhancing API security with Amazon API Gateway TLS security policies

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/compute/enhancing-api-security-with-amazon-api-gateway-tls-security-policies/

As compliance frameworks evolve and cryptographic standards advance, organizations are looking for additional controls to improve their cloud security posture. One of the neccesary controls is a more granular TLS configuration, for example when regulatory requirements mandate disabling older ciphers like CBC or enforcing TLS 1.3 as a minimum version.

In this post, you will learn how the new Amazon API Gateway’s enhanced TLS security policies help you meet standards such as PCI DSS, Open Banking, and FIPS, while strengthening how your APIs handle TLS negotiation. This new capability increases your security posture without adding operational complexity, and provides you with a single, consistent way to standardize TLS configuration across your API Gateway infrastructure.

Overview

Previously, API Gateway offered limited control over TLS configuration, and only for custom domain names. Default endpoints used fixed security policies, which meant you often had to introduce additional infrastructure, such as custom Amazon CloudFront distributions, to meet your organization’s security or compliance requirements.

With this launch, you can configure TLS behavior directly on all REST API endpoint types, including Regional, edge-optimized, and private, and apply consistent TLS settings across both your APIs and their custom domain names. You can choose from predefined enhanced security policies to enforce the minimum TLS versions and cipher suites that your workloads require. For example, you can enforce TLS 1.3, use hardened TLS 1.2 without CBC ciphers, adopt FIPS-aligned suites for government workloads, or prepare for the future with policies that include post-quantum cryptography (PQC). The new security policies provide finer-grained control without adding operational complexity, helping you align your APIs with evolving security and compliance expectations.

Understanding API Gateway security policies

A security policy in API Gateway is a predefined combination of a minimum TLS version and a curated set of cipher suites. When a client connects to your REST API or custom domain name, API Gateway uses the selected policy to determine which protocol versions and ciphers it will accept during the TLS handshake. This gives you a predictable and enforceable way to control how clients establish encrypted connections to your APIs.

API Gateway supports two categories of security policies. Legacy policies, such as TLS_1_0 or TLS_1_2, remain available for backwards compatibility. Enhanced policies, identified by the SecurityPolicy_* prefix, provide stricter and more modern controls for regulated workloads, advanced governance, or cryptographic hardening. When you use an enhanced policy, you must also specify an endpoint access mode, which adds additional validation for how traffic reaches your API, as described in the following sections.

Enhanced policies follow a consistent naming patterns that helps you quickly understand what each policy enforces. For example, for REGIONAL and PRIVATE endpoint types, the following pattern applies:

SecurityPolicy_[TLS-Versions]_[Variant]_[YYYY-MM]

From this structure, you can identify the minimum TLS versions supported, any specialized cryptographic variants (such as FIPS, PFS, or PQ), and the release date of the policy. For example, SecurityPolicy_TLS13_1_3_2025_09 accepts only TLS 1.3 traffic, while SecurityPolicy_TLS13_1_2_PFS_PQ_2025_09 supports TLS 1.2 as lowest and TLS 1.3 as highest TLS version with forward secrecy and post-quantum enhancements.

Each policy maps to a curated combination of ciphers. For instance, SecurityPolicy_TLS13_1_3_2025_09 accepts only three TLS 1.3 cipher suites (TLS_AES_128_GCM_SHA256, TLS_AES_256_GCM_SHA384, and TLS_CHACHA20_POLY1305_SHA256) and rejects any other protocol versions or ciphers. For a full list of supported policies and ciphers, and naming pattern for the EDGE endpont type, see the API Gateway documentation.

How security policies apply to default endpoints and custom domains

You can use API Gateway to attach different security policies to your default API endpoint and custom domain names. During TLS negotiation, API Gateway selects the policy based on the Server Name Indication (SNI) value in the client’s TLS handshake, not the HTTP Host header. This means the policy depends on the hostname the client uses when initiating TLS.

For example, if a client connects directly to your default endpoint, such as:

https://abcdef1234.execute-api.us-east-1.amazonaws.com

API Gateway uses the policy attached to that default endpoint because the SNI value matches its hostname.

If the client instead connects through a custom domain name, such as:

https://api.example.com

API Gateway uses the policy attached to that custom domain. In this case, the SNI value api.example.com determines which policy is enforced.

This distinction is important even if you disable your default endpoint. TLS negotiation always occurs before API Gateway evaluates endpoint settings, so the default endpoint security policy still applies to clients that connect directly to its hostname. To avoid unexpected client behavior, you should keep the API and its custom domain name aligned with the same security policy whenever possible.

Understanding endpoint access mode

When you use an enhanced security policy (SecurityPolicy_*), you must also specify an endpoint access mode. Endpoint access mode defines how strictly API Gateway validates the network path a request takes before it reaches your API. This gives you an additional layer of governance and helps you prevent unauthorized or misrouted traffic.

You can choose between two modes:

  • BASIC mode provides standard API Gateway behavior. It is the recommended starting point when you migrate an existing API to an enhanced security policy. Clients can continue reaching your API as they do today, without additional validation.
  • STRICT mode adds enforcement checks to ensure that requests originate from the correct endpoint type, and TLS negotiation aligns with your configuration.

When you enable STRICT mode, API Gateway performs additional validations, such as:

  • The SNI and HTTP Host header values match
  • The request originates from the same endpoint type as your API (Regional, edge-optimized, or private)

If any of these validations fail, API Gateway rejects the request. STRICT is a viable choice when you need stronger security guarantees, such as when running regulated or sensitive workloads. See API Gateway documentation for additional details.

When you switch from BASIC to STRICT mode, it takes up to 15 minutes for the change to fully propagate. Your API remains available during this period. If your endpoint access mode is set to STRICT, you cannot change the endpoint type until you revert the mode back to BASIC.

Applying security policies to new and existing APIs

You can apply a security policy when you create a new REST API or custom domain name, or update an existing resource to use one of the enhanced SecurityPolicy_* options. When migrating existing APIs, the recommended approach is to start with BASIC mode, validate client behavior (SNI and HTTP Host header values match, request originates from the same endpoint type as your API), and then move to STRICT mode once you confirm compatibility.

The following code snippets illustrate how to apply security policies to different scenarios:

Create a REST API with a security policy and STRICT endpoint access mode

You can attach a security policy directly during API creation, removing the need for extra infrastructure just to control TLS negotiation.

aws apigateway create-rest-api \
  --name "your-private-api-name" \
  --endpoint-configuration '{"types":["PRIVATE"]}' \
  --security-policy "SecurityPolicy_TLS13_1_3_2025_09" \
  --endpoint-access-mode STRICT \
  --policy file://api-policy.json

Create a custom domain name with a security policy and STRICT endpoint access mode

You can also specify the security policy when creating a custom domain name. API Gateway applies the selected policy during TLS negotiation based on the SNI value the client provides.

aws apigateway create-domain-name \
  --domain-name api.example.com \
  --regional-certificate-arn arn:aws:acm:region:account-id:certificate/certificate-id \
  --endpoint-configuration '{"types":["REGIONAL"]}' \
  --security-policy SecurityPolicy_TLS13_1_3_2025_09 \
  --endpoint-access-mode STRICT

Updating existing REST API

If you are migrating an existing API, start by applying the enhanced security policy with BASIC mode. After confirming that your clients can connect with BASIC mode as expected, proceed to enable the STRICT mode.

1. Apply the new policy with BASIC mode

aws apigateway update-rest-api --rest-api-id abcd123 --patch-operations '[
    {
         "op": "replace",
         "path": "/securityPolicy",
         "value": "SecurityPolicy_TLS13_1_3_2025_09"
    },
    {
         "op": "replace",
         "path": "/endpointAccessMode",
         "value": "BASIC"
     }
]'

Verify your clients can consume the API as expected using access logs and performance metrics in Amazon CloudWatch.

2. Enable the STRICT mode after validation

aws apigateway update-rest-api --rest-api-id abcd123 --patch-operations '[
    {
        "op": "replace",
        "path": "/endpointAccessMode",
        "value": "STRICT"
     }
]'

Updating existing custom domain name

Custom domain names follow the same migration approach as REST APIs.

1. Apply the new policy with BASIC mode and validate clients can successfully connect.

aws apigateway update-domain-name --domain-name api.example.com --patch-operations '[
    {
        "op": "replace",
        "path": "/securityPolicy",
        "value": "SecurityPolicy_TLS13_1_3_2025_09"
    },
    {
        "op": "replace",
        "path": "/endpointAccessMode",
        "value": "BASIC"
     }
]'

2. Enable the STRICT mode after validation

aws apigateway update-domain-name --domain-name api.example.com --patch-operations '[
    {
        "op": "replace",
        "path": "/endpointAccessMode",
        "value": "STRICT"
     }
]'

After you update your REST API or custom domain configuration, redeploy your API so that stages receive the new settings. When you change a security policy, the update takes up to 15 minutes to complete. The API status appears as UPDATING while the change propagates and returns to AVAILABLE when complete. Your API remains fully functional throughout this process.

Rolling back endpoint access mode

If you notice clients failing to connect to your API after applying the STRICT mode, you can revert the endpoint access mode back to BASIC at any time. Below code snippet illustrates doing this for a REST API.

aws apigateway update-rest-api --rest-api-id abcd123 --patch-operations '[
    {
      "op": "replace",
      "path": "/endpointAccessMode",
      "value": "BASIC"
    }
  ]'

You can use the same approach to update a custom domain name.

Monitoring TLS usage and policy migrations

As you adopt enhanced security policies, it is important to understand how clients negotiate encrypted connections with your API. Monitoring helps you verify client readiness, identify legacy consumers that may require updates, and validate that STRICT endpoint access mode behaves as expected during rollout. Use the following API Gateway access logs variables to monitor protocol and cipher usage over time.

  • $context.tlsVersion – the negotiated TLS version
  • $context.cipherSuite – the cipher suite selected during the handshake

You can use these variables to confirm that:

  • Clients are using the expected minimum TLS version
  • BC-based ciphers are no longer used after you move to a hardened policy
  • PQC and FIPS-aligned policies are being exercised by the appropriate clients

Access logs are especially useful during migrations, where validating the actual client behavior is a prerequisite before enabling STRICT mode. For example, if you still observe live clients negotiating TLS 1.0 or TLS 1.2 CBC ciphers after applying a hardened policy in BASIC mode, you can identify the affected clients and plan remediation before switching to STRICT mode.

Future-proof security configurations

Some of the new policies combine TLS 1.3 with post-quantum cryptography (PQC) to help you prepare for a future where quantum-capable threat actors exist. With these policies you can start testing and adopting quantum-resistant algorithms without redesigning your API architecture.

As standards evolve and new cipher suites are introduced, API Gateway’s policy model provides you with a clear path for adding new variants while keeping your configuration simple and predictable.

Conclusion and next steps

Enhanced TLS security policies and endpoint access mode in the Amazon API Gateway gives you direct control over how clients establish secure connections to your APIs. You can choose the policies that match your compliance needs, such as PCI DSS, FIPS, Open Banking, PQC, and use STRICT mode to control how traffic reaches your endpoints and apply additional domain-level validations, further hardening security of your APIs

To get started:

  1. Review the list of available security policies in the API Gateway documentation.
  2. Identify which REST APIs and domains require stronger TLS controls.
  3. Apply an appropriate SecurityPolicy-* policy with BASIC mode.
  4. Validate client behavior using access logs and CloudWatch metrics.
  5. Move to STRICT mode when you are ready to enforce additional connection-level protection.

For more information about building Serverless architectures, see ServerlessLand.com

Improving throughput of serverless streaming workloads for Kafka

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/compute/improving-throughput-of-serverless-streaming-workloads-for-kafka/

Event-driven applications often need to process data in real-time. When you use AWS Lambda to process records from Apache Kafka topics, you frequently encounter two typical requirements: you need to process very high volumes of records in close to real-time, and you want your consumers to have the ability to scale rapidly to handle traffic spikes. Achieving both necessitates understanding how Lambda consumes Kafka streams, where the potential bottlenecks are, and how to optimize configurations for high throughput and best performance.

In this post, we discuss how to optimize Kafka processing with Lambda for both high throughput and predictable scaling. We explore the Lambda’s Kafka Event Source Mappings (ESMs) scaling, optimization techniques available during record consumption, how to use ESM Provisioned Mode for bursty workloads, and which observability metrics you need to use for performance optimization.

Overview

To start processing records from a Kafka topic with a Lambda function, whether using Amazon Managed Streaming for Apache Kafka (Amazon MSK) or a self-managed Kafka cluster, you create an ESM: a lightweight serverless resource that consumes records from Kafka topics and invokes your function.

The scaling behavior of Kafka ESMs is based on the offset lag. This is a metric indicating the number of records in the topic that have not yet been consumed by the Lambda function. This metric typically grows when producers publish new records faster than consumers process them. As the lag grows, the Lambda service gradually adds more Kafka consumers (also known as pollers) to your ESM. To preserve ordering guarantees, the maximum number of pollers is capped by the number of partitions in the topic. Lambda also scales pollers down automatically when lag decreases.

Each ESM follows a consistent polling workflow: poll -> filter -> batch -> invoke, as shown in the following diagram. Every stage has configurable options that directly affect performance, latency, and cost.


Figure 1. ESM processing workflow.

Polling: Increasing predictability with Provisioned Mode

By default, Kafka ESM uses the on-demand polling mode. In this mode, ESM starts with one poller, automatically adds more pollers when the offset lag grows, and scales the number of pollers down as lag decreases. On-demand mode does not need upfront scaling configuration and is the lowest-cost option for steady workloads. For many applications, this behavior is sufficient: scaling up can take several minutes, but the throughput eventually catches up, and you only pay for the resources you use, such as number of invocations.

However, if your workloads are bursty and latency-sensitive, then on-demand scaling may not be fast enough and can result in a rapidly growing lag. This can be addressed by switching to Provisioned Mode, which gives you more fine-grained control to configure a minimum and maximum number of always-on pollers for your Kafka ESM. These pollers remain connected even when traffic is low, so consumption begins immediately when a spike occurs, and scaling within the configured range is faster and more predictable.

The following diagram shows the performance improvements of using the ESM in Provisioned Mode for bursty workloads. You can see that in on-demand mode it took ESM over 15 minutes to eventually catch up to the new traffic volume, while in Provisioned Mode the ESM handled the traffic increase instantly.


Figure 2. Comparing Kafka ESM on-demand and Provisioned Mode.

Best practices for using Provisioned Mode:

  • Start small: Provisioned Mode is a paid capability. AWS recommends that for smaller topics (less than 10 partitions) you start with a single provisioned poller to evaluate throughput and observe workload behavior. For larger topics, you can start with a higher number of provisioned pollers to accommodate the baseline consumption. You can adjust this configuration at any time as you learn traffic patterns and refine your performance targets.
  • Estimate throughput: A single provisioned poller can process up to 5 MB/s of Kafka data. Monitor your average record size and per-record processing time to establish a baseline for minimum and maximum pollers, then validate with real workload metrics.
  • Set a low floor and flexible ceiling: Choose a minimum number of pollers that makes sure that latency targets are met when a traffic burst occurs, then allow the ESM to scale toward a higher maximum as needed.

See Low latency processing for Kafka event sources for more information.

To summarize:

  • Use Provisioned Mode for bursty traffic, strict SLOs, or when backlogs pose downstream risk.
  • Use on-demand polling mode for steady traffic, flexible latency requirements, or when minimizing cost is the primary objective.

Filtering: Drop irrelevant records early

By default, all records from Kafka are delivered to your Lambda function. This approach is direct and flexible. Your handler code decides which records to process and which to ignore. This default behavior is highly efficient for workloads where nearly all records are valuable.

When you find yourself discarding a large portion of records in your handler code, you can use native ESM filtering capabilities to drop irrelevant records before they reach your function. You can filter early to reduce cost, free up concurrency, increase throughput, and make sure that your Lambda function spends cycles on valuable work only.

The following diagram shows the application of an ESM filter to only process telemetry that meets a specified condition.


Figure 3. ESM filtering configuration.

Batching: Processing more records per invocation

You can batch multiple Kafka records together to process more data per invocation and increase the efficiency of your Lambda functions. Larger batches help you achieve higher throughput and reduce costs by making better use of each invocation run. To get the best results, you should balance batch size and latency targets and adjust the configuration based on your workload’s specific traffic patterns and SLOs.

Lambda gives you two primary controls for configuring ESM batching behavior:

  • Batch window: This is how long the ESM waits to accumulate records before invoking your function. A shorter window produces smaller batches and more frequent invocations. A longer window (up to 5 minutes) produces larger batches and less frequent invocations.
  • Batch size: This is the maximum number of records that the ESM can accumulate before invoking your function, up to 10,000.

There’s no single setting that universally works for all workloads. Your optimal configuration depends on workload characteristics such as latency tolerance and record size. AWS recommends starting with the default values and then gradually adjusting the configuration based on your requirements. For example, you can increase the batch size while monitoring function duration, error rates, and end-to-end latency.

The following diagram shows how to configure batch window and size using Terraform:


Figure 4. ESM batch window and batch size configuration with Terraform.

The ESM invokes your function when one of the following three conditions is met:

  1. The batch window elapses.
  2. The accumulated batch reaches the configured maximum batch size.
  3. The accumulated payload approaches the 6 MB maximum invocation payload limit of Lambda.

When using higher batch window values during traffic spikes, you typically see more records-per-batch and longer function invocation durations. This is normal: larger batches can take longer to process. Always interpret the Duration metric in the context of the batch size being processed.

Invoke: Process each batch faster and more efficiently

You control how quickly each batch completes through two main factors: the efficiency of your function code and the compute resources you allocate to your functions. You can improve both to process more records per second, reduce the necessary concurrency, and lower cost.

Optimize your code: Review your function handler code to identify where you can reduce work per record. For example, eliminate redundant serialization, initialize dependencies once during function startup, and consider parallel processing within the handler (where applicable). For performance-critical workloads, you can also choose languages that compile to binary, such as Go or Rust, which typically deliver high performance with lower resource usage.

Tune compute resources: Increasing the memory function allocation proportionally increases vCPU. Use the Lambda PowerTuning tool to find the memory configuration that best balances performance and cost for your workload.

Correlate metrics: As you optimize, monitor Duration and Concurrency. You should see the concurrency drop as duration improves. That correlation confirms that your changes are improving the system throughput and efficiency.

When you combine handler optimizations with early filtering and efficient batching, even small improvements can make your pipeline noticeably faster to operate under load.

Observability drives good decisions

You can’t optimize what you can’t see. To tune your data processing pipeline, use a combination of OffsetLag, function invocation metrics, and Kafka broker metrics to understand your data processing performance. OffsetLag tells you whether your function is keeping up with incoming records, as shown in the following figure. Function metrics such as Duration, Concurrency, Errors, and Throttles show how efficiently your code is processing record batches. If you use Provisioned Mode, then you can use the Provisioned Pollers metric to track the poller capacity.


Figure 5. Kafka consumption observability with Amazon CloudWatch.

Always interpret function duration in the context of batch size. During traffic spikes, you can typically observe both duration and actual batch size increase, which is expected amortization, not a regression. For alerting, monitor lag growth, unexpected drops in invocation rate, and error spikes. With these signals in place, you can detect issues early and tune your configuration with confidence.

A sample step-by-step optimization loop

  1. Establish a clean baseline: Make your handler idempotent and batch-aware, start with a short batch window and moderate batch size. Monitor your ESM and confirm offset lag stays near zero at steady state.
  2. Filter early: Move static checks (record type, version, other custom properties) into ESM filtering and verify invoked counts drop relative to polled counts, proving the filter saves cost and concurrency.
  3. Increase batch size gradually while monitoring the duration, error rates, and latency metrics. Extend the batch window slightly if spikes cause too many invocations.
  4. Speed up the handler: Increase memory for more CPU, reduce per-record I/O, remove redundant serialization, and parallelize safely inside the batch while tracking duration and concurrency metrics together.
  5. Prove spike readiness: Replay realistic surges, monitor offset lag and drain time, and enable Provisioned Mode with a small minimum if recovery takes too long, adjusting with MB/s-per-poller estimates.
  6. Implement alerting: Watch for sustained lag growth, unexpected gaps between polled and invoked, and error spikes tied to partitions or large batches. Always read metrics in context with batch size.
  7. Re-evaluate periodically: Re-measure system throughput, confirm filter effectiveness, and retune batch and memory settings regularly as workloads evolve.

Conclusion

Optimizing Kafka streams processing with AWS Lambda necessitates understanding how ESMs work and tuning consumption components: polling, filtering, batching, and invoking. Filtering redundant records early removes unnecessary work, batching helps you process more records per invocation, and handler optimizations make sure that you make the most of the compute that you allocate. Together, these adjustments let you scale efficiently and keep offset lag under control.

When your workload is bursty, use Provisioned Mode to absorb spikes without long recovery times. With the right alerts on lag, errors, and unexpected polled versus invoked behavior, you can spot problems early and adjust before they impact users. Following this optimization guide gives you a practical way to measure, tune, and revisit your setup as traffic patterns change.

To learn more about optimizing Kafka consumption, see the AWS re:Invent 2024 session about Improving throughput and monitoring of serverless streaming workloads.

To learn more about building Serverless architectures see Serverless Land.

Build scalable REST APIs using Amazon API Gateway private integration with Application Load Balancer

Post Syndicated from Christian Silva original https://aws.amazon.com/blogs/compute/build-scalable-rest-apis-using-amazon-api-gateway-private-integration-with-application-load-balancer/

This post is written by Vijay Menon, Principal Solutions Architect, and Christian Silva, Senior Solutions Architect.

Today, we announced Amazon API Gateway REST API’s support for private integration with Application Load Balancers (ALBs). You can use this new capability to securely expose your VPC-based applications through your REST APIs without exposing your ALBs to the public internet.

Prior to this launch, if you wanted to connect API Gateway to private ALBs, you would have had to use a Network Load Balancer (NLB) as an intermediary, increasing cost and complexity. Now, you can directly integrate API Gateway with private ALBs without requiring an NLB, reducing operational overhead and optimizing cost.

Previous architecture: Connecting API Gateway to private ALBs

Before this launch, API Gateway REST APIs connect to private ALB resources through an NLB positioned in front of the ALB. Many customers have successfully built and operated production workloads using this architecture, demonstrating its reliability for business-critical applications. The following architecture demonstrates this setup.

Figure 1. Previous architecture: API Gateway to private ALB via intermediary NLB

In response to customer feedback for a simplified architecture and reduced costs, we’ve extended VPC link v2 support to REST APIs. This feature now enables direct private ALB integration for REST APIs, eliminating the need for an intermediary NLB.

New architecture: Connecting API Gateway to private ALBs

With direct private ALB integration, this architecture becomes simpler and more efficient. The integration removes the need for an intermediate NLB, reducing the number of hops between client and your services. This streamlined setup simplifies the architecture for applications, allowing more efficient use of ALB’s layer-7 load-balancing capabilities, authentication, and authorization features. While these ALB features were technically accessible before, the new architecture removes the overhead and complexity of managing an additional NLB. Here’s how the simplified architecture looks now:

Figure 2. Direct integration between API Gateway and private ALB

Benefits of a direct integration between your API Gateway endpoint and your private ALB

  • Architectural simplification and operational excellence: Now that your API Gateway can directly connect to your private ALB, you no longer need an NLB to act as a bridge between your API Gateway and your private ALB. This eliminates the need to provision, configure, manage, or monitor an intermediate load balancer. The reduction in infrastructure components translates to reduced operational overhead and fewer potential failure points. Traffic flows directly from API Gateway to your ALB within the Amazon Web Services (AWS) network, reducing network hops and latency.
  • Improved scalability: VPC link v2 supports a one-to-many relationship with load balancers. A single VPC link v2 allows API Gateway to integrate with multiple ALBs or NLBs within your VPC. This architectural advantage is particularly valuable for organizations managing complex applications with multiple microservices, each potentially behind its own ALB, or those running numerous APIs. The ability to consolidate multiple load balancer connections through a single VPC link not only reduces administrative overhead but also provides greater flexibility in scaling your architecture. As your application grows and you add more services or load balancers, you won’t need to provision additional VPC links, making it easier to expand your infrastructure while maintaining operational efficiency.
  • Cost optimization: You can remove the NLB from your architecture and thereby eliminate both the hourly charges for running the NLB and the associated Network Load Balancer Capacity Units (NLCU) used per hour. For organizations running multiple environments or numerous APIs, these savings can accumulate to thousands of dollars annually. Moreover, your data transfer patterns become more efficient. Traffic flows directly from API Gateway to your ALB within the AWS network, which avoids any unnecessary hops that could incur more data transfer charges. This streamlined path not only reduces costs but also improves performance by minimizing network latency.

Getting started

This tutorial demonstrates the setup using both the AWS Management Console and AWS Command Line Interface (AWS CLI). Before you begin, make sure that you have an internal ALB configured in your VPC. For resources that need naming, use appropriate names for your environment.

Step 1: Create a VPC link v2
The first step in our process is to create a VPC link v2, which will enable API Gateway to route traffic to your internal ALB. Here’s how to set it up:

  1. Navigate to the API Gateway console.
  2. In the left navigation pane, choose VPC links.
  3. Choose Create VPC link.
  4. Choose VPC link v2 as the VPC link type.
  5. Provide a descriptive name for your VPC link.
  6. Choose your VPC and subnets where your ALB resides. For high availability, choose subnets in multiple AWS Availability Zones (AZs) that match your ALB configuration.
  7. Assign one or more security groups to your VPC link. These security groups will control the traffic flow between API Gateway and your VPC.
  8. Choose Create and wait for the VPC link status to become Available. This process can take a few minutes.

Alternatively, you can create a VPC link v2 using the AWS CLI:

# Create VPC link v2
aws apigatewayv2 create-vpc-link \
    --name "test-vpc-link-v2" \
    --subnet-ids "<your-subnet1-id>" "<your-subnet2-id>" \
    --security-group-ids "<your-security-group-id>" \
    --region <your-AWS-region>

# Check VPC link v2 status
aws apigatewayv2 get-vpc-link \
    --vpc-link-id "<your-vpc-link-v2-id>" \
    --region <your-AWS-region>

Step 2: Create a REST API and configure integration
With your VPC link v2 now available, the next step is to create a REST API and configure it to use the VPC Link. This process involves creating the API, setting up resources and methods, and configuring the integration with your internal ALB.

  1. In the API Gateway console, choose Create API.
  2. Choose REST API.
  3. Enter an API name and choose Create API.
  4. Create a new resource by choosing Actions, then choose Create resource. This resource will represent the endpoint for your API.
  5. Create a method by choosing Actions, then choose Create method. The method defines the type of request your API will accept (GET, POST, etc.).
  6. Now, configure the integration. This is where you’ll connect your API to your internal ALB via the VPC link v2:
    1. Choose VPC link as the integration type.
    2. Choose the HTTP method for your backend integration.
    3. Choose your newly created VPC link v2.
    4. Specify your ALB as the Integration target.
    5. Enter the endpoint URL for your integration. The port specified in the URL is used to route requests to the backend.
    6. Set the Integration timeout.

Using the AWS CLI:

# Create REST API
aws apigateway create-rest-api \
    --name "test-rest-api" \
    --description "REST API integration with internal ALB via VPC link v2" \
    --region <your-AWS-region>

# Get REST API’s root resource ID
aws apigateway get-resources \
    --rest-api-id "<your-rest-api-id>" \
    --region <your-AWS-region>

# Create a new resource
aws apigateway create-resource \
    --rest-api-id "<your-rest-api-id>" \
    --parent-id "<your-parent-id>" \
    --path-part "internal-alb" \ 
    --region <your-AWS-region>

# Create a new method
aws apigateway put-method \
    --rest-api-id "<your-rest-api-id>" \
    --resource-id "<your-resource-id>" \
    --http-method ANY \
    --authorization-type NONE \
    --region <your-AWS-region>

# Create the integration
aws apigateway put-integration \
    --rest-api-id "<your-rest-api-id>" \
    --resource-id "<your-resource-id>" \
    --http-method ANY \
    --type HTTP_PROXY \
    --integration-http-method ANY \
    --uri "http://test-internal-alb.com/test" \
    --connection-type VPC_LINK \
    --connection-id "<your-vpc-link-v2-id>" \
    --integration-target "<your-ALB-arn>" \
    --region <your-AWS-region>

Step 3: Deploy and test
With your API configured, it’s time to deploy it and verify that it’s working correctly.

  1. Choose Deploy API to create a new deployment of your API.
  2. Create a new stage (for example “test”). Stages allow you to manage multiple versions of your API.
  3. After deployment, you’ll receive an API endpoint URL. Copy this URL as you’ll need it for testing.

Test your API using your preferred API client or a simple curl command.

Using the AWS CLI:

# Create a new deployment to a test stage
aws apigateway create-deployment \
    --rest-api-id "<your-rest-api-id>" \
    --stage-name "test" \
    --region <your-AWS-region>

Test your API integration using a curl command:

curl https://<rest-api-id>.execute-api.<your-aws-region>.amazonaws.com/internal-alb
{"message": "Hello from internal ALB"}

Step 4: Scale your VPC link v2
A single VPC link can now connect to multiple ALBs or NLBs within your VPC, simplifying infrastructure management. This AWS CLI snippet demonstrates API Gateway integrating with multiple internal services, for example orders and payments services, each behind its own ALB, using a single VPC link v2. Note how the same VPC link ID is used across both integrations.

# Orders service integration (ALB-1)
aws apigateway put-integration \
    --rest-api-id "<your-rest-api-id>" \
    --resource-id "<orders-resource-id>" \
    --http-method ANY \
    --type HTTP_PROXY \
    --integration-http-method ANY \
    --uri "<your-orders-alb-endpoint>" \
    --connection-type VPC_LINK \
    --connection-id "<your-vpc-link-v2-id>" \
    --integration-target "<your-orders-alb-arn>" \
    --region "<your-aws-region>"

# Payments service integration (ALB-2)
aws apigateway put-integration \
    --rest-api-id "<your-rest-api-id>" \
    --resource-id "<payments-resource-id>" \
    --http-method ANY \
    --type HTTP_PROXY \
    --integration-http-method ANY \
    --uri "<your-payments-alb-endpoint>" \
    --connection-type VPC_LINK \
    --connection-id "<your-vpc-link-v2-id>" \
    --integration-target "<your-payments-alb-arn>" \
    --region "<your-aws-region>"

For a detailed, step-by-step guide, please see our official documentation in the API Gateway Developer Guide.

Use cases

Private ALB integration with API Gateway enables architectural patterns that solve enterprise challenges. These are three key scenarios where organizations can use this new capability:

  • Microservices on Amazon ECS and Amazon EKS: Exposing microservices running on Amazon ECS or Amazon EKS becomes simpler with this integration. It allows secure, path-based routing to different services without exposing your ALB to the public internet or using complex NLB proxy patterns.
  • Hybrid cloud architectures: Seamless and secure connectivity between cloud-native APIs and on-premises resources is achieved via AWS Direct Connect or AWS Site-to-Site VPN. This setup allows flexible routing based on HTTP methods and headers to various internal systems.
  • Enterprise modernization: Gradual application modernization is facilitated by enabling phased migration from monolithic architectures to microservices. Organizations can route traffic between legacy and new components while maintaining operational continuity and minimizing risk.

Conclusion

Direct private integration between API Gateway REST APIs and ALBs enhances API architecture on AWS. By simplifying infrastructure and reducing operational overhead, this capability improves performance and efficiency for API-driven applications.

This feature is available today in all AWS Regions where VPC link v2 and ALBs are present. We can’t wait to see what you build with it and how it transforms your API architectures. Get started now by visiting the API Gateway console and creating your first VPC link v2 for direct ALB integration.

For more information, visit the API Gateway product page, review our pricing details, and explore the comprehensive developer documentation to learn about all the powerful features available to help you build world-class APIs on AWS.

Building multi-tenant SaaS applications with AWS Lambda’s new tenant isolation mode

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/compute/building-multi-tenant-saas-applications-with-aws-lambdas-new-tenant-isolation-mode/

Today, AWS announced a new tenant isolation mode for AWS Lambda, that allows you to process function invocations in separate execution environments for each application end-user or tenant invoking your Lambda function. This capability simplifies building secure multi-tenant SaaS applications by managing tenant-level compute environment isolation and request routing for you. As a result, you can focus on your core business logic rather than implementing your own tenant-aware compute environment isolation.

Overview

Lambda runs your function code in secure execution environments that leverage Firecracker virtualization to provide isolation. These execution environments never share or reuse virtual resources (such as vCPU, disk, or memory) across functions, or even across different versions of the same function. However, Lambda can reuse execution environments for multiple invocations of the same function version, as these execution environments are fully set-up and can therefore deliver faster request processing for your functions.

Figure 1. Incoming invocations processed by a collection of execution environments that belong to a single function.

Figure 1. Incoming invocations processed by a collection of execution environments that belong to a single function.

Multi-tenant SaaS applications that handle sensitive tenant-specific data or execute code supplied dynamically by tenants may need a higher degree of isolation—at the individual application tenant level rather than at the function level—for secure code execution and to reduce the risk of cross-tenant data access.

Prior to today’s launch, developers would implement custom solutions, such as SDKs or application logic to manage isolation within function code. This approach was bug-prone, required more work from application development teams, and didn’t ensure isolation at the compute environment level.

Alternatively, developers adopted the approach of creating separate functions per application tenant, replicating the same code across hundreds or thousands of tenants. This approach provided stronger compute environment isolation than sharing compute environments across multiple tenants of the same function, but increased implementation overhead and operational complexity as workloads grew to support a larger number of tenants over time.

Figure 2. Using function-per-tenant model, each tenant’s requests are processed by a separate function.

Figure 2. Using function-per-tenant model, each tenant’s requests are processed by a separate function.

Starting today, AWS Lambda offers a new tenant isolation mode that lets you isolate execution environments used across different tenants of your multi-tenant SaaS applications, even when all of the tenants invoke the same function. When you enable the new tenant isolation mode, you include a tenant identifier with each function invocation. Lambda uses this identifier to route the request to the correct execution environment. As a result, each execution environment is reused only for invocations from the same tenant. This means you still get the performance benefits of warm execution environments, while ensuring that each tenant’s workloads remain isolated.

Figure 3. With the new tenant isolation capability, Lambda creates separate execution environments per tenant for a single function.

Figure 3. With the new tenant isolation capability, Lambda creates separate execution environments per tenant for a single function.

For organizations handling sensitive tenant-specific data or running untrusted code supplied dynamically by end-users, Lambda’s new tenant isolation mode provides the security benefits of per-tenant compute environment separation without the operational complexity of managing individual functions or infrastructure for each tenant.

Example scenario

Consider building a multi-tenant serverless SaaS application. To optimize performance, your function handler can retrieve tenant-specific configuration and data, cache it in memory, and reuse it for subsequent invocations from the same tenant. For example, you might cache tenant-specific database location, feature flags, or business rules that are frequently accessed during request processing. You may store this information within the application runtime process as global variables or as files in the /tmp directory. However, if the underlying execution environment is used to serve multiple tenants, this approach can potentially expose data across tenants.

With tenant isolation mode you can address this risk with much simpler architecture and configuration. This built-in capability makes Lambda an excellent choice for multi-tenant SaaS applications needing isolated compute environments for individual tenants.

Getting Started with Lambda Tenant Isolation Mode

Use the new tenancy-config parameter to configure tenant isolation mode when you create your function. You can only apply this configuration at function creation time; it cannot be updated for existing functions. The following snippet creates a function with tenancy config using the AWS CLI.

aws lambda create-function \
   --function-name my-function1 \
   --runtime nodejs22.x \
   --zip-file fileb://my-function1.zip \
   --handler index.handler \
   --role arn:aws:iam:1234567890:role/my-function-role \
   --tenancy-config '{"TenantIsolationMode": "PER_TENANT"}'

After the function is created, you must provide the tenant ID parameter with each invocation. Lambda uses this identifier to ensure that the execution environment used for a particular tenant is never reused for other tenants. For subsequent invocations from the same tenant, Lambda may reuse the execution environment to optimize performance. Specify this tenant-id parameter as illustrated below:

aws lambda invoke \
   --function-name my-function \
   --tenant-id BlueTenant \
   response.json

The new tenant-id parameter is required for functions using the tenant isolation mode. Function invocations omitting this parameter will fail with an invocation error, as shown below:

aws lambda invoke --function-name multitenant-function out.json

An error occurred (InvalidParameterValueException) when calling the Invoke operation:
The invoked function is enabled with tenancy configuration. 
Add a valid tenant ID in your request and try again.

Lambda makes the tenant ID parameter available through your function handler’s context object. This allows you to access tenant-specific information in your code, for example if you wish to implement custom logic based on the tenant identity, as shown below:

exports.handler = async function (event, context) {
   const tenantId = context.tenantId;

   // Process tenant-specific logic

   return {
      statusCode: 200,
      body: `OK for tenantId=${tenantId}`
   };
};

The following table outlines differences between Lambda functions with and without tenant isolation mode enabled:

Feature Without the new
tenant isolation mode
With the new
tenant isolation mode
Execution environment isolation Isolated per function version. Isolated per end-user or tenant invoking a function version.
Execution environment reuse Can be reused to process all invocations of a function version. Can only be reused to process invocations from the same tenant invoking a function version.
Data stored on local disk and in-memory Potentially accessible across all invocations of a function version. Potentially accessible across invocations from the same tenant. Not accessible for invocations from other tenants.
Cold starts Occur when there are no warm execution environments available to process incoming invocation. Occur when there are no tenant-specific warm execution environments available to process incoming invocation. More cold starts expected due to tenant-specific execution environments.

Integrating with Amazon API Gateway

Amazon API Gateway uses Lambda’s Invoke API to invoke Lambda functions. When using the Invoke API, Lambda expects the tenant ID parameter to be passed using the X-Amz-Tenant-Id HTTP header. You can configure API Gateway to inject this HTTP header into the Lambda invocation request with a value obtained from client request properties such as HTTP header, query parameter, or path parameter. When using Lambda Authorizers, you can obtain the value from authorization context information returned by the authorizer, such as principal ID or JWT claim. See API Gateway documentation to learn how you can return authorization information from Lambda authorizers to be used for the X-Amz-Tenant-Id header value.

Figure 4. Obtaining X-Amz-Tenant-Id header value from authentication sources.

Figure 4. Obtaining X-Amz-Tenant-Id header value from authentication sources.

The following screenshot illustrates API Gateway Lambda integration configuration, where the incoming request to API Gateway includes an x-tenant-id header that is mapped to the X-Amz-Tenant-Id request header to invoke a Lambda function using tenant isolation mode.

Figure 5. Mapping client request header to Lambda tenant-id header.

Figure 5. Mapping client request header to Lambda tenant-id header.

The following code snippet illustrates this configuration implemented with the AWS CDK.

const lambdaIntegration = new ApiGw.LambdaIntegration(fn, {
   requestParameters: {
      // This configures API Gateway to inject X-Amz-Tenant-Id header
      // into downstream requests. The header value is obtained from 
      // x-tenant-id header in the client request.
      'integration.request.header.X-Amz-Tenant-Id': 'method.request.header.x-tenant-id'
   }
});

resource.addMethod('GET', lambdaIntegration, {
   requestParameters: {
      // This enables API Gateway to use the x-tenant-id header value 
      // obtained from the client request. The header name is arbitrary.
      // you can use any other header name. 
      'method.request.header.x-tenant-id': true
   }
});

Tenant-aware observability

For functions using tenant isolation, Lambda automatically includes the tenant ID in function logs when you have JSON logging enabled, making it easier to monitor and debug tenant-specific issues. Note that the tenantId property is available during function invocation, rather than during function initialization. The tenantId property is included for both platform events (like platform.start and platform.report) and custom logs you print in your function code, as shown in the following screenshot:

Figure 6. Lambda function logs with tenantId.

Figure 6. Lambda function logs with tenantId.

Lambda creates a separate CloudWatch log stream for each execution environment. You can use CloudWatch Log Insights to find log streams that belong to a particular tenant by filtering by tenant Id:

fields @logStream, @message
| filter tenantId=='BlueTenant' or record.tenantId=='BlueTenant'
| stats count() as logCount by @logStream
| sort @timestamp desc

You can also retrieve tenant-specific logs across all log streams:

fields @message
| filter tenantId=='BlueTenant' or record.tenantId=='BlueTenant'
| limit 1000

Each log stream starts with function initialization logs followed by the invocation logs. This structure helps you to debug tenant-specific issues and understand the lifecycle of each tenant’s execution environments.

Considerations

When using the new tenant isolation for Lambda functions, consider the following:

  • Each tenant’s execution environments are isolated from other tenants so that tenant-specific data stored on disk or in memory remain separated from other tenants invoking the same Lambda function.
  • All tenants share the function’s execution role. For more fine-grained permissions for individual tenants, consider propagating tenant-scoped credentials from the upstream application components invoking your Lambda function.
  • Your application may experience higher percentage of cold starts, as Lambda processes requests in separate execution environments for each tenant invoking your functions.
  • You pay a fee for each new tenant-specific execution environment created, depending on the memory configured for your function. See Lambda pricing page for details.

Best practices

When using the new tenant isolation mode for Lambda functions, AWS recommends the following best practices:

  • Implement robust tenant ID validation at the application layer to prevent unauthorized access through tenant ID manipulation. Consider using a dedicated service or database to maintain valid tenant IDs.
  • Monitor and audit tenant access patterns regularly to detect potential security anomalies or unauthorized cross-tenant access attempts.
  • Be aware of Lambda concurrency quotas when building multi-tenant applications. You might need to request quota increases based on your tenant count and usage patterns.

Sample code

Follow the instructions in this GitHub repository to provision a sample project in your own account and see the new Lambda tenant isolation mode in action. The sample project illustrates how to integrate a function using the new tenant isolation mode with Amazon API Gateway and propagate tenant identity from client requests.

Conclusion

The new tenant isolation mode for Lambda simplifies building serverless multi-tenant SaaS applications on AWS. By automatically managing application tenant-level compute environment isolation, this capability eliminates the need for custom isolation logic or separate tenant functions, allowing you to focus on the core business logic while AWS handles the complexities of tenant-aware compute environment isolation.

Combined with the existing security features in Lambda, rapid scaling, and pay-per-use pricing, tenant isolation mode makes Lambda an even more compelling choice for modern SaaS applications, whether you’re building new solutions or enhancing existing ones.

To learn more, refer to the documentation for tenant isolation. For details on pricing, refer to Lambda’s pricing page.

Build priority-based message processing with Amazon MQ and AWS App Runner

Post Syndicated from Aritra Nag original https://aws.amazon.com/blogs/architecture/build-priority-based-message-processing-with-amazon-mq-and-aws-app-runner/

Organizations need message processing systems that can prioritize critical business operations while handling routine tasks efficiently. When handling time-sensitive tasks like rush orders from key customers, critical system alerts, or multi-step business processes, you need to prioritize urgent messages while making sure other routine requests are processed reliably.

In this post, we show you how to build a priority-based message processing system using Amazon MQ for priority queuing, Amazon DynamoDB for data persistence, and AWS App Runner for serverless compute. We demonstrate how to implement application-level delays that high-priority messages can bypass, create real-time UIs with WebSocket connections, and configure dual-layer retry mechanisms for maximum reliability.

This solution addresses three critical challenges in modern data processing systems:

  • Implementing configurable delay processing at the application level
  • Supporting priority-based message routing that respects business requirements
  • Providing real-time feedback to users through WebSocket connections

The use of AWS managed services reduces operational complexity, so teams can focus on business logic rather than infrastructure management. Message handling with priority-based processing makes sure operations receive attention while routine tasks are processed in the background. Users will experience status updates that provide visibility into their requests, while retry mechanisms provide reliability during failures. The infrastructure as code (IaC) approach supports deployments across different environments, from development through production.

Solution overview

The solution consists of several AWS managed services to create a serverless, priority-based message processing system with real-time user feedback. The architecture implements intelligent routing based on three message priority levels, to make sure critical messages receive immediate processing:

  • High-priority path – Messages bypass delays and queue immediately with JMS priority 9
  • Standard-priority path – Messages undergo configured delays before queuing with JMS priority 4
  • Low-priority path – Messages process after all higher priority messages with JMS priority 0

The following diagram illustrates this architecture.

The solution uses the following AWS managed services to deliver a scalable, serverless architecture:

  • AWS App Runner is a fully managed container application service that automatically builds, deploys, and scales containerized applications. It provides automatic scaling based on traffic, built-in load balancing and HTTPS, seamless integration with container registries, and zero infrastructure management overhead.
  • Amazon MQ is a managed message broker service for Apache ActiveMQ that offers priority-based message queuing, automatic failover for high availability, message persistence and durability, and JMS protocol support for enterprise applications.
  • Amazon DynamoDB is a fully managed NoSQL database service providing single-digit millisecond performance at any scale, automatic scaling with on-demand pricing, built-in security and backup capabilities, and global tables for multi-Region deployments.

The system uses JMS priority levels with High=9, Medium=4, and Low=0 for automatic ordering, combined with conditional delay processing based on priority classification. Amazon MQ provides reliable message delivery and persistence with dead-letter queue (DLQ) configuration for failed message handling.

Asynchronous delay processing uses CompletableFuture implementation for non-blocking delays, thread pool management for concurrent processing, graceful error handling with retry mechanisms, and configurable delay periods per message type to optimize resource utilization. For real-time status updates, the solution provides WebSocket connections for bidirectional communication, Amazon DynamoDB Streams for change data capture (CDC), comprehensive status tracking throughout the processing lifecycle, and a React frontend integration for live updates, so users have complete visibility into their message processing status.

The standard priority messaging flow (shown in the following diagram) handles messages with configurable delays using JMS asynchronous processing capabilities. Messages wait for their specified delay period before entering the Amazon MQ queue, where they’re processed.

The high-priority messaging flow (shown in the following diagram) provides an express lane for critical messages. These messages skip the delay mechanism entirely and proceed directly to the queue, providing immediate processing for time-sensitive operations.

To make it even more straightforward to get started, we’ve prepared an example application that you can use to observe the Amazon MQ behavior with varying message volumes. You can find the source code repository, IaC implementation, and instructions to run the sample on GitHub.

In the following sections, we walk you through deploying the complete processing system.

Prerequisites

Make sure you have the following tools, permissions, and knowledge to successfully deploy the priority-based message processing system. You must have an active AWS account with the following configurations:

# JSON
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
"apprunner:CreateService",
"apprunner:UpdateService",
"apprunner:DeleteService"
      ],
      "Resource": "arn:aws:apprunner:*:*:service/reactive-demo-*"
    },
    {
      "Effect": "Allow",
      "Action": [
"mq:SendMessage",
"mq:ReceiveMessage",
"mq:DeleteMessage"
      ],
      "Resource": "arn:aws:mq:*:*:broker/reactive-demo-broker/*"
    },
    {
      "Effect": "Allow",
      "Action": [
"dynamodb:PutItem",
"dynamodb:GetItem",
"dynamodb:UpdateItem",
"dynamodb:Query"
      ],
      "Resource": "arn:aws:dynamodb:*:*:table/reactive-items*"
    }
  ]
}

Install and configure the following development tools on your local machine:

To successfully implement this solution, you should have basic familiarity with the following:

  • Spring Boot applications
  • Message queue concepts
  • WebSocket protocols
  • React development

Configure the infrastructure stack

This step involves creating the core AWS services using the AWS Cloud Development Kit (AWS CDK). This modular approach enables independent stack management and environment-specific configurations.

  1. Create a new AWS CDK project:
# Bash
mkdir priority-processing && cd priority-processing
cdk init app --language python
pip install aws-cdk-lib constructs
  1. Create the infrastructure stack:
# Python
from aws_cdk import (
    Stack,
    aws_dynamodb as dynamodb,
    aws_amazonmq as mq,
    aws_kms as kms,
    Duration,
    RemovalPolicy,
    CfnOutput
)
from constructs import Construct

class MessageProcessingStack(Stack):
    def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
super().__init__(scope, construct_id, **kwargs)

# Create KMS key for encryption
self.kms_key = kms.Key(
    self, "ProcessingKey",
    description="Key for message processing encryption",
    enable_key_rotation=True
)

# DynamoDB table with comprehensive configuration
self.items_table = dynamodb.Table(
    self, "ItemsTable",
    table_name="reactive-items",
    partition_key=dynamodb.Attribute(
name="id",
type=dynamodb.AttributeType.STRING
    ),
    stream=dynamodb.StreamViewType.NEW_AND_OLD_IMAGES,
    billing_mode=dynamodb.BillingMode.ON_DEMAND,
    encryption=dynamodb.TableEncryption.CUSTOMER_MANAGED,
    encryption_key=self.kms_key,
    point_in_time_recovery=True,
    removal_policy=RemovalPolicy.DESTROY
)

# Add Global Secondary Index for status queries
self.items_table.add_global_secondary_index(
    index_name="StatusIndex",
    partition_key=dynamodb.Attribute(
name="status",
type=dynamodb.AttributeType.STRING
    ),
    sort_key=dynamodb.Attribute(
name="createdAt",
type=dynamodb.AttributeType.STRING
    )
)

# Amazon MQ broker configuration
self.mq_broker = mq.CfnBroker(
    self, "MessageBroker",
    broker_name="reactive-demo-broker",
    engine_type="ACTIVEMQ",
    engine_version="5.18",
    host_instance_type="mq.t3.micro",
    deployment_mode="SINGLE_INSTANCE",
    publicly_accessible=False,
    logs=mq.CfnBroker.LogListProperty(
audit=True,
general=True
    ),
    encryption_options=mq.CfnBroker.EncryptionOptionsProperty(
use_aws_owned_key=False,
kms_key_id=self.kms_key.key_id
    ),
    users=[mq.CfnBroker.UserProperty(
username="admin",
password="SecurePassword123!",
console_access=True
    )]
)

# Output values for application configuration
CfnOutput(self, "TableName", 
    value=self.items_table.table_name,
    description="DynamoDB table name")
CfnOutput(self, "MQBrokerEndpoint",
    value=self.mq_broker.attr_amqp_endpoints[0],
    description="Amazon MQ broker endpoint")
  1. Run the following commands to deploy the stack:
# Bash
cdk bootstrap
cdk deploy MessageProcessingStack

You can verify the infrastructure on the AWS Management Console.

Configure the message processing application

In this step, we create the Spring Boot application with priority-based message processing capabilities. First, we configure the application.properties file to incorporate environment variables, including AWS credentials, AWS Regions, and other configuration parameters such as log levels into the application and business logic implementation. Next, we implement the message service using a JMS template with comprehensive error handling, followed by enhancing the JMS configuration with connection pooling for improved performance.

The following code illustrates an example message service implementation:

// Example message service implementation
@Service
public class MessageService {
    @Autowired
    private JmsTemplate jmsTemplate;
    
    public void sendPriorityMessage(Message message) {
jmsTemplate.send(session -> {
    Message jmsMessage = session.createTextMessage(message.getContent());
    jmsMessage.setJMSPriority(message.getPriority());
    return jmsMessage;
});
    }
}

For proper timestamp update implementation, we integrate the DynamoDB SDK service with caching capabilities. Finally, after implementing the REST controller for the API with asynchronous processing support, we can deploy the message processing application. This implementation includes Java code application-level delay processing for demonstration purposes. Although this approach effectively showcases the priority-based message routing capabilities and real-time WebSocket updates in our demo environment, AWS recommends using Amazon MQ delay processing features for production workloads. For production implementations, use Amazon MQ delay and scheduling capabilities instead of application-level delays through features like Amazon MQ delay queues, ActiveMQ scheduling features, and appropriate message Time-to-Live (TTL) configurations.

The following code is an example snippet showcasing the Amazon MQ feature:

// Create connection factory with Amazon MQ endpoint
ActiveMQConnectionFactory factory = new ActiveMQConnectionFactory(brokerUrl);
factory.setUserName("admin");
factory.setPassword("your-password");
try (Connection connection = factory.createConnection();
     Session session = connection.createSession(false, Session.AUTO_ACKNOWLEDGE)) {
    
    // Create destination and producer
    Destination destination = session.createQueue(queueName);
    MessageProducer producer = session.createProducer(destination);
    
    // Create message
    TextMessage message = session.createTextMessage(messageContent);
    
    // Set native delay using ActiveMQ scheduled delivery
    message.setLongProperty(ScheduledMessage.AMQ_SCHEDULED_DELAY, delayMillis);
    
    // Optionally set priority for delayed message
    message.setJMSPriority(4);
    
    // Send the message - it will be delivered after the specified delay
    producer.send(message);
}

Build and deploy the Spring Boot application to App Runner

In this step, we push the application to Amazon Elastic Container Registry (Amazon ECR) to run it in App Runner:

  1. Build and push the Docker image to Amazon ECR:
# Bash

# Build the Docker image
docker build -t reactive-demo .

# Create ECR repository
aws ecr create-repository --repository-name reactive-demo --region us-east-1

# Get login token and login to ECR
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin $ECR_URI

# Tag and push image
ECR_URI=$(aws ecr describe-repositories --repository-names reactive-demo --query 'repositories[0].repositoryUri' --output text)
docker tag reactive-demo:latest $ECR_URI:latest
docker push $ECR_URI:latest
  1. Create the App Runner service with environment variables for the DynamoDB table and Amazon MQ broker endpoint:
# Python

from aws_cdk import (
    aws_apprunner as apprunner,
    aws_iam as iam
)

class AppRunnerStack(Stack):
    def __init__(self, scope: Construct, id: str, 
 table_name: str, mq_endpoint: str, **kwargs):
super().__init__(scope, id, **kwargs)

# Create IAM role for App Runner
app_runner_role = iam.Role(
    self, "AppRunnerRole",
    assumed_by=iam.ServicePrincipal("tasks.apprunner.amazonaws.com"),
    managed_policies=[
iam.ManagedPolicy.from_aws_managed_policy_name(
    "AmazonDynamoDBFullAccess"
),
iam.ManagedPolicy.from_aws_managed_policy_name(
    "AmazonMQFullAccess"
)
    ]
)

# Create App Runner service
self.service = apprunner.CfnService(
    self, "ReactiveProcessingService",
    service_name="reactive-processing-service",
    source_configuration=apprunner.CfnService.SourceConfigurationProperty(
authentication_configuration=apprunner.CfnService.AuthenticationConfigurationProperty(
    access_role_arn=app_runner_role.role_arn
),
image_repository=apprunner.CfnService.ImageRepositoryProperty(
    image_identifier=f"{ECR_URI}:latest",
    image_configuration=apprunner.CfnService.ImageConfigurationProperty(
port="8080",
runtime_environment_variables=[
    {"name": "DYNAMODB_TABLE_NAME", "value": table_name},
    {"name": "MQ_BROKER_URL", "value": mq_endpoint}
]
    ),
    image_repository_type="ECR"
)
    ),
    health_check_configuration=apprunner.CfnService.HealthCheckConfigurationProperty(
path="/actuator/health",
protocol="HTTP",
interval=10,
timeout=5,
healthy_threshold=1,
unhealthy_threshold=5
    ),
    instance_configuration=apprunner.CfnService.InstanceConfigurationProperty(
cpu="0.5 vCPU",
memory="1 GB"
    )
)

Set up real-time updates

For this step, we implement WebSocket support for real-time status updates using AWS Lambda to process DynamoDB streams and send updates to connected clients using Amazon API Gateway WebSocket connections. You can find the code snippet for this in this link

Deploy the React application to Amazon S3 and Amazon CloudFront

In this step, we create a frontend application to enable the WebSocket connection for seeing the messaging getting updated in the DynamoDB and API Gateway WebSocket connections.

Similar to the above section, here is the AWS cdk code for building the frontend for proceeding towards the validation of the solution

Validate the solution

This section provides comprehensive testing procedures to validate the priority-based message processing system.

Automated testing script

After you have completed the preceding steps, you can initiate a comprehensive testing script to validate priority processing and delay behavior:

# Bash
#!/bin/bash
curl -X POST "$API_URL/api/items" \
  -H "Content-Type: application/json" \
  -d '{
    "title": "High Priority Task",
    "priority": "High",
    "delay": 10
  }'

Validation through the web interface

The following screenshot of the UI illustrates how the queueing mechanism can work with the real-time updates using WebSockets.

The web interface provides validation of the priority-based message processing system. Access the Amazon CloudFront URL to view the following information:

  • Real-time message processing with live status updates
  • Queue statistics showing message distribution by priority
  • Processing timeline demonstrating priority bypass behavior
  • WebSocket connection status indicating real-time connectivity

Amazon CloudWatch dashboards and alarms

AWS recommends creating Amazon CloudWatch dashboards to track your priority-based message processing system’s performance across multiple dimensions. Monitor message processing by priority levels to make sure high-priority messages are processed first and identify any bottlenecks in your priority routing logic. The following screenshot shows an example dashboard.

You can track queue depth and processing times to understand system load and latency patterns, helping you optimize resource allocation and identify when scaling is needed. Observe DynamoDB performance metrics including read/write capacity consumption, throttling events, and latency to make sure your database layer maintains optimal performance under varying loads.

Additionally, implement application-specific custom metrics such as message processing success rates, retry counts, and business-specific KPIs to gain deeper insights into your application’s behavior and make data-driven decisions for continuous improvement.

Security considerations

AWS recommends implementing comprehensive security measures to safeguard your message processing system. Start by implementing least privilege IAM policies that grant only the minimum permissions required for each component to function, making sure services like App Runner can only access the specific DynamoDB tables and Amazon MQ queues they need. Configure your network architecture using a virtual private cloud (VPC) with private subnets for Amazon MQ, isolating your message broker from direct internet access while maintaining connectivity through NAT gateways for necessary outbound connections.

Enable encryption at rest using AWS Key Management Service (AWS KMS) for DynamoDB tables and Amazon MQ data and enforce encryption in transit by configuring SSL/TLS connections for all service communications, particularly for ActiveMQ broker connections. Finally, configure security groups with minimal access rules that explicitly define allowed traffic between components, restricting inbound connections to only the ports and protocols required for your application to function, such as port 61617 for ActiveMQ SSL connections from App Runner instances.

Cost considerations

The following table contains cost estimates based on the US East (N. Virginia) Region. Actual costs might vary based on your Region, usage patterns, and pricing changes.

Service Small (1,000 msg/day) Medium (10,000 msg/day) Large (100,000 msg/day)
Amazon DynamoDB $5–10 $25–50 $200–400
Amazon MQ $15 (t3.micro) $30 (m5.large) $120 (m5.xlarge)
AWS App Runner $20–40 $50–150 $400–800
Amazon API Gateway WebSocket $3–5 $10–25 $50–100
Amazon CloudWatch Logs $5–10 $10–20 $30–50
Data Transfer $5 $10-20 $50-100
Total Estimated Cost $53–95 $135–295 $850–1,570

Troubleshooting

The following are common issues and their solutions when implementing the priority-based message processing system:

  • Messages not processing in priority order:
    • Verify JMS priority is configured correctly: message.setJMSPriority(priority)
    • Check ActiveMQ broker configuration for priority queue support
    • Confirm CLIENT_ACKNOWLEDGE mode is properly configured
    • Review queue consumer concurrency settings
  • WebSocket updates not working:
    • Verify DynamoDB Streams is enabled on the table
    • Check the Lambda function is triggered by stream events
    • Validate API Gateway WebSocket configuration and IAM permissions
    • Test the WebSocket connection using browser developer tools
  • Application scaling issues:
    • Monitor App Runner metrics in CloudWatch
    • Adjust auto scaling configuration based on traffic patterns
    • Consider Amazon MQ broker capacity and upgrade if needed
    • Review DynamoDB capacity settings and enable auto scaling

Clean up

To avoid incurring ongoing AWS charges, delete the resources you created in this walkthrough:

  1. Delete the CDK stacks:
cdk destroy MessageProcessingStack
cdk destroy FrontendStack
  1. Remove the App Runner service:
aws apprunner delete-service --service-arn <your-service-arn>
  1. Delete the ECR repositories and container images.
  2. Remove CloudWatch log groups if not set to auto-delete.
  3. Delete S3 buckets used for frontend hosting.

Next steps

To extend this solution and add additional capabilities, consider the following enhancements:

Conclusion

This solution demonstrates how to build a production-ready priority-based message processing system using AWS managed services. By combining Amazon MQ priority queuing with DynamoDB real-time streams and App Runner serverless compute, you create a resilient architecture that intelligently handles messages based on business priorities.The implementation of application-level delays with priority bypass makes sure critical messages receive immediate attention, and the dual-layer retry mechanism provides maximum reliability. Real-time WebSocket updates keep users informed of processing status, creating a responsive and transparent system.To learn more about the services and patterns used in this solution, explore the following resources:


About the authors

Know before you go – AWS re:Invent 2025 guide to Well-Architected and Cloud Optimization sessions

Post Syndicated from Anitha Selvan original https://aws.amazon.com/blogs/architecture/know-before-you-go-aws-reinvent-2025-guide-to-well-architected-and-cloud-optimization-sessions/

Are you ready to maximize your Well-Architected and Cloud Optimization learning and networking time at re:Invent 2025? We have put together this comprehensive guide to help you plan your schedule and make the most of the Well-Architected and cloud optimization sessions available this year. These sessions will deliver the practical guidance your teams need to lead strategic cloud initiatives, design next-generation architectures, optimize costs, or secure AI-powered systems.

Key themes at re:Invent for Well-Architected and Cloud Optimization – You can expect to see the following themes at re:Invent 2025

AI-powered architecture and governance

The sessions in this theme showcase how AWS is integrating AI technologies to transform traditional architectural practices. From using AI services for automated Well-Architected reviews to implementing self-evolving systems with agentic AI, these sessions demonstrate how you can use AI to automate architectural decisions, streamline governance processes, and scale best practices across the enterprise.

Sessions: ARC324-R, ARC317-R, SPS320, ARC302-R (session details are posted in the following section)

Well-Architected Framework evolution and implementation

These sessions highlight how the AWS Well-Architected Framework has evolved beyond its original scope to address modern architectural challenges. Attendees will learn how to implement the framework principles across different domains—from IoT security to backup strategies—while focusing on enterprise-scale governance and compliance.

Sessions: ARC204, SEC337, STG313-R, ARC323-R (session details are posted in the following section)

Cost optimization and FinOps

The cost optimization track focuses on innovative approaches to cloud financial management, with a strong emphasis on AI-powered FinOps solutions. Sessions range from hands-on workshops like the Frugal Architect GameDay to chalk talks on establishing effective cost governance models.

Sessions: ARC318-R, COP309-R, ARC309, DEV318 (session details are posted in the following section)

Session formats to fit your learning style

This year’s catalog features an exciting mix of content across different formats: from breakout, chalk talks, workshops, builder sessions to code talks.

Breakout sessions – Stay in the know

Sit back and enjoy these presentations to stay current with the latest solution enhancements and practical applications. AWS experts and guest speakers will share valuable insights and real-world examples.

From ideas to impact: Architecting with cloud best practices

ARC204 | Breakout session | December 1, 8:30 AM

Discover how foundational frameworks like the AWS Well-Architected Framework, AWS Cloud Adoption Framework, and AWS Cloud Operating Model evolved through customer feedback and real-world learnings from thousands of organizations, transforming from structured guidance into dynamic insights for optimizing cloud environments. Learn practical strategies for applying unified best practices to accelerate cloud transformation while managing large-scale architectural changes and maintaining operational excellence.

Build a well-architected foundation for scaling generative AI and agentic apps

AIM310 | Breakout session | December 1, 10:00 AM

Move beyond proof-of-concepts to build a production-ready foundation supporting all AI applications across your organization, addressing the critical transition from experimentation to enterprise-scale AI deployment. Navigate model access and management, tool discovery, memory and state handling, and observability at scale while building foundations that seamlessly integrate model access, orchestration workflows, agents, and tools with enterprise-grade governance controls.

AI-Powered Enterprise Architecture with ServiceNow & AWS 

ARC337-S | Breakout session | December 2, 3:00 PM

Enterprises face a core challenge: translating architectural vision into resilient cloud reality. See how integrating ServiceNow’s Enterprise Architecture Workspace with the AWS Well-Architected Tool transforms traditional design processes. Through elegant “shift-left” methodologies, architects gain contextual insights that seamlessly blend enterprise modeling with cloud best practices. This presentation is brought to you by ServiceNow, an AWS Partner.

The AI revolution in customer support: Building predictive service systems

SPS315 | Breakout session | December 3, 5:30 PM

Discover how AWS is using generative AI to transform customer support from reactive to proactive. We’ll show how large language models and AI agents are improving customer satisfaction and efficiency. Topics include smart case routing, context-aware support, early problem detection, and responsible AI use. We’ll share real results and discuss balancing AI capabilities with human touch.

Optimize AWS Costs: Developer Tools and Techniques

DEV318 | Breakout session | December 1, 3:00 PM

As cloud applications grow in complexity, optimizing costs becomes crucial for developers. This session explores AWS native tools and coding practices that reduce expenses without compromising performance or scalability.

Chalk talks

AWS speakers set the stage at the beginning of the talk and then open up for discussion. Bring your questions and dive deep into the topic with AWS experts and other customers.

Architecting agentic systems: Self-evolving patterns with AWS AI

ARC324-R | Chalk talk | December 2, 1:30 PM

Learn to architect self-evolving systems using agentic AI that align with AWS Well-Architected principles, exploring cutting-edge patterns for systems that adapt, heal, and optimize themselves autonomously while maintaining architectural integrity. Implement autonomous monitoring and self-healing capabilities with Amazon Bedrock Agents, design AI-driven security controls and automated recovery mechanisms and create systems that continuously adapt to workload patterns while maintaining reliability and performance standards.

Building Well-Architected agentic AI applications

ARC317-R/R1 | Chalk talk | December 2, 3:00 PM and December 4, 1:00 PM

Navigate generative AI agent development with robust architectural practices for security and compliance, focusing on proven patterns for building production-ready agentic AI applications that meet enterprise requirements. Design agent architectures with guardrails, monitoring systems, and access controls using the AWS Well-Architected Generative AI Lens while implementing governance patterns that ensure regulatory compliance and enable systems to scale from prototype to enterprise-wide deployment.

Using generative AI to automate architectural guidance

ARC315 | Chalk talk | December 1, 4:30 PM

Replace time-intensive manual processes with AI-powered systems that generate strategic recommendations, design principles, and best practices at scale while maintaining quality and consistency. Generate organization-specific design principles using AI analysis of architectural patterns, implement AI-driven guidance systems with effective quality control mechanisms, and build knowledge bases that feed AI-powered architectural guidance while maintaining human oversight and addressing ethical considerations.

Agentic architecting: From prototype to production-ready systems

ARC330-R/R1 | Chalk talk | December 2, 5:30PM and December 4, 2:30 PM

Transform prototypes into production-ready systems by incorporating security, monitoring, and CI/CD through agentic architecting, focusing on practical challenges of moving from experimental AI systems to production-grade architectures. Use AI agents to generate and optimize AWS CDK infrastructure and application code, implement automated security improvements and CI/CD pipeline creation, and maintain AWS Well-Architected principles while enabling teams to focus on business logic as AI handles infrastructure complexity.

AI-powered FinOps: Agent-based cloud cost management

ARC318-R/R1 | Chalk talk | December 1, 4:00 PM and December 3, 4:00 PM

Learn how intelligent agents tackle fragmented cost data and optimization processes in complex multi-account environments, moving beyond traditional FinOps approaches to autonomous, intelligent financial optimization. Architect solutions using Amazon OpenSearch Service for data aggregation and Amazon Bedrock for contextual reasoning to design secure, scalable FinOps solutions that continuously optimize costs while delivering measurable business outcomes.

Supercharge your well-architected reviews with AWS Generative AI

SPS320 | Chalk talk | December 3, 4:00 PM

Discover how Koch Industries revolutionized AWS Well-Architected reviews using generative AI, transforming weeks-long manual processes into automated, intelligent systems. Automate architectural assessments using Amazon Bedrock Knowledge Bases and Model Context Protocol (MCP) to scale best practice reviews and optimize workloads in minutes instead of days while achieving more comprehensive, consistent, and actionable recommendations through proven change management and organizational adoption strategies.

Architecting enterprise-scale governance beyond AWS Control Tower

ARC323-R/R1 | Chalk talk | December 3, 11:30 AM and December 4, 2:00PM

Discover advanced governance strategies that build upon AWS Control Tower for enterprise-scale environments requiring sophisticated compliance, security, and operational controls. Implement infrastructure across six Well-Architected Foundations capabilities with critical trade-off understanding, build efficient multi-account structures balancing security requirements with innovation needs, and architect automated compliance checks and policy enforcement at scale while enabling self-service capabilities with centralized governance and security controls.

Securing IoT Workloads with AWS IoT Lens and AWS Security Reference Architecture

SEC337 | Chalk talk | December 3, 11:30 AM

Industrial environments are reaching new levels of connectivity, automation, efficiency, and real-time data insights. However, this increased connectivity also introduces significant security challenges. Unaddressed security concerns can expose vulnerabilities and slow down companies looking to accelerate digital transformation using IoT and cloud. This chalk talk explores relevant techniques, architecture patterns, best practices and AWS security services for securing complex OT/IT environments, IoT devices, edge and cloud using the AWS Well-Architected IoT Lens and AWS Security Reference Architecture (SRA).

Establishing effective cost governance

COP309-R/R1 | Chalk talk | December 3, 3:00 PM and December 4, 12:30 PM

Generative AI agent development demands robust architectural practices for security and compliance. This chalk talk explores proven patterns for architecting secure, efficient AI agents using the AWS Well-Architected Generative AI Lens. Through collaborative discussion and whiteboarding, examine architectural governance and best practices for production environments. Learn to design agent architectures incorporating guardrails, monitoring systems, access controls, and sustainable deployment practices. Gain actionable insights for building secure, efficient, and cost-effective agentic AI applications that scale.

Break down monoliths, modernizing applications on Amazon ECS

CNS346 | Chalk talk | December 2, 4:30 PM

Join this interactive chalk talk to solve a common challenge where monolithic applications take months to deploy new features, and scaling becomes increasingly difficult. We’ll start with a real scenario, an application running on servers with a shared database. Together, we’ll design the modernization path using Amazon ECS and Well-Architected Framework principles. You’ll explore common architecture patterns, containerization strategies, CI/CD automation, and blue/green deployment approaches for ECS. After this session, you’ll walk away with a practical roadmap to transform your monolithic application into scalable microservices. Bring your curiosity and help us build the architecture live.

Hands-on workshop and Builders’ sessions

AWS speakers will introduce the use-case and tools designed to tackle the challenge. You will follow instructions, complete the tasks, and walk away with better understanding of the capabilities.

AI-powered Well-Architected reviews: Building automated governance

ARC302-R | Builders’ session | December 1, 9:00 AM; December 2, 11:30 AM and December 3, 4:00 PM

Build intelligent systems that automate AWS Well-Architected Framework reviews using generative AI, transforming manual architectural assessments into continuous, intelligent governance processes. Evaluate architecture against Well-Architected pillars while incorporating organization-specific requirements, implement continuous analysis of architecture and infrastructure as code templates, and enhance AI understanding of architectural context using Model Context Protocol servers to transform time-intensive reviews into scalable, automated processes with consistent governance.

AI-powered troubleshooting: From chaos to Well-Architected

ARC301-R | Builders’ session | December 1, 8:30 AM; December 2, 3:00 PM and December 3, 10:00 AM

Tackle complex scenarios using AI-powered tools to diagnose and resolve architectural problems, gaining practical experience using AI to transform poorly designed systems into well-architected solutions. Troubleshoot and optimize architectures with scaling bottlenecks and database inefficiencies using Amazon Q, apply Well-Architected principles to enhance performance and security under pressure, and accelerate problem identification and resolution while building AWS optimization expertise and learning to identify architectural anti-patterns before they become critical issues.

The Frugal Architect GameDay: Building cost-aware architectures

ARC309 | Workshop | December 1, 8:00 AM

Compete to implement cost efficiency improvements across multiple AWS services in this interactive GameDay, applying the Laws of the Frugal Architect to real-world scenarios for practical experience in transforming high-cost infrastructure into efficient, sustainable architectures. Address challenges spanning compute, networking, storage, serverless, and observability domains while learning to reduce cloud unit costs and improve profitability without compromising service quality through gamified scenarios that build rapid cost optimization decision-making skills.

Optimize AWS Backup using AI evaluation and Well-Architected best practices

STG313-R | Builders’ session | December 2, 1:30 PM and December 3, 8:30 AM

Enhance AWS Backup implementation using the AWS Backup Evaluator Solution, an AI agent that synthesizes data from multiple sources to provide intelligent backup optimization recommendations. Assess backup environments against the Well-Architected AWS Backup lens using Strands Agents SDK, create unified visibility across backup landscapes to identify optimization opportunities, and implement AI agents that continuously monitor backup efficiency while aligning with AWS best practices to enhance efficiency and cost-effectiveness.


Visit the AWS Cloud Support kiosk in the Venetian

Important notes:

Session dates, times, and locations listed in the post are subject to change as we continue to optimize the schedule on session popularity and venue capacity. Please check this blog post and the re:Invent session catalog regularly for the most up-to-date information about your registered sessions and newly added activities. For a full view of Well-architected content, including sessions with partners, explore the AWS re:Invent catalog and filter on the Well-Architected Framework area of interest.

Remember to reserve your seats early as popular sessions fill up quickly and bring your laptop for hands-on builders’ sessions and workshops. Register today


About the authors

How Yelp modernized its data infrastructure with a streaming lakehouse on AWS

Post Syndicated from Umesh Dangat original https://aws.amazon.com/blogs/big-data/how-yelp-modernized-its-data-infrastructure-with-a-streaming-lakehouse-on-aws/

This is a guest post by Umesh Dangat, Senior Principal Engineer for Distributed Services and Systems at Yelp, and Toby Cole, Principle Engineer for Data Processing at Yelp, in partnership with AWS.

Yelp processes massive amounts of user data daily—over 300 million business reviews, 100,000 photo uploads, and countless check-ins. Maintaining sub-minute data freshness with this volume presented a significant challenge for our Data Processing team. Our homegrown data pipeline, built in 2015 using then-modern streaming technologies, scaled effectively for many years. As our business and data needs evolved, we began to encounter new challenges in managing observability and governance across an increasingly complex data ecosystem, prompting the need for a more modern approach. This affected our outage incidents, making it harder to both assess impact and restore service. At the same time, our streaming framework struggled with Kafka for data streaming and permanent data storage. In addition, our connectors to analytical data stores experienced latencies exceeding 18 hours.

This came to a head when our efforts to comply with General Data Protection Regulation (GDPR) requirements revealed gaps in our infrastructure that would require us to clean up our data, while simultaneously maintaining operational reliability and reducing data processing times. Something had to change.

In this post, we share how we modernized our data infrastructure by embracing a streaming lakehouse architecture, achieving real-time processing capabilities at a fraction of the cost while reducing operational complexity. With this modernization effort, we reduced analytics data latencies from 18 hours to mere minutes, while also removing the need for using Kafka as a permanent storage for our change log streams.

The problem: Why we needed change

We started this transformation by initiating a migration from self-managed Apache Kafka to Amazon Managed Streaming for Apache Kafka (Amazon MSK), which significantly reduced our operational overhead and enhanced security. Amazon MSK’s express brokers also provided better elasticity for our Apache Kafka clusters. While these improvements were a promising start, we recognized the need for a more fundamental architectural change

Legacy architecture pain points

Let’s examine the specific challenges and limitations of our previous architecture that prompted us to seek a modern solution.

The following diagram depicts Yelp’s original data architecture.

Kafka topics proliferated across our infrastructure, creating long processing chains. As a result, each hop added latency, operational overhead, and storage costs. The system’s reliance on Kafka for both ingestion and storage created a fundamental bottleneck—Kafka’s architecture, optimized for high-throughput messaging, wasn’t designed for long-term storage and to handle complex querying patterns.

Another challenge was our custom “Yelp CDC” format—a proprietary change data capture language—was powerful and tailored to our needs. However, as our team grew and our use cases expanded, it introduced complexity and a steeper learning curve for new engineers. It also made integrations with off-the-shelf systems more complex and maintenance intensive.

The cost and latency trade-off

The traditional trade-off between real-time processing and cost efficiency had us caught in an expensive bind. Real-time streaming systems demand significant resources to maintain state within compute engines like Apache Flink, keep multiple copies of data across Kafka clusters, and run always-on processing jobs. Our infrastructure costs were growing, and it was largely driven by:

  • Long Kafka chains: Data often traversed 4-5 Kafka topics before reaching its destination and each topic was replicated for reliability
  • Duplicate data storage: The same data existed in multiple formats across different systems—raw in Kafka, processed in intermediate topics, and final forms in data warehouses and Flink RocksDB for join-like use cases
  • Complex custom tooling maintenance: The proprietary nature of our tools meant engineering resources were focused on maintenance rather than building new capabilities

Meanwhile, our business requirements became more demanding. Teams at Yelp needed faster insights, near-real-time results, and the ability to quickly run complex historical analyses without delay. This pushed us to shape our new architecture to improve streaming discovery and metadata visibility, provide more flexible transformation tooling, and simplify operational workflows with faster recovery times.

Understanding the streamhouse concept

To understand how we solved our data infrastructure challenges, it’s important to first grasp the concept of a streamhouse and how it differs from traditional architectures.

Evolution of data architecture

To understand why a streaming lakehouse or streamhouse was the answer to our challenges, it’s helpful to trace the evolution of data architectures. The journey from data warehouses to modern streaming systems reveals why each generation solved certain problems while creating new ones.

Data warehouses like Amazon Redshift and Snowflake brought structure and reliability to analytics, but their batch-oriented nature meant accepting hours or days of latency. Data lakes emerged to handle the volume and variety of big data, using low-cost object storage like Amazon S3, but often became “data swamps” without proper governance. The lakehouse architecture, pioneered by technologies like Apache Iceberg and Delta Lake, promised to combine the best of both, the structure of warehouses with the flexibility and economics of lakes.

But even lakehouses were designed with batch processing in mind. While they added streaming capabilities, these were often bolted on rather than fundamental to the architecture. What we needed was something different: a reimagining that treated streaming as a first-class citizen while maintaining lakehouse economics.

What makes a streamhouse different

A streamhouse, as we define it, is “a stream processing framework with a storage layer that leverages a table format, making intermediate streaming data directly queryable.” This seemingly simple definition represents a fundamental shift in how we think about data processing.

Traditional streaming systems maintain dynamic tables like materialized views in databases, but these aren’t directly queryable. You can only consume them as streams, limiting their utility for ad-hoc analysis or debugging. Lakehouses, conversely, excel at queries but struggle with low-latency updates and complex streaming operations like out-of-order event handling or partial updates.

The streamhouse bridges this gap by:

  • Treating batch as a special case of streaming, rather than a separate paradigm
  • Making data, including intermediate processing results, queryable via SQL
  • Providing streaming-native features like database change-data capture (CDC) and temporal joins
  • Leveraging cost-effective object storage while maintaining minute-level latencies

Core capabilities we needed

Our requirements for a streaming lakehouse were shaped by years of operating at scale:

Real-time processing with minute-level latency: While sub-second latency wasn’t necessary for most use cases, our previous hours-long delays weren’t acceptable. The sweet spot was processing latencies measured in minutes fast enough for real-time decision-making but relaxed enough to leverage cost-effective storage.

Efficient CDC handling: With numerous MySQL databases powering our applications, the ability to efficiently capture and process database changes was crucial. The solution needed to handle both initial snapshots and ongoing changes seamlessly, without manual intervention or downtime.

Cost-effective scaling: The architecture had to break the linear relationship between data volume and cost. This meant leveraging tiered storage, with hot data on fast storage and cold data on low-cost object storage, all while maintaining query performance.

Built-in data management: Schema evolution, data lineage, time travel queries, and data quality controls needed to be first-class features, not afterthoughts. Our experience maintaining our custom Schematizer taught us that these capabilities were essential for operating at scale.

The solution architecture

Our modernized data infrastructure combines several key technologies into a cohesive streamhouse architecture that addresses our core requirements while maintaining operational efficiency.

Our technology stack selection

We carefully selected and integrated several proven technologies to build our streamhouse solution.The following diagram depicts Yelp’s new data architecture.

After extensive evaluation, we assembled a modern streaming lakehouse stack, streamhouse, built on proven open source technologies:

Amazon MSK continues to deliver existing streams as they did before from source applications and services.

Apache Flink on Amazon EKS served as our compute engine, a natural choice given our existing expertise and investment in Flink-based processing. Its powerful stream processing capabilities, exactly-once semantics, and mature framework made it ideal for the computational layer.

Apache Paimon emerged as the key innovation, providing the streaming lakehouse storage layer. Born from the Flink community’s FLIP-188 proposal for built-in dynamic table storage, Paimon was designed from the ground up for streaming workloads. Its LSM-tree-based architecture provided the high-speed ingestion capabilities we needed.

Amazon S3 serves as our streamhouse storage layer, offering highly scalable capacity at a fraction of the cost. The shift from compute-coupled storage (Kafka brokers) to object storage represented a fundamental architectural change that unlocked massive cost savings.

Flink CDC connectors replaced our custom CDC implementations, providing battle-tested integrations with databases like MySQL. These connectors handled the complexity of initial snapshots, incremental updates, and schema changes automatically.

Architectural transformation

The transformation from our legacy architecture to the streamhouse model involved three key architectural shifts:

1. Decoupling ingestion from storage

In our old world, Kafka handled both data ingestion and storage, creating an expensive coupling. Every byte ingested had to be stored on Kafka brokers with replication for reliability. Our new architecture separated these concerns: Flink CDC handled ingestion by immediately writing to Paimon tables backed by S3. This separation reduced our storage costs by over 80% and improved reliability through the 11 nines of durability of S3.

2. Unified data format

The migration from our proprietary CDC format to the industry-standard Debezium format was more than a technical change. It reflected a broader move toward community-supported standards. We built a Data Format Converter that bridged the gap, allowing legacy streams to continue functioning while new streams leveraged standard formats. This approach facilitated backward compatibility while paving the way for future simplification.

3. Streamhouse tables

Perhaps the most radical change was replacing some of our Kafka topics with Paimon tables. These weren’t just storage locations—they were dynamic, versioned, queryable entities that supported:

  • Time travel queries in the table’s snapshot retention period
  • Automatic schema evolution without downtime
  • SQL-based access for both streaming and batch workloads
  • Built-in compaction and optimization

Key design decisions

Several key design decisions shaped our implementation:

SQL as the primary interface: Rather than requiring developers to write Java or Scala code for every transformation, SQL became our lingua franca. This democratized access to streaming data, allowing analysts and data scientists to work with real-time data using familiar tools.

Separation of compute and storage: By decoupling these layers, we could scale them independently. A spike in processing needs no longer meant provisioning more storage, and historical data could be kept indefinitely without impacting compute costs.

Embracing open source standards: The shift from home-grown formats and tools to community-supported projects reduced our maintenance burden and accelerated feature development. When issues arose, our engineers could leverage community knowledge rather than debugging in isolation.

Implementation journey

Our transition to the new streamhouse architecture followed a carefully planned path, encompassing prototype development, phased migration, and systematic validation of each component.

Migration strategy

Our migration to the streamhouse architecture required careful planning and execution. The strategy had to balance the need for transformation with the reality of maintaining critical production systems.

1. Prototype development

Our journey began with building foundational components:

  • Pure Java client library: Removing Scala dependencies were crucial for broader adoption. Our new library removed reliance on Yelp-specific configurations, allowing it to run in many environments.
  • Data Format Converter: This bridge component translated between our proprietary CDC format and the standard Debezium format, making sure existing consumers could continue operating during the migration.
  • Paimon ingestor: A Flink job that could ingest data from Kafka sources into Paimon tables, handling schema evolution automatically.

2. Phased rollout approach

Rather than attempting a “big bang” migration, we adopted a per-use case approach—moving a vertical slice of data rather than the entire system at once. Our phased rollout followed these steps:

  • Select a representative, real-world use case that provides broad coverage of the existing feature set.
    • In our use case, this included data sourced from both databases and event streams, with writes going to Cassandra and Nrtsearch
  • Re-implement the use case on the new stack in a development environment using sample data to test the logic
  • Shadow-launch the new stack in production to test it at scale
    • This was a critical step for us, as we had to iterate through various configuration tweaks before the system could reliably sustain our production traffic.
  • Verify the new production deployment against the legacy system’s output
  • Switch live traffic to the new system only after both the Yelp Platform team and data owners are confident in its performance and reliability
  • Decommission the legacy system for that use case once the migration is complete

This phased approach allowed our team to build confidence, identify issues early, and refine our processes before touching business-critical systems in production.

Technical challenges we overcame

The migration surfaced several technical challenges that required innovative solutions:

System integration: We developed comprehensive monitoring to track end-to-end latencies and built automated alerting to detect any degradation in performance.

Performance tuning: Initial write performance to Paimon tables was suboptimal for our higher-throughput streams. After careful analysis, we identified that Paimon was re-reading manifest files from S3 on every commit. To alleviate this, we enabled Paimon’s sink writer coordinator cache setting, which is disabled by default. This massively reduced the number of S3 calls during commits. We also found that writing parallelism in Paimon is limited by the number of “buckets” within a partition. Selecting the right number of buckets to allow you to scale horizontally, but also not spread your data too thinly is important for balancing write performance against query performance.

Data validation: Validating data consistency between our legacy Yelp CDC streams and the new Debezium-based format presented notable challenges. During the parallel run phase, we implemented comprehensive validation frameworks to make sure the Data Format Convertor accurately transformed messages, while maintaining data integrity, ordering guarantees, and schema compatibility across both systems.

Data migration complexity: For consistency, we developed custom tooling to verify ordering guarantees and implemented parallel running of old and new systems. We chose Spark as the framework to implement our validations as every data source and sink in our framework has mature connectors, and Spark is a well-supported system at Yelp.

Practical wins we achieved

Our implementation delivered transformative results:

Simplified streaming stack: By replacing multiple custom components with standardized tools, we avoided years of technical debt in one migration. We reduced our complexity and thereby simplified our entire streaming architecture, leading to higher reliability and less maintenance overhead. Our Schematizer, encryption layer, and custom CDC format were all replaced by built-in features from Paimon and standard Kafka, along with IAM controls across S3 and MSK.

Fine-grained access management: Moving our analytical use cases read via Iceberg unlocked a huge win for us: the ability to enable AWS Lake Formation on our data lake. Previously, our access management relied on large, complex S3 bucket policy documents that were approaching their size limits. By moving to Lake Formation we could build an access request lifecycle into our in-house Access Hub to automate access granting and revocation.

Built-in data management features: Capabilities that would have required months of custom development came out-of-the-box, such as automatic schema evolution, time travel queries, and incremental snapshots for efficient processing.

Potential for reduced operational costs: We anticipate that transitioning from Kafka storage to S3 in a streamhouse architecture will significantly reduce storage costs. Avoiding long Kafka chains will also simplify data pipelines and reduce compute costs.

Enhanced troubleshooting capabilities: The streamhouse architecture promises built-in observability features that will make debugging easier. Rather than having to manually look through event streams for problematic data, which can be time-consuming and complex for multi-stream pipelines, engineers can now query live data directly from tables using standard SQL.

Lessons learned and best practices

Throughout this transformation, we gained valuable insights about both technical implementation and organizational change management that can benefit others undertaking similar modernization efforts.

Technical insights

Our journey revealed several crucial technical lessons:

Battle-tested open source wins: Choosing Apache Paimon and Flink CDC over custom solutions proved wise. The community support, continuous improvements, and shared knowledge base accelerated our development and reduced risk.

SQL interfaces democratize access: Making streaming data accessible via SQL transformed who could work with real-time data. Engineers and analysts familiar with SQL can now understand how streaming pipelines work. The barrier to entry has been significantly lowered as engineers no longer need to understand Flink-specific APIs to create a streaming application.

Separation of storage and compute is fundamental: This architectural principle unlocked cost savings and operational flexibility that wouldn’t have been possible otherwise. Our teams can now optimize storage and compute independently based on their specific needs.

Organizational learnings

The human side of the transformation was equally important:

Phased migration reduces risk: Our gradual approach allowed teams to build confidence and expertise, while maintaining business continuity. Each successful phase created momentum for the next. Building trust with newer systems helps gain velocity in later stages of migrations.

Backward compatibility enables progress: By maintaining compatibility layers, our teams could migrate at their own pace without forcing synchronized changes across the organization.

Investment in learning pays dividends: Giving our teams space to learn new technologies like Paimon and streaming SQL had some opportunity cost, but they paid off through increased productivity and reduced operational burden.

Conclusion

Our transformation to a streaming lakehouse architecture (streamhouse) has revolutionized Yelp’s data infrastructure, delivering impressive results across multiple dimensions. By implementing Apache Paimon with AWS services like Amazon S3 and Amazon MSK, we reduced our analytics data latencies from 18 hours to just minutes while cutting storage costs by 80%. The migration also simplified our architecture by replacing multiple custom components with standardized tools, significantly reducing maintenance overhead and improving reliability.

Key achievements include the successful implementation of real-time processing capabilities, streamlined CDC handling, and enhanced data management features like automatic schema evolution and time travel queries. The shift to SQL-based interfaces has democratized access to streaming data, while the separation of compute and storage has given us unprecedented flexibility in resource optimization. These improvements have transformed not just our technology stack, but also how our teams work with data.

For organizations facing similar challenges with data processing latency, operational costs, and infrastructure complexity, we encourage you to explore the streamhouse approach. Start by evaluating your current architecture against modern streaming solutions, particularly those leveraging cloud services and open-source technologies like Apache Paimon. Make sure to leverage security best practices when implementing your solution. You can find AWS security best practices here. Visit the Apache Paimon website or AWS documentation to learn more about implementing these solutions in your environment.


About the authors

Umesh Dangat

Umesh Dangat

Umesh is a Senior Principal Engineer for Distributed Services and Systems at Yelp, where he architects and leads the evolution of Yelp’s high-performance distributed systems—spanning streaming, storage, and real-time data retrieval. He oversees the core infrastructure powering Yelp’s search, ranking, and data platforms, driving engineering efficiency by improving platform scalability, reliability, and alignment with business needs. Umesh is also an active open source contributor to projects such as Elasticsearch, NrtSearch, Apache Paimon, and Apache Flink CDC.

Toby Cole

Toby Cole

Toby is a Principle Engineer for Data Processing at Yelp, and has been working with distributed systems for the past 18 years. He has lead groundbreaking efforts to containerize datastores like Cassandra and is currently building Yelps next generation of streaming infrastructure. You can often find him petting cats and taking apart electrical devices for no apparent reason.

Ali Alemi

Ali Alemi

Ali is a Principal Streaming Solutions Architect at AWS. Ali advises AWS customers with architectural best practices and helps them design real-time analytics data systems which are reliable, secure, efficient, and cost-effective. Prior to joining AWS, Ali supported several public sector customers and AWS consulting partners in their application modernization journey and migration to the Cloud.

Bryan Spaulding

Bryan Spaulding

Bryan is a Senior Solutions Architect at AWS. Bryan works with AdTech customers to advise on their technology strategy, apply best practice AWS architecture patterns, and champion their interests within AWS. Prior to joining AWS, Bryan served in technology leadership roles in various Media & Entertainment and EdTech startups where he was also an AWS customer himself, and early in his career was a consultant in multiple professional services firms.

Behind the Streams: Real-Time Recommendations for Live Events Part 3

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/behind-the-streams-real-time-recommendations-for-live-events-e027cb313f8f

By: Kris Range, Ankush Gulati, Jim Isaacs, Jennifer Shin, Jeremy Kelly, Jason Tu

This is part 3 in a series called “Behind the Streams”. Check out part 1 and part 2 to learn more.

Picture this: It’s seconds before the biggest fight night in Netflix history. Sixty-five million fans are waiting, devices in hand, hearts pounding. The countdown hits zero. What does it take to get everyone to the action on time, every time? At Netflix, we’re used to on-demand viewing where everyone chooses their own moment. But with live events, millions are eager to join in at once. Our job: make sure our members never miss a beat.

When Live events break streaming records ¹ ² ³, our infrastructure faces the ultimate stress test. Here’s how we engineered a discovery experience for a global audience excited to see a knockout.

Why are Live Events Different?

Unlike Video on Demand (VOD), members want to catch live events as they happen. There’s something uniquely exciting about being part of the moment. That means we only have a brief window to recommend a Live event at just the right time. Too early, excitement fades; too late, the moment is missed. Every second counts.

To capture that excitement, we enhanced our recommendation delivery systems to serve real-time suggestions, providing members richer and more compelling signals to hit play in the moment when it matters most. The challenge? Sending dynamic, timely updates concurrently to over a hundred million devices worldwide without creating a thundering herd effect that would overwhelm our cloud services. Simply scaling up linearly isn’t efficient and reliable. For popular events, it could also divert resources from other critical services. We needed a smarter and more scalable solution than just adding more resources.

Orchestrating the moment: Real-time Recommendations

With millions of devices online and live event schedules that can shift in real time, the challenge was to keep everyone perfectly in sync. We set out to solve this by building a system that doesn’t just react, but adapts by dynamically updating recommendations as the event unfolds. We identified the need to balance three constraints:

  • Time: the duration required to coordinate an update.
  • Request throughput: the capacity of our cloud services to handle requests.
  • Compute cardinality: the variety of requests necessary to serve a unique update.
Visualizing constraints for real-time updates

We solved this constraint optimization problem by splitting the real-time recommendations into two phases: prefetching and real-time broadcasting. First, we prefetch the necessary data ahead of time, distributing the load over a longer period to avoid traffic spikes. When the Live event starts or ends, we broadcast a low cardinality message to all connected devices, prompting them to use the prefetched data locally. The timing of the broadcast also adapts when event times shift to preserve accuracy with the production of the Live event. By combining these two phases, we’re able to keep our members’ devices in sync and solve the thundering herd problem. To maximize device reach, especially for those with unstable networks, we use “at least once” broadcasts to ensure every device gets the latest updates and can catch up on any previously missed broadcasts as soon as they’re back online.

The first phase optimizes request throughput and compute cardinality by prefetching materialized recommendations, displayed title metadata, and artwork for a Live event. As members naturally browse their devices before the event, this data is prepopulated and stored locally in device cache, awaiting the notification trigger to serve the recommendations instantaneously. By distributing these requests naturally over time ahead of the event, we can eliminate any related traffic spikes and avoid the need for large-scale, real-time system scaling.

A phased approach, smoothing traffic requests over time with a real-time low-cardinality broadcast

The second phase optimizes request throughput and time to update devices by broadcasting a low-cardinality, real-time message to all connected devices at critical moments in a Live event’s lifecycle. Each broadcast payload includes a state key and a timestamp. The state key indicates the current stage of the Live event, allowing devices to use their pre-fetched data to update cached responses locally without additional server requests. The timestamp ensures that if a device misses a broadcast due to network issues, it can catch up by replaying missed updates upon reconnecting. This mechanism guarantees devices receive updates at least once, significantly increasing delivery reliability even on unstable networks.

A phased approach optimizes each constraint to ensure we can deliver for the big moment!

Moment in Numbers: During peak load, we have successfully delivered updates at multiple stages of our events to over 100 million devices in under a minute.

Under the Hood: How It Works

With the big picture in mind, let’s examine how these pieces interact in practice.

In the diagram below, the Message Producer microservice centralizes all of the business logic. It continuously monitors live events for setup and timing changes. When it detects an update, it schedules broadcasts to be sent at precisely the right moment. The Message Producer also standardizes communication by providing a concise GraphQL schema for both device queries and broadcast payloads.

Rather than sending broadcasts directly to devices via WebSocket, the Message Producer hands them off to the Message Router. The Message Router is part of a robust two-tier pub/sub architecture built on proven technologies like Pushy (our WebSocket proxy), Apache Kafka, and Netflix’s KV key-value store. The Message Router tracks subscriptions at the Pushy node granularity, while Pushy nodes map the subscriptions to individual connections, creating a low-latency fanout that minimizes compute and bandwidth requirements.

Devices interface with our GraphQL Domain Graph Service (DGS). These schemas offer multiple query interfaces for prefetching, allowing devices to tailor their requests to the specific experience being presented. Each response adheres to a consistent API that resolves to a map of stage keys, enabling fast lookups and keeping business logic off the device. Our broadcast schema specifies WebSocket connection parameters, the current event stage, and the timestamp of the last broadcast message. When a device receives a broadcast, it injects the payload directly into its cache, triggering an immediate update and re-render of the interface.

Balancing the Moment: Throughput Management

In addition to building the new technology to support real-time recommendations, we also evaluated our existing systems for potential traffic hotspots. Using high-watermark traffic projections for live events, we generated synthetic traffic to simulate game-day scenarios and observed how our online services handled these bursts. Through this process, several common patterns emerged:

Breaking the Cache Synchrony

Our game-day simulations revealed that while our approach mitigated the immediate thundering herd risks driven by member traffic during the events, live events introduced unexpected mini thundering herds in our systems hours before and after the actual events. The surge of members joining just in time for these events led to concentrated cache expirations and recomputations, which created traffic spikes well outside the event window that we did not anticipate. This was not a problem for VOD content because the member traffic patterns are a lot smoother. We found that fixed TTLs caused cache expirations and refresh-traffic spikes to happen all at once. To address this, we added jitter to server and client cache expirations to spread out refreshes and smooth out traffic spikes.

Adaptive Traffic Prioritization

While our services already leverage traffic prioritization and partitioning based on factors such as request type and device type, live events introduced a distinct challenge. These events generated brief traffic bursts that were intensely spiky and placed significant strain on our systems. Through simulations, we recognized the need for an additional event-driven layer of traffic management.

To tackle this, we improved our traffic sharding strategies by using event-based signals. This enabled us to route live event traffic to dedicated clusters with more aggressive scaling policies. We also added a dynamic traffic prioritization ruleset that activates whenever we see high requests per second (RPS) to ensure our systems can handle the surge smoothly. During these peaks, we aggressively deprioritize non-critical server-driven updates so that our systems can devote resources to the most time-sensitive computations. This approach ensures smooth performance and reliability when demand is at its highest.

Snapshot of non-critical traffic volume decline (in %) for a member-facing service during a live event — achieved via aggressive de-prioritization

Looking Ahead

When we set out to build a seamlessly scalable scheduled viewing experience, our goal was to create a dynamic and richer member experience for live content. Popular live events like the Crawford v. Canelo fight and the NFL Christmas games truly put our systems to the test. Along the way, we also uncovered valuable learnings that continue to shape our work. Our attempts to deprioritize traffic to other non-critical services caused unexpected call patterns and spikes in traffic elsewhere. Similarly, in hindsight, we also learned that the high traffic volume from popular events caused excessive non-essential logging and was putting unnecessary pressure on our ingestion pipelines.

None of this work would have been possible without our stunning colleagues at Netflix who collaborated across multiple functions to architect, build, and test these approaches, ensuring members can easily access events at the right moment: UI Engineering, Cloud Gateway, Data Science & Engineering, Search and Discovery, Evidence Engineering, Member Experience Foundations, Content Promotion and Distribution, Operations and Reliability, Device Playback, Experience and Design and Product Management.

As Netflix’s content offering expands to include new formats like live titles, free-to-air linear content, and games, we’re excited to build on what we’ve accomplished and look ahead to even more possibilities. Our roadmap includes extending the capabilities we developed for scheduled live viewing to these emerging formats. We’re also focused on enhancing our engineering tooling for greater visibility into operations, message delivery, and error handling to help us continue to deliver the best possible experience for our members.

Join Us for What’s Next

We’re just scratching the surface of what’s possible as we bring new live experiences to members around the world. If you are looking to solve interesting technical challenges in a unique culture, then apply for a role that captures your curiosity.

Look out for future blog posts in our “Behind the Streams” series, where we’ll explore the systems that ensure viewers can watch live streams once they manage to find and play them.


Behind the Streams: Real-Time Recommendations for Live Events Part 3 was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Modernization of real-time payment orchestration on AWS

Post Syndicated from Neeraj Kaushik original https://aws.amazon.com/blogs/architecture/modernization-of-real-time-payment-orchestration-on-aws/

The global real-time payments market is experiencing significant growth. According to Fortune Business Insights, the market was valued at USD 24.91 billion in 2024 and is projected to grow to USD 284.49 billion by 2032, with a CAGR of 35.4%. Similarly, Grand View Research reports that the global mobile payment market, valued at USD 88.50 billion in 2024, is expected to grow at a CAGR of 38.0% from 2025 to 2030. (Disclaimer: Third-party market research and statistics are provided for informational purposed only. AWS and IBM make no representations about the accuracy of this information.)

This rapid expansion underscores the urgency for financial institutions to modernize their payment processing infrastructure. Financial institutions often need to process high volume of transactions with near-zero latency to meet stringent service level agreements (SLAs) to support surging mobile payments volume.

However, traditional payment orchestration systems, often built on monolithic architectures, struggle to meet these demands due to latency, availability, and scalability challenges. Additionally, their reliance on on-premises infrastructure leads to higher costs and an impediment to innovation, reinforcing the need for modernization.

As sustainability becomes a priority, organizations are turning to cloud-based solutions to optimize infrastructure, reduce carbon footprints, and enhance energy efficiency. This shift provides scalability and performance, and aligns with global sustainability goals, securing the future of real-time payments.

In this post, we discuss the real-time payment orchestration framework. It uses an event-driven architecture and AWS serverless services to enhance the resiliency, efficiency, and scalability of real-time payments. By decomposing payment processing into distinct business capabilities, financial institutions can improve modularity and flexibility. Implementing tenant-based segregation helps with data isolation and security. Additionally, adopting asynchronous communication through Amazon Managed Streaming for Apache Kafka (Amazon MSK) enhances scalability and resilience.

Traditional real-time payment orchestration

Payment orchestration serves as a middleware solution, streamlining transaction processing across multiple payment methods, gateways, and financial institutions. It orchestrates key business functions such as payment authorization, payment processing, settlement and clearing, compliance and risk management, and account management for both inbound and outbound payment flows.

The following diagram depicts the high-level business capabilities supported by payment orchestrators across various payment flows, including real-time payments, digital disbursements, tax payments, wires, and more.

Payment processing system flowchart showing main components from acceptance to billing

Detailed flowchart depicting a payment processing system with multiple components. The diagram shows primary payment types at the top (including Realtime Payments, Digital Disbursement, Credit Transfer, and Peer to Peer Payments) flowing down through core processing stages including Payment Acceptance, Execution, Clearing, Reporting, Tracking, Reversals, and Billing.

Many financial institutions adopt a tenant-based approach organized by geography due to varying clearing processes, localized regulations, and transaction requirements across AWS Regions. However, without proper separation of services, teams often continue to add region-specific logic to existing services, gradually increasing their monolithic complexity and using the same infrastructure for all payment flows.

Traditional payment systems process transactions linearly, with each step waiting for the previous one to complete. However, analysis of payment workflows reveals numerous opportunities for parallel execution:

  • Sanctions screening and fraud detection – Compliance and fraud checks can run simultaneously with initial routing decisions, rather than sequentially blocking all subsequent processing
  • Payment routing and authorization requests – When basic validations are complete, routing and authorization can proceed in parallel rather than one after another
  • Payment execution and ledger updates – The actual payment execution doesn’t need to wait for ledger records to be updated—these can occur concurrently
  • Settlement, reconciliation, and tracking – These post-transaction processes can be initiated independently as soon as the primary transaction is complete

This parallel approach can dramatically improve throughput and reduce latency compared to traditional queue-based systems where operations form a sequential chain that extends processing time and creates bottlenecks.

Most legacy payment orchestration systems rely heavily on on-premises virtual machines (VMs), leading to several challenges:

  • Multi-Region support for disaster recovery and multi-tenancy resulting in significant capital expenditure and operational overhead
  • High latency and SLA issues caused by sequential message processing and delays between globally separated data centers
  • Limited reusability of payment flows as monolithic architectures require region-specific changes for local clearing mechanisms and regulations, increasing complexity and costs
  • Scalability challenges and high memory consumption due to inefficient resource utilization and execution of irrelevant logic across regions
  • Complex cross-border payment routing caused by variations in clearing rules, transaction limits, and local regulations, increasing latency and routing errors
  • Integration challenges with diverse data formats because legacy systems rely on proprietary standards (for example, ISO 20022, SWIFT MT), complicating data conversion and compliance
  • High deployment complexity for new payment flows due to monolithic architectures requiring extensive region-specific modifications, slowing time to market
  • Environmental impact and high carbon footprint from on-premises infrastructure consuming excessive energy, whereas cloud-based approaches improve efficiency

Solution overview

To overcome these challenges, the proposed architecture embraces the following design principles to build a future-ready, real-time payment orchestration solution:

  • Performance at scale – Handling over 1,000 transactions per second (TPS) with consistent low latency under varying load conditions.
  • High availability – Achieving 99.999% uptime to meet the strict requirements of financial transactions.
  • Geographic resilience – Supporting global operations with region-specific compliance while maintaining consistent performance.
  • Cost optimization – Reducing total cost of ownership through efficient resource utilization and serverless technologies.
  • Security and compliance – Supporting data protection and regulatory adherence across different jurisdictions.
  • Operational simplicity – Streamlining deployment, monitoring, and maintenance across the payment ecosystem.
  • Microservices – Decomposing payment processing into distinct business capabilities, so financial institutions can improve modularity and flexibility. This microservices-based approach allows for independent scaling and development of critical components.

The following diagram depicts the high-level solution architecture for real-time payments. The existing channels using synchronous or asynchronous APIs can be modified to use edge-optimized endpoints to reduce latency.

Event-driven payment orchestration system with pub/sub channels connecting multiple payment processing modules

Architecture diagram detailing an AWS-based payment orchestration platform utilizing event-driven principles. Features reusable components across two regions, with dedicated modules for payment initiation, execution, reconciliation, billing, and risk management. Implements pub/sub messaging patterns for inter-component communication and connects to enterprise systems including accounting, compliance, and analytics.

An event-driven architecture is used for payment orchestration, which handles communication through a pub/sub pattern. This architecture maintains persistent connections, improving performance of the end-to-end real-time payment processing.

The event-driven architecture for real-time payment processing allows multiple payment operations to occur simultaneously using different adaptors, as opposed to the traditional systems where payment processes are sequential and flow through a single pipeline. Payment events are distributed to specialized payment processor microservices based on their function (initiation, execution, tracking, settlements), enabling each to process independently without waiting for others to complete.

Because we’re transitioning from sequential processing to distributed, maintaining transaction traceability is crucial. The payment tracking adapters shown in the preceding diagram connect to enterprise analytics systems, creating a specialized layer for monitoring transactions. The pub/sub model allows for attaching correlation IDs to events, enabling systems to track related events across different topics and processing stages.

A standardized event schema serves as the foundation for this architecture, providing consistency across regional deployments while allowing for customization at the adapter level. This schema defines uniform event structures containing tenant-specific metadata and supports versioning to accommodate evolving requirements. By isolating region-specific variations to the adapter layer, the solution maintains core functionality while interfacing with diverse enterprise systems through configuration-driven customization rather than code changes.

For most payment processes, especially those with independent processing steps that can run in parallel, this architecture delivers net performance gains despite the topic switching overhead, particularly for complex transactions where multiple independent validations or processing steps are required.

Deployment on the AWS Cloud

The solution uses edge-optimized Amazon API Gateway for channels. An edge-optimized API endpoint routes requests to the nearest Amazon CloudFront Point of Presence (POP), which can help in cases where your clients are geographically distributed to enable efficient routing within each geographical region, enhancing global responsiveness by minimizing network round trips and making sure requests take the shortest possible path before transitioning from the public internet to the client network.

The following diagram illustrates the high-level solution architecture for real-time payments.

Multi-region AWS payment architecture with managed Kafka topics connecting Lambda microservices and DynamoDB storage

Comprehensive AWS payment orchestration solution implementing modern cloud-native architecture principles. Core processing logic implemented as Lambda functions covering initiation, execution, reconciliation, billing, tracking, risk management, and settlement workflows. Leverages Amazon MSK for reliable event streaming between components, with dedicated Kafka topics for each processing stage. Data persistence handled by Amazon DynamoDB, supporting cross-region operations. Architecture demonstrates AWS best practices for financial services, including regional redundancy, serverless computing, managed services, and event-driven design patterns. System integrates with external banking infrastructure and enterprise systems while maintaining separation of concerns through microservices architecture. Features built-in support for compliance monitoring, risk management, and payment tracking through specialized Lambda functions.

The solution uses Amazon MSK to implement an event-driven architecture that efficiently handles both inbound and outbound channels traffic through API requests and asynchronous message-based events. Amazon MSK communicates using a high-performance binary protocol between producers, consumers, and brokers, providing low latency and high throughput. Real-time payments are logically partitioned across multiple tenants within geographical regions—North America, EMEA, LATAM, and Asia-Pacific.

Each real-time payment tenant follows an active/active disaster recovery strategy by deploying MSK clusters across multiple AWS Regions, designed to achieve high availability and resilience. Amazon MSK offer both serverless and provisioned cluster options. The team can decide to select one or the other depending on the non-functional requirements and team expertise. Amazon MSK automatically manages partition leadership with leaders in primary Regions and followers in secondary Regions. During failover, leaders are re-elected in healthy Regions, designed to help maintain processing capabilities during regional incidents. Sticky partitioning uses consistent hashing for deterministic routing, and cooperative rebalancing enables efficient failover. Multi-AZ deployment provides zone redundancy and isolated clusters per Region for data sovereignty compliance through programmatic AWS Identity and Access Management (IAM) and virtual private cloud (VPC) boundaries.

To support seamless cross-Region replication and maintain message continuity, Amazon MSK Replicator—a fully managed feature of Amazon MSK—is used to replicate topics and synchronize consumer group offsets across clusters. MSK Replicator simplifies the process of building multi-Region Kafka applications by not needing custom code, open-source tool configuration, or infrastructure management. It automatically provisions and scales the necessary resources, so teams can focus on business logic while only paying for the data being replicated. In the event of a regional outage or failover, traffic can be automatically redirected to a healthy Region without data loss or service disruption, providing near-zero Recovery Time Objectives (RTOs) and uninterrupted operations for downstream services such as payment processors and audit trail consumers.

In addition to regional redundancy, the architecture uses an event-driven architecture to enable parallel and decoupled processing of payment transactions. Events such as transaction initiation, validation, and settlement are emitted asynchronously and consumed by various microservices independently, which drastically reduces end-to-end latency.

To process these events at scale, the architecture can use AWS Lambda, Amazon Elastic Container Service (Amazon ECS), or Amazon Elastic Kubernetes Service (Amazon EKS) depending upon non-functional requirements. Automatic scaling responds to Amazon CloudWatch metrics, and exponential backoff retry logic with dead-letter queues (DLQs) handles throttling scenarios. Circuit breakers prevent cascade failures during high error rates.

One of the key benefits of the solution is the reusability of payment flows across different regions. Although each region has its own unique compliance requirements and settlement rules, the core functionalities of real-time payments (payment authorization, payment processing, settlement and clearing) are largely similar. This reusability enables rapid deployment of payment solutions across new regions without rearchitecting the entire system. For example, the real-time payment system in the US and UK might share similar business logic for real-time gross settlement but differ in the clearing and compliance requirements. The solution treats these as bounded contexts within the microservices architecture, providing flexibility while making sure each region can handle its own specific rules and regulations.

Sustainability

AWS relentlessly innovates its infrastructure design, build, and operations to make progress towards net-zero carbon by 2040 and being water positive by 2030. Amazon MSK with AWS Graviton based instances use up to 60% less energy than comparable M5 instances, helping you achieve your sustainability goals. Lambda is inherently sustainable by design. Its serverless model makes sure compute resources are only used when needed, drastically reducing idle infrastructure and wasted energy. Instead of keeping always-on servers for infrequent tasks, Lambda provisions compute power just-in-time, achieving near-zero idle capacity.

Security and compliance in financial services

Given the sensitive nature of payment transactions and financial data, you should apply the security controls required to meet financial regulations such as AWS PCI DSS and AWS Federal Information Processing Standard (FIPS) 140-3 according to your organization’s needs.

The solution should incorporate multi-layered security controls, continuous monitoring, and automated compliance auditing to meet the rigorous expectations of banking regulators and internal risk teams. For more information, refer to Security Guidance.

Conclusion

The modernization of payment orchestration systems using an event-driven architecture and AWS serverless technologies marks a significant advancement in meeting the demands of today’s rapidly evolving financial services landscape. This solution addresses the key challenges faced by traditional payment systems while delivering substantial benefits in performance, scalability, cost optimization, global resilience, sustainability, and compliance. By using cutting-edge cloud technologies and robust security controls, financial institutions can now build a future-ready foundation that adapts to evolving business needs while maintaining the highest standards of performance, security, and reliability. As the real-time payments market continues its explosive growth, this modern architecture provides a solution that meets today’s demands and is also well-positioned to support tomorrow’s payment innovations. Organizations looking to modernize their payment infrastructure can use this blueprint to accelerate their digital transformation journey, supporting sustainable, secure, and efficient payment processing at scale in an increasingly competitive global marketplace.

The architecture presented here is for reference purposes only. IBM will work closely with you to deploy the solution in accordance with industry standards and compliance requirements.For additional resources, refer to:

IBM Consulting is an AWS Premier Tier Services Partner that helps customers who use AWS to harness the power of innovation and drive their business transformation. They are recognized as a Global Systems Integrator (GSI) for over 22 competencies, including Financial Services Consulting. For additional information, please contact an IBM Representative.

Create an OpenSearch dashboard with Amazon OpenSearch Service

Post Syndicated from Smita Singh original https://aws.amazon.com/blogs/big-data/create-an-opensearch-dashboard-with-amazon-opensearch-service/

Effective log analysis is essential for maintaining the health and performance of modern applications. Amazon OpenSearch Service stands out as a powerful, fully managed solution for log analytics and observability. With its advanced indexing, full-text search, and real-time analytics capabilities, OpenSearch Service makes it possible for organizations to seamlessly ingest, process, and search log data across diverse sources—including AWS services like Amazon CloudWatch, VPC Flow Logs, and more.

With OpenSearch Dashboards, you can turn indexed log data into actionable visualizations that reveal insights and help detect anomalies. By querying data stored in OpenSearch Service, you can extract relevant information and display it using a variety of visualization types—such as line charts, bar graphs, pie charts, heatmaps, and more. These tools make it effortless to monitor system behavior, spot trends, and quickly identify issues in your environment.

This post demonstrates how to harness OpenSearch Dashboards to analyze logs visually and interactively. With this solution, IT administrators, developers, and DevOps engineers can create custom dashboards to monitor system behavior, detect anomalies early, and troubleshoot issues faster through interactive charts and graphs.

Solution overview

In this post, we show how to create an index pattern in OpenSearch Dashboards, create two types of visualizations, and display these visualizations on a custom dashboard. We also demonstrate how to export and import visualizations.

Prerequisites

Before diving into log analysis with OpenSearch Dashboards, you must have the following:

  • A properly configured OpenSearch Service domain
  • A working log collection and ingestion pipeline

Amazon OpenSearch Service 101: Create your first search application with OpenSearch guides you through setting up your OpenSearch Service domain and configuring the log ingestion pipeline.

For this post, we work with the following log sources, which have already been ingested into an OpenSearch Service cluster as part of the prerequisite steps:

Access OpenSearch Dashboards

Complete the following steps to access OpenSearch Dashboards:

  1. On the OpenSearch Service console, choose Domains in the navigation pane.
  2. Check if your domain status shows as Active.
  3. Choose your domain to open the domain details page.
  4. Choose the OpenSearch Dashboards URL to open it in a new browser window.

  1. Authenticate into OpenSearch Dashboards using one of the supported methods.

Create an index pattern

After you’re logged in to OpenSearch Dashboards, you must create an index pattern. An index pattern allows OpenSearch Dashboards to locate indexes to search. Complete the following steps

  1. In OpenSearch Dashboards, expand the navigation pane and choose Dashboard Management under Management.
  2. Choose Index patterns in the navigation pane.

  1. Choose Create index pattern.
  2. For Index pattern name, enter a name (for example, log-aws-cloudtrail-*).
  3. Choose Next step.

  1. For Time field¸ choose @timestamp.
  2. Choose Create index pattern.

Create visualizations

Now that the index pattern is created, let’s create some visualizations. For this post, we create a pie chart and an area graph.

Create a pie chart

Complete the following steps to create a pie chart:

  1. In OpenSearch Dashboards, choose Visualize in the navigation pane.

  1. Choose Create visualization.

  1. Choose Pie as the visualization type.
  2. For Source¸ choose log-aws-cloudtrail-*.

  1. Under Buckets¸ choose Add and Split slices.

  1. For Aggregation, choose Terms.

  1. For Field, choose eventName.
  2. For Size, enter 10.

  1. Leave all other parameters as default and choose Update.
  2. Choose Save to save the visualization.

Sample ndjson file for the pie chart – EventNamePie.ndjson

Please refer Export and import visualizations for how to import the samples.

The following screenshot shows our pie chart, which displays different types of events and their occurrence percentage in the last 30 minutes.

Create an area graph

Complete the following steps to create an area graph:

  1. In OpenSearch Dashboards, choose Visualize in the navigation pane.
  2. Choose Create visualization.
  3. Choose Area as the visualization type.

  1. For Source¸ choose log-aws-cloudtrail-*.

  1. Under Buckets¸ choose Add and X-axis.

  1. For Aggregation, choose Date Histogram.
  2. For Field, choose @timestamp.
  3. Leave all other parameters as default and choose Update

  1. Under Advanced¸ choose Add and Split series.

  1. For Aggregation, choose Terms.
  2. For Field, choose eventName.
  3. For Size, enter 10.
  4. Leave all other parameters as default and choose Update.

  1. Choose Save.
  2. Update the time range to Last 60 minutes.
  3. Choose Refresh and Save.

The following screenshot shows an area graph with different types of events and their occurrence count in the last 60 minutes.

Sample ndjson file for Area chart – EventNameArea.ndjson

Please refer Export and import visualizations for how to import the samples.

Create a dashboard

Now we will combine the visualizations we just created into a dashboard. A dashboard serves as a customizable interface that consolidates multiple visualizations, saved searches, and various content into a comprehensive view of data. Users can combine diverse visual elements—including charts, graphs, metrics, and tables—into a single cohesive display that can be arranged and resized on a flexible grid layout. You can simultaneously apply filters and time ranges across multiple visualizations, creating a coordinated analytical experience. Complete the following steps to create a dashboard:

  1. In OpenSearch Dashboards, choose Dashboards in the navigation pane.
  2. Choose Create new dashboard.

  1. Choose Add on the menu bar.

  1. Search for and choose the visualizations you created.

You can resize panels by dragging their corners to adjust dimensions. To modify the layout arrangement, you can drag the top portion of panels, which allows you to organize them horizontally in a row formation. When working with tabular visualizations, the system provides a convenient option to export your results in CSV format for further analysis or reporting purposes.

  1. Choose Save.
  2. Change the time range to Last 60 minutes.
  3. Choose Refresh and Save.

Sample ndjson file for dashboard – CloudTrailSummary.ndjson

Please refer Export and import visualizations for how to import the samples.

The following screenshot shows the CloudTrail dashboard displaying both visualizations.

Export and import visualizations

In OpenSearch, an NDJSON file is used to import and export saved objects, such as dashboards, visualizations, maps, and index template. The NDJSON file provides a streamlined approach for handling large datasets by representing each JSON object on a separate line. This format enables efficient import/export operations, simplified data migration between environments, and seamless sharing of complex dashboard configurations. Organizations can back up and restore critical visualizations, saved searches, and dashboard settings while maintaining their integrity. The format’s structure reduces memory overhead during large transfers and improves processing speed for bulk operations. NDJSON’s human-readable nature also facilitates troubleshooting and manual editing when necessary, making it an invaluable tool for maintaining OpenSearch Dashboards deployments across development, testing, and production environments.

Export a visualization

Complete the following steps to export a visualization:

  1. In OpenSearch Dashboards, choose Saved objects in the navigation pane.
  2. Search for and select your object (in this case, a visualization), then choose Export.

The NDJSON file is downloaded in your local host.

Import a visualization

Complete the following steps to import a visualization:

  1. In OpenSearch Dashboards, choose Saved objects in the navigation pane.
  2. Choose Import.
  3. Choose the first NDJSON file to be imported from your local host.
  4. Select Create new objects with random IDs.
  5. Choose Import.

  1. Choose Done.

  1. Choose Import.

You can now open the imported object.

The following screenshot shows our updated dashboard.

Clean up

To clean up your resources, delete the OpenSearch Service domain and relevant information stored or visualizations created on the domain. You will not be able to recover the data after you delete it.

  1. On the OpenSearch Service console, choose Domains in the navigation pane.
  2. Select the domain you created and choose Delete.

Conclusion

OpenSearch Dashboards is a powerful tool for transforming raw log data into actionable visualizations that drive insights and decision-making. In this post, we’ve shown how to create visualizations like pie charts and area graphs, build comprehensive dashboards, and efficiently export and import your work using NDJSON files. By using the fully managed OpenSearch Service features, organizations can focus on extracting valuable insights rather than managing infrastructure, ultimately enhancing their observability posture and operational efficiency.

To further enhance your OpenSearch proficiency, consider exploring advanced visualization options such as heat maps, gauge charts, and geographic maps that can represent your data in more specialized ways. Implementing automated alerting based on predefined thresholds will help you proactively identify anomalies before they become critical issues. You can also use OpenSearch’s powerful machine learning capabilities for sophisticated anomaly detection and predictive analytics to gain deeper insights from your log data. As your implementation grows, customizing security settings with fine-grained access controls will provide appropriate data visibility across different teams in your organization.

For comprehensive learning resources, refer to the Amazon OpenSearch Service Developer Guide, watch Create your first OpenSearch Dashboard on YouTube, explore best practices in Amazon OpenSearch blog posts, and gain hands-on experience through workshops available in AWS Workshops.


About the Authors

Smita Singh is a Senior Solutions Architect at AWS. She focuses on defining technical strategic vision and works on architecture, design, and implementation of modern, scalable platforms for large-scale global enterprises and SaaS providers. She is a data, analytics, and generative AI enthusiast and is passionate about building innovative, highly scalable, resilient, fault-tolerant, self-healing, multi-tenant platform solutions and accelerators.

Dipayan Sarkar is a Specialist Solutions Architect for Analytics at AWS, where he helps customers modernize their data platform using AWS analytics services. He works with customers to design and build analytics solutions, enabling businesses to make data-driven decisions.

Secure file sharing solutions in AWS: A security and cost analysis guide: Part 2

Post Syndicated from Swapnil Singh original https://aws.amazon.com/blogs/security/secure-file-sharing-solutions-in-aws-a-security-and-cost-analysis-guide-part-2/

As introduced in Part 1 of this series, implementing secure file sharing solutions in AWS requires a comprehensive understanding of your organization’s needs and constraints. Before selecting a specific solution, organizations must evaluate five fundamental areas: access patterns and scale, technical requirements, security and compliance, operational requirements, and business constraints. These areas cover everything from how files will be shared and what protocols are needed, to security measures, day-to-day operations, and business limitations.

See Part 1 of this series for detailed information about each of these fundamental areas and their specific considerations. Part 1 also covers solutions including AWS Transfer Family, Transfer Family web apps, and Amazon Simple Storage Service (Amazon S3) pre-signed URLs. This part continues our analysis with additional AWS file sharing solutions to help you make an informed decision based on your specific requirements.

Solutions

Let’s start by looking at the various file sharing mechanisms that AWS supports. The following table identifies the key AWS services needed for each solution, describes the security and cost implications of the solutions, and describes their complexity and protocol support capabilities.

Solution AWS services Security features Cost* Region control
CloudFront signed URLs CloudFront, Amazon S3, and Lambda Optional edge security using AWS Lambda@Edge, WAF integration, SSL/TLS, geo restrictions, and AWS Shield Standard (included automatically) Content delivery network (CDN) costs, request pricing, and data transfer fees Global service by design; origin can be AWS Region-specific
Amazon VPC endpoint service AWS PrivateLink, Amazon VPC, and Network Load Balancer (NLB) Complete network isolation, private connectivity, and multi-layer security Endpoint hourly charges, NLB costs, and data processing fees Service endpoints are strictly Region-specific; must create endpoints in each Region where access is needed
S3 Access Points Amazon S3, IAM, Amazon VPC (for VPC-specific access points)
  • Dedicated IAM policies per access point
  • VPC-only access restrictions available
  • Works with bucket policies for layered security
  • Supports PrivateLink for private network access
  • Compatible with S3 Block Public Access settings
  • No additional charge for S3 Access Points
  • Standard S3 request pricing applies
  • Data transfer fees apply based on standard S3 rates
  • Amazon VPC endpoint charges apply when using VPC endpoints with access points
  • Access points are Region-specific
  • Each access point is created in the same Region as its S3 bucket
  • Cross-Region access requires separate access points in each Region
  • VPC-specific access points are limited to the VPC’s Region

The following table shows the solutions described in Part 1.

Solution AWS services Security features Cost* Region control
AWS Transfer Family Transfer Family, Amazon S3, API Gateway, and Lambda Managed security, encryption in transit and at rest, IAM integration, and custom authentication $0.30 per hour per protocol, data transfer fees, and storage costs Can deploy to specific AWS Regions, can only transfer files to and from S3 buckets in the same Region
Transfer Family web apps Transfer Family, S3, and CloudFront Browser-based access, IAM Identity Center integration, and S3 Access Grants Pay-per-file operation, CloudFront costs, and storage costs Uses CloudFront (global) for web access, but backend components can be Region-specific
Amazon S3 pre-signed URLs S3 Time-limited URLs, IAM controls for URL generation, and HTTPS S3 request and data transfer fees Can be restricted to specific Regions
Serverless application with Amazon S3 presigned URLs S3, Lambda, and API Gateway Time-limited URLs, HTTPS, IAM controls, customizable authentication Pay per request and minimal infrastructure cost Components can be Region-specific

* Pricing information provided is based on AWS service rates at the time of publication and is intended as an estimation only. Additional costs may be incurred depending on your specific implementation and usage patterns. For the most current and accurate pricing details, please consult the official AWS pricing pages for each service mentioned.

Let’s examine each of the solutions in detail. Part 1 talked about AWS Transfer Family, Transfer Family web apps, and Amazon S3 pre-signed URLs. Here in Part 2, we explain the remaining solutions to help you make the right choice for your use case.

CloudFront signed URLs with Amazon S3

Amazon CloudFront signed URLs combine Amazon S3 storage with the global edge network of CloudFront to deliver files securely with lower latency.

CloudFront edge locations cache content geographically closer to users, which usually reduces latency and gives better performance for users. CloudFront also reduces the number of origin requests to Amazon S3. CloudFront integration with AWS Shield and AWS WAF provides options for additional security layers, helping to protect against DDoS events and unintended requests. You can use custom domains with AWS-provided or your own SSL/TLS certificates managed through AWS Certificate Manager (ACM), helping to facilitate secure connections from users to edge locations.

When a user requests a file, the system generates a signed URL using either a CloudFront key pair or a custom trusted signer (such as Lambda Edge) that includes security parameters such as IP restrictions, time windows, and custom policies. The major difference is the content distribution network (CDN) making performance faster by caching data geographically close to the user downloading it.

The built-in logging and monitoring capabilities of CloudFront provide detailed insights into content access patterns, cache hit ratios, and security events. CloudFront integrates seamlessly with Amazon S3 to support origin access identity (OAI), helping to make sure that the S3 objects can be accessed only through CloudFront and not directly through S3 APIs.

Figure 1: CloudFront signed URLs with Amazon S3 architecture

Figure 1: CloudFront signed URLs with Amazon S3 architecture

Pros

If Amazon S3 pre-signed URLs sound good, but you need higher performance at a global scale, CloudFront signed URLs are the right choice. The AWS global edge network has points of presence (POPs) all over the world, which significantly reduces latency for users and minimizes data transfer costs through caching. This architecture provides substantial cost savings for frequently accessed content, because edge locations serve cached copies without retrieving objects from the S3 origin. The integration with AWS security services offers protection against various threats, including sophisticated distributed denial of service (DDoS) events and web application issues, making it particularly suitable for public-facing file sharing applications. Choose CloudFront instead of S3 if you tend to make the same file available to many people who download it many times, such as in software distribution or documentation distribution.

The solution’s security model provides extensive flexibility in access control implementation. You can define granular permissions through custom policies, implement geo-restriction rules, and enforce IP-based access controls. The ability to use custom TLS certificates and domains maintains brand consistency while helping to facilitate secure communications. The integration with AWS WAF enables advanced request filtering and rate limiting, while detailed access logging and real-time metrics provide visibility into content delivery and security events. The solution’s support for both signed URLs and signed cookies offers flexibility in implementing various access control scenarios. Signed cookies are used when you want to provide access to multiple restricted files. For example, if you need to provide access to many files in a private directory, you can use signed cookies to avoid having to create individual signed URLs for each file. When choosing between CloudFront signed URLs (ideal for individual file access) or signed cookies (better for providing access to multiple files, like a subscriber’s content library), consider your content distribution needs and whether your clients support cookies.

Cons

If you implement CloudFront, you must develop expertise in its configuration options, including robust key management processes and secure key rotation procedures. Self-managed certificates don’t automatically renew. You must track expiration dates and make sure you renew on time, or your users will get warnings and errors when they try to download. ACM can simplify TLS certificate management and automatically renew certificates before they expire. while trusted signer workflows enhance your security posture.

Note: To create signed URLs, you need a signer. A signer is either a trusted key group that you create in CloudFront, or an AWS account that contains a CloudFront key pair.

Misconfigured web caches have many surprising and frustrating effects for users. Understanding and configuring CloudFront cache behavior is key to helping to prevent unintended content exposure or availability issues. You need to add cache invalidation to your publication workflows so that old versions are no longer available from the cache. This might introduce additional costs and operational overhead, especially in scenarios with frequent content changes. If you frequently change the content that you share, if the content is unique to an individual (such as a personalized report), or if the same content isn’t downloaded many times by many people in many locations, you won’t realize much cost savings or reduced latency from CloudFront caching. The additional complexity added by cache configuration might not be justified unless the cache is used a lot.

If you use the CloudFront global content delivery network, your content will be stored in caches in hundreds of locations around the world. ACM will store your TLS certificates for CloudFront (whether ACM is issuing them or you manage them yourself) in the us-east-1 AWS Region. Because CloudFront is a global service, it automatically distributes the certificate from the us-east-1 Region to the Regions associated with your CloudFront distribution. Caching data and keys around the world might not be acceptable if you have data sovereignty requirements to keep your data in one country.

From a cost perspective, while CloudFront can provide savings through caching, the pricing model has other variables to consider. Data transfer costs vary by Region and can be significant for large-scale distributions. If you need custom domain names and custom TLS certificates, that might introduce additional costs. Implementation expertise is needed when dealing with dynamic content or when specific origin request handling is required. CloudFront only delivers via HTTPS and HTTP protocols, so you won’t be able to use it if you require support for other file transfer protocols. CloudFront distributions provide statistics on cache hit-and-miss rates—pay attention to these because low cache hit rates mean that you’re pulling data from the origin frequently, which limits the possible cost savings.

Amazon VPC endpoint service with custom application

Amazon VPC endpoint services, powered by AWS PrivateLink, enable private connectivity between VPCs without requiring internet access, VPN connections, or direct physical connections. This solution creates a highly secure, private network path for file sharing by exposing services through Network Load Balancers (NLB) and allowing other VPCs to access them through interface endpoints. The architecture isolates the file sharing service from the public internet, operating entirely within the AWS private network infrastructure.

The best use cases for this architecture involve sharing data or distributing software around your AWS infrastructure without exposing it to the public internet.

Figure 2: Amazon VPC endpoint service architecture

Figure 2: Amazon VPC endpoint service architecture

The solution, shown in Figure 2, typically involves deploying a custom file sharing application behind an NLB in the service VPC, which is then exposed as an endpoint service. Consumer VPCs create interface endpoints to connect to this service, establishing private connectivity through the AWS backbone network. Traffic remains within the AWS network, is encrypted in transit, and is subject to security controls at both the endpoint and VPC levels. The architecture supports many TCP-based protocols, making it versatile for various file transfer requirements.

This architecture provides secure pathways for data to travel by using multiple layers, including VPC security groups, network access control lists (ACLs), endpoint policies, and the custom application’s authentication mechanisms. The built-in security features of PrivateLink are designed so that only approved AWS principals can create interface endpoints to connect to the service, while detailed VPC flow logs provide network traffic visibility.

Pros

Amazon VPC endpoint services provide complete network isolation and private connectivity that’s inaccessible from the public internet. This reduces the exposure footprint and helps meet security requirements for sensitive data transfer operations. The solution maintains private connectivity across different AWS accounts and Regions while keeping traffic within the AWS network infrastructure.

This solution also provides the most flexible protocol support. Other solutions require you to use HTTPS, AWS API calls (which are HTTPS), or one of the protocols supported by Transfer Family (such as SFTP). If you have software that uses custom protocols, and you need security controls and network isolation, this architecture provides predictable performance through dedicated network paths and supports high throughput requirements without internet bandwidth constraints. The granular control over network security through VPC security groups, network ACLs, and endpoint policies enables organizations to implement defense-in-depth strategies effectively. Additionally, the solution’s integration with AWS Organizations facilitates centralized management and governance across multiple accounts.

Cons

Setting up and maintaining VPC endpoints requires significant expertise in AWS networking, including VPC design, PrivateLink configuration, and network security controls. The initial architecture design must carefully consider IP address management, service quotas, and Regional availability to provide scalability and reliability. Organizations must also develop and maintain the custom file sharing application in addition to the VPC endpoints.

This solution has many components that incur hourly and bandwidth-related charges. Each interface endpoint incurs hourly charges and data processing fees, which can accumulate significantly in multi-VPC or multi-Region deployments. NLBs add another cost component, and you must maintain sufficient capacity for peak loads. The solution also has operational costs because of the need for specialized expertise and ongoing maintenance. Additionally, while the private connectivity model provides superior security, it can make troubleshooting more challenging and might require additional tooling for effective monitoring and diagnostics. The Regional nature of VPC endpoints might necessitate additional architecture for multi-Region deployments, potentially increasing both costs and operational overhead. This solution is most suitable when private network security considerations are the highest priority, and cost considerations are secondary.

Amazon S3 Access Points

Amazon S3 Access Points simplify managing data access at scale for applications using shared data sets on S3. Access points are named network endpoints attached to S3 buckets that streamline managing access to shared datasets. Each access point has its own AWS Identity and Access Management (IAM) policy that controls access to the data, allowing you to create custom access permissions for different applications or user groups accessing the same bucket.

The architecture uses S3 buckets with access points providing dedicated access paths to the data. Each access point has its own hostname (URL) and access policy that works in conjunction with the bucket policy. You can create access points that only allow connections from your Amazon Virtual Private Cloud (Amazon VPC) for private network access to Amazon S3 or create access points with Internet connectivity. You can use this flexibility to implement sophisticated access control patterns while maintaining a single source of truth in S3.

Figure 3: S3 Access Points with VPC endpoints

Figure 3: S3 Access Points with VPC endpoints

Pros

Amazon S3 Access Points simplify permissions management and security to accommodate multiple access patterns and use cases. For example, if an S3 bucket contains data that needs to be accessed by multiple applications, each requiring different levels of access, you can create a dedicated access point for each application with precisely the permissions it needs, rather than managing a long monolithic bucket policy.

You can implement access control workflows, such as restricting access to specific VPCs, encryption, or limit access to specific objects or prefixes. The service requires no new infrastructure management, reducing operational overhead and allowing you to focus on business logic implementation.

Access points provide a way to enforce network controls through VPC-only access points, helping to make sure that data can only be accessed from within your private network. IAM permissions management becomes more granular and straightforward to audit when each application or user group has its own access point with a dedicated policy. You can associate different access points with different network origins.

Another possible use case is when you need to provide temporary access to specific data within a bucket without modifying the bucket policy. You can create a temporary access point with the necessary permissions and delete it when the access is no longer needed.

Cons

Access points add another layer to your Amazon S3 architecture that needs to be managed and monitored. Each access point has its own Amazon Resource Name (ARN) and hostname that applications need to use instead of the bucket name, which might require changes to your application code.

There are limits to the number of access points you can create for each bucket, which might be a constraint for large-scale applications. Access points can only control access to the bucket they’re associated with, not across multiple buckets, so if your application needs to access data across buckets, you’ll need multiple access points.

When implementing this solution, you need to design your access point policies to make sure that they work correctly with your bucket policy. Think of your S3 bucket policy as the primary security framework, while access point policies act as specialized gatekeepers. These two layers of security must work in harmony. The bucket policy takes precedence. For example, if your bucket policy explicitly denies access from specific IP ranges, an access point policy can’t override this restriction. This hierarchical relationship requires strategic planning. Start by defining your broad security boundaries in the bucket policy—perhaps allowing access only from specific VPCs or requiring encryption. Then create your access point policies within these boundaries.

While Amazon S3 Access Points offer powerful flexibility, understanding their boundaries is crucial. Cross-account scenarios, common in large enterprises or partner collaborations, require careful configuration. Imagine you’re working with an external auditing firm that needs temporary access to your financial data stored in S3. Setting up a cross-account access point requires creating the access point in your account, configuring a trust policy to allow the external account, verifying that the bucket policy permits access from the access point, and providing the auditors with the access point ARN and necessary IAM permissions in their account. This process maintains tight control over your data while enabling secure cross-account access.

Some Amazon S3 operations are only controlled at the bucket level and can’t be controlled by access points. Core bucket operations such as configuring versioning, logging, managing lifecycle policies, and setting up cross-Region replication require direct bucket access. For these operations, you need to interact directly with the bucket through the appropriate permissions. This limitation helps make sure that fundamental bucket configurations remain centralized and controlled by bucket owners.

Creating a dedicated IAM role for bucket administration tasks—separate from the roles that interact with data through access points—enhances security and aligns with the principle of least privilege.

Conclusion

In this second part of a two-part post, you’ve learned about multiple solutions for secure file sharing using AWS services and the pros and cons of each. You can find additional options and a full decision matrix in Part 1. The optimal solution depends on your specific organizational requirements, technical capabilities, and budget constraints. You don’t have to choose just one option, you can implement multiple solutions to address different use cases, creating a file sharing strategy that balances security, cost, and operational efficiency.

Additional resources:

If you have feedback about this post, submit comments in the Comments section below.

Swapnil Singh

Swapnil Singh

Swapnil is a Senior Solutions Architect for AWS World Wide Public Sector. As a Product Acceleration Solutions Architect at AWS, she currently works with GovTech customers to ideate, design, validate, and launch products using cloud-native technologies and modern development practices.

Sumit Bhati

Sumit Bhati

Sumit is a Senior Customer Solutions Manager at AWS, specializing in expediting the cloud journey for enterprise customers. Sumit is dedicated to assisting customers through every phase of their cloud adoption, from accelerating migrations to modernizing workloads and facilitating the integration of innovative practices.

How Zapier runs isolated tasks on AWS Lambda and upgrades functions at scale

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/architecture/how-zapier-runs-isolated-tasks-on-aws-lambda-and-upgrades-functions-at-scale/

Zapier is a leading no-code automation provider whose customers use their solution to automate workflows and move data across over 8,000 applications such as Slack, Salesforce, Asana, and Dropbox. Zapier runs these automations through integrations called Zaps, which are implemented using a serverless architecture running on Amazon Web Services (AWS). Each Zap is powered by an AWS Lambda function.

In this post, you’ll learn how Zapier has built their serverless architecture focusing on three key aspects: using Lambda functions to build isolated Zaps, operating over a hundred thousand Lambda functions through Zapier’s control plane infrastructure, and enhancing security posture while reducing maintenance efforts by introducing automated function upgrades and cleanup workflows into their platform architecture.

Architecting a secure and isolated runtime environment

Zaps created by Zapier’s users implement tenant-specific business logic, hence they require cross-tenant compute isolation. Code implementing one Zap can’t share an execution environment with code implementing another Zap. Moreover, the same Zap type used by two different tenants can’t share execution environments as well.

To achieve the required level of isolation, Zapier’s engineering team adopted AWS Lambda, a serverless compute service that runs code in response to events and automatically manages cloud compute resources. Minimal operational overhead, built-in high availability, automated scaling, high level of isolation, and pay-per-use model made Lambda a great fit for this use case. Currently, Zapier’s architecture is running over a hundred thousand Lambda functions to support their customer’s integration workflows.

Because they’re powered by the open source Firecracker microVMs, each function is completely isolated from the others. Moreover, each execution environment belonging to the same function (sometimes referred to as function instances) is also isolated from other execution environments. The following architecture topology diagram uses red lines to represent isolation boundaries. Each execution environment of every function is isolated from its peers and is getting its own virtual resources such as disk, memory, and CPU. For more details, read Security in AWS Lambda.

Isolation boundary

Zapier’s control plane is architected using Amazon Elastic Kubernetes Service (Amazon EKS). A designated database is used to maintain the up-to-date function inventory. Whenever a user creates a new Zap, the control plane creates a corresponding Lambda function and stores a reference in the inventory database. When a Zap is triggered, the control plane retrieves information about a relevant Lambda function and invokes it to facilitate the integration workflow, as illustrated in the following diagram.

Control and Data planes

Understanding the runtime deprecation process

When building architectures using the traditional non-serverless compute, cloud engineers are the ones responsible for keeping operating systems and software on their compute instances up to date and applying security and maintenance patches. With serverless architectures and Lambda functions, security patches and minor runtime upgrades are handled by AWS automatically, which means customers can focus on delivering business value instead of the undifferentiated heavy lifting of infrastructure management.

When a major Lambda managed runtime version reaches end-of-life, AWS initiates a deprecation process through the AWS Health Dashboard and direct email communications to affected customers. Because deprecated runtimes eventually lose access to security updates and support, organizations must upgrade to supported runtime versions to avoid potential security risks. Read more about the shared responsibility model, runtime use after deprecation, and receiving runtime deprecation notifications.

As Zapier’s user base and architectural complexity – and consequently the number of Zaps – were growing, keeping all functions on the most up-to-date major runtime versions became a laborious task. Top contributing factors were:

  • High number of functions. At its peak, the Zapier platform was running Zaps using hundreds of thousands of unique Lambda functions. Approximately 35% of these functions were using a runtime that was scheduled for deprecation in the next 12 months.
  • Zapier architected their data plane environment to be ephemeral – the control plane creates and deletes Lambda functions on demand and manages their lifecycle dynamically. Identifying a specific owner for each affected function wasn’t always straightforward.
  • Security is paramount at Zapier and upgrading affected functions runtime prior to the deprecation date was an absolute must. At no point could Zapier functions use runtimes after their deprecation date. This was a task which required extra resources.
  • The upgrade process shouldn’t have had any impact on the end customer experience. At no point should customer experience be affected.

With a short runway, high-volume workload, and the strict requirements of not impacting customer experience, Zapier’s Platform Engineering team took on this challenge of maintaining high security posture in their platform architecture.

Applying the solution

The solution had three work streams:

  1. Reducing the risk by analyzing the architecture and identifying and cleaning up unused functions.
  2. Prioritizing upgrades by identifying the most critical and impactful functions.
  3. Empowering engineering teams with automated tools and knowledge to streamline the upgrade process in future.

Identify and clean up unused functions

The first step in streamlining the upgrade process was identifying and removing unused functions. This reduced the total number of functions in Zapier’s architecture that required upgrades, eliminating unnecessary work for the team.

Zapier started by augmenting the function inventory with runtime information using AWS Trusted Advisor and Amazon Cloud Intelligence Trusted Advisor dashboards, as illustrated in the following diagram.

Gathering data

This meant the team could build a detailed inventory of functions that were running on soon-to-be deprecated runtimes. Using Amazon CloudWatch, Zapier’s platform team started to monitor metrics such as number of invocations. They identified which functions were active, which functions weren’t used for an extended period, and which functions didn’t have an active owner and could be removed.

One of the primary mechanisms for ownership validation within the organization was using resource tags. Functions that were active, but didn’t have clear ownership, were flagged for additional review before removal. Functions that were confirmed as unused or didn’t have an active owner were marked for deletion. Removing such functions allowed Zapier to significantly simplify their architecture and reduce the number of functions that had to be upgraded.

Prioritizing upgrades

With a smaller volume of functions to upgrade, Zapier’s platform team prioritized function upgrades based on usage patterns, criticality, and potential customer impact. Three primary prioritization categories were:

  • Customer-facing functions – Any functions directly involved in executing user Zaps were marked as high priority. These had to be upgraded first to avoid service disruptions.
  • Backend infrastructure functions – Internal functions that supported system operations were evaluated based on their importance to platform stability.
  • High-volume functions – Functions with the highest execution frequency were prioritized because upgrading them would have the greatest impact on reducing operational risk.

Using these factors, Zapier’s platform team has created an upgrade roadmap, ensuring that critical assets were addressed first while minimizing potential disruptions.

Refer to Retrieve data about Lambda functions that use a deprecated runtime in the Lambda Developer Guide to learn how to identify most commonly and most frequently used Lambda functions in your serverless architecture.

Empowering engineering teams with automated tools and knowledge

To ensure a smooth and efficient upgrade process across their serverless architecture, Zapier’s team empowered engineering teams with clear guidelines and automated solutions. The platform incorporated two main approaches: Terraform-managed functions and a custom-built Lambda runtime canary tool. Implementing and adopting these tools and practices resulted in reducing the number of functions using soon-to-be deprecated runtimes by 95%.

For functions managed through infrastructure-as-code (IaC), Zapier’s team developed standardized Terraform modules that specified supported runtime versions. Development teams implemented these modules in their configurations:

resource "aws_lambda_function" "example" {
    runtime = "python3.13"  # Updated to supported runtime
}

After applying the new module version, teams validated changes by testing the new runtime in staging environments and monitoring Terraform plan outputs to ensure proper runtime version updates.

To efficiently manage most Lambda functions in their architecture, Zapier developed the Lambda runtime canary tool suite. Using this solution, they automated the runtime upgrade process for thousands of active Lambda functions with minimal manual intervention. The tool suite implements several key features:

  • Architected for gradual traffic shifting with the Lambda built-in routing mechanism through function version and aliasing. The tool can gradually shift traffic distribution from an old to a new function version. During this gradual traffic shift, the system monitors CloudWatch metrics for errors and automatically rolls back if error rates exceed acceptable thresholds.
  • Optimistic upgrade strategy implements direct upgrades for infrequently used functions using a flag value stored in a cache to detect potential issues during the first post-upgrade invocation. If this invocation fails, the control plane retries it using the previous function version. If the retried invocation succeeds, Zapier’s control plane initiates a rollback, assuming the error is most likely due to the runtime upgrade. After rollback, it will log the error and alert relevant stakeholders.
  • Integration with existing infrastructure uses an administrative interface and task queue for automated traffic shifting. A database ledger maintains tracking of function states and rollback information.
  • Operational controls provide manual rollback capabilities and implement centralized control switches for process management. After a function was upgraded to a new runtime and no rollback activity was detected within a set time period, an automated pruning task cleans up older versions.

Zapier’s Lambda canary tool, through its integration of gradual traffic shifting, real-time CloudWatch monitoring, and automated rollback mechanisms, established a sustainable framework for managing runtime upgrades across their serverless architecture. This approach not only automated the upgrade process and minimized operational risks but also created a scalable solution that provides continuous runtime upgrades, preventing the use of deprecated runtimes at any point. By allowing continuous function runtime updates with minimal disruption to end user experience, Zapier maintains security and stability while requiring minimal manual intervention. This framework efficiently manages their growing serverless infrastructure, providing both security and operational efficiency for future runtime updates.

Conclusion

In this post, you’ve learned how Zapier architected their software-as-a-service (SaaS) platform to provide secure, isolated execution environments using AWS Lambda and Amazon EKS, enabling their customers to create hundreds of thousands of Zaps. You’ve learned how Zapier’s team implemented the function runtime upgrade process at scale and reduced the number of functions running on soon-to-be deprecated runtimes by 95%. You’ve seen best practices that were established and techniques that helped Zapier to keep high security posture without impacting customer experience.

Use the following links to learn more about Lambda runtimes and upgrading your functions to the latest runtime versions:


About the authors

Behind the Streams: Live at Netflix. Part 1

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/behind-the-streams-live-at-netflix-part-1-d23f917c2f40

Behind the Streams: Three Years Of Live at Netflix. Part 1.

By Sergey Fedorov, Chris Pham, Flavio Ribeiro, Chris Newton, and Wei Wei

Many great ideas at Netflix begin with a question, and three years ago, we asked one of our boldest yet: if we were to entertain the world through Live — a format almost as old as television itself — how would we do it?

What began with an engineering plan to pave the path towards our first Live comedy special, Chris Rock: Selective Outrage, has since led to hundreds of Live events ranging from the biggest comedy shows and NFL Christmas Games to record-breaking boxing fights and becoming the home of WWE.

In our series Behind the Streams — where we take you through the technical journey of our biggest bets — we will do a multiple part deep-dive into the architecture of Live and what we learned while building it. Part one begins with the foundation we set for Live, and the critical decisions we made that influenced our approach.

| But First: What Makes Live Streaming Different?

While Live as a television format is not new, the streaming experience we intended to build required capabilities we did not have at the time. Despite 15 years of on-demand streaming under our belt, Live introduced new considerations influencing architecture and technology choices:

References: 1. Content Pre-Positioning on Open Connect, 2.Load-Balancing Netflix Traffic at Global Scale

This means that we had a lot to build in order to make Live work well on Netflix. That starts with making the right choices regarding the fundamentals of our Live Architecture.

| Key Pillars of Netflix Live Architecture

Our Live Technology needed to extend the same promise to members that we’ve made with on-demand streaming: great quality on as many devices as possible without interruptions. Live is one of many entertainment formats on Netflix, so we also needed to seamlessly blend Live events into the user experience, all while scaling to over 300 million global subscribers.

When we started, we had nine months until the first launch. While we needed to execute quickly, we also wanted to architect for future growth in both magnitude and multitude of events. As a key principle, we leveraged our unique position of building support for a single product — Netflix — and having control over the full Live lifecycle, from Production to Screen.

Dedicated Broadcast Facilities to Ingest Live Content from Production

Live events can happen anywhere in the world, but not every location has Live facilities or great connectivity. To ensure secure and reliable live signal transport, we leverage distributed and highly connected broadcast operations centers, with specialized equipment for signal ingest and inspection, closed-captioning, graphics and advertisement management. We prioritized repeatability, conditioning engineering to launch live events consistently, reliably, and cost-effectively, leaning into automation wherever possible. As a result, we have been able to reduce the event-specific setup to the transmission between production and the Broadcast Operations Center, reusing the rest across events.

Cloud-based Redundant Transcoding and Packaging Pipelines

The feed received at the Broadcast Center contains a fully produced program, but still needs to be encoded and packaged for streaming on devices. We chose a Cloud-based approach to allow for dynamic scaling, flexibility in configuration, and ease of integration with our Digital Rights Management (DRM), content management, and content delivery services already deployed in the cloud. We leverage AWS MediaConnect and AWS MediaLive to acquire feeds in the cloud and transcode them into various video quality levels with bitrates tailored per show. We built a custom packager to better integrate with our delivery and playback systems. We also built a custom Live Origin to ensure strict read and write SLAs for Live segments.

Scaling Live Content Delivery to millions of viewers with Open Connect CDN

In order for the produced media assets to be streamed, they need to be transferred from a few AWS locations, where Live Origin is deployed, to hundreds of millions of devices worldwide. We leverage Netflix’s CDN, Open Connect, to scale Live asset delivery. Open Connect servers are placed close to the viewers at over 6K locations and connected to AWS locations via a dedicated Open Connect Backbone network.

18K+ servers in 6K+ locations, in Internet Exchanges, or embedded into ISP networks
Open Connect Backbone connects servers in Internet Exchange locations to 5 AWS regions

By enabling Live delivery on Open Connect, we build on top of $1B+ in Netflix investments over the last 12 years focused on scaling the network and optimizing the performance of delivery servers. By sharing capacity across on-demand and Live viewership we improve utilization, and by caching past Live content on the same servers used for on-demand streaming, we can easily enable catch-up viewing.

Optimizing Live Playback for Device Compatibility, Scale, Quality, and Stability

To make Live accessible to the majority of our customers without upgrading their streaming devices, we settled on using HTTPS-based Live Streaming. While UDP-based protocols can provide additional features like ultra-low latency, HTTPS has ubiquitous support among devices and compatibility with delivery and encoding systems. Furthermore, we use AVC and HEVC video codecs, transcode with multiple quality levels up from SD to 4K, and use a 2-second segment duration to balance compression efficiency, infrastructure load, and latency. While prioritizing streaming quality and playback stability, we have also achieved industry standard latency from camera to device, and continue to improve it.

To configure playback, the device player receives a playback manifest at the play start. The manifest contains items like the encoding bitrates and CDN servers players should use. We deliver the manifest from the cloud instead of the CDN, as it allows us to personalize the configuration for each device. To reference segments of the stream, the manifest includes a segment template that is used by devices to map a wall-clock time to URLs on the CDN. Using a segment template vs periodic polling for manifest updates minimizes network dependencies, CDN server load, and overhead on resource-constrained devices, like smart TVs, thus improving both scalability and stability of our system. While streaming, the player monitors network performance and dynamically chooses the bitrate and CDN server, maximizing streaming quality while minimizing rebuffering.

Run Discovery and Playback Control Services in the Cloud

So far, we have covered the streaming path from Camera to Device. To make the stream fully work, we also need to orchestrate across all systems, and ensure viewers can find and start the Live event. This functionality is performed by dozens of Cloud services, with functions like playback configuration, personalization, or metrics collection. These services tend to receive disproportionately higher loads around Live event start time, and Cloud deployment provides flexibility in dynamically scaling compute resources. Moreover, as Live demand tends to be localized, we are able to balance load across multiple AWS regions, better utilizing our global footprint. Deployment in the cloud also allows us to build a user experience where we embed Live content into a broader selection of entertainment options in the UI, like on-demand titles or Games.

Centralize Real-time Metrics in the Cloud with Specialized Tools and Facilities

With control over ingest, encoding pipelines, the Open Connect CDN, and device players, we have nearly end-to-end observability into the Live workflow. During Live, we collect system and user metrics in real-time (e.g., where members see the title on Netflix and their quality of experience), alerting us to poor user experiences or degraded system performance. Our real-time monitoring is built using a mix of internally developed tools, such as Atlas, Mantis, and Lumen, and open-source technologies, such as Kafka and Druid, processing up to 38 million events per second during some of our largest live events while providing critical metrics and operational insights in a matter of seconds. Furthermore, we set up dedicated “Control Center” facilities, which bring key metrics together to the operational team that monitors the event in real-time.

| Our key learnings so far

Building new functionality always brings fresh challenges and opportunities to learn, especially with a system as complex as Live. Even after three years, we’re still learning every day how to deliver Live events more effectively. Here are a few key highlights:

Extensive testing: Prior to Live we heavily relied on the predictable flow of on-demand traffic for pre-release canaries or A/B tests to validate deployments. But Live traffic was not always available, especially not at the scale representative of a big launch. As a result, we spent considerable effort to:

  1. Generate internal “test streams,” which engineers use to run integration, regression, or smoke tests as part of the development lifecycle.
  2. Build synthetic load testing capabilities to stress test cloud and CDN systems. We use 2 approaches, allowing us to generate up to 100K starts-per-second:
     — Capture, modify, and replay past Live production traffic, representing a diversity of user devices and request patterns.
     — Virtualize Netflix devices and generate traffic against CDN or Cloud endpoints to test the impact of the latest changes across all systems.
  3. Run automated failure injection, forcing missing or corrupted segments from the encoding pipeline, loss of a cloud region, network drop, or server timeouts.

Regular practice: Despite rigorous pre-release testing, nothing beats a production environment, especially when operating at scale. We learned that having a regular schedule with diverse Live content is essential to making improvements while balancing the risks of member impact. We run A/B tests, perform chaos testing, operational exercises, and train operational teams for upcoming launches.

Viewership predictions: We use prediction-based techniques to pre-provision Cloud and CDN capacity, and share forecasts with our ISP and Cloud partners ahead of time so they can plan network and compute resources. Then we complement them with reactive scaling of cloud systems powering sign-up, log-in, title discovery, and playback services to account for viewership exceeding our predictions. We have found success with forward-looking real-time viewership predictions during a live event, allowing us to take steps to mitigate risks earlier, before more members are impacted.

Graceful degradation: Despite our best efforts, we can (and did!) find ourselves in a situation where viewership exceeded our predictions and provisioned capacity. In this case, we developed a number of levers to continue streaming, even if it means gradually removing some nice-to-have features. For example, we use service-level prioritized load shedding to prioritize live traffic over non-critical traffic (like pre-fetch). Beyond that, we can lighten the experience, like dialing down personalization, disabling bookmarks, or lowering the maximum streaming quality. Our load tests include scenarios where we under-scale systems to validate desired behavior.

Retry storms: When systems reach capacity, our key focus is to avoid cascading issues or further overloading systems with retries. Beyond system retries, users may retry manually — we’ve seen a 10x increase in traffic load due to stream restarts after viewing interruptions of as little as 30 seconds. We spent considerable time understanding device retry behavior in the presence of issues like network timeouts or missing segments. As a result, we implemented strategies like server-guided backoff for device retries, absorbing spikes via prioritized traffic shedding at Cloud Edge Gateway, and re-balancing traffic between cloud regions.

Contingency planning: Everyone has a plan until they get punched in the mouth” is very relevant for Live. When something breaks, there is practically no time for troubleshooting. For large events, we set up in-person launch rooms with engineering owners of critical systems. For quick detection and response, we developed a small set of metrics as early indicators of issues, and have extensive runbooks for common operational issues. We don’t learn on launch day; instead, launch teams practice failure response via Game Day exercises ahead of time. Finally, our runbooks extend beyond engineering, covering escalation to executive leadership and coordination across functions like Customer Service, Production, Communications, or Social.

Our commitment to enhancing the member experience doesn’t end at the “Thanks for Watching!” screen. Shortly after each live stream, we dive into metrics to identify areas for improvement. Our Data & Insights team conducts comprehensive analyses, A/B tests, and consumer research to ensure the next event is even more delightful for our members. We leverage insights on member behavior, preferences, and expectations to refine the Netflix product experience and optimize our Live technology — like reducing latency by ~10 seconds through A/B tests, without affecting quality or stability.

| What’s next on our Live journey?

Despite three years of effort, we are far from done! In fact, we are just getting started, actively building on the learnings shared above to deliver more joy to our members with Live events. To support the growing number of Live titles and new formats, like FIFA WWC in 2027, we keep building our broadcast and delivery infrastructure and are actively working to further improve the Live experience.

In this post, we’ve provided a broad overview and have barely scratched the surface. In the upcoming posts, we will dive deeper into key pillars of our Live systems, covering our encoding, delivery, playback, and user experience investments in more detail.

Getting this far would not have been possible without the hard work of dozens of teams across Netflix, who collaborate closely to design, build, and operate Live systems: Operations and Reliability, Encoding Technologies, Content Delivery, Device Playback, Streaming Algorithms, UI Engineering, Search and Discovery, Messaging, Content Promotion and Distribution, Data Platform, Cloud Infrastructure, Tooling and Productivity, Program Management, Data Science & Engineering, Product Management, Globalization, Consumer Insights, Ads, Security, Payments, Live Production, Experience and Design, Product Marketing and Customer Service, amongst many others.


Behind the Streams: Live at Netflix. Part 1 was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Revenue NSW modernises analytics with AWS, enabling unified and scalable data management, processing, and access

Post Syndicated from Saeed Barghi original https://aws.amazon.com/blogs/big-data/revenue-nsw-modernises-analytics-with-aws-enabling-unified-and-scalable-data-management-processing-and-access/

Revenue NSW, in Australia, is New South Wales (NSW) state’s principal revenue management agency and aspires to be the world’s most innovative and customer-centric revenue agency. Revenue NSW exists to administer grants, resolve fines, and collect revenue to fund essential state services for the over 8 million people of NSW in a fair, efficient, and timely manner.

Analytics at Revenue NSW plays a key role in enabling the organization’s goals and purpose by delivering reliable, secure, and authoritative insights. These insights are key to:

  • Understanding customer attributes to enable empathetic and informed actions
  • Supporting policy development
  • Assisting in the sequencing of millions of decisions
  • Maintaining compliance and education
  • Fostering transparency by providing open data and insights directly to the public

The challenge

Revenue NSW Analytics consumes data from a multitude of operational databases and real-time interfaces and through internally generated reports and files received from external data partners such as other government departments and agencies. The varying technologies, formats, and complexities of these data sources created friction and inefficiencies in data transformation, consolidation, and analysis in an environment that is often time-critical. In addition, these analytics systems were previously hosted on dedicated hardware on-premises that was nearing end-of-life and wasn’t easy to scale efficiently. To address these challenges, Revenue NSW Analytics used their partnership with AWS to build a strategic, unified, scalable, frictionless and modern data environment to help them standardize data transformation and consolidation pipelines from the hundreds of data sources. Additionally, the modern data environment must provide a single source of truth and enable secure and seamless access to the data through a unified SQL interface regardless of the data’s original format or technology.

After understanding other offerings, Revenue NSW Analytics decided on a proof of concept (PoC) using Amazon Web Services (AWS) cloud-based services, including Amazon Redshift. The key goals of the PoC were to assess the completeness of the solution, its performance, and the potential change in total cost of ownership compared to their on-premises setup.

Amazon Redshift, with its integration options, columnar storage, and massively parallel processing (MPP) architecture, offered the desired end-state solution. Tests demonstrated a typical speed increase between 5- and 50-fold in query execution, with many results 100 times faster than the existing on-premises solution. Amazon Redshift also performed significantly better compared with other cloud-based solutions, offering up to 6 times better performance. The success of the initial PoC led Revenue NSW Analytics to further collaborate with AWS, working towards developing a prototype that incorporated Amazon Redshift alongside various data ingestion patterns.

The solution

To simplify data ingestion from the operational databases—which run on different database engines including Oracle, PostgreSQL, and Microsoft SQL—Revenue NSW Analytics used AWS Database Migration Service (AWS DMS) to perform a bulk initial load, followed by capturing ongoing changes from these databases into Amazon Redshift in near real time.

For data from Salesforce’s real-time API, Revenue NSW Analytics used Amazon AppFlow to automate the continuous pulling and ingesting of data into Amazon Redshift.

The hundreds of structured and semi-structured data files were handled using AWS Glue. These files are regularly uploaded to Amazon Simple Storage Service (Amazon S3), triggering the relevant AWS Glue extract, transform, and load (ETL) jobs in an event-based architecture to transfer the data into Amazon Redshift.

To facilitate repeatability and enable iteration, Revenue NSW Analytics used infrastructure-as-code (IaC) and continuous integration and delivery (CI/CD) pipelines to deploy the different components of the solution.

The following is a high-level architecture demonstrating how these different components and services fit together.

Along with standardization and unified access, the success criteria of the new data environment included the ease of transition, consolidation of processes to the new standardised pipelines, scalability, language uniformity, and availability. The combination of supporting standard SQL, AWS DMS, and Amazon AppFlow low-code capabilities, and supporting Python in AWS Glue, a popular programming language, played a crucial role in facilitating the successful transition and adoption of the cloud-based data environment.

Other success factors of this environment include the ability to work within current budgets, and the extendibility and modularity of the solution. As shown in the preceding high-level architecture, the solution runs on multiple building blocks that are decoupled, modular, and either serverless—like AWS Glue—or managed services that support seamless scalable configurations that don’t require rebuilds. This allowed Revenue NSW Analytics to start small with each use case, expand and grow as required, and pay only for what they need.

Moreover, with the new cloud-based data environment, Revenue NSW Analytics can access to up-to-date data in near real time, which is essential to fulfilling critical use cases such as information requests and assisting with compliance case identification. The automated data ingestion pipelines removed much of the boilerplate and heavy lifting, allowing Revenue NSW teams to work more efficiently and focus on the differentiators of their business, and in some cases, shorten workflow times from months to weeks or days.

Another significant factor contributing to the project’s success is the people at the heart of Revenue NSW Analytics. The teams allocated to own and deliver this platform are cross-functional, with adjoining responsibilities and skills, and were prepared through multiple in-person and online training sessions. The teams were empowered to trial individual services to deliver new use cases and iterate on the solution to learn from successes and innovate progressively. This approach, together with support Revenue NSW received from AWS specialist solution architects, helped to minimize the risk of knowledge gaps that often arise when separate teams are responsible for building and operating a system.

The hard work of the Analytics team, the investment of Revenue NSW Analytics leadership in its people, and the continuous support from AWS can truly be seen throughout the delivery of the data environment, resulting in the achievement of the intended outcomes.

Conclusion and call to action

Since going live with their cloud-based data environment on AWS, Revenue NSW has onboarded dozens of analysts who can get more done in less time. This is a result of establishing a single source of truth from different data sources in Amazon Redshift, so that analysts and data consumers don’t need to shop around to find the data that they need to complete their tasks. This new data environment also provides Revenue NSW with the ability to create improved conditions for:

  • Increasing agility by exposing reusable, trusted data services for people and AI
  • Empowering operational systems with services best provided by analytical approaches
  • Decommissioning heritage, costly infrastructure and data practices.

Successful delivery of the cloud-based data environment on AWS has led to further collaboration between AWS and Revenue NSW. This includes exploring the adoption of AI and machine learning (AI/ML) and generative AI to further improve the delivery of services for the people of NSW.

To learn more about customer success stories like this or how to get started with building a data environment on AWS, contact your AWS account team. You can read about similar customers by browsing Customer Success Stories on our website.


About the authors

Saeed BarghiSaeed Barghi is a Sr. Specialist Solutions Architect at Amazon Web Services (AWS) specializing in architecting enterprise data platforms and AI solutions. Based in Melbourne, Australia, Saeed works with public sector customers in Australia and New Zealand and helps his customers build fit-for-purpose and future-proof data platforms and AI solutions.

Miroslaw (Mick) Mioduszewski is the Director of Analytics at Revenue NSW Department of Customer service in NSW. He held multiple C-level roles in private and public companies as well as government, e.g. COO and CIO, as well as serving as company director. Mick holds computer science and business degrees, is a fellow of the Australian Institute of Company Directors and an industry fellow at the University of technology, Sydney.

Moha Alsouli is a Public Sector Solutions Architect at Amazon Web Services (AWS) in Sydney. He is dedicated to supporting state and local government customers deliver citizen services, through solution design, reviews, optimisation, and architecture guidance. Moha is also specialising in generative artificial intelligence (AI) on AWS.

How Stifel built a modern data platform using AWS Glue and an event-driven domain architecture

Post Syndicated from Amit Maindola original https://aws.amazon.com/blogs/big-data/how-stifel-built-a-modern-data-platform-using-aws-glue-and-an-event-driven-domain-architecture/

Stifel Financial Corp. is an American multinational independent investment bank and financial services company, founded in 1890 and headquartered in downtown St. Louis, Missouri. Stifel offers securities-related financial services in the United States and Europe through several wholly owned subsidiaries. Stifel provides both equity and fixed income research and is the largest provider of US equity research.

In this post, we show you how Stifel implemented a modern data platform using AWS services and open data standards, building an event-driven architecture for domain data products while centralizing the metadata to facilitate discovery and sharing of data products.

Stifel’s modern data platform use case

Stifel envisioned a data platform that delivers accurate, timely, and properly governed data, providing consistency throughout the organization whenever users access the information. This approach showed limitations as the data complexity increased, data volumes grew, and demand for quick, business-driven insights rose. These challenges are encountered by financial institutions worldwide, leading to a reassessment of traditional data management practices. Under the federated governance model, Stifel developed a modern data strategy based on the following objectives:

  • Managing ingestion and metadata
  • Creating source-aligned data products complying with Stifel business streams
  • Integrating source-aligned data products from other domains (Stifel business units)
  • Producing consumer-aligned data products for specific business purposes
  • Publishing data products to a centralized data catalog

Some of the Stifel challenges highlighted in the preceding list required building a data platform that can:

  • Boost agility by democratizing data, thus reducing time to market and enhancing the customer experience
  • Improve data quality and trust in the data
  • Standardize tools and eliminate the shadow information technology (IT) culture to increase scalability, reduce risk, and minimize operational inefficiencies

Following the federated governance model, Stifel has organized its domain structure to provide autonomy to various functional teams while preserving the core values of data mesh. The following diagram depicts a high-level architecture of the data mesh implementation at Stifel.

Each data domain has the flexibility to create data products that can be published to the centralized catalog, while maintaining the autonomy for teams to develop data products that are exclusively accessible to teams within the domain. These products aren’t available to others until they are deemed ready for broader enterprise use. Domains have the freedom to decide which data they want to share. They can either:

  • Make their data products visible to everyone through the central catalog
  • Keep their data products visible only within their own domain

By implementing an event-driven domain architecture, organizations can achieve significant business advantages while positioning themselves for future growth and innovation. Stifel data products refreshes were dependent on data assets with variable cadence. Event-driven architecture enables real-time or near real-time updates by allowing data products to automatically respond to changes in underlying data assets as they occur, rather than relying on fixed batch schedules that might miss critical updates or waste resources on unnecessary refreshes. The key is to carefully plan the implementation and make sure of alignment with business objectives while considering both technical and organizational factors. This architecture style particularly suits organizations that:

  • Need real-time processing capabilities
  • Have complex domain interactions
  • Require high scalability
  • Want to improve business agility
  • Need better system integration
  • Are pursuing digital transformation

The following are some of the key AWS Services that helped Stifel to build their modern data platform.

  • AWS Glue is a serverless data integration service that’s used for data processing to build data assets and data products in the domains. Data is also cataloged in AWS Glue Catalog, making it straightforward to discover and query with supported engines.
  • Amazon EventBridge provides a scalable and flexible serverless event bus that facilitates seamless communication between different domains and services. By using EventBridge, Stifel was able to implement a publish-subscribe model where domain events can be emitted, filtered, and routed to appropriate consumers based on configurable rules. EventBridge supports custom event buses for domain-specific events, enabling clear separation of concerns and improved manageability.
  • AWS Lake Formation helped in providing centralized security, governance, and catalog capabilities while preserving domain autonomy in data product creation and management. With Lake Formation, data domains were able to maintain their independent data products within a federated structure while enforcing consistent access controls, data quality standards, and metadata management across the organization.
  • Apache Hudi on Amazon Simple Storage Service (Amazon S3) offers an optimized way to store data assets and products and promotes interoperability across other services.

Stifel’s solution architecture

The following diagram illustrates the data mesh architecture that Stifel uses to build a domain-driven architecture. In this system, various domains create data products and share them with other domains through a central governance account that uses Lake Formation.

Let’s look at some of the key design components that are being used to enable and implement data mesh and event driven design

Data ingestion framework

The data ingestion framework consists of several processor modules that are built using several AWS services and metadata driven architecture. The following diagram shows the architecture of the raw data ingestion framework.

The framework gets raw data files from both internal Stifel systems and third-party data sources. These files are processed and stored in a raw data ingestion account on Amazon S3 in open table format Apache Hudi. This stored data is then shared with different parts of the organization, called data domains. Each domain can use this shared data to create their own data products.

As a file (in CSV, XML, JSON and custom formats) lands into the landing bucket, an Amazon S3 event notification is created and placed in an Amazon Simple Queue Service (Amazon SQS)queue. The Amazon SQS queue triggers an AWS Lambda function and saves the metadata (such as the name of the file, date and time the file was received, and the file size) to a file audit data store (Amazon Aurora PostgreSQL-Compatible Edition).

An EventBridge time scheduler invokes an AWS Step Functions workflow at pre-determined intervals. The Step Functions workflow orchestrates the batch ingestion from raw to staging layer.

  1. The Step Functions workflow orchestrates a set of Lambda functions to get the list of unprocessed raw files from the audit data store and create batches of raw files to process them in parallel. The Step Functions workflow then triggers parallel AWS Glue jobs that process each batch of raw files.
  2. Each raw file is validated for any data quality checks and the data is saved to staging tables in Hudi format. Any errors encountered are logged into an audit table and a notification is generated for support team. For all successfully processed raw files, the file status is updated to PROCESSED and logged into an audit table.
  3. After the Hudi table is updated, a data refresh event is sent to EventBridge and then passed to the Central Mesh Account. The Central Mesh Account forwards these events to the data domains to notify them that the raw tables are refreshed, allowing the data domains to use this data for creating their own data products.

Event driven data product refresh

The Stifel data lake is based on a data mesh architecture where several data producers share data across data domains. A mechanism is needed to alert consumers who depend on other data producers’ data products when those source data products are refreshed, so that the consumers can update their own data products accordingly. The following diagram describes the technical architecture of event-based data processing. The central governance account acts as the central event bus, which receives all data refresh events from all data producers. The central event bus forwards the events to consumer accounts. The consumer accounts filter the events consumers are interested in from data producers for their data processing needs.

Orchestration design

Stifel designed and implemented an event-based data pipeline orchestration system that triggers data pipelines when specific events occur. This system processes data immediately after receiving all required dependency events, enabling efficient workflow management.

The following diagram describes the logical architecture of the domain data pipeline orchestration framework.

The orchestration framework includes the components described in the following list. The data dependencies and data pipeline state management metadata are hosted in an Aurora PostgreSQL database.

  1. Data refresh processor: Receives data refresh events from central mesh and local data domain and evaluates if the domain data products data dependencies are met
  2. Data product dependency processor: Retrieves metadata for the product, kicks off a corresponding data domain AWS Glue job, and updates metadata with the job information
  3. Data pipeline state change processor: Monitors the domain data jobs and takes actions based on the job’s final status (SUCCEED or FAILED) and then creates incident tickets for failed jobs

Conclusion

Stifel has improved its data management and reduced data silos by adopting a data product approach. This strategy has positioned Stifel to become a data-driven, customer-centric organization. The company combines federated platform practices with AWS and open standards. As a result, Stifel is achieving its decentralization objectives through a scalable data platform. This platform empowers domain teams to make informed decisions, drive innovation, and maintain a competitive edge. Here are the some of the advantages Stifel got from an event-driven domain architecture (EDDA):

  • Business agility: Rapid market response, new business capability integration, scalable domains, quicker feature deployment, and flexible process modification
  • Customer experience: Real-time processing, responsive interactions, personalized services, consistent omnichannel presence, and enhanced service availability
  • Operational efficiency: Reduced system coupling, optimal resource use, scalable systems, lower maintenance overhead, and efficient data processing
  • Cost benefits: Lower development costs, reduced infrastructure expenses, decreased maintenance costs, efficient resource usage, and a better ROI on technology investments

In this post, we demonstrated how Stifel is building a modern data platform by recognizing the critical importance of data in today’s financial landscape. This strategic approach not only enhances operational efficiency but also positions Stifel at the forefront of technological innovation in the financial services industry. To learn more and get started, see the following resources:


About the authors

Amit Maindola is a Senior Data Architect focused on data engineering, analytics, and AI/ML at Amazon Web Services. He helps customers in their digital transformation journey and enables them to build highly scalable, robust, and secure cloud-based analytical solutions on AWS to gain timely insights and make critical business decisions.

Srinivas Kandi is a Senior Architect at Stifel focusing on delivering the next generation of cloud data platform on AWS. Prior to joining Stifel, Srini was a delivery specialist in cloud data analytics at AWS helping several customers in their transformational journey into AWS cloud. In his free time, Srini likes to explore cooking, travel and learn new trends and innovations in AI and cloud computing.

Hossein Johari is a seasoned data and analytics leader with over 25 years of experience architecting enterprise-scale platforms. As Lead and Senior Architect at Stifel Financial Corp. in St. Louis, Missouri, he spearheads initiatives in Data Platforms and Strategic Solutions, driving the design and implementation of innovative frameworks that support enterprise-wide analytics, strategic decision-making, and digital transformation. Known for aligning technical vision with business objectives, he works closely with cross-functional teams to deliver scalable, forward-looking solutions that advance organizational agility and performance.

Ahmad Rawashdeh is a Senior Architect at Stifel Financial. He supports Stifel and its clients in designing, implementing, and building scalable and reliable data architectures on Amazon Web Services (AWS), with a strong focus on data lake strategies, database services, and efficient data ingestion and transformation pipelines.

Lei Meng is a data architect at Stifel. His focus is working in designing and implementing scalable and secure data solutions on the AWS and helping Stifel’s cloud migration from on-premises systems.

Amazon OpenSearch Service 101: Create your first search application with OpenSearch

Post Syndicated from Sriharsha Subramanya Begolli original https://aws.amazon.com/blogs/big-data/amazon-opensearch-service-101-create-your-first-search-application-with-opensearch/

Organizations today face the challenge of managing and deriving insights from an ever-expanding universe of data in real time. Industrial Internet of Things (IoT) sensors stream millions of temperature, pressure, and performance metrics from field equipment every second. Ecommerce platforms need to surface relevant products from vast catalogs instantly. Security teams must analyze system logs in real time to detect threats. As data volumes grow, organizations increasingly struggle with fragmented monitoring tools that create critical visibility gaps and slow incident response times. The cost of commercial observability solutions becomes prohibitive, forcing teams to manage multiple separate tools and increasing both operational overhead and troubleshooting complexity. Across these diverse scenarios, the ability to efficiently search, analyze, and visualize data in real time has become crucial for business success.

Amazon OpenSearch Service addresses these challenges by providing a fully managed search and analytics service. This managed service configures, manages, and scales OpenSearch clusters so you can focus on your search workloads and end customers. Amazon OpenSearch Serverless further makes it straightforward to run search and log analytics workloads by automatically scaling compute and storage resources up and down to match your application’s demands—with no infrastructure to manage. Whether you’re processing continuous streams of IoT telemetry, enabling product discovery, or performing security analytics, OpenSearch Service scales to meet your needs.

In this post, we walk you through a search application building process using Amazon OpenSearch Service. Whether you’re a developer new to search or looking to understand OpenSearch fundamentals, this hands-on post shows you how to build a search application from scratch—starting with the initial setup; diving into core components such as indexing, querying, result presentation; and culminating in the execution of your first search query.

Components of OpenSearch Service

Before building your first search application, it’s important to understand some key architectural components in OpenSearch. The fundamental unit of information in OpenSearch is a document stored in JSON format. These documents are organized into indices—collections of related documents that function similar to database tables. When you search for information, OpenSearch queries these indices to find matching documents.

OpenSearch operates on a distributed architecture where multiple servers, called nodes, work together in a cluster or domain. Each cluster can utilize dedicated master nodes that focus solely on cluster management tasks, such as maintaining cluster state, managing indices, and orchestrating shard allocation. These specialized nodes enhance cluster stability by offloading cluster management duties from data nodes. Data nodes, on the other hand, handle the storage, indexing, and querying of data—essentially performing the heavy lifting of data operations. Together, they provide scalability, availability, and efficient data processing in the cluster. Configure dedicated coordinator nodes that specialize in routing and distributing search and indexing requests across the cluster. These nodes reduce the load on data nodes, which allows them to focus on data storage, indexing, and search operations.

Coordinator nodes in OpenSearch are most beneficial in the following scenarios:

  1. Large cluster deployments – When managing substantial data volumes across many nodes.
  2. Query-intensive workloads – For environments handling frequent search queries or aggregations, especially those with complex date histograms or multiple aggregations, benefit from faster query processing.
  3. Heavy dashboard utilizationOpenSearch Dashboards can be resource-intensive. Offloading this responsibility to dedicated coordinator nodes reduces the strain on data nodes.

To manage large datasets efficiently, OpenSearch splits indices into smaller pieces called shards. Each shard is distributed across the cluster, with a recommended size of 10–50 GB for optimal performance. For reliability and high availability, OpenSearch maintains replica copies of these shards on different nodes, which means that your data remains accessible even if some nodes fail.

Search operations in OpenSearch are powered by inverted indices, a data structure that maps terms to the documents containing them. The BM25 ranking algorithm helps make sure that search results are relevant to users’ queries. Although searches happen in near real time, with configurable refresh intervals, individual document retrievals are immediate.

This architecture provides the foundation for handling high-volume IoT data streams, complex full-text search operations, and real-time analytics, all while maintaining fault tolerance. Understanding these components will help you make informed decisions as you build your search application.OpenSearch Dashboards is a visualization and analytics tool for exploring, analyzing, and visualizing data in real time. It provides an intuitive interface for querying, monitoring, and reporting on OpenSearch data using visualizations such as charts, graphs, and maps. Key features include interactive dashboards, alerting, anomaly detection, security monitoring, and trace analytics.

Sample Amazon OpenSearch Service tutorial application overview

The following architecture diagram demonstrates how to build and deploy a scalable, fully managed search application on Amazon Web Services (AWS). The architecture uses Amazon OpenSearch Service for indexing and searching data. The UI application is deployed on AWS App Runner and interacts with Amazon OpenSearch Service through secure serverless Amazon API Gateway and AWS Lambda.

Scope of Solution

Here is the end-to-end workflow for our application detailing how user requests are handled from initial access through to data retrieval or indexing:

  1. Users access the application through AWS App Runner, which hosts the frontend interface.
  2. Amazon Cognito handles user authentication and authorization for secure access to the application.
  3. When users interact with the application, their requests are sent to API Gateway. API Gateway communicates with Amazon Cognito to verify user authentication status. It serves as the primary entry point for all API operations and routes the requests appropriately. It forwards requests to Lambda functions within the virtual private cloud (VPC).
  4. Lambda functions process the requests, performing either:
  5. Data indexing operations into OpenSearch Service
  6. Search queries against the OpenSearch Service cluster
  7. The OpenSearch Service cluster resides within a private subnet in a VPC for enhanced security.

Prerequisites

Before you deploy the solution, review the prerequisites.

Install the sample app

The entire infrastructure is deployed using AWS Cloud Development Kit (AWS CDK), with cluster configurations customizable through the cdk.json file on GitHub. This deployment approach provides consistent and repeatable infrastructure creation while maintaining security best practices. The steps to deploy this infrastructure are available in this README file. After deployment, you’ll access a comprehensive search application built with Cloudscape React components that includes:

  1. Interactive search functionality – Test various OpenSearch query methods including prefix match keyword searches, phrase matching, fuzzy searches, and field-specific queries against the sample product dataset
  2. Document management tools – Bulk index the product catalog with a single click or delete and recreate the index as needed for testing purposes
  3. Educational resources – Access embedded guides explaining OpenSearch concepts, query syntax, and best practices

Index the documents

After you’ve deployed this search application, the first step is to index some documents into OpenSearch Service. Sign in to the search application UI and follow these steps:

  1. To trigger a bulk index process, under Index Documents in the navigation pane, choose Bulk Index Product Catalog.
  2. Choose Index Product catalog, as shown in the following screenshot.

The Lambda function indexes a comprehensive ecommerce product catalog into your newly created OpenSearch Service cluster. This sample dataset includes detailed fashion and lifestyle products spanning multiple categories. Each product record contains rich metadata, including title, detailed description, category, color, and price.

Bulk Index Process

Keyword searches

OpenSearch Service offers multiple search features. For an exhaustive list, refer to Search features. We focus on a few keyword search types to help you get started with OpenSearch.

With the product catalog in OpenSearch, you can perform prefix searches through the search application’s intuitive interface. To better understand the search functionality, expand the Guide section at the top of the interface. This interactive guide explains how various kinds of searches work, complete with a practical example in context of the product catalog dataset. The guide includes best practices and a link to the detailed documentation to help you make the most of OpenSearch’s powerful query capabilities.

You can do a prefix search on any of the three key search fields: Title, Description, or Color.

A typical prefix match query looks like this:

{
  "query": {
    "match_phrase_prefix": {
      "attribute_name": {
        "query": "attribute_value",
        "max_expansions": 10,
        "slop": 1
      }
    }
  }
}

You can use this query pattern to find documents where specific fields begin with your search term, offering an intuitive “starts with” search experience.

The following image illustrates a practical example of the Prefix Match search. Entering “Ru” in the title field matches products with titles such as “Running”, “Runners” and “Ruby.” Prefix Match search is particularly useful when users only remember the beginning of a product name or are searching across multiple variations or simply exploring product categories.

Prefix Match example

Multi Match search enables searching across multiple fields simultaneously. For example, you can search for “Coral” across product title, description, and color fields simultaneously. The search query can be customized using field boosting in which matches in certain fields carry more weight than others.

A typical multi match query looks like this:

{
  "query": {
    "multi_match": {
      "query": "Coral",
      "fields": [
        "title^3",
        "description",
        "color"
      ],
      "type": "best_fields"
    }
  }
}

You can explore Wildcard Match, Range Filter, and other search features through the search application. For developers and administrators managing this search infrastructure, OpenSearch Dashboards is a native, developer-friendly interface for indexing, searching, and managing your data. It serves as a comprehensive control center where you can interact directly with your indices, test queries, and monitor performance in real time. The following screenshot shows OpenSearch Dashboards which provides an interactive UI to explore, analyze and visualize search and log data.

OpenSearch Dashboards

While our example demonstrates lexical search functionality on a sample product catalog, OpenSearch Service is equally powerful for observability usecases. When handling time-series data from logs, metrics, or traces, OpenSearch excels at real-time analytics and visualization. For instance, DevOps teams can index application logs and system telemetry data, then use date histograms and statistical aggregations to identify performance bottlenecks or security anomalies as they occur. This real-time search allows IT teams to detect and respond to incidents with minimal delay. Using OpenSearch Dashboards, teams can create live operational dashboards that update automatically as new data streams in. For IoT applications monitoring thousands of sensors, this means temperature anomalies or equipment failures can trigger immediate alerts through OpenSearch’s alerting capabilities. These observability workloads benefit from the same distributed architecture that powers our product search example, with the added advantage of time-series optimized indices and retention policies for managing high-volume streaming data efficiently.

Beyond search management, you can configure alerts for specific conditions, set up notification channels for operational events, and enable data discovery features. If you want to experiment with the same search queries we implemented in our application, you can launch OpenSearch Dashboards and use relevant index and search APIs from the Dev Tools section, which is an ideal environment for developing and testing before implementing in your production application. Because our OpenSearch Service cluster resides within a private subnet, you need to create a Secure Shell (SSH) tunnel to access the dashboard. For more information and steps to do this, refer to How do I use an SSH tunnel to access OpenSearch Dashboards with Amazon Cognito authentication from outside a VPC? in the Knowledge Center. So far, we’ve explored OpenSearch’s query domain-specific language (DSL). However, for those coming in from a traditional database background, OpenSearch also offers SQL and Piped Processing Language (PPL) functionality, making the transition smoother. You can explore more on this at SQL and PPL in the OpenSearch documentation.

In this post, we introduced you to different types of keyword searches. You can also store documents as vector embeddings in OpenSearch and use it for semantic search, hybrid search, multimodal search, or to implement Retrieval Augmented Generation (RAG) pattern.

Conclusion

You can now build sample search applications by following the steps outlined in this post and the implementation details available at sample-for-amazon-opensearch-service-tutorials-101 on GitHub. By using the distributed architecture of Amazon OpenSearch Service, an AWS managed service, you get fast, scalable search capabilities that grow with your business, built-in security and compliance controls, and automated cluster management—all with pay-only-for-what-you-use pricing flexibility.

Ready to learn more? Check out the Amazon OpenSearch Service Developer Guide. For more insights, best practices and architectures, and industry trends, refer to Amazon OpenSearch Service blog posts and hands-on workshops at AWS Workshops. Please also visit the OpenSearch Service Migration Hub if you are ready to migrate legacy or self-managed workloads to OpenSearch Service.

We hope this detailed guide and accompanying code will help you get started. Try it out, let us know your thoughts in the comments section, and feel free to reach out to us for questions!


About the authors

SriharshaSriharsha Subramanya Begolli works as a Senior Solutions Architect with Amazon Web Services (AWS), based in Bengaluru, India. His primary focus is assisting large enterprise customers in modernizing their applications and developing cloud-based systems to meet their business objectives. His expertise lies in the domains of data and analytics.

Fraser SequeiraFraser Sequeira is a Startups Solutions Architect with Amazon Web Services (AWS) based in Melbourne, Australia. In his role at AWS, Fraser works closely with startups to design and build cloud-native solutions on AWS, with a focus on analytics and streaming workloads. With over 10 years of experience in cloud computing, Fraser has deep expertise in big data, real-time analytics, and building event-driven architecture on AWS. He enjoys staying on top of the latest technology innovations from AWS and sharing his learnings with customers. He spends his free time tinkering with new open source technologies.

Amazon Bedrock baseline architecture in an AWS landing zone

Post Syndicated from Abdel-Rahman Awad original https://aws.amazon.com/blogs/architecture/amazon-bedrock-baseline-architecture-in-an-aws-landing-zone/

As organizations increasingly adopt Amazon Bedrock to build and deploy large-scale AI applications, it’s important that they understand and adopt critical network access controls to protect their data and workloads. These generative AI-enabled applications might have access to sensitive or confidential information within their knowledge bases, Retrieval Augmented Generation (RAG) data sources, or models themselves, which could pose a risk if exposed to unauthorized parties. Additionally, organizations might want to limit access to certain AI models to specific teams or services, making sure only authorized users can use the most powerful capabilities. Another important consideration is cost optimization, because organizations need to be able to monitor and control access to manage various aspects of their cloud spending.

In this post, we explore the Amazon Bedrock baseline architecture and how you can secure and control network access to your various Amazon Bedrock capabilities within AWS network services and tools. We discuss key design considerations, such as using Amazon VPC Lattice auth policies, Amazon Virtual Private Cloud (Amazon VPC) endpoints, and AWS Identity and Access Management (IAM) to restrict and monitor access to your Amazon Bedrock capabilities.

By the end of this post, you will have a better understanding of how to configure your AWS landing zone to establish secure and controlled network connectivity to Amazon Bedrock across your organization using VPC Lattice.

Solution overview

Addressing the aforementioned challenges requires a well-designed network architecture and security controls. For this, we use the standard AWS Landing Zone Accelerator networking configuration. It provides a good starting point for managing network communication across multiple accounts. On top of the AWS Landing Zone Accelerator network design, we add two shared accounts.

In this solution design, we create a centralized architecture for managing organization AI capabilities across different accounts. The architecture consists of three main parts that work together to provide secure and controlled access to AI services:

  • Service network account – This account serves as the central networking hub for the organization, managing network connectivity and access policies. Through this account, network administrators can centrally manage and control access to AI services across the organization. The account follows AWS Landing Zone Accelerator networking practices that scale with enterprise organizational needs.
  • Generative AI account – This account hosts the organization’s Amazon Bedrock capabilities and serves as the central point for AI/ML management. The organization’s AI/ML scientists and prompt engineers will centrally build and manage Amazon Bedrock capabilities. The account provides access to various large language models (LLMs) through Amazon Bedrock by using VPC interface endpoints, while also enabling centralized monitoring of cost consumption and access patterns.
  • Workload accounts (dev, test, prod) – These accounts represent different environments where teams develop and deploy applications that consume AI services. Through secure network connections established through the service network account, these workload accounts can access the AI capabilities hosted in the generative AI account. This separation enforces proper isolation between development, testing, and production workloads while maintaining secure access to AI services.
Amazon Bedrock baseline architecture in an AWS landing zone

Amazon Bedrock baseline architecture in an AWS landing zone

The following diagram illustrates the solution architecture.

The service network account has its own VPC Lattice service network—a centralized networking construct that enables service-to-service communication across your organization, which is shared with workload accounts using AWS Resource Access Manager (AWS RAM) to enable VPC Lattice Service network sharing.

Workload accounts (dev, test, prod) establish VPC associations with the shared VPC Lattice service network by creating a service network association in their VPC. When an application in these accounts makes a request, it first queries the VPC resolver for DNS resolution. The resolver routes the traffic to the VPC Lattice service network.

Access control is implemented through an VPC Lattice auth policy. The service network policies determine which accounts can access the VPC Lattice service network, and service-level policies control access to specific AI services and define what actions each account can perform.

In the central AI services account, we find the proxy layer, we create a VPC Lattice service that points to a proxy layer, which acts as a single entry point, providing workload accounts access to Amazon Bedrock. This proxy layer then connects to Amazon Bedrock through VPC endpoints. Through this setup, the AI team can configure which foundation models (FMs) are available and manage access permissions for different workload accounts. After the necessary policies and connections are in place, workload accounts can access Amazon Bedrock capabilities through the established secure pathway. This setup enables secure, cross-account access to AI services while maintaining centralized control and monitoring.

Network components

We use VPC Lattice, which is a fully managed application networking service that helps you simplify network connectivity, security, and monitoring for service-to-service communication needs. With VPC Lattice, organizations can achieve a centralized connectivity pattern to control and monitor access to the services required for building generative AI applications.

For details about VPC Lattice, refer to the Amazon VPC Lattice User Guide. The following is an overview of the constructs you can use in setting up the centralized pattern in this solution:

  • VPC Lattice service network – You can use the VPC Lattice service network to provide central connectivity and security to the central AI services account. The service network is a logical grouping mechanism that simplifies how you can enable connectivity across VPCs or accounts, and apply common security policies for application communication patterns. You can create a service network in an account and share it with other accounts within or outside AWS Organizations using AWS RAM.
  • VPC Lattice service – In a service network, you can associate a VPC Lattice service, which consists of a listener (protocol and port number), routing rules that allow for control of the application flow (for example, path, method, header-based, or weighted routing), and target group, which defines the application infrastructure. A service can have multiple listeners to meet various client capabilities. Supported protocols include HTTP, HTTPS, gRPC, and TLS. The path-based routing allows control to various high-performing FMs and other capabilities you would need to build a generative AI application.
  • Proxy layer – You use a proxy layer for the VPC Lattice service target group. The proxy layer can be built based on your organization’s preference of AWS services, such as AWS Lambda, AWS Fargate, or Amazon Elastic Kubernetes Service (Amazon EKS). The purpose of the proxy layer is to provide a single entry point to access LLMs, knowledge bases, and other capabilities that are tested and approved according to your organization’s compliance requirements.
  • VPC Lattice auth policies – For security, you use VPC Lattice auth policies. VPC Lattice auth policies are specified using the same syntax as IAM policies. You can apply an auth policy to VPC Lattice service network as well as to the VPC Lattice service.
  • Fully Qualified Domain Names –To facilitate service discovery, VPC Lattice supports custom domain names for your services and resources, and maintains a Fully Qualified Domain Name (FQDN) for each VPC Lattice service and resource you define. You can use these FQDNs in your Amazon Route 53 private hosted zone configurations, and empower business units or teams to discover and access services and resources.
  • Service network VPC – Business units or teams can access generative AI services in a service network using service network VPC associations or a service network VPC endpoint.
  • Monitoring – You can choose to enable monitoring at the VPC Lattice service network level and VPC Lattice service level. VPC Lattice generates metrics and logs for requests and responses, making it more efficient to monitor and troubleshoot applications

The preceding guidance takes a “secure by default” approach—you must be explicit about which features, models, and so on should be accessed by which business unit. The setup also enables you to implement a defense-in-depth strategy at multiple layers of the network:

  • The first level of defense is that business team needs to connect to the service network in order to get access to the generative AI service through the central AI service account.
  • The second level includes network-level security protections in the business team’s VPC for the service network, such as security groups and network access control lists (ACLs). By using these, you can allow access to specific workloads or teams in a VPC.
  • The third level is through the VPC Lattice auth policy, which you can apply at two layers: at the service network level to allow authenticated requests within the organization, and at the service level to allow access to specific models and features.

VPC Lattice auth policy

This solution makes it possible to centrally manage access to Amazon Bedrock resources across your organization. This approach uses an VPC Lattice auth policy to centrally control Amazon Bedrock resources and manage it from one location across all the organization accounts.

Typically, the auth policy on the service network is operated by the network or cloud administrator. For example, allowing only authenticated requests from specific workloads or teams in your AWS organization. In the following example, access is granted to invoke the generated AI service for authenticated requests and to principals that are part of the o-123456example organization:

{
   "Version": "2012-10-17",
   "Statement": [
      {
         "Effect": "Allow",
         "Principal": "*",
         "Action": "vpc-lattice-svcs:Invoke",
         "Resource": "*",
         "Condition": {
            "StringEquals": {
               "aws:PrincipalOrgID": [ 
                  "o-123456example"
               ]
            }
         }
      }
   ]
}

The auth policy at the service level is managed by the central AI service team to set fine-grained controls, which can be more restrictive than the coarse-grained authorization applied at the service network level. For example, the following policy restricts access to claude-3-haiku for only business-team1:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect": "Allow",
         "Principal": {
            "AWS": [
               "arn:aws:iam::<account-number>:role/businss-team1"
            ]
         },
         "Action": "vpc-lattice-svcs:Invoke",
         "Resource": [
            "arn:aws:vpc-lattice:<aws-region>:<account-number>:service/svc-0123456789abcdef0/*"
         ],
         "Condition": {
            "StringEquals": {
               "vpc-lattice-svcs:RequestQueryString/modelid": "claude-3-haiku"            }
         }
      }
   ]
}

Monitoring and tracking

This design employs three monitoring approaches, using Amazon CloudWatch, AWS CloudTrail, and VPC Lattice access logs. This strategy provides a view of service usage, security, and performance.

CloudWatch metrics offer real-time monitoring of VPC Lattice service performance and usage. CloudWatch tracks metrics such as request counts and response times for Amazon Bedrock related endpoints, allowing for the setup of alarms for proactive management of service health and capacity. This enables monitoring of overall usage patterns of Amazon Bedrock models across different business units, facilitating capacity planning and resource allocation. CloudTrail provides detailed API-level auditing of Amazon Bedrock related actions. It logs cross-account access attempts and interactions with Amazon Bedrock services, providing a compliance and security audit trail. This tracking of who is accessing which Amazon Bedrock models, when, and from which accounts helps organizations adhere to their organizational policies.VPC Lattice access logs provide detailed insights into HTTP/HTTPS requests to Amazon Bedrock services, capturing specific usage patterns of AI models by different business teams. These logs contain client-specific information, which for example can be used to enable organizations to implement capabilities such as charge-back models. This allows for accurate attribution of AI service usage to specific teams or departments, facilitating fair cost allocation and responsible resource utilization across the organization. These services work together to enhance security, optimize performance, and provide valuable insights for managing cross-account Amazon Bedrock access.

Conclusion

In this post, we explored the importance of securing and controlling network access to Amazon Bedrock capabilities within an organization’s AWS landing zone. We discussed the key business challenges, such as the need to protect sensitive information in Amazon Bedrock knowledge bases, limit access to AI models, and optimize cloud costs by monitoring and controlling Amazon Bedrock capabilities. To address these challenges, we outlined a multi-layered network solution that uses AWS networking services, including a VPC Lattice auth policy to restrict and monitor access to Amazon Bedrock capabilities. Try out this solution for your own use case, and share your feedback in the comments.


About the authors

Stream data from Amazon MSK to Apache Iceberg tables in Amazon S3 and Amazon S3 Tables using Amazon Data Firehose

Post Syndicated from Pratik Patel original https://aws.amazon.com/blogs/big-data/stream-data-from-amazon-msk-to-apache-iceberg-tables-in-amazon-s3-and-amazon-s3-tables-using-amazon-data-firehose/

In today’s data-driven/fast-paced landscape/environment real-time streaming analytics has become critical for business success. From detecting fraudulent transactions in financial services to monitoring Internet of Things (IoT) sensor data in manufacturing, or tracking user behavior in ecommerce platforms, streaming analytics enables organizations to make split-second decisions and respond to opportunities and threats as they emerge.

Increasingly, organizations are adopting Apache Iceberg, an open source table format that simplifies data processing on large datasets stored in data lakes. Iceberg brings SQL-like familiarity to big data, offering capabilities such as ACID transactions, row-level operations, partition evolution, data versioning, incremental processing, and advanced query scanning. It seamlessly integrates with popular open source big data processing frameworks Apache Spark, Apache Hive, Apache Flink, Presto, and Trino. Amazon Simple Storage Service (Amazon S3) supports Iceberg tables both directly using the Iceberg table format and in Amazon S3 Tables.

Although Amazon Managed Streaming for Apache Kafka (Amazon MSK) provides robust, scalable streaming capabilities for real-time data needs, many customers need to efficiently and seamlessly deliver their streaming data from Amazon MSK to Iceberg tables in Amazon S3 and S3 Tables. This is where Amazon Data Firehose (Firehose) comes in. With its built-in support for Iceberg tables in Amazon S3 and S3 Tables, Firehose makes it possible to seamlessly deliver streaming data from provisioned MSK clusters to Iceberg tables in Amazon S3 and S3 Tables.

As a fully managed extract, transform, and load (ETL) service, Firehose reads data from your Apache Kafka topics, transforms the records, and writes them directly to Iceberg tables in your data lake in Amazon S3. This new capability requires no code or infrastructure management on your part, allowing for continuous, efficient data loading from Amazon MSK to Iceberg in Amazon S3.In this post, we walk through two solutions that demonstrate how to stream data from your Amazon MSK provisioned cluster to Iceberg-based data lakes in Amazon S3 using Firehose.

Solution 1 overview: Amazon MSK to Iceberg tables in Amazon S3

The following diagram illustrates the high-level architecture to deliver streaming messages from Amazon MSK to Iceberg tables in Amazon S3.

bdb-4769-image-1

Prerequisites

To follow the tutorial in this post, you need the following prerequisites:

Verify permission

Before configuring the Firehose delivery stream, you must verify the destination table available in the Data Catalog.

  1. On the AWS Glue console, go to Glue Data Catalog and verify the Iceberg table is available with the required attributes.

bdb-4769-image-2

  1. Verify your Amazon MSK provisioned cluster is up and running with IAM authentication, and multi-VPC connectivity is enabled for it.

bdb-4769-image-3

  1. Grant Firehose access to your private MSK cluster:
    1. On the Amazon MSK console, go to the cluster and choose Properties and Security settings.
    2. Edit the cluster policy and define a policy similar to the following example:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Principal": {
        "Service": [
          "firehose.amazonaws.com"
        ]
    },
    "Effect": "Allow",
    "Action": [
      "kafka:CreateVpcConnection"
    ],
    "Resource": "<Amazon MSK cluster-arn>"
    }
  ]
}

This ensures Firehose has the necessary permissions on the source Amazon MSK provisioned cluster.

Create a Firehose role

This section describes the permissions that grant Firehose access to ingest, process, and deliver data from source to destination. You must specify an IAM role that grants Firehose permissions to ingest source data from the specified Amazon MSK provisioned cluster. Make sure that the following trust policies are attached to that role so that Firehose can assume it:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Principal": {
        "Service": [
          "firehose.amazonaws.com"
        ]
      },
      "Effect": "Allow",
      "Action": "sts:AssumeRole"
    }
  ]
}

Make sure that this role grants Firehose the following permissions to ingest source data from the specified Amazon MSK provisioned cluster:

{
   "Version": "2012-10-17",      
   "Statement": [{
        "Effect":"Allow",
        "Action": [
           "kafka:GetBootstrapBrokers",
           "kafka:DescribeCluster",
           "kafka:DescribeClusterV2",
           "kafka-cluster:Connect"
         ],
         "Resource": "<CLUSTER-ARN>"
       },
       {
         "Effect":"Allow",
         "Action": [
           "kafka-cluster:DescribeTopic",
           "kafka-cluster:DescribeTopicDynamicConfiguration",
           "kafka-cluster:ReadData"
         ],
         "Resource": "<TOPIC-ARN>"
       }]
}

Make sure the Firehose role has permissions to the Glue Data Catalog and S3 bucket:

{
    "Version": "2012-10-17",  
    "Statement":
    [    
        {      
            "Effect": "Allow",      
            "Action": [
                "glue:GetTable",
                "glue:GetDatabase",
                "glue:UpdateTable"
            ],      
            "Resource": [   
                "arn:aws:glue:<region>:<aws-account-id>:catalog",
                "arn:aws:glue:<region>:<aws-account-id>:database/*",
                "arn:aws:glue:<region>:<aws-account-id>:table/*/*"             
            ]    
        },        
        {      
            "Effect": "Allow",      
            "Action": [
                "s3:AbortMultipartUpload",
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:ListBucketMultipartUploads",
                "s3:PutObject",
                "s3:DeleteObject"
            ],      
            "Resource": [   
                "arn:aws:s3:::<S3 bucket name>",
                "arn:aws:s3:::<S3 bucket name>/*"              
            ]    
        } 
    ]
}    

For detailed policies, refer to the following resources:

Now you have verified that your source MSK cluster and destination Iceberg table are available, you’re ready to set up Firehose to deliver streaming data to the Iceberg tables in Amazon S3.

Create a Firehose stream

Complete the following steps to create a Firehose stream:

  1. On the Firehose console, choose Create Firehose stream.
  2. Choose Amazon MSK for Source and Apache Iceberg Tables for Destination.

bdb-4769-image-4

  1. Provide a Firehose stream name and specify the cluster configurations.

bdb-4769-image-5

  1. You can choose an MSK cluster in the current account or another account.
  2. To choose the cluster, it must be in active state with IAM as one of its access control methods and multi-VPC connectivity should be enabled.

bdb-4769-image-6

  1. Provide the MSK topic name from which Firehose will read the data.

bdb-4769-image-7

  1. Enter the Firehose stream name.

bdb-4769-image-8

  1. Enter the destination settings where you can opt to send data in the current account or across accounts.
  2. Select the account location as Current account, choose an appropriate AWS Region, and for Catalog, choose the current account ID.

bdb-4769-image-9

To route streaming data to different Iceberg tables and perform operations such as insert, update, and delete, you can use Firehose JQ expressions. You can find the required information here.

  1. Provide the unique key configuration, which makes it possible to perform update and delete actions on your data.

bdb-4769-image-10

  1. Go to Buffer hints and configure Buffer size to 1 MiB and Buffer interval to 60 seconds. You can tune these settings according to your use case needs.
  2. Configure your backup settings by providing an S3 backup bucket.

With Firehose, you can configure backup settings by specifying an S3 backup bucket with custom prefixes like error, so failed records are automatically preserved and accessible for troubleshooting and reprocessing.

bdb-4769-image-11

  1. Under Advanced settings, enable Amazon CloudWatch error logging.

bdb-4769-image-12

  1. Under Service access, choose the IAM role you created earlier for Firehose.
  2. Verify your configurations and choose Create Firehose stream.

bdb-4769-image-14

The Firehose stream will be available and it will stream data from the MSK topic to the Iceberg table in Amazon S3.

bdb-4769-image-15

You can query the table with Amazon Athena to validate the streaming data.

  1. On the Athena console, open the query editor.
  2. Choose the Iceberg table and run a table preview.

You will be able to access the streaming data in the table.

bdb-4769-image-16

Solution 2 overview: Amazon MSK to S3 Tables

S3 Tables is built on Iceberg’s open table format, providing table-like capabilities directly to Amazon S3. You can organize and query data using familiar table semantics while using Iceberg’s features for schema evolution, partition evolution, and time travel capabilities. The feature performs ACID-compliant transactions and supports INSERT, UPDATE, and DELETE operations in Amazon S3 data, making data lake management more efficient and reliable.

You can use Firehose to deliver streaming data from an Amazon MSK provisioned cluster to Iceberg tables in Amazon S3. You can create an S3 table bucket using the Amazon S3 console, and it registers the bucket to AWS Lake Formation, which helps you manage fine-grained access control for your Iceberg-based data lake on S3 Tables. The following diagram illustrates the solution architecture.

Prerequisites

You should have the following prerequisites:

  • An AWS account
  • An active Amazon MSK provisioned cluster with IAM access control authentication enabled and multi-VPC connectivity
  • The Firehose role mentioned earlier with the additional IAM policy:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Statement1",
            "Effect": "Allow",
            "Action": [
                "lakeformation:GetDataAccess"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

Further, in your Firehose role, add s3tablescatalog as a resource to provide access to S3 Table as shown below.

Create an S3 table bucket

To create an S3 table bucket on the Amazon S3 console, refer to Creating a table bucket.

When you create your first table bucket with the Enable integration option, Amazon S3 attempts to automatically integrate your table bucket with AWS analytics services. This integration makes it possible to use AWS analytics services to query all tables in the current Region. This is an important step for the further set up. If this integration is already in place, you can use the AWS Command Line Interface (AWS CLI) as follows:

aws s3tables create-table-bucket --region <region id> --name <bucket name>

bdb-4769-image-18

Create a namespace

An S3 table namespace is a logical construct within an S3 table bucket. Each table belongs to a single namespace. Before creating a table, you must create a namespace to group tables under. You can create a namespace by using the Amazon S3 REST API, AWS SDK, AWS CLI, or integrated query engines.

You can use the following AWS CLI to create a table namespace:

aws s3tables create-namespace --table-bucket-arn arn:aws:s3tables:us-east-1:111122223333:bucket/amzn-s3-demo-bucket --namespace example_namespace

Create a table

An S3 table is a sub-resource of a table bucket. This resource stores S3 tables in Iceberg format so you can work with them using query engines and other applications that support Iceberg. You can create a table with the following AWS CLI command:

aws s3tables create-table --cli-input-json file://mytabledefinition.json

The following code is for mytabledefinition.json:

{
    "tableBucketARN": "arn:aws:s3tables:us-east-1:111122223333:bucket/amzn-s3-demo-table-bucket",
    "namespace": "example_namespace ",
    "name": "example_table",
    "format": "ICEBERG",
    "metadata": {
        "iceberg": {
            "schema": {
                "fields": [
                     {"name": "id", "type": "int", "required": true},
                     {"name": "name", "type": "string"},
                     {"name": "value", "type": "int"}
                ]
            }
        }
    }
}

Now you have the required table with the relevant attributes available in Lake Formation.

Grant Lake Formation permissions on your table resources

After integration, Lake Formation manages access to your table resources. It uses its own permissions model (Lake Formation permissions) that enables fine-grained access control for Glue Data Catalog resources. To allow Firehose to write data to S3 Tables, you can grant a principal Lake Formation permission on a table in the S3 table bucket, either through the Lake Formation console or AWS CLI. Complete the following steps:

  1. Make sure you’re running AWS CLI commands as a data lake administrator. For more information, see Create a data lake administrator.
  2. Run the following command to grant Lake Formation permissions on the table in the S3 table bucket to an IAM principal (Firehose role) to access the table:
aws lakeformation grant-permissions \
--region <region e.g. us-east-1> \
--cli-input-json \
'{
    "Principal": {
        "DataLakePrincipalIdentifier": "<Amazon Data Firehose role ARN e.g. arn:aws:iam::<accound-id>:role/ExampleRole>"
    },
    "Resource": {
        "Table": {
            "CatalogId": "<account-id>:<s3tablescatalog>/<S3 table bucket name>",
            "DatabaseName": "<S3 table bucket namespace e.g. test_namespace>",
            "Name": "<S3 table bucket table name e.g. test_table>"
        }
    },
    "Permissions": [
        "ALL"
    ]
}'

Set up a Firehose stream to S3 Tables

To set up a Firehose stream to S3 Tables using the Firehose console, complete the following steps:

  1. On the Firehose console, choose Create Firehose stream.
  2. For Source, choose Amazon MSK.
  3. For Destination, choose Apache Iceberg Tables.
  4. Enter a Firehose stream name.
  5. Configure your source settings.
  6. For Destination settings, select Current Account, choose your Region, and enter the name of the table bucket you want to stream in.
  7. Configure the database and table names using Unique Key configuration settings, JSONQuery expressions, or in an AWS Lambda function.

For more information, refer to Route incoming records to a single Iceberg table and Route incoming records to different Iceberg tables.

  1. Under Backup settings, specify a S3 backup bucket.
  2. For Existing IAM roles under Advanced settings, choose the IAM role you created for Firehose.
  3. Choose Create Firehose stream.

The Firehose stream will be available and it will stream data from the Amazon MSK topic to the Iceberg table. You can verify it by querying the Iceberg table using an Athena query.

bdb-4769-image-19

Clean up

It’s always a good practice to clean up the resources created as part of this post to avoid additional costs. To clean up your resources, delete the MSK cluster, Firehose stream, Iceberg S3 table bucket, S3 general purpose bucket, and CloudWatch logs.

Conclusion

In this post, we demonstrated two approaches for data streaming from Amazon MSK to data lakes using Firehose: direct streaming to Iceberg tables in Amazon S3, and streaming to S3 Tables. Firehose alleviates the complexity of traditional data pipeline management by offering a fully managed, no-code approach that handles data transformation, compression, and error handling automatically. The seamless integration between Amazon MSK, Firehose, and Iceberg format in Amazon S3 demonstrates AWS’s commitment to simplifying big data architectures while maintaining the robust features of ACID compliance and advanced query capabilities that modern data lakes demand. We hope you found this post helpful and encourage you to try out this solution and simplify your streaming data pipelines to Iceberg tables.


About the authors

bdb-4769-image-21Pratik Patel is Sr. Technical Account Manager and streaming analytics specialist. He works with AWS customers and provides ongoing support and technical guidance to help plan and build solutions using best practices and proactively keep customers’ AWS environments operationally healthy.

Amar is a seasoned Data Analytics specialist at AWS UK, who helps AWS customers to deliver large-scale data solutions. With deep expertise in AWS analytics and machine learning services, he enables organizations to drive data-driven transformation and innovation. He is passionate about building high-impact solutions and actively engages with the tech community to share knowledge and best practices in data analytics.

bdb-4769-image-22Priyanka Chaudhary is a Senior Solutions Architect and data analytics specialist. She works with AWS customers as their trusted advisor, providing technical guidance and support in building Well-Architected, innovative industry solutions.

How Stellantis streamlines floating license management with serverless orchestration on AWS

Post Syndicated from Göksel SARIKAYA original https://aws.amazon.com/blogs/architecture/how-stellantis-streamlines-floating-license-management-with-serverless-orchestration-on-aws/

This post is written by Goeksel Sarikaya, Senior Delivery Consultant at AWS, and Milosz Stawarski, Senior Software Architect at Stellantis.

Software licensing is a critical aspect of many organizations’ operations, with various models available to suit different needs. Two common types are named user licenses, which are assigned to specific individuals, and floating licenses, which can be shared among a pool of users. Some independent software vendors (ISVs) offer both options, whereas others might have limitations, particularly in cloud environments.

In this post, we explore a unique scenario where an ISV, unable to provide a floating license option for cloud usage, worked with Stellantis to develop an alternative solution. This approach, implemented with the ISV’s permission, treats named user licenses as if they were floating, automatically assigning and removing them based on the state of user workbench instances.

This solution is not intended to circumvent licensing terms or reduce costs at the expense of ISVs. Rather, it’s a collaborative approach to address specific customer needs when traditional floating licenses aren’t available. We will demonstrate how the solution uses serverless AWS services like Amazon EventBridge, AWS Lambda, Amazon DynamoDB, and AWS Systems Manager, keeping in mind that any similar implementation should only be pursued with explicit permission from the software vendor.

Overview of Stellantis

Stellantis N.V., born from the merger of FCA and PSA Group, leads the change towards software defined vehicles (SDV). As part of this transformation, AWS and Stellantis created the Virtual Engineering Workbench (VEW), a modular framework to develop, integrate, and test vehicle software in the cloud, ultimately connecting their vehicles to the cloud.

The VEW provides predefined environments tailored to specific use cases. These environments come fully equipped with the tools, integrated development environments (IDEs), and licensing necessary for developers to jumpstart their projects.

For more details on VEW, refer to Stellantis’ SDV transformation with the Virtual Engineering Workbench on AWS.

Overview of solution

As the number of developers and projects grew, Stellantis faced a challenge in managing the limited number of named user licenses for their software tools. The manual process of assigning and revoking licenses became increasingly time-consuming and inefficient, potentially hindering the agility and productivity of their development teams.

Stellantis and AWS tackled this challenge head-on by collaborating on an innovative, dynamic license management solution using AWS serverless services. This solution transforms the traditional named user license model into a more flexible floating license system, automatically assigning and revoking licenses based on the state of user workbench instances. The licenses and solution discussed in this post pertain solely to the use of standalone software tools such as those used in automotive domains. These do not involve sharing of user data or content when licenses are reused.

Before we dive into the detailed workflow of the solution, let’s examine the high-level architecture. The following diagram illustrates how various AWS services work together to create this efficient license management system.

Multi-region AWS license management architecture showing event-driven workflows between toolchain and user accounts with VEW workbench integration

Architecture

This architecture uses key AWS services such as EventBridge, Lambda, DynamoDB, and Systems Manager to create a scalable, serverless solution that significantly reduces administrative overhead and optimizes license utilization.

In the following sections, we explore each component of this architecture in detail, explaining how they interact to provide a seamless license management experience for Stellantis’ VEW.

In workbench accounts (user accounts)

The design is serverless and based on an event-driven approach. The workflow in the user accounts is as follows:

  1. Workbench instances are Amazon Elastic Compute Cloud (Amazon EC2). Their start and stop automatically sends AWS events.
  2. An EventBridge rule invokes a Lambda function when such an event occurs. This function checks the tags on the EC2 instance to distinguish workbenches from other EC2 instances. Two tags are important for identifying workbench instances: vew:workbench:ownerId and vew:workbench:type.
  3. The Lambda function creates a custom event with the following data: user-id, workbench-type, workbench-state, and instance-id, and sends this event to the default event bus.
  4. An EventBridge rule forwards the custom event to a custom event bus in the license server account.

In license server account

The following steps take place in the license server account:

  1. An EventBridge rule invokes a Lambda
  2. This function interacts with a DynamoDB table that stores a mapping of licensed products to users. The function does the following:
    1. Deduces the licensed products present in the workbench from the workbench type.
    2. For each licensed product, it verifies if the combination of product and user is already present in the DynamoDB
    3. If the workbench is starting:
      1. If the combination is already present, it increases the count of workbenches in the table for this item by 1.
      2. If the combination is not present, it creates a new item in the table (product, user-id, workbench-count, timestamp).
    4. If the workbench is stopping, it decreases the count of workbenches in the table for this item by 1. If the count becomes 0, the item is deleted.
  3. Any update to the DynamoDB table triggers another Lambda
  4. If the change in the table is a creation of a new entry or deletion of an entry, this function writes the current timestamp to a Systems Manager parameter in both cases. This is so that if no changes are detected in the database, we don’t unnecessarily run the xLC (License Client for related product) caller function.
  5. Another Lambda function is invoked every minute. It compares the timestamp written in the Systems Manager parameter indicating a DynamoDB item creation or deletion with the last time the function called the xLC CLI to assign users to a license.
  6. If the DynamoDB timestamp is earlier, the function stops. If the DynamoDB timestamp is later, the function queries the table for obtaining the user-id for each product.
  7. To maintain a comprehensive record of license assignment operations, you can enable data plane events for DynamoDB in AWS CloudTrail.
  8. For each licensed product, the function uses Run Command, a capability of Systems Manager, to invoke the xLC CLI API on the license server to assign named users to a license for a product. The function provides the list of users assigned to the product to the API. This updates the named user list on the license server—the list is completely overwritten, which includes adding new user IDs and removing ones that are no longer needed.

Benefits and key features

The solution offers the following benefits:

  • Automated license assignment and removal – Users are automatically assigned licenses when their workbench instances start, and licenses are returned to the pool when instances stop, providing efficient license utilization.
  • Scalable and serverless architecture – The solution is built on serverless AWS services, allowing it to scale seamlessly as the number of users and workbench instances grows, without the need for provisioning or managing servers.
  • Centralized license management – The license server account acts as a central hub for managing licenses across multiple workbench accounts, simplifying administration and providing a unified view of license usage.
  • Reduced administrative overhead – By automating the license assignment and removal process, the solution can significantly reduce the administrative burden associated with manual license management.
  • Optimized license utilization – Licenses are assigned only when needed and returned to the pool when no longer required, maximizing license availability and minimizing idle licenses.
  • Monitoring and metrics – The solution provides monitoring capabilities and license usage metrics, enabling better visibility and informed decision-making regarding license procurement and allocation.

Conclusion

By implementing this serverless solution, it is possible to transform a manual named user license management systems to an automated floating license system for software tools. The event-driven architecture and serverless components provide efficient and scalable license assignment and removal based on the workbench instance state.

This solution has streamlined the license management process, reducing administrative overhead and optimizing license utilization. It is now possible to provision software tools more efficiently, improving productivity and resource allocation across the organization. Additionally, the centralized license management and monitoring capabilities provide better visibility and control over license usage, enabling informed decision-making and cost optimization.

Overall, this AWS based floating license solution has empowered organizations to use software tools more effectively, while minimizing the operational burden associated with license management. For more serverless learning resources, visit Serverless Land.


About the authors