Tag Archives: serverless

Enhancing multi-account activity monitoring with event-driven architectures

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/compute/enhancing-multi-account-activity-monitoring-with-event-driven-architectures/

Enterprise cloud environments are growing increasingly complex as they scale, with organizations managing hundreds to thousands of Amazon Web Services (AWS) accounts across multiple business units and AWS Regions. Organizations need efficient ways to collect, transport, and analyze activity data for threat detection and compliance monitoring. This presents unique challenges for enterprise Application Security (AppSec) teams, cloud security vendors, and DevSecOps professionals, because traditional polling-based monitoring approaches struggle to provide real-time activity insights needed for modern cloud operations.

In this post, you will learn to use AWS CloudTrail and Amazon EventBridge for real-time cloud activity monitoring and automated response.

Overview

As organizations expand their cloud footprint, account activity monitoring that comprehensively tracks user actions and successfully identifies security threats becomes crucial for threat detection and compliance. Although AWS provides native tools—such as CloudTrail for API activity capture, EventBridge for real-time event routing, AWS Organizations for multi-account management, and AWS Config for resource evaluation—many enterprises struggle with the volume of activities while maintaining efficiency and controlling costs. Organizations need to carefully architect solutions to effectively use these tools as their environments scale.

Traditional polling-based techniques, which worked well for smaller environments, can become unsustainable when scaled to enterprise deployments, where the volume of activity data grows exponentially with each new account and service. API polling limitations, growing data volumes, and increasing demand for real-time analysis are pushing teams to rethink their architectural approach.

Figure 1. Poll model, periodically retrieving the latest state.

Adopting push-based event-driven architectures offers a compelling solution for AppSec teams and cloud security vendors facing these challenges. Using AWS services, such as CloudTrail and EventBridge, allows these teams and vendors to build scalable activity monitoring solutions that overcome the limitations of traditional polling-based approaches and provide real-time notifications across thousands of AWS accounts. This approach not only enables security use cases but also supports broader real-time operational monitoring, compliance reporting, and automation requirements.

“By integrating AWS CloudTrail and Amazon EventBridge, we’ve built a scalable architecture to monitor activity across thousands of AWS accounts. This provides the visibility needed to detect threats in real time and protect large, distributed AWS environments.” — Rob Solomon, Senior Cloud Solution Architect, CrowdStrike

Solution components

Enterprise AppSec teams and cloud security vendors share common requirements when building multi-account monitoring solutions. They need to efficiently collect activity data across thousands of accounts, transport it to a centralized location for analysis, and process it in real-time to detect threats and compliance violations. The solution must scale seamlessly from dozens to thousands of accounts while remaining highly-performant and cost-efficient. At its core, a scalable multi-account activity monitoring solution consists of three components: activity data collection, cross-account transport to a centralized location, and processing. In the following sections, you will learn how AppSec teams and cloud security vendors can implement each step efficiently while avoiding common pitfalls.

Figure 2. Push model. Account activity is collected at the source, and pushed to the AppSec or cloud security vendor account for further processing.

Data collection strategies

Many teams begin their cloud activity monitoring journey by polling the resource status through service management APIs. Although this approach works good for retrieving the latest resource state on-demand, its fundamental limitation is inability to detect state changes efficiently, necessitating continuous querying of all resources at fixed intervals. Consider a scenario where you’re monitoring 1,000 accounts, with 100 resources in each account. A single polling cycle would necessitate 100,000 API requests, consuming over 28 million API calls daily if running at five-minute intervals. This inefficiency compounds as environments grow, leading to throttling issues, increased costs, and scaling challenges.

AWS Config improved upon this by offering continuous resource configuration tracking without manual polling. Although this works excellent for configuration compliance and a history of changes for auditing, AWS Config reports changes on a best-effort basis and is not optimal for real-time threat detection.

To overcome this constraint, your solution can use services such as CloudTrail and EventBridge as primary data sources, complemented by intelligent on-demand targeted API polling. CloudTrail records API activity across AWS services, providing a detailed history of actions taken by users, roles, and AWS services in your accounts. Over 250 AWS services automatically report their activity and API calls to CloudTrail and EventBridge in real-time. This allows you to capture this information, providing a detailed history of actions taken in your accounts, and enabling security analysis, resource state change tracking, and compliance audit.

Figure 3. Over 250 AWS services are automatically emitting activity events to CloudTrail.

When a resource state changes, commonly as a result of a management API call, the affected service sends an event to CloudTrail and EventBridge. Your monitoring solution can examine the event payload to determine if polling for supplementary data is necessary, particularly when the initial payload lacks complete information. This provides you with comprehensive service coverage with reduced maintenance effort. This hybrid approach guarantees delivery of activity data to eliminate monitoring blind spots, while significantly reducing AWS management API quota consumption.

Cross-account data transport

Your solution should transport activity data from thousands of tenant accounts into a small number of centralized accounts, such as a regional AppSec account, for further processing and analysis. The solution must be secure, scalable, resilient, and cost-efficient while maintaining real-time delivery.

The most direct way to achieve it is to enable Amazon S3 event notifications for new objects that are created in the CloudTrail trails S3 bucket. When you receive the notification, you can retrieve and process new activities.

Figure 4. Exporting CloudTrail events into an S3 bucket and retrieving after receiving a notification.

This direct way to consume CloudTrail events has one important consideration: typically it can take an average of five minutes to deliver events to Amazon S3. Teams and vendors looking for lower mean-time-to-detect (MTTD) and mean-time-to-respond (MTTR) should evaluate transporting CloudTrail events across accounts with EventBridge, which provides close-to-real-time delivery.

Transporting events with EventBridge

EventBridge is a serverless event router that connects applications. It receives events from various sources, such as CloudTrail, and routes them to multiple targets based on defined rules.

Using EventBridge for cross-account data transport comes with several major benefits:

There are two approaches you can take for delivering cross-account events with EventBridge: direct service-to-service or service-to-API-endpoint.

The first approach uses the EventBridge direct bus-to-bus and bus-to-service delivery capabilities. This method is most suitable when you want AWS to handle data ingestion on the receiving end. The delivery target is always either an EventBridge bus, or another AWS service, such as an Amazon Simple Queue Service (Amazon SQS) queue, Amazon Kinesis Data Streams stream, or an AWS Lambda function. With support for up to 18,750 target invocations per second and native AWS Identity and Access Management (IAM) integration, this method is particularly suitable for large multi-account deployments.

The second approach uses the EventBridge API destinations feature. This method is most suitable when you have existing HTTP-based ingestion endpoints in place. Although it offers lower throughput, it provides greater flexibility for ingestion endpoint and authentication methods implementation, making it attractive for AppSec teams and cloud security vendors integrating with existing ingestion infrastructure.

Figure 5. Emitting CloudTrail events in real-time through EventBridge.

The following table summarizes two approaches for transporting events across accounts with EventBridge.

Direct bus-to-bus or bus-to-service API destinations
Data ingestion implementation effort Minimal Needed
Default target invocations per second (TPS) quotas Up to 18,750 (region dependent) Up to 300 (region dependent)
Can the TPS quota be increased Yes Yes
Authorization support Native AWS IAM, fully handled by AWS Basic, OAuth2, API Key. You’re responsible for implementing credentials validation during ingestion.
Cross-account delivery costs $1 per million events $0.20 per million events

Go to the EventBridge quotas and pricing pages for more details.

Processing architecture

Processing would commonly be done by existing products and services the AppSec team or cloud security vendor provides for activity analysis. The architecture for event processing pipeline operating at enterprise scale must consider design decisions to handle large and potentially irregular event volumes while maintaining high performance, as shown in the following figure.

Figure 6. An activity event processing pipeline, with priority-based processing.

Use the following best practices for a robust processing architecture:

  • Buffer ingested events Use services such as Amazon SQS, Amazon Kinesis Data Streams, or Amazon Managed Streaming for Apache Kafka to buffer incoming events, handle traffic surges, and make sure of reliable processing.
  • Use serverless services that scale automatically, or invest in automated scaling mechanisms that adjust processing capacity based on event volume
  • Minimize polling: Resort to intelligent on-demand polling, only poll when you need additional data that is not available in the CloudTrail event payload.
  • Routing and classification: Rather than processing all events equally, implement intelligent classification and routing early in your pipeline. Security-related events such as IAM changes or security group modifications should take priority over routine activities or data events. This approach helps to control costs while maintaining rapid detection of important security events.
  • Cost optimization: At the enterprise scale, cost optimization becomes crucial. Use EventBridge rules in source accounts to filter out irrelevant events before they enter your processing pipeline. Consider implementing regional collection points to optimize data transfer costs. When using Lambda functions for data processing, use batch processing to reduce invocation costs. Evaluate which event types must be delivered in real-time through EventBridge, which event types can be delayed and collected through S3 bucket export, and which events should be discarded.
  • Observability: Monitor the ingestion and processing throughput to react to potential slowdowns early. Detect when source accounts are approaching EventBridge quotas. Consider using AWS Service Quotas to request quota increases automatically through APIs.
  • Cross-Region considerations: Design your architecture to support efficient cross-Region event collection while respecting data sovereignty requirements. Consider implementing regional processing nodes with centralized aggregation for global security analysis.
  • Integration patterns: Modern security solutions must integrate with existing security tools and workflows. Implement standardized output formats that allow seamless integration with SIEM systems, ticketing platforms, and automation frameworks. Consider publishing security findings back to EventBridge buses to enable automated remediation workflows. If you’re a cloud security vendor, then consider integrating with EventBridge as an SaaS partner.

Conclusion

Event-driven architectures present a powerful opportunity for building scalable multi-account activity monitoring solutions. Using services such as AWS CloudTrail and Amazon EventBridge allows teams to overcome the limitations of traditional polling-based approaches while achieving close to real-time delivery.The shift to event-driven security monitoring isn’t just an architectural choice—it’s becoming a necessity for teams operating at enterprise scale. This approach enables security teams to achieve the real-time threat detection capabilities needed in today’s cloud environments while maintaining operational efficiency and cost control.

Powering hybrid workloads with Amazon API Gateway

Post Syndicated from Mankaran Singh original https://aws.amazon.com/blogs/compute/powering-hybrid-workloads-with-amazon-api-gateway/

Amazon API Gateway can provide a single-entry point for all incoming API requests for Hybrid Workloads. You can use API Gateway to expose your resources in Amazon Virtual Private Cloud (VPC) and on-premises as REST APIs to external consumers. It provides a layer of abstraction between the API consumers and the backend services, allowing for centralized control. Routing all traffic through the API Gateway lets builders centrally enforce authentication, authorization, rate limiting, and other security features. This blog post describes how to configure API Gateway as an entry point to your on-premises resources.

Hybrid workloads can take advantage of API Gateway acting as single-entry point and provide a consistent interface for cloud and on-premises private API’s. You can connect API Gateway to resources within your private network through VPC link.

Figure 1 – private connectivity through VPC link

When private resources are located in different VPCs or AWS accounts, you can use AWS Transit Gateway or VPC peering to connect them.

Figure 2 – private connectivity through AWS Transit Gateway

You can also connect API Gateway to private resources hosted in your on-premises network.

Prerequisite

This blog assumes that you have an on-premises server hosting an API. Private connectivity between your AWS VPC and on-premises is needed, follow implementation step 1 for establishing private connectivity.

Solution overview

Figure 3 illustrates how to connect API Gateway’s REST API to on-premises application. The following steps detail the setup process.

Figure 3 – REST API architecture diagram for On-Premise applications

Implementation

The proposed solution can be implemented in six major steps:

Step 1. Enable VPC communication with on-premises network
Step 2. Setup Network Load Balancer for private integration with API gateway
Step 3. Create the VPC link
Step 4. Configure the API Gateway
Step 5. Create integration with VPC link
Step 6. Deploy the API

Step 1. Enable VPC communication with on-premises network

In this step we setup connectivity between Amazon VPC and on-premises network

  1. Create a VPC if one isn’t already configured.
  2. If no private connection between the VPC and your on-premises network exists, use either Virtual Private Gateway or AWS Transit Gateway to setup AWS Site-to-Site VPN or AWS Direct Connect.

Step 2. Setup Network Load Balancer for private integration with API gateway

In this step we setup Network Load Balancer required for private integration with API Gateway

  1. Sign in to the AWS Management Console and open the Amazon EC2 console at Amazon EC2 console
  2. Configure target group for your Network Load Balancer.
    Target group is used for request routing to your application. You will register on-premises server IPs in the target group. The load balancer checks the health of targets in this target group using the health check settings defined for the target group.

    1. Open the Amazon EC2 console at Amazon EC2 console.
    2. In the navigation pane, under Load Balancing, choose Target Groups.
    3. Choose Create target group.
    4. Keep the target type as IP addresses
    5. For Target group name, enter a name for the new target group.
    6. For Protocol, choose TCP, and for Port, choose the port where your application is running.
    7. For VPC, select the VPC created in PART A.
    8. For Health checks, keep the default settings.
    9. Choose Next.
    10. On the Register targets page, complete the following steps:
      1. Select the network as Other private IP address and Availability Zone as All
      2. Enter the IP addresses and port of the on-premises application, and then choose Include as pending below.
    11. Choose Create target group.

      Figure 4 – Amazon EC2 console create target group

  3. Configure your load balancer and listener
    To create a Network Load Balancer, you must first provide basic configuration information for your load balancer, such as a name, scheme, and IP address type. Then provide information about your network, and one or more listeners. A listener is a process that checks for connection requests. It is configured with a protocol and a port for connections from clients to the load balancer.

    1. For Load balancer name, enter a name for your load balancer.
    2. For Scheme and IP address type, keep the default values.
    3. For Network mapping, select the VPC that was previously created. Select one subnet each in at least two availability zones for high availability. By default, AWS assigns an IPv4 address to each load balancer node from the subnet for its Availability Zone.
    4. For Security groups, you will have a default security group associated for your VPC. Remove the default security group as it is not required for this setup.Review your configuration, and choose Create load balancer.
    5. For Listeners and routing, select the protocol as TCP and port of your application, and select the target group from the list. This configures a listener that accepts TCP traffic on port that you specify and forwards traffic to the selected target group by default.
    6. Review your configuration, and choose Create load balancer.

      Figure 5 – Amazon EC2 console create load balancer

  4. Turn off security group evaluation for PrivateLink for your Network Load Balancer.
    1. Go to your Network Load Balancer.
    2. Select the Security tab.
    3. Choose Edit.
    4. Clear “Enforce inbound rules on PrivateLink traffic”.
    5. Save changes

      Figure 6 – Amazon EC2 console -> Load Balancers -> Security; turn off security group evaluation

Step 3. Create the VPC Link

In this step we create a VPC link to connect your API and your Network Load Balancer. After you create a VPC link, you create private integrations to route traffic from your API to resources in your VPC through your VPC link and Network Load Balancer. To create VPC link, you need to do the following:

  1. Open the API Gateway console at Amazon API Gateway console
  2. Click on VPC Links in the navigation pane.
  3. Click on Create VPC Link and provide a name for the VPC link.
  4. Select the VPC link for REST APIs and provide the VPC link details following:
    1. For Name, enter the name for your VPC link
    2. For Description(optional), provide a description for your VPC link.
    3. For Target NLB, select the NLB created in the previous step from the dropdown.
  5. Choose Create

    Figure 7 – Amazon API Gateway console create a VPC link

Step 4. Create the API Gateway

In this step we create API Gateway that will have private integration with the Network Load Balancer

  1. Go back to the API Gateway Console Amazon API Gateway console
  2. Choose Create API. Under REST API, choose Build.
  3. Create Regional REST API.
    1. For API details, select New API
    2. For API name, provide a name for your API
    3. For Description(optional), provide a description for your API.
    4. For API endpoint type, select regional from the drop-down option.
  4. Choose Create API.

    Figure 8 – Amazon API Gateway console create REST API

Step 5. Create integration with VPC link

In this step we integrate the VPC link with the API created in the previous step.

  1. Create Resource
    1. From API Gateway console select Create resource
    2. Under Resource details, specify the resource path and resource name
    3. Choose Create resource

      Figure 9 – Amazon API Gateway console create resource for VPC link integration

  2. Create Method
    1. From API Gateway console select Create method.
    2. For Method type, select the desired method.
    3. For Integration type, select VPC link.
    4. Turn on VPC proxy integration.
    5. For HTTP method, select desired method.
    6. For VPC link, select the VPC link from the dropdown menu that was created in the previous steps.
    7. For Endpoint URL, enter a URL for the NLB created in the previous steps along with the port number. For eg: http://nlb-api-integration-xxxxxxxxxxxxxxxx.elb.us-east-1.amazonaws.com:80/on-prem. Assuming the endpoint is going to retrieve /on-prem resource.
    8. Choose Create method.
With the proxy integration, the API is ready for deployment. Otherwise, you need to proceed to set up appropriate method responses and integration responses.

      Figure 10 – Amazon API Gateway console create method and provide method details

      Figure 11 – Amazon API Gateway console create method and provide method details

Step 6. Deploy the API

Final step is to deploy the API. You can do that by using the following steps:

  1. Choose Deploy API
  2. For Stage, select New stage.
  3. For Stage name, enter a stage name.
  4. For Description(optional), enter a description.
  5. Choose Deploy

    Figure 12 – Amazon API Gateway console deploy the created API

Security

Security is the top priority at AWS and operates on a shared responsibility model between AWS and its customers. When managing hybrid APIs, implementing robust security measures is essential since these APIs serve as critical gateways to sensitive data and services. For detailed guidance on securing your REST APIs using API Gateway, please consult our documentation

Cleanup

To prevent incurring additional charges, remove the resources that were created during this walkthrough

  1. Open the API Gateway console.
  2. Select the APIs you created and select delete.
  3. Go to the VPC links in the navigation pane and select the VPC link created. Delete the VPC link.
  4. Within the EC2 console, go to load balancers in the navigation pane and delete the target group and NLB.

Conclusion

This post demonstrates how to configure API Gateway as an entry point for your on-premises resources, providing a unified API interface for your clients.

You can read more about working with API Gateway in AWS documentation and use these capabilities to create architectures to suit your specific requirements. For more serverless learning resources, visit Serverless Land.

Securing Amazon S3 presigned URLs for serverless applications

Post Syndicated from Raaga N.G original https://aws.amazon.com/blogs/compute/securing-amazon-s3-presigned-urls-for-serverless-applications/

Modern serverless applications must be capable of seamlessly handling large file uploads. This blog demonstrates how to leverage Amazon Simple Storage Service (Amazon S3) presigned URLs to allow your users to securely upload files to S3 without requiring explicit permissions in the AWS Account. This blog post specifically focuses on the security ramifications of using S3 presigned URLs, and explains mitigation steps that serverless developers can take to improve the security of their systems using S3 presigned URLs. Additionally, the blog post also walks through an AWS Lambda function that adheres to the provided recommendations, ensuring a robust and secure approach to handling S3 presigned URLs. For more information on S3 presigned URLs, see Working with presigned URLs.

Presigned URL Workflow for Serverless Applications

The following architecture diagram illustrates a serverless application that generates an S3 presigned URL. By using S3 presigned URLs, serverless applications can offload to S3 the computation required to receive files. The diagram captures a seven-step process between the client, Amazon API Gateway, the Lambda function, and S3.

A typical workflow to upload a file to a serverless application hosted on S3 includes the following steps:

  1. Client submits a request to upload a file.
  2. API Gateway receives the client request and invokes a Lambda function that then generates the S3 presigned URL.
  3. The Lambda function makes a getSignedUrl API call to S3.
  4. S3 returns the presigned URL for the object to be uploaded.
  5. The Lambda function returns a presigned URL to the API.
  6. Client receives the S3 presigned URL to upload the file.
  7. Client uploads the file directly to S3 using the presigned URL.

How to Secure Presigned URLs

When designing a serverless application that utilizes S3 presigned URLs to store data in S3, a developer must consider several primary security aspects. S3 presigned URLs are public resources that do not authenticate users, and anyone in possession of a valid S3 presigned URL can access the associated resource. Consequently, it is important to implement additional security measures to ensure that these URLs are not misused or accessed by unauthorized parties. The following blog post contains techniques you can use to make your presigned URLs more secure.

1. Add a Content-MD5 checksum using the X-Amz-Signed header

When you upload an object to S3, you can include a precalculated checksum of the object as part of your request. S3 will perform an integrity check and verify if the object sent is the same as the object received. S3 supports the use of MD5 checksums to verify the integrity of objects uploaded. You provide the MD5 digest by including a Content-MD5 header in the initial PUT request. Upon receiving the object, S3 will calculate the MD5 digest and compare it with the one you originally provided. The upload operation succeeds only if both MD5 digests match, ensuring end-to-end data integrity. If an unintended party gets their hands on the S3 presigned URL, then they will not be able to use it without possessing the same object. This provides protection against arbitrary file uploads.

The key element for a developer to remember is that when the client uploads the file to the S3 presigned URL, it must supply the correct MD5 in Base64 using the Content-MD5 header. Developers can see a sample serverless application with client-side code to extract the MD5 digest, request a S3 presigned URL, and upload a file in this GitHub repositoryThis sample application uses NodeJS v20 in the Lambda function.

2. Expire the S3 presigned URLs 

An S3 presigned URL remains valid for the period of time specified when the URL is generated. It is important to ensure that the S3 presigned URL does not remain accessible for longer than required as it can be reused when still valid. You can define the expiration time of the S3 presigned URL by either passing X-Amz-Expires as a query parameter or by setting the expiresIn parameter when using the AWS SDK for JavaScript.

S3 validates the expiration time and date at the time of initial HTTP Request. However, to support situations where the connection drops and the client needs to restart uploading a file, you may want your S3 presigned URL to remain valid for the entire anticipated time needed to upload the file to S3. The challenge is to generate an S3 presigned URL that is valid long enough to accommodate the file’s upload, yet still short enough that you prevent reuse.

A solution we propose to overcome these challenges is to dynamically set the S3 presigned URL’s expiration time by using the browser Network Information API. Using this new API, when the client browser places the initial request for an S3 presigned URL, the client also transmits the file’s size and the network type, so the Lambda function can calculate the anticipated transfer time.

Within the Lambda function, we can now estimate the transfer time for this size of file on this type of network, using sample code as featured in this GitHub repository.

With the estimated transfer time calculated, the Lambda function can now request the S3 presigned URL and set the expiresIn parameter to the transfer time, resulting in an S3 presigned URL that is only available for the time needed to upload that size of file on this type of network.

If you are using the AWS SDK, you may also be using AWS Signature Version 4 (SigV4) to sign your requests. To create a defense in depth approach, which will place a ceiling on total expiration time, you can utilize condition keys in bucket policies. For an example policy, see Limiting presigned URL capabilities.

3. Generating a UUID to replace the uploaded filename

When an application allows a user to upload files, the application exposes itself to various security threats, such as path traversal attacks. Path traversal vulnerabilities allow attackers to access files that are not meant to be accessed or to overwrite files outside the intended directory structure. In order to secure your applications against such vulnerabilities, the most effective approach is to incorporate user input validation and sanitization. You can sanitize the filename by replacing it with a generated UUID (Universally Unique Identifier).

You can see an example function in the server-side code for Lambda in this GitHub repository.

4. Applying the Principle of Least Privilege and using a separate Lambda function to create S3 presigned URLs

The capabilities of an S3 presigned URL are constrained by the permissions of the principal that created it. To offer fine-grained access, the very first step in limiting use of an S3 presigned URL should be building a specific Lambda function that generates these URLs. By having a Lambda function dedicated to this purpose, you do not risk an overly permissive Lambda function. The second step is to limit your specific Lambda function’s access to S3.

Adhering to the Principle of Least Privilege, it’s important to restrict the Lambda function’s permissions to only the required prefixes in the bucket and allow it to perform only the required actions on the bucket, instead of granting full bucket access. This minimizes the potential attack surface and mitigates the risk of unintended data exposure or modification. It is important to limit the permissions to the minimum required set of actions and resources.

This example AWS Identity and Access Management (IAM) policy demonstrates how to grant the Lambda function read access (GET) to objects within the "Example-Prefix" prefix of a specific S3 bucket. The IAM policy is attached to the Lambda function via an execution role, which together establish what actions the Lambda function can perform.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadStatement",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::EXAMPLE-BUCKET/Example-Prefix/",
        "arn:aws:s3:::EXAMPLE-BUCKET/Example-Prefix/*"
      ],
      "Effect": "Allow"
    }
  ]
}

This example IAM policy demonstrates how to grant the Lambda function permissions to upload (PUT) objects within the "Example-Prefix" prefix of a specific S3 bucket.

{   
    "Version": "2012-10-17",
    "Statement": [
        {   
            "Sid": "UploadStatement",
            "Action": [
                "s3:PutObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::EXAMPLE-BUCKET/Example-Prefix/",
                "arn:aws:s3:::EXAMPLE-BUCKET/Example-Prefix/*"
            ],
            "Effect": "Allow"
        }
    ]
}

This approach will ensure that your Lambda function possesses the minimum required permissions to perform its intended tasks and reduces the risk of unintended data access or modification.

If you want to restrict the use of S3 presigned URLs and all S3 access to a particular network path, you can also define a network-path restriction policy on the S3 Bucket. This restriction on the bucket requires that all requests to the bucket originate from a specified network. AWS Prescriptive Guidance says, an extension of least privilege is to maintain a data perimeter that’s consistent with your organization’s needs. The goal of an AWS perimeter is to ensure that the access is allowed only if the request is coming from a trusted entity, for trusted resources from a trusted network. These data perimeters are applicable to S3 presigned URLs as well.

5.Creating one-time use S3 presigned URLs

Serverless applications developers may want each S3 presigned URL to only be used once. Developers can incorporate a token-based mechanism to facilitate secure one-time use of an S3 presigned URL. This involves generating unique tokens for each authorized user or client and associating these tokens with the S3 presigned URLs. When a client attempts to access the resource using the S3 presigned URL, they must provide the corresponding token for validation. This additional layer of security ensures that only authorized entities can access the S3 presigned URLs and the associated resources. Furthermore, you can leverage a database to track the issued tokens and expire them after each use. A solution to implement such a mechanism has been discussed in detail in How to securely transfer files with presigned URLs.

Cleaning up

You may clean up the sample application by deleting the API Gateway, Lambda function, and S3 bucket. In addition, please do not forget to delete any IAM execution roles you created for the Lambda function.

Conclusion

In this blog we have discussed various considerations that a developer must make when designing an application that leverages S3 presigned URLs. By incorporating robust security measures, such as proper access control, input sanitization, expiration handling and integrity checks, developers can mitigate potential risks when using S3 presigned URLs.

Petabyte-scale data migration made simple: AppsFlyer’s best practice journey with Amazon EMR Serverless

Post Syndicated from Roy Ninio original https://aws.amazon.com/blogs/big-data/petabyte-scale-data-migration-made-simple-appsflyers-best-practice-journey-with-amazon-emr-serverless/

This post is co-written with Roy Ninio from Appsflyer.

Organizations worldwide aim to harness the power of data to drive smarter, more informed decision-making by embedding data at the core of their processes. Using data-driven insights enables you to respond more effectively to unexpected challenges, foster innovation, and deliver enhanced experiences to your customers. In fact, data has transformed how organizations drive decision-making, but historically, managing the infrastructure to support it posed significant challenges and required specific skill sets and dedicated personnel. The complexity of setting up, scaling, and maintaining large-scale data systems impacted agility and pace of innovation. This reliance on experts and intricate setups often diverted resources from innovation, slowed time-to-market, and hindered the ability to respond to changes in industry demands.

AppsFlyer is a leading analytics and attribution company designed to help businesses measure and optimize their marketing efforts across mobile, web, and connected devices. With a focus on privacy-first innovation, AppsFlyer empowers organizations to make data-driven decisions while respecting user privacy and compliance regulations. AppsFlyer provides tools for tracking user acquisition, engagement, and retention, delivering actionable insights to enhance ROI and streamline marketing strategies.

In this post, we share how AppsFlyer successfully migrated their massive data infrastructure from self-managed Hadoop clusters to Amazon EMR Serverless, detailing their best practices, challenges to overcome, and lessons learned that can help guide other organizations in similar transformations.

Why AppsFlyer embraced a serverless approach for big data

AppsFlyer manages one of the largest-scale data infrastructures in the industry, processing 100 PB of data daily, handling millions of events per second, and running thousands of jobs across nearly 100 self-managed Hadoop clusters. The AppsFlyer architecture is comprised of many data engineering open source technologies, including but not limited to Apache Spark, Apache Kafka, Apache Iceberg, and Apache Airflow. Although this setup has powered operations for years, the growing complexity of scaling resources to meet fluctuating demands, coupled with the operational overhead of maintaining clusters, prompted AppsFlyer to rethink their big data processing strategy.

EMR Serverless is a modern, scalable solution that alleviates the need for manual cluster management while dynamically adjusting resources to match real-time workload requirements. With EMR Serverless, scaling up or down happens within seconds, minimizing idle time and interruptions like spot terminations.

This shift has freed engineering teams to focus on innovation, improved resilience and high availability, and future-proofed the architecture to support their ever-increasing demands. By only paying for compute and memory resources used during runtime, AppsFlyer also optimized costs and minimized charges for idle resources, marking a significant step forward in efficiency and scalability.

Solution overview

AppsFlyer’s previous architecture was built around self-managed Hadoop clusters running on Amazon Elastic Compute Cloud (Amazon EC2) and handled the scale and complexity of the data workflows. Although this setup supported operational needs, it required substantial manual effort to maintain, scale, and optimize.

AppsFlyer orchestrated over 100,000 daily workflows with Airflow, managing both streaming and batch operations. Streaming pipelines used Spark Streaming to ingest real-time data from Kafka, writing raw datasets to an Amazon Simple Storage Service (Amazon S3) data lake while simultaneously loading them into BigQuery and Google Cloud Storage to build logical data layers. Batch jobs then processed this raw data, transforming it into actionable datasets for internal teams, dashboards, and analytics workflows. Additionally, some processed outputs were ingested into external data sources, enabling seamless delivery of AppsFlyer insights to customers across the web.

For analytics and fast queries, real-time data streams were ingested into ClickHouse and Druid to power dashboards. Additionally, Iceberg tables were created from Delta Lake raw data and made accessible through Amazon Athena for further data exploration and analytics.

With the migration to EMR Serverless, AppsFlyer replaced its self-managed Hadoop clusters, bringing significant improvements to scalability, cost-efficiency, and operational simplicity.

Spark-based workflows, including streaming and batch jobs, were migrated to run on EMR Serverless and take advantage of the elasticity of EMR Serverless, dynamically scaling to meet workload demands.

This transition has significantly reduced operational overhead, alleviating the need for manual cluster management, so teams can focus more on data processing and less on infrastructure.

The following diagram illustrates the solution architecture.

This post reviews the main challenges and lessons learned by the team at AppsFlyer from this migration.

Challenges and lessons learned

Migrating a large-scale organization like AppsFlyer, with dozens of teams, from Hadoop to EMR Serverless was a significant challenge—especially because many R&D teams had limited or no prior experience managing infrastructure. To provide a smooth transition, AppsFlyer’s Data Infrastructure (DataInfra) team developed a comprehensive migration strategy that empowered the R&D teams to seamlessly migrate their pipelines.

In this section, we discuss how AppsFlyer approached the challenge and achieved success for the entire organization.

Centralized preparation by the DataInfra team

To provide a seamless transition to EMR Serverless, the DataInfra team took the lead in centralizing preparation efforts:

  • Clear ownership – Taking full responsibility for the migration, the team planned, guided, and supported R&D teams throughout the process.
  • Structured migration guide – A detailed, step-by-step guide was created to streamline the transition from Hadoop, breaking down the complexities and making it accessible to teams with limited infrastructure experience.

Building a strong support network

To make sure the R&D teams had the resources they needed, AppsFlyer established a robust support environment:

  • Data community – The primary resource for answering technical questions. It encouraged knowledge sharing across teams and was spearheaded by the DataInfra team.
  • Slack support channel – A dedicated channel where the DataInfra team actively responded to questions and guided teams through the migration process. This real-time support significantly reduced bottlenecks and helped teams resolve issues quickly.

Infrastructure templates with best practices

Recognizing the complexity of the team’s migration, the DataInfra team had standardized templates to help teams start quickly and efficiently:

  • Infrastructure as code (IaC) templates – They developed Terraform templates with best practices for building applications on EMR Serverless. These templates included code examples and real production workflows already migrated to EMR Serverless. Teams could quickly bootstrap their projects by using these ready-made templates.
  • Cross-account access solutions – Operating across multiple AWS accounts required managing secure access between EMR Serverless accounts (where jobs run) and data storage accounts (where datasets reside). To streamline this, a step-by-step module was developed for setting up cross-account access using Assume Role permissions. Additionally, a dedicated repository was created, so teams can define and automate role and policy creation, providing seamless and scalable access management.

Airflow integration

As AppsFlyer’s primary workflow scheduler, Airflow plays a critical role, making it essential to provide a seamless transition for its users.

AppsFlyer developed a dedicated Airflow operator for executing Spark jobs on EMR Serverless, carefully designed to replicate the functionality of the existing Hadoop-based Spark operator. In addition, a Python package was made available across all Airflow clusters with the relevant operators. This approach minimized code changes, allowing teams to transition seamlessly with minimal modifications.

Solving common permission challenges

To streamline permissions management, AppsFlyer developed targeted solutions for frequent use cases:

  • Comprehensive documentation – Provided detailed instructions for handling permissions for services like Athena, BigQuery, Vault, GIT, Kafka, and many more.
  • Standardized Spark defaults configuration for teams to apply to their applications – Included built-in solutions for collecting lineage from Spark jobs running on EMR Serverless, providing accountability and traceability.

Continuous engagement with R&D teams

To promote progress and maintain alignment across teams, AppsFlyer introduced the following measures:

  • Weekly meetings – Weekly status meetings to review the status of each team’s migration efforts. Teams shared updates, challenges, and commitments, fostering transparency and collaboration.
  • Assistance – Proactive assistance was provided for issues raised during meetings to minimize delays. This made sure that the teams were on track and had the support they needed to meet their commitments.

By implementing these strategies, AppsFlyer transformed the migration process from a daunting challenge into a structured and well-supported journey. Key outcomes included:

  • Empowered teams – R&D teams with minimal infrastructure experience were able to confidently migrate their pipelines.
  • Standardized practices – Infrastructure templates and predefined solutions provided consistency and best practices across the organization.
  • Reduced downtime – The custom Airflow operator and detailed documentation minimized disruptions to existing workflows.
  • Cross-account compatibility – With seamless cross-account access, teams could run jobs and access data efficiently.
  • Improved collaboration – The data community and Slack support channel fostered a sense of collaboration and shared responsibility across teams.

Migrating an entire organization’s data workflows to EMR Serverless is a complex task, but by investing in preparation, templates, and support, AppsFlyer successfully streamlined the process for all R&D teams in the company.

This approach can serve as a model for organizations undertaking similar migrations.

Spark application code management and deployment

For AppsFlyer data engineers, developing and deploying Spark applications is a core daily responsibility. The Data Platform team focuses on identifying and implementing the right set of tools and safeguards that would not only simplify the migration to EMR Serverless, but also streamline ongoing operations.

There are two different approaches available for running Spark code on EMR Serverless: custom container images and JARs or Python files. At the beginning of the exploration, custom images looked promising because it allows greater customization than JARs, which should allow the DataInfra team smoother migration for existing workloads. After deeper research, it was realized that custom images have great power, but come with a cost that in large scale would need to be evaluated. Custom images presented the following challenges:

  • Custom images are supported as of version 6.9.0, but some of AppsFlyer’s workloads used earlier versions.
  • EMR Serverless resources run from the moment EMR Serverless begins downloading the image until workers are stopped. This means a payment is done for aggregate vCPU, memory, and storage resources during the image download phase.
  • They required a different continuous integration and delivery (CI/CD) approach than compiling a JAR or Python file, leading to operational work that should be minimized as much as possible.

AppsFlyer decided to go all in with JARs and allow only in unique cases, where the customization required the use of custom images. Eventually, it was realized that using non-custom images was suitable for AppsFlyer use cases.

CI/CD perspective

From a CI/CD perspective, AppsFlyer’s DataInfra team decided to align with AppsFlyer’s GitOps vision, making sure that both infrastructure and application code are version-controlled, built, and deployed using Git operations.

The following diagram illustrates the GitOps approach AppsFlyer adopted.

JARs continuous integration

For CI, the process in charge of building the application artifacts, several options have been explored. The following key considerations drove the exploration process:

  • Use Amazon S3 as the native JAR source for EMR Serverless
  • Support different versions for the same job
  • Support staging and production environments
  • Allow hotfixes, patches, and rollbacks

Using AppsFlyer’s current external package repository led to challenges, because it required them to build a custom delivery into Amazon S3 or a complex runtime ability to fetch the code externally.

Using Amazon S3 directly also had several alternative approaches:

  • Buckets – Use single vs. separated buckets for staging and production
  • Versions – Use Amazon S3 native object versioning vs. uploading a new file
  • Hotfix – Override the same job’s JAR file vs. uploading a new one

Finally, the decision was to go with immutable builds for consistent deployment across the environments.

Each Spark job git repository pushes to the main branch, triggers a CI process to validate the semantic versioning (semver) assignment, compiles the JAR artifact, and uploads it to Amazon S3. Each artifact is uploaded to three different paths according to the version of the JAR, and also include a version tag for the S3 object:

  • <BucketName>/<SparkJobName>/<major>"."<minor>"."<patch>/app.jar
  • <BucketName>/<SparkJobName>/<major>"."<minor>"/app.jar
  • <BucketName>/<SparkJobName>/<major>/app.jar

AppsFlyer can now have deep granularity and assign each EMR Serverless job to a pinpointed version. Some jobs can run with the latest major version, and other stability and SLA sensitive jobs require a lock to a specific patch version.

EMR Serverless continuous deployment

Uploading the files to Amazon S3 was the final step in the CI process, which then leads to a different CD process.

CD is done by changing the infrastructure code, which is Terraform based, to point to the new JAR that was uploaded to Amazon S3. Then the staging or production application can start using the newly uploaded code and the process can be considered deployed.

Spark application rollbacks

If they need an application rollback, AppsFlyer points the EMR Serverless job IaC configuration from the current impaired JAR version to the previous stable JAR version in the relevant Amazon S3 path.

AppsFlyer believes that every automation impacting production, like CD, requires a breaking glass mechanism for an emergency situation. In such cases, AppsFlyer can manually override the needed S3 object (JAR file) while still using Amazon S3 versions in order to have better visibility and manual version control.

Single-job vs. multi-job applications

When using EMR Serverless, one important architectural decision is whether to create a separate application for each Spark job or use an automatic scaling application shared across multiple Spark jobs. The following table summarizes these considerations.

Aspect Single-Job Application Multi-Job Application
Logical Nature Dedicated application for each job. Shared application for multiple jobs.
Shared Configurations Limited shared configurations; each application is independently configured. Allows shared configurations through spark-defaults, including executors, memory settings, and JARs.
Isolation Maximum isolation; each job runs independently. Maintains job-level isolation through distinct IAM roles despite sharing the application.
Flexibility Flexible for unique configurations or resource requirements. Reduces overhead by reusing configurations and using automatic scaling.
Overhead Higher setup and management overhead due to multiple applications. Lower administrative overhead but requires careful resource contention management.
Use Cases Suitable for jobs with unique requirements or strict isolation needs. Ideal for related workloads that benefit from shared settings and dynamic scaling.

By balancing these considerations, AppsFlyer tailored its EMR Serverless usage to efficiently meet the demands of diverse Spark workloads across their teams.

Airflow operator: Simplifying the transition to EMR Serverless

Before the migration to EMR Serverless, AppsFlyer’s teams relied on a custom Airflow Spark operator created by the DataInfra team.

This operator, packaged as a Python library, was integrated into the Airflow environment and became a key component of the data workflows.

It provided essential capabilities, including:

  • Retries and alerts – Built-in retry logic and PagerDuty alert integration
  • AWS role-based access – Automatic fetching of AWS permissions based on role names
  • Custom defaults – Setting Spark configurations and package defaults tailored for each job
  • State management – Job state tracking

This operator streamlined running Spark jobs on Hadoop and was highly tailored to AppsFlyer’s requirements.

When moving to EMR Serverless, the team chose to build a custom Airflow operator to align with their existing Spark-based workflows. They already had dozens of Directed Acyclic Graphs (DAGs) in production, so with this approach, they could maintain their familiar interface, including custom handling for retries, alerting, and configurations—all without requiring broad changes across the board.

This abstraction provided a smoother migration by preserving the same development patterns and minimizing the migration efforts of adapting to the native operator semantics.

The DataInfra team developed a dedicated, custom, EMR Serverless operator to support the following goals:

  • Seamless migration – The operator was designed to closely mimic the interface of the existing Spark operator on Hadoop. This made sure that teams could migrate with minimal code changes.
  • Feature parity – They added the features missing from the native operator:
    • Built-in retry logic.
    • PagerDuty integration for alerts.
    • Automatic role-based permission fetching.
    • Default Spark configurations and package support for each job.
  • Simplified integration – It’s packaged as a Python library available in Airflow clusters. Teams could use the operator just like they did with the previous Spark operator.

The custom operator abstracts some of the underlying configurations required to submit jobs to EMR Serverless, aligning with AppsFlyer’s internal best practices and adding essential features.

The following is from an example DAG using the operator:

return SparkBatchJobEmrServerlessOperator(
    task_id=task_id,  # Unique task identifier in the DAG

    jar_file=jar_file,  # Path to the Spark job JAR file on S3
    main_class="<main class path>",

    spark_conf=spark_conf,

    app_id=default_args["<emr_serverless_application_id>"],  # EMR Serverless app ID
    execution_role=default_args["<job_execution_role_arn>"],  # IAM role for job execution

    polling_interval_sec=120,  # How often to poll for job status
    execution_timeout=timedelta(hours=1),  # Max allowed runtime

    retries=5,  # Retry attempts for failed jobs
    app_args=[],  # Arguments to pass to the Spark job

    depends_on_past=True,  # Ensure sequential task execution

    tags={'owner': '<team_tag>'},  # Metadata for ownership
    aws_assume_role="<my_aws_role>",  # Role for cross-account access

    alerting_policy=ALERT_POLICY_CRITICAL.with_slack_channel(sc),  # Alerting integration
    owner="<team_owner>",

    dag=dag  # DAG this task belongs to
)

Cross-account permissions on AWS: Simplifying EMRs workflows

AppsFlyer operates across multiple AWS accounts, creating a need for secure and efficient cross-account access. EMR Serverless jobs are executed in the production account, and the data they process resides in a separate data account. To enable seamless operation, Assume Role permissions are used to verify that EMR Serverless jobs running in the production account can access the data and services in the data account. The following diagram illustrates this architecture.

Below is a diagram demonstrating the cross-account permissions AppsFlyer adopted:

Role management strategy

To manage cross-account access efficiently, three distinct roles were created and maintained:

  • EMR role – Used for executing and managing EMR Serverless applications in the production account. Integrated directly into Airflow workers to make it available for the DAGs on the dedicated team Airflow cluster.
  • Execution role – Assigned to the Spark job running on EMR Serverless. Passed by the EMR role in the DAG code to provide seamless integration.
  • Data role – Resides in the data account and is assumed by the execution role to access data stored in Amazon S3 and other AWS services.

To enforce access boundaries, each role and policy is tagged with team-specific identifiers.
This makes sure that teams can only access their own data and roles, minimizing unauthorized access to other teams’ resources.

Simplifying Airflow migration

A streamlined process to make cross-account permissions transparent for teams migrating their workloads to EMR Serverless was developed:

  1. The EMR role is embedded into Airflow workers, making it available for DAGs in the dedicated Airflow cluster for each team:
{
   "Version":"2012-10-17",
   "Statement":[
      "..."{
         "Effect":"Allow",
         "Action":"iam:PassRole",
         "Resource":"arn:aws:iam::account-id:role/execution-role",
         "Condition":{
            "StringEquals":{
               "iam:ResourceTag/Team":"team-tag"
            }
         }
      }
   ]
}
  1. The EMR role automatically passes the execution role to the job within the DAG code:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sts:AssumeRole",
      "Resource": "arn:aws:iam::data-account-id:role/data-role",
      "Condition": {
        "StringEquals": {
          "iam:ResourceTag/Team": "team-tag"
        }
      }
    }
  ]
}
  1. The execution role assumes the data role dynamically during job execution to access the required data and services in the data account:

Allows the Execution Role in the Production account to assume the Data Role.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::production-account-id:role/execution-role"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
  1. Policies, trust relationships, and role definitions are managed in a dedicated GitLab repository. GitLab CI/CD pipelines automate the creation and integration of roles and policies, providing consistency and reducing manual overhead.

Benefits of AppsFlyer’s approach

This approach offered the following benefits:

  • Seamless access – Teams no longer need to handle cross-account permissions manually because these are automated through preconfigured roles and policies, providing seamless and secure access to resources across accounts.
  • Scalable and secure – Role-based and tag-based permissions provide security and scalability across multiple teams and accounts. By using roles and tags, it alleviates the need to create separate hardcoded policies for each team or account. Instead, they can define generalized policies that scale automatically as new resources, accounts, or teams are added.
  • Automated management – GitLab CI/CD streamlines the deployment and integration of policies and roles, reducing manual effort while enhancing consistency. It also minimizes human errors, improves change transparency, and simplifies version management.
  • Flexibility for teams – Teams have the flexibility to use their own or native EMR Serverless operators while maintaining secure access to data.

By implementing a robust, automated cross-account permissions system, AppsFlyer has enabled secure and efficient access to data and services across multiple AWS accounts. This makes sure that teams can focus on their workloads without worrying about infrastructure complexities, accelerating their migration to EMR Serverless.

Integrating lineage into EMR Serverless

AppsFlyer developed a robust solution for column-level lineage collection to provide comprehensive visibility into data transformations across pipelines. Lineage data is stored in Amazon S3 and subsequently ingested into DataHub, AppsFlyer’s lineage and metadata management environment.

Currently, AppsFlyer collects column-level lineage from a variety of sources, including Amazon Athena, BigQuery, Spark, and more.

This section focuses on how AppsFlyer collects Spark column-level lineage specifically within the EMR Serverless infrastructure.

Collecting Spark lineage with Spline

To capture lineage from Spark jobs, AppsFlyer uses Spline, an open source tool designed for automated tracking of data lineage and pipeline structures.

AppsFlyer modified Spline’s default behavior to output a customized Spline object that aligns with AppsFlyer’s specific requirements. AppsFlyer adapted the Spline integration into both legacy and modern environments. In the pre-migration phase, they injected the Spline agent into Spark jobs through their customized Airflow Spark operator. In the post-migration phase, they integrated Spline directly into EMR Serverless applications.

The lineage workflow consists of the following steps:

  1. As Spark jobs execute, Spline captures detailed metadata about the queries and transformations performed.
  2. The captured metadata is exported as Spline object files to a dedicated S3 bucket.
  3. These Spline objects are processed into column-level lineage objects customized to fit AppsFlyer’s data architecture and requirements.
  4. The processed lineage data is ingested into DataHub, providing a centralized and interactive view of data dependencies.

The following figure is an example of a lineage diagram from DataHub.

Challenges and how AppsFlyer addressed them

AppsFlyer encountered the following challenges:

  • Supporting different EMR Serverless applications – Each EMR Serverless application has its own Spark and Scala version requirements.
  • Diverse operator usage – Teams often use custom or native EMR Serverless operators, making uniform Spline integration challenging.
  • Confirming universal adoption – They need to make sure Spark jobs across multiple accounts use the Spline agent for lineage tracking.

AppsFlyer addressed these challenges with the following solutions:

  • Version-specific Spline agents – AppsFlyer created a dedicated Spline agent for each EMR Serverless application version to match its Spark and Scala versions. For example, EMR Serverless application version 7.0.1 and Spline.7.0.1.
  • Spark defaults integration – They integrated the Spline agent into EMR Serverless application Spark defaults to verify lineage collection for jobs executed on the application—no job-specific modifications needed.
  • Automation for compliance – This process consists of the following steps:
    • Detect a newly created EMR Serverless application across accounts.
    • Verify that Spline is properly defined in the application’s Spark defaults.
    • Send a PagerDuty alert to the dedicated team if misconfigurations are detected.

Example integration with Terraform

To automate Spline integration, AppsFlyer used Terraform and local-exec to define Spark defaults for EMR Serverless applications. With Amazon EMR, you can set unified Spark configuration properties through spark-defaults, which are then applied to Spark jobs.

This configuration makes sure the Spline agent is automatically applied to every Spark job without requiring modifications to the Airflow operator or the job itself.

This robust lineage integration provides the following benefits:

  • Full visibility – Automatic lineage tracking provides detailed insights into data transformations
  • Seamless scalability – Version-specific Spline agents provide compatibility with EMR Serverless applications
  • Proactive monitoring – Automated compliance checks verify that lineage tracking is consistently enabled across accounts
  • Enhanced governance – Ingesting lineage data into DataHub provides traceability, supports audits, and fosters a deeper understanding of data dependencies

By integrating Spline with EMR Serverless applications, AppsFlyer has provided comprehensive and automated lineage tracking, so teams can understand their data pipelines better while meeting compliance requirements. This scalable approach aligns with AppsFlyer’s commitment to maintaining transparency and reliability throughout their data landscape.

Monitoring and observability

When embarking on a large migration, and as a day-to-day best-practice process, monitoring and observability are key parts of being able to run workloads successfully for stability, debugging, and cost.

AppsFlyer’s DataInfra team set several KPIs for monitoring and observability in EMR Serverless:

  • Monitor infrastructure-level metrics and logs:
    • EMR Serverless resource usage, including cost
    • EMR Serverless API usage
  • Monitor Spark application-level metrics and logs:
    • stdout and stderr logs
    • Spark engine metrics
  • Centralized observability over the existing environments, Datadog

Metrics

Using EMR Serverless native metrics, AppsFlyer’s DataInfra team set up several dashboards to support tracking both the migration and the day-to-day usage of EMR Serverless across the company. The following are the main metrics that were monitored:

  • Service quota usage metrics:
    • vCPU usage tracking (ResourceCount with vCPU dimension)
    • API usage tracking (API actual usage vs. API limits)
  • Application status metrics:
    • RunningJobs, SuccessJobs, FailedJobs, PendingJobs, CancelledJobs
  • Resource limits tracking:
    • MaxCPUAllowed vs. CPUAllocated
    • MaxMemoryAllowed vs. MemoryAllocated
    • MaxStorageAllowed vs. StorageAllocated
  • Worker-level metrics:
    • WorkerCpuAllocated vs. WorkerCpuUsed
    • WorkerMemoryAllocated vs. WorkerMemoryUsed
    • WorkerEphemeralStorageAllocated vs. WorkerEphemeralStorageUsed
  • Capacity allocation tracking:
    • Metrics filtered by CapacityAllocationType (PreInitCapacity vs. OnDemandCapacity)
    • ResourceCount
  • Worker type distribution:
    • Metrics filtered by WorkerType (SPARK_DRIVER vs. SPARK_EXECUTORS)
  • Job success rates over time:
    • SuccessJobs vs. FailedJobs ratio
    • SubmitedJobs vs. PendingJobs

The following screenshot shows an example of the tracked metrics.

Logs

For logs management, AppsFlyer’s DataInfra team explored several options:

Streamlining EMR Serverless log shipping to Datadog

Because AppsFlyer decided to keep their logs in an external logging environment, the DataInfra team aimed to reduce the number of components involved in the shipping process and minimize maintenance overhead. Instead of managing a Lambda based log shipper, they developed a custom Spark plugin that seamlessly exports logs from EMR Serverless to Datadog.

Companies already storing logs in Amazon S3 or CloudWatch Logs can take advantage of EMR Serverless native support for those environments. However, for teams needing a direct, real-time integration with Datadog, this approach alleviates the need for extra infrastructure, providing a more efficient and maintainable logging solution.

The custom Spark plugin offers the following capabilities:

  • Automated log export – Streams logs from EMR Serverless to Datadog
  • Fewer extra components – Alleviates the need for Lambda based log shippers
  • Secure API key management – Uses Vault instead of hardcoding credentials
  • Customizable logging – Supports custom Log4j settings and log levels
  • Full integration with Spark – Works on both driver and executor nodes

How the plugin works

In this section, we walk through the components of how the plugin works and provide a pseudocode overview:

  • Driver pluginLoggerDriverPlugin runs on the Spark driver to configure logging. The plugin fetches EMR job metadata, calls Vault to retrieve the Datadog API key, and configures logging settings.
initialize() {
  if (user provided log4j.xml) {
     Use custom log configuration
  } else {
     Fetch EMR job metadata (application name, job ID, tags)
     Retrieve Datadog API key from Vault
     Apply default logging settings
  }
}
  • Executor plugin – LoggerExecutorPlugin provides consistent logging across executor nodes. It inherits the driver’s log configuration and makes sure the executors use consistent logging
initialize() {
   fetch logging config from Driver
   apply log settings (log4j, log levels)
}
  • Main plugin – LoggerSparkPlugin registers the driver and executor plugins in Spark. It serves as the entry point for Spark and applies custom logging settings dynamically.
function registerPlugin() {
  return (driverPlugin, executorPlugin);
}
loginToVault(role, vaultAddress) {
    create AWS signed request
    authenticate with Vault
    return vault token
}

getDatadogApiKey(vaultToken, secretPath) {
    fetch API key from Vault
    return key
}

Set up the plugin

To set up the plugin, complete the following steps:

  1. Add the following dependencies to your project:
<dependency>
  <groupId>com.AppsFlyer.datacom</groupId>
  <artifactId>emr-serverless-logger-plugin</artifactId>
  <version><!-- insert version here --></version>
</dependency>
  1. Configure the Spark plugin. The following code enables the custom Spark plugin and assigns the Vault role to access the Datadog API key:

--conf "spark.plugins=com.AppsFlyer.datacom.emr.plugin.LoggerSparkPlugin"

--conf "spark.datacom.emr.plugin.vaultAuthRole=your_vault_role"

  1. Use a custom or default Log4j configuration:

--conf "spark.datacom.emr.plugin.location=classpath:my_custom_log4j.xml"

  1. Set the environment variables for different log levels. This adjusts the logging for specific packages.

--conf "spark.emr-serverless.driverEnv.ROOT_LOG_LEVEL=WARN"

--conf "spark.executorEnv.ROOT_LOG_LEVEL=WARN"

--conf "spark.emr-serverless.driverEnv.LOG_LEVEL=DEBUG"

--conf "spark.executorEnv.LOG_LEVEL=DEBUG"

  1. Configure the Vault and Datadog API key and verify secure Datadog API key retrieval.

By adopting this plugin, AppsFlyer was able to significantly simplify log shipping, reducing the number of moving parts while maintaining real-time log visibility in Datadog. This approach provides reliability, security, and ease of maintenance, making it an ideal solution for teams using EMR Serverless with Datadog.

Summary

Through their migration to EMR Serverless, AppsFlyer achieved a significant transformation in team autonomy and operational efficiency. Individual teams now have greater freedom to choose and build their own resources without depending on a central infrastructure team, and can work more independently and innovatively. The minimization of spot interruptions, which were common in their previous self-managed Hadoop clusters, has substantially improved stability and agility in their operations. Thanks to this autonomy and reliability, combined with the automatic scaling capabilities of EMR Serverless, the AppsFlyer teams can focus more on data processing and innovation rather than infrastructure management. The result is a more efficient, flexible, and self-sufficient development environment where teams can better respond to their specific needs while maintaining high performance standards.

Ruli Weisbach, AppsFlyer EVP of R&D, says,

“EMR-Serverless is a game changer for AppsFlyer; we are able to save significantly our cost with remarkably lower management overhead and maximal elasticity.”

If the AppsFlyer approach sparked your interest and you are thinking about implementing a similar solution in your organization, refer to the following resources:

Migrating to EMR Serverless can transform your organization’s data processing capabilities, offering a fully managed, cloud-based experience that automatically scales resources and eases the operational complexity of traditional cluster management, while enabling advanced analytics and machine learning workloads with greater cost-efficiency.


About the authors

Roy Ninio is an AI Platform Lead with deep expertise in scalable data platform and cloud-native architectures. At AppsFlyer, Roy led the design of a high-performance Data Lake handling PB of daily events, driven the adoption of EMR Serverless for dynamic big data processing, and architected lineage and governance systems across platforms.

Avichay Marciano is a Sr. Analytics Solutions Architect at Amazon Web Services. He has over a decade of experience in building large-scale data platforms using Apache Spark, modern data lake architectures, and OpenSearch. He is passionate about data-intensive systems, analytics at scale, and it’s intersection with machine learning.

Eitav Arditti is AWS Senior Solutions Architect with 15 years in AdTech industry, specializing in Serverless, Containers, Platform engineering, and Edge technologies. Designs cost-efficient, large-scale AWS architectures that leverage the cloud-native and edge computing to deliver scalable, reliable solutions for business growth.

Yonatan Dolan is a Principal Analytics Specialist at Amazon Web Services. Yonatan is an Apache Iceberg evangelist, helping customers design scalable, open data lakehouse architectures and adopt modern analytics solutions across industries.

Monitoring network traffic in AWS Lambda functions

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/compute/monitoring-network-traffic-in-aws-lambda-functions/

Network monitoring provides essential visibility into cloud application traffic patterns across large organizations. It enables security and compliance teams to detect anomalies and maintain compliance, while allowing development teams to troubleshoot issues, optimize performance, and track costs in multi-tenant software as a service (SaaS) environments. Implementing robust network monitoring allows organizations to effectively manage their security, compliance, and operational requirements while continuously enhancing their applications.

In this post, you will learn methods for network monitoring in AWS Lambda functions and how to apply them to your scenarios.

Overview

Lambda is a secure and highly scalable serverless compute service where each function operates in an isolated execution environment with strict security boundaries. This architecture delivers key advantages, such as enhanced security, automatic compute capacity scaling, and minimal operational overhead. Minimizing infrastructure management allows Lambda to enable organizations to redirect their focus from managing servers to other critical aspects, such as performance optimization and network traffic analysis. In turn, these enable organizations to build more secure and efficient applications.

Lambda network monitoring addresses diverse organizational needs, such as compliance requirements for audit logs and anomaly detection, business needs for traffic metering and customer billing, and development needs for troubleshooting network issues. Traditional agent-based or host-based monitoring methods often aren’t compatible with the strongly isolated, ephemeral execution environment of Lambda, which necessitates alternative approaches to meet these critical requirements.

You can use AWS-native, integrated network monitoring solutions, such as Amazon Virtual Private Cloud (Amazon VPC) Flow Logs, or build your own custom solution, as detailed in this post. Each solution offers distinct capabilities with varying levels of granularity and real-time visibility. When choosing an approach, you must evaluate key factors such as the desired data granularity, operational complexity, latency tolerance, and cost implications.

Using VPC Flow Logs

VPC Flow Logs is the AWS-recommended tool for network activity monitoring. If your scenario necessitates monitoring of the network activity of Lambda functions, then you can attach these functions to a VPC and enable Flow Logs. This captures detailed network traffic data, such as source and destination IPs, ports, protocols, and traffic volume for all traffic flowing through the network interfaces used by your functions.

When you attach your functions to a VPC, the Lambda service automatically creates an Elastic Network Interface (ENI) for functions to communicate with VPC-based resources. By default, VPC-attached functions can only access private resources within the VPC. If you need your functions to communicate with other AWS services, then you should use VPC Endpoints. If your function needs to communicate with the public internet, then you should use an NAT Gateway for egress traffic. The following diagram shows how you can use VPC Flow Logs for Lambda functions.

Flow Logs provide detailed insights into the IP traffic flowing to and from the network interfaces within a VPC, offering valuable data for network audits and activity monitoring. This approach promotes a clear separation of concerns between application and networking layers, with VPC constructs typically managed by a dedicated operations or infrastructure team.

VPC Flow Logs provides a robust network monitoring solution. However, when using it with Lambda functions, you should evaluate the following considerations:

  • It captures ENI-level information. ENIs can be reused across multiple functions, thus it won’t provide per-function or per-invoke granularity.
  • It only logs IP addresses, not DNS names (if capturing DNS names is a requirement for you).
  • It introduces infrastructure management into your serverless applications. You must learn VPC constructs or involve your infrastructure team.
  • Potential added data transfer costs. Go to the pricing for NAT Gateway, VPC Endpoints, and Flow Logs for more details.

The following sections explore Lambda network monitoring methods that can either be augmented with VPC Flow Logs for better granularity or used without attaching your functions to a VPC.

Proxying network traffic

You can configure the Lambda runtime to route egress network traffic through a side-car proxy that runs as a Lambda layer within the Lambda execution environment and logs network activity. The proxy layer should be agnostic to the language runtime. AWS recommends that you use compiled languages such as Rust or Golang for maximum reusability and minimal added latency. The following diagram shows emitting logs from a network proxy layer.

Applying proxy configuration differs across language runtimes. In Python you set proxy_http and proxy_https environment variables. Java uses JVM flags. Node.js doesn’t provide a native way to configure proxy using command line flags or environment variables. Therefore, you need to make code changes, such as configuring a proxy for your AWS SDK or using a third-party open source libraries such as global-agent or Interceptors.

The proxy approach is most suitable if you’re okay with making function code or configuration changes that might vary across runtimes. Furthermore, adding a network proxy server process inside the execution environment consumes resources shared with the function code, which can add network latency.

Refer to the post Enhancing runtime security and governance with the AWS Lambda Runtime API proxy extension for ways to intercept inbound requests/responses between the Lambda Runtime API and Runtime process.

Runtime-agnostic techniques

Following techniques use the fact that the Lambda execution environment is a Linux-based micro-VM. Lambda runtimes operate within a restricted user space that prevents the use of traditional OS-level monitoring tools needing elevated privileges, such as tcpdump, iptables, ptrace, or eBPF. The following techniques are specifically designed to work under these user space constraints, allowing their use without needing elevated privileges.

Reading OS networking layer information from procfs

Use this method when you need to obtain the OS-level information, such as metering transferred bytes, or see all open connections. You can use it to implement tenant chargebacks or detect network traffic pattern changes. The method is based on the proc pseudo-filesystem (also known as procfs) available in Linux OS, which provides an interface to kernel data and allows you to read OS-level information. For example, /proc/cpuinfo and /proc/meminfo pseudo-files provide information about current CPU and memory usage, while /proc/net/* provides you with network layer information. Reading /proc/net/tcp and /proc/net/udp gives you a list of active TCP/UDP connections, such as remote IP addresses and ports. Reading /proc/net/dev provides the list of network devices with detailed usage statistics, such as bytes transferred and received.

“The procfs method provides a simple but powerful way to collect critical network telemetry data from Lambda functions, such as up-to-date network statistics and file descriptor counts, which enables us to monitor outbound connections from Lambda functions. Better yet, it enables us to support multiple Lambda runtimes with a single implementation in our Rust-based, next-generation Lambda Extension”—AJ Stuyvenberg, Staff Engineer at Datadog.

The sample project provides the LambdaNetworkMonitor-Procfs stack to show this technique. For every invocation, the function reads /proc/net/dev, and sends network statistics to log and Amazon CloudWatch Metrics, as shown in the following figure.

Reading the /proc/net/dev pseudo-file is a simple and effective way to monitor Lambda functions networking without adding latency. However, it doesn’t capture DNS names and the IP addresses to which they resolve.

Intercepting network-related libc calls

Low-level network operations in Linux, such as DNS lookup and connection creation, are managed by the C Standard Library (libc). You can intercept libc function calls made by Lambda runtimes to monitor network traffic at the OS level. This is a significantly more advanced and complex mechanism, enabling visibility into OS-level activities, as shown in the following figure.

Intercepting libc function calls, such as getaddrinfo (DNS lookup) and connect, allows you to log details such as DNS name, IP addresses, ports, and protocols at a granular level, as shown in the following figure. This method allows you to capture comprehensive information about DNS queries and initiated network connections. It can provide precise per-function and per-invoke network monitoring, such as hostnames and IP addresses.

The following is a simplified flow:

  1. A function sends a request to example.com.
  2. The runtime calls libc getaddrinfo to resolve the DNS name.
  3. You intercept this call, log the DNS name, and forward the call to the original libc getaddrinfo.
  4. The original libc getaddrinfo returns resolved IP addresses. You log them and return to the runtime.
  5. The runtime calls libc connect method to create a new connection.
  6. You intercept this call, log the IP address, forward the call to the original libc connect, and so on.

To implement this technique, you need to use a language that compiles to a shared object (.so) file. To implement libc method signatures you should use a language such as C, C++, or Rust. The following sample code uses Rust for its strong safety guarantees and implements overriding the getaddrinfo libc function (DNS lookup).

pub extern "C" fn getaddrinfo(
    node: *const c_char,
    service: *const c_char,
    hints: *const addrinfo,
    res: *mut *mut addrinfo,
) -> c_int {
    let printable_node = format!("{}", PrintableCString::from(node));
    let printable_service = format!("{}", PrintableCString::from(service));

    log::debug!("> getaddrinfo node={printable_node} service={printable_service}");

    LIBC_GETADDRINFO(node, service, hints, res)
}

The following should be considered:

  • The method signature fully matches the libc function signature of the same name.
  • The node and service arguments would commonly be DNS name and port.
  • At the end, the function invokes the real libc getaddrinfo and returns the result.

When compiled to an .so file, you must package it as a Lambda layer, attach the layer to your functions, and configure the execution environment to use it through the Linux dynamic linker capability called preloading. Set the LD_PRELOAD environment variable to point to your .so file to instruct the OS to preload it before it loads any other library, such as libc. You can configure this either as a function environment variable or through a wrapper script, as shown in the following figure.

#!/bin/sh
echo "running wrapper..."
args=("$@")
export LD_PRELOAD=/opt/liblambda_network_monitor.so
exec "${args[@]}"

This technique allows you to get detailed connection-level monitoring such as DNS lookups, resolved IP addresses, ports, protocols, and count bytes transferred. Depending on your requirements, it can be adapted to track further network-related information as needed.

The sample project provides the LambdaNetworkMonitor-LdPreload stack to show this technique, as shown in the following figure. For every invocation, the function prints intercepted libc functions, DNS names, and connection IP addresses.

Considerations

  • OS-level techniques necessitate strong understanding of Linux fundamentals and careful implementation. AWS recommends that you closely evaluate which methods to use and keep your solution minimally invasive.
  • LD_PRELOAD is an advanced low-level technique that allows you to override libc methods and OS behavior. Incorrectly implemented hooks could lead to undefined behavior and crashes. Make sure your code is robust to recursion and thread-safe. Test it thoroughly in a controlled environment before using it in production.
  • The LD_PRELOAD technique relies on dynamic linking. It works with dynamically linked runtimes such as Node.js, Python, and Java. It doesn’t work with runtimes that use static linking, such as Golang.
  • When using runtime-dependent functionality, consider using Lambda runtime update controls to make sure that runtimes are only updated when the next function update occurs.
  • Always install layers from trusted sources only. Use infrastructure as code (IaC) tools to attach and audit layer configurations, such as AWS Identity and Access Management (IAM) permissions.

Conclusion

Monitoring network traffic in Lambda functions is a common requirement for many organizations. In case you need to audit IP-level network logs, AWS recommends that you to attach your functions to a VPC and use Flow Logs. If you need per-function or per-invoke granularity, then you can augment it with techniques described in this post.

These techniques provide valuable insights for debugging, auditing, and monitoring, but they also necessitate a solid understanding of Linux fundamentals and careful implementation. They offer a practical solution for organizations that need Lambda network monitoring, allowing them to troubleshoot issues and maintain compliance.

To learn more about Serverless architectures and asynchronous Lambda invocation patterns, go to Serverless Land.

AWS Lambda introduces tiered pricing for Amazon CloudWatch logs and additional logging destinations

Post Syndicated from Shridhar Pandey original https://aws.amazon.com/blogs/compute/aws-lambda-introduces-tiered-pricing-for-amazon-cloudwatch-logs-and-additional-logging-destinations/

Effective logging is an important part of an observability strategy when building serverless applications using AWS Lambda.

Lambda automatically captures and sends logs to Amazon CloudWatch Logs. This allows you to focus on building application logic rather than setting up logging infrastructure and allows operators to troubleshoot failures and performance issues more easily.

On May 1st, 2025, AWS announced changes to Lambda logging, which can reduce Lambda CloudWatch logging costs and make it easier and more cost-effective to use a wider range of monitoring tools. Lambda logs are now available at volume-based tiered pricing when using CloudWatch Logs Standard and Infrequent Access log classes. When generating Lambda logs at scale, you can expect an immediate cost reduction under this new pricing model. Lambda also now supports Amazon S3 and Amazon Data Firehose as additional destinations for Lambda logs, in addition to CloudWatch Logs. Lambda logs sent to S3 and Firehose are also available at volume-based tiered pricing.

This blog post covers some recent Lambda logging enhancements and describes how this change delivers a simpler, more cost-effective logging experience for Lambda.

Overview

Logging provides developers and operators with valuable data for debugging and troubleshooting application behavior, performance issues, and potential failures. It becomes even more important for serverless applications built using Lambda because of the ephemeral and stateless nature of the Lambda execution environment. Lambda’s built-in integration with CloudWatch Logs ensures that logs for every function invocation are readily available for analysis. The captured log data includes application logs generated by your Lambda function code and system logs generated by the Lambda service while running your function code. CloudWatch Logs allows you to search, filter, and analyze log data to troubleshoot issues, track metrics, and set up alerts.

Logging requirements evolve as serverless applications grow in complexity and scale, sometimes spanning hundreds or thousands of Lambda functions which generate substantial log volumes. Organizations need sophisticated logging solutions that can handle this scale while remaining cost-effective. Some scenarios—such as monitoring critical business transactions—demand real-time log analysis, while others focus on after-the-fact forensic analysis. Debug logs from development and staging environments often need high granularity, whereas you may want lower verbosity in production logs to improve the signal-to-noise ratio.

Recent Lambda logging enhancements

In recent years, Lambda and CloudWatch Logs have expanded Lambda’s logging capabilities to meet the evolving needs of serverless applications. These capabilities provide deeper insights, greater control, and more cost-effective solutions to capture, process, and consume logs to enhancing the serverless observability experience. Lambda advanced logging controls gives developers control over log generation and content. These controls allow you to capture Lambda logs in JSON structured format. You don’t have to use logging libraries and customize log levels (INFO, DEBUG, WARN, ERROR) separately for application and system logs. This helps reduce logging costs by ensuring only necessary logs are generated while maintaining appropriate visibility across different environments. For example, you can set verbose DEBUG level logging in development environments while limiting production logging to ERROR level to improve the signal-to-noise ratio and control costs.

The Infrequent Access log class for CloudWatch Logs introduced a cost-effective solution for logs that need retention but are accessed less frequently. Infrequent Access is 50% lower per GB ingestion price than the Standard log class This tailored set of capabilities allows you to reduce your logging costs while maintaining access to historical data for compliance, audit purposes, or forensic analysis.

CloudWatch Logs Live Tail is an interactive, real-time log streaming and analytics capability. Live Tail streamlines debugging and monitoring workflows; it allows you to observe log output as functions execute without navigating away from the Lambda console. This makes it easier to identify and diagnose issues during development and troubleshooting. Logs Live Tail is also available in Visual Code IDE.

Tiered pricing for Lambda logs in CloudWatch Logs

Starting today, Lambda logs sent to CloudWatch Logs are classed as Vended Logs, which are logs from specific AWS services that are available at volume tiered pricing. This replaces the previous flat rate model when using CloudWatch Logs Standard log class. For example, in the US East (N. Virginia) AWS Region, you were charged at $0.50 per GB when using Standard log class for your Lambda logs. Under the new pricing model, you are charged for sending your Lambda logs to CloudWatch Logs starting at $0.50 per GB for initial usage. As log volume increases, the price per GB automatically decreases through multiple tiers, reaching rates as low as $0.05 per GB in the lowest tier. This pricing change applies automatically to all Lambda logs sent to CloudWatch Logs, requiring no code or configuration changes from you.

Data Ingested CloudWatch Logs Standard CloudWatch Logs Infrequent Access
First 10 TB per month $0.50 per GB $0.25 per GB
Next 20 TB per month $0.25 per GB $0.15 per GB
Next 20 TB per month $0.10 per GB $0.075 per GB
Over 50 TB per month $0.05 per GB $0.05 per GB

Table 1: Tiered pricing for Lambda logs in CloudWatch Logs in US East (N. Virginia) Region

When generating Lambda logs at scale, you will see an immediate cost reduction under this new pricing model. For example, if you generate 60 TB of Lambda logs monthly in CloudWatch Logs, costs would decrease by 58% (from $30,000 to $12,500). The pricing tiers scale with your logging volume, ensuring that cost benefits increase as your application grows. This allows you to maintain comprehensive logging practices that previously may have been cost-prohibitive. Vended logs tiered pricing is applied on all vended logs ingested to CloudWatch and not tiered per service.

When ingesting other vended logs, such as Amazon Virtual Private Cloud flow logs and Amazon Route 53 resolver query logs, you will see larger discounts as the tiering is applied at a consolidated log ingestion volume.

New Lambda logging destinations: Amazon S3 and Amazon Data Firehose

Starting today, Lambda also supports Amazon S3 and Amazon Data Firehose as destinations for Lambda logs, in addition to CloudWatch Logs. When using S3 or Firehose as a destination, logging costs start at $0.25 per GB. The tiered pricing also applies, with rates reducing to as low as $0.05 per GB in the lowest tier. This tiering is also applied at a consolidated log ingestion volume.

Data Ingested Delivery Cost to Amazon S3 Delivery Cost to Amazon Data Firehose
First 10TB per month $0.25 per GB $0.25 per GB
Next 20TB per month $0.15 per GB $0.15 per GB
Next 20TB per month $0.075 per GB $0.075 per GB
Over 50TB per month $0.05 per GB $0.05 per GB

Table 2:Tiered pricing for Lambda logs delivery to Amazon S3 and Amazon Data Firehose in US East (N. Virginia) Region

Direct delivery of Lambda logs to S3 provides enhanced flexibility in log management. Support for Firehose streamlines Lambda log delivery to additional destinations such as Amazon OpenSearch Service, HTTP endpoints, and third-party observability providers. This matches the established log delivery pattern used with other AWS compute services such as Amazon Elastic Container Service (Amazon ECS) and Amazon Elastic Compute Cloud (Amazon EC2).

This new capability provides significant cost benefits and streamlines log delivery to additional logging destinations, making it easier to use a wider range of monitoring tools (including CloudWatch) when building serverless applications using Lambda.

New Lambda logging destinations in action

All new and existing Lambda functions have CloudWatch Logs as the default logging destination, with S3 and Firehose as alternative choices. When you select S3 or Firehose as your logging destination, Lambda sends logs to the selected destination via a new CloudWatch Logs Delivery log class. This log class enables efficient routing but doesn’t support CloudWatch Logs Standard log class features, such as Logs Insights and Live Tail.

To set up S3 or Firehose as the destination for your Lambda logs in the Lambda console:

  1. Navigate to the Lambda console, and select or create a function to set up an S3 or Firehose logging destination.
  2. In the Configuration tab, select Monitoring and operations tools on the left pane.
  3. Select Edit in the Logging configuration. This opens the Edit logging configuration page.

    Figure 1. Edit logging configuration in Lambda console

    Figure 1. Edit logging configuration in Lambda console

  4. In the Log destination section, select Amazon S3 or Amazon Data Firehose. Amazon CloudWatch Logs is the default selection.

    Figure 2. Select log destination in the Edit logging configuration page

    Figure 2. Select log destination in the Edit logging configuration page

  5. Under CloudWatch delivery log group, choose Create new log group or Existing log group.
  6. To create a new delivery log group to send logs to S3, enter a log group name and specify the destination S3 bucket. Provide an AWS Identity and Access Management (IAM) role for CloudWatch Logs to deliver logs to S3.
    Follow similar steps to send logs to a Firehose stream.

    Figure 3. Create new CloudWatch delivery log group for S3

    Figure 3. Create new CloudWatch delivery log group for S3

  7. To use an existing delivery log group, select one from the Delivery log group. The selected delivery log group must have a configured destination (S3 or Firehose) and match the destination you selected.

    Figure 4. Select existing CloudWatch delivery log group for Firehose

    Figure 4. Select existing CloudWatch delivery log group for Firehose

Advanced logging controls are also available for S3 and Firehose destinations. These controls include JSON structured format selection and log level filters for both application and system logs. This gives you enhanced log management controls for easier search, filter, and analysis. You can also use AWS Command Line Interface (AWS CLI) and infrastructure as code (IaC) tools such as AWS CloudFormation and AWS Cloud Development Kit (AWS CDK) to set up Lambda logs delivery to S3 and Firehose.

Best practices

To get the most out of the changes announced today, ensure that your logging strategy is closely aligned with the requirements of your workload. For example, consider sending critical production logs to CloudWatch Logs to take advantage of its advanced real-time analytics and alerting features. You now automatically benefit from volume-based discounts through tiered pricing in CloudWatch Logs for high-volume logging scenarios. For logs that need long-term retention for historical analysis, you can use S3’s storage classes to further reduce costs. When using your existing or third-party monitoring tools, direct integration through Firehose eliminates the need for custom forwarding solutions and associated costs.

Logging cost optimization extends beyond destination selection. Monitor log volumes regularly to understand the impact of pricing tiers. Implement appropriate retention policies to prevent unnecessary storage of old logs and log sampling for high-volume debug logs. Consider using different logging strategies across development, staging, and production environments to balance observability needs with cost efficiency.

Conclusion

Tiered pricing for Lambda logs in CloudWatch Logs and support for S3 and Firehose as additional logging destinations improves Lambda application observability. You can now manage logging costs at scale and expand Lambda monitoring solutions through cost-effective, easy-to-configure integrations. Whether you’re building new serverless applications or optimizing existing ones, these enhancements help you implement comprehensive logging strategies that scale cost-effectively with your workload.

The new features announced today are available in all commercial AWS Regions where Lambda and CloudWatch Logs are available. Support for configuring log delivery to S3 and Firehose in the Lambda console is available in US East (Ohio), US East (N. Virginia), US West (Oregon), and Europe (Ireland) Regions, with additional Regions coming soon. Review the Lambda documentation and CloudWatch Logs documentation to learn more about these features and how to use them. Review the CloudWatch pricing page to learn more about how these features are priced.

For more serverless learning resources, visit Serverless Land.

Integrating aggregators and Quick Service Restaurants with AWS serverless architectures

Post Syndicated from Mike Gomez original https://aws.amazon.com/blogs/compute/integrating-aggregators-and-quick-service-restaurants-with-aws-serverless-architectures/

In this post, you learn how to use AWS serverless technologies, such as Amazon EventBridge and AWS Lambda, to build an integration between Quick Service Restaurants (QSRs) and online ordering and food delivery aggregators. These aggregators have taken off as an option to QSRs to expand their consumer base, enabling them with delivery options to help grow their businesses.

QSR overview

QSRs prioritize speedy and convenient service, offering a streamlined menu. To meet evolving consumer expectations, QSRs can use API integrations with third-party aggregators. This technological synergy enables QSRs to expand their capabilities, introducing diverse payment methods and incorporating delivery services. These features have become standard in this restaurant segment.

Behind the scenes, the APIs are used to orchestrate the interaction between the aggregator and the QSR while having a consistent ordering and delivery experience.

QSR business objectives are:

  • Providing consistent ordering and delivery experiences
  • Offering personalized menu items
  • Retaining repeat customers
  • Reducing third-party delivery cancellation due to lack of delivery personalization options

This post starts with a simple architecture and adds components to solve architectural challenges.

Architecture

As a solutions architect, you’ve been approached by a thriving local restaurant business seeking technological solutions to fuel their expansion. Your task is to design an optimal integration architecture that aligns with their technical requirements, streamlines operations, and enhances customer experience.

At the core of this integration is Amazon API Gateway, which accepts the incoming orders from various delivery aggregators. The API Gateway becomes the front door, connecting the QSRs with the end customers for a streamlined and dynamic order processing system.

Driving the backend of this integration are Lambda functions. These functions validate orders and securely communicate with delivery aggregators. Lambda functions can scale dynamically based on-demand, and make sure of optimal resource usage and cost-effectiveness.

Order placement workflow

The following steps outline the serverless integration between API Gateway and Lambda functions, as shown in the following figure:

  • Customers can place orders either through food delivery aggregators or the business’s own ordering system.
  • The order request is sent to API Gateway.

This architecture works for small and simple integrations. To scale this architecture for high traffic, use asynchronous integration to reduce the coupling between API and Lambda function.

Order routing workflow

The following steps outline a serverless integration where API Gateway connects to Lambda functions through Amazon EventBridge as the event routing service, as shown in the following figure:

  1. API Gateway receives the order request.
  2. The API Gateway routes the customer’s order request to an EventBridge bus for processing.

EventBridge routes events (for example order status changes) to Lambda functions, making sure of resiliency during service disruptions. This eliminates manual error handling and keeps QSRs and aggregators synchronized.

EventBridge delivers the following essential capabilities:

  • EventBridge receives events triggered by various actions, such as new orders or menu updates.
  • It routes events to the relevant Lambda functions, initiating the appropriate actions.
  • EventBridge supports event replay, allowing recovery from Lambda deployment issues or function failures. This feature enables business continuity by storing events during service disruptions and automatically resuming processing when the system stabilizes.

To maintain order history and enable fast data retrieval, the system needs a highly performant database. Amazon DynamoDB, a serverless NoSQL database service, meets these requirements by efficiently storing and managing order information and metadata. The order processing Lambda function interacts with DynamoDB to persist order details. This approach enables asynchronous processing of the stored data by other backend processes. The database solution provides the scalability and responsiveness needed to handle growing order volumes while maintaining consistent performance, separating order intake from subsequent processing steps.

Order processing workflow

The following steps outline the order processing workflow, as shown in the following figure:

  • The order processing Lambda function validates the order and updates the DynamoDB database with the new order details.
  • The function publishes error events to EventBridge, enabling downstream processing for error handling and retry logic. These events can trigger more Lambda functions designed to manage specific error scenarios and recovery processes.

EventBridge implementation patterns: single or dual bus approaches

EventBridge offers multiple approaches for event bus topology. Architects can choose to either use a single event bus with distinct event patterns based on order status or implement a multi-bus strategy.

The single-bus approach uses one event bus for all events with routing rule patterns based on order status. For example, rules would match specific statuses (for example “new” or “processed”) to trigger appropriate Lambda functions. Although it is architecturally simple, it needs careful management of the event schema to avoid potential errors. However, a single-bus approach requires careful handling to prevent recursive processing, where messages trigger additional messages in an endless loop.

Alternatively, the multi-bus method, separating order placement and processing across different buses, effectively prevents loops and recursion issues. This approach provides better separation of transactions, albeit with a slightly more complex setup.

EventBridge can directly target external services using the API destination option, eliminating the need for Lambda functions for third party integrations.

Orchestrating order processing

In complex order processing systems for QSRs, managing multiple interdependent Lambda functions can become challenging, potentially leading to intricate code and difficult-to-maintain architectures. To address this, AWS Step Functions can be introduced as an orchestration layer.

Step Functions acts as a central coordinator for the business logic needed in QSR order flows. This service manages the progression of activities in the order processing workflow, thereby efficiently coordinating tasks such as kitchen preparation and delivery logistics. Defining and managing complex workflows allows Step Functions to optimize the overall efficiency of QSR operations, providing a structured and adaptable solution. This orchestration enhances the restaurant’s ability to handle dynamic processing, achieving a smooth and responsive integration with delivery services while streamlining the underlying architecture.

The following steps outline the orchestration of order processing, as shown in the following figure:

  • Order processing trigger respective Lambda function, which updates the order data in the DynamoDB database.
  • The updated order is made available for subsequent Lambda functions that process more business logic being performed by further Lambda functions.

In a multi-bus EventBridge architecture, the process flows are as follows:

  1. The first EventBridge bus receives the initial order event and routes it to a Step Functions workflow.
  2. The Step Functions workflow orchestrates the order processing, coordinating various tasks and checks.
  3. Upon completion, the Step Functions workflow emits an event with the processing results to the second EventBridge bus.
  4. Based on the output from the Step Function workflow, this second bus contains a rule that triggers the Aggregator API as an API destination.

User engagement workflow

When a customer places an order, there must be a way to confirm or notify them when the order is ready. For this purpose, you can use AWS End User Messaging services to push notifications for order completion and new offers to customers.

Analyzing customer data and individual preferences allows Amazon Personalize to be used to present personalized recommendations and promotions.

Amazon Personalize can analyze historical order data to enhance the user experience through personalized recommendations, such as optimal delivery times, preferred menu items, and tailored promotions based on individual ordering patterns.

Conclusion

This post showed how to use AWS serverless services to build a platform for your order processing without worrying about managing underlying infrastructure. The serverless services included were Amazon API Gateway, AWS Lambda, Amazon EventBridge, AWS Step Functions, AWS End User Messaging, and Amazon Personalize.

This post is a brief introduction to event-driven architectures focused on integrations of internal ordering systems with delivery aggregators and third-party ordering platforms. This can help expand the user base, and it has been a key factor in the growth of many QSRs. Making the ordering, take-out, and delivery experience more efficient translates to revenue growth, reduction of order abandonment, as well as increased recurrent customer retention and brand loyalty.

For more serverless learning resources, visit Serverless Land. To find more patterns, go directly to the Serverless Patterns Collection.

Accelerate data pipeline creation with the new visual interface in Amazon OpenSearch Ingestion

Post Syndicated from Samuel Selvan original https://aws.amazon.com/blogs/big-data/accelerate-data-pipeline-creation-with-the-new-visual-interface-in-amazon-opensearch-ingestion/

Amazon OpenSearch Ingestion is a fully managed serverless pipeline that allows you to ingest, filter, transform, enrich, and route data to an Amazon OpenSearch Service domain or Amazon OpenSearch Serverless collection. OpenSearch Ingestion is capable of ingesting data from a wide variety of sources and has a rich ecosystem of built-in processors to take care of your most complex data transformation needs.

Today, we’re launching a new visual interface for OpenSearch Ingestion that makes it simple to create and manage your data pipelines from the AWS Management Console. With this new feature, you can build pipelines in minutes without writing complex configurations manually.

The new visual interface brings three key improvements to help streamline your workflow:

  • A guided visual workflow that walks you through pipeline creation
  • Automatic permission setup that eliminates manual AWS Identity and Access Management (IAM) policy management
  • Real-time validation checks that help catch issues early

These enhancements make it straightforward to ingest, transform, enrich, and route your data, whether you’re setting up your first pipeline or architecting sophisticated data workflows with multiple transformations and sinks.

In this post, we walk through how these new features work and how you can use them to accelerate your data ingestion projects.

Automatic discovery

Before the visual interface, creating an OpenSearch Ingestion pipeline started with selecting a blueprint that provided a template with placeholders for sources and sinks. You would then need to manually modify this template to match your specific requirements.

The new visual interface improves this process by automatically discovering your sources and sinks as you build. Instead of modifying template code, you can simply select from available resources on the dropdown menus and watch your pipeline configuration build in real time.

This automatic discovery feature eliminates the need to switch between different service consoles to find your source and sink details. Previously, you had to navigate to services like Amazon Simple Storage Service (Amazon S3) or Amazon DynamoDB to copy resource details and Amazon Resource Name (ARN) values, then switch back to enter them into your template. This keeps you focused on your pipeline design, streamlining the entire creation process.

Automated IAM role management

With automatic permission creation, you no longer need to manually create IAM policies for your pipelines and the components involved. With the new UI, you can now create a unified IAM role automatically, granting the necessary permissions for all the components in your pipeline. This significantly reduces the complexity of security management and minimizes the risk of permission-related errors. You can also still use your existing roles if you have them defined already.

Real-time validation

The new interface introduces real-time validation capabilities that go far beyond basic syntax checking. Whereas previous versions only validated keyword syntax, the new interface executes your processor chain in real time, catching both configuration and runtime errors as you build. As you construct your pipeline, the interface continuously validates your entire configuration, helping you identify and resolve potential issues like processor misconfigurations, data type mismatches, or transformation errors before deployment. This proactive, execution-based validation approach helps make sure your pipelines work as intended from the start, alleviating the need to wait until runtime to discover processing chain issues.

Now that we’ve covered the key features, let’s walk through the process of creating a pipeline using the new interface.

Create a pipeline in OpenSearch Ingestion

Getting started with the visual interface is straightforward — you can choose a blueprint as your pipeline foundation or start with a clean slate from a blank template. The interface then guides you through each step, using intelligent resource discovery and automatic population features to simplify the entire creation process. For this post, we use the “Zero-ETL with DynamoDB” blueprint.

The visual interface streamlines source configuration by presenting your DynamoDB tables on an easy-to-navigate dropdown menu. After you select a table, the interface handles all the technical details, including automatically retrieving and configuring the ARN. This same functionality extends to Amazon S3 export configuration, where you can choose Browse S3 to select your bucket and folders directly within the pipeline creation workflow.

After your source is configured, you can enhance your pipeline with processors to transform your data. The processor configuration panel starts with a search field where you can find and select the processor you need. You can choose Add to include processors also then arrange them in the desired order. This flexibility allows you to build complex data transformation workflows by combining different processors in the sequence you need.

If there are any issues, such as missing required fields, the interface displays clear error messages, allowing you to address problems before moving forward. This validation at each step makes sure your pipeline is properly configured before deployment.

The following screen capture shows an example of the visual interface.

The interface’s real-time validation capabilities extend to processor configuration, helping you identify and resolve potential issues before they impact your pipeline. Each processor’s configuration is validated as you build your pipeline, with clear error messages guiding you toward proper setup. This proactive validation approach makes sure your data transformation logic is sound before moving to the next stage of pipeline creation.

The sink configuration panel offers flexibility in choosing your OpenSearch destination. You can select between a managed cluster or serverless option, depending on your specific needs. For added convenience, we’ve integrated the ability to create a new OpenSearch domain directly from this interface, streamlining the end-to-end pipeline setup process.

The sink configuration provides options for both dynamic and custom mapping. Dynamic mapping automatically handles data type detection and mapping creation, whereas custom mapping gives you precise control over your data structure. To maintain data reliability, you can enable a dead-letter queue (DLQ)—a holding area for messages that couldn’t be processed successfully—to capture and manage any failed events.

As you make choices in the visual interface, the corresponding YAML/JSON configuration updates in real time. This immediate feedback helps you understand how your selections translate into technical configurations, from index naming to mapping options and advanced settings like flush timeout and document versioning.

Security configuration is now seamless with automated IAM role management. The interface intelligently handles the creation and management of permissions across all pipeline components. You can either create a new service role or use an existing one, and the interface automatically generates a unified IAM role that provides the precise permissions needed across pipeline components—from your source to Amazon S3 components needed for the DLQ and OpenSearch/Amazon S3 sinks. This automation not only saves time but also reduces the risk of permission-related errors that could occur when managing access controls across multiple resources. The following screen capture shows an example.

By consolidating resource selection into a single interface, we’ve eliminated the need to navigate between multiple AWS services. This saves time and reduces the potential for errors that could occur when manually copying resource identifiers. Once a pipeline is created using the visual interface, you can also edit a pipeline using the same visual interface to quickly alter pipeline configuration.

Conclusion

The new visual interface for OpenSearch Ingestion introduces guided visual workflows that simplify pipeline creation, automatic discovery of resources, automated IAM role management, real-time validation, and dynamic configuration previews. These enhancements collectively streamline the pipeline creation process, reduce the potential for errors, and provide a more intuitive experience for users of all skill levels.

Ready to get started? Visit the OpenSearch Service console today and begin building your first visual pipeline. With this new interface, you can transform your data ingestion workflows and unlock new insights from your data more quickly and efficiently than ever before.


About the authors

Sam Selvan is a Principal Specialist Solution Architect with Amazon OpenSearch Service.

Jagadish Kumar (Jag) is a Senior Specialist Solutions Architect at AWS focused on Amazon OpenSearch Service. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS.

How Smartsheet reduced latency and optimized costs in their serverless architecture

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/architecture/how-smartsheet-reduced-latency-and-optimized-costs-in-their-serverless-architecture/

Cloud software as a service (SaaS) companies are often looking for ways to enhance their architectures for performance and cost-efficiency. Serverless technologies offload infrastructure management, allowing development teams to focus on innovation and delivering business value. As application architectures grow and face more demanding requirements, continued optimization helps maximize both the technical and financial advantages of the serverless approach.

In this post, we discuss Smartsheet’s journey optimizing its serverless architecture. We explore the solution, the stringent requirements Smartsheet faced, and how they’ve achieved an over 80% latency reduction. This technical journey offers valuable insights for organizations looking to enhance their serverless architectures with proven enterprise-grade optimization techniques.

Solution overview

Smartsheet is a leading cloud-based enterprise work management platform, enabling millions of users worldwide to plan, manage, track, automate, and report on work at scale. At the core of the platform lies an event-driven architecture that processes real-time user activity across various document types. Given the collaborative nature of the platform, multiple users can work on these documents concurrently. Every document interaction triggers a series of events that must be processed with minimal latency to maintain data consistency and provide immediate feedback. Processing delays can impact user experience and productivity, making consistently low latency a fundamental business requirement.

Smartsheet’s traffic pattern is spiky during business hours and mostly dormant during nights and weekends. Within peak periods, traffic can fluctuate as users collaborate in real time. To efficiently manage dynamic workloads, which can surge from hundreds to tens of thousands of events per second within minutes, Smartsheet implements a serverless event processing architecture using services such as Amazon Simple Queue Service (Amazon SQS) and AWS Lambda. This architecture uses the elasticity of serverless services and the ability to automatically scale dynamically based on the traffic volume. It makes sure Smartsheet can efficiently handle sudden traffic surges while automatically scaling down during off-peak hours, optimizing for both performance and cost-efficiency.

The following diagram illustrates the high-level architecture of the Smartsheet event processing pipeline.

high-level architecture of the Smartsheet event processing pipeline

Optimization opportunity

Smartsheet uses Lambda functions to serve both batch jobs and API requests. The primary runtime used for building those functions is Java. Lambda automatically scales the number of execution environments allocated to your function on demand to accommodate traffic volume. When Lambda receives an incoming request, it attempts to serve it with an existing execution environment first. If no execution environments are available, the service initializes a new one. During initialization, the Smartsheet’s function code commonly sends several requests to external dependencies, such as databases and REST APIs, which might take time to reply.

The following diagram illustrates how Lambda functions reach out to external dependencies during initialization.

Lambda functions reach out to external dependencies during initialization

These tasks introduced execution environment initialization latency, commonly referred to as a cold start. Although cold starts typically affect less than 1% of requests, Smartsheet had stringent low latency requirements for their architecture to further prioritize the best possible end-user experience.

“To reduce customer request latency while keeping costs low, our engineering team utilized Lambda provisioned concurrency with auto scaling and Graviton, which resulted in an 83% reduction in P95 latency while providing a high quality of service as we continue to scale our platform and its limits,” says Abhishek Gurunathan, Sr Director of Engineering at Smartsheet.

Addressing the cold start with provisioned concurrency

To reduce cold start latency, the Smartsheet team adopted provisioned concurrency in their architecture, a capability that allows developers to specify the number of execution environments that Lambda should keep warm to instantly handle invocations. The following diagram illustrates the difference. Without provisioned concurrency, execution environments are created on demand, which means some invocations (typically less than 1%) need to wait for the execution environment to be created and initialization code to be run. With provisioned concurrency, Lambda creates execution environments and runs initialization code preemptively, making sure invocations are served by warm execution environments.

invocations are served by warm execution environments

Provisioned concurrency includes a dynamic spillover mechanism, making your serverless architecture highly resilient to traffic spikes. When incoming traffic exceeds the preconfigured provisioned concurrency, additional requests are automatically served by on-demand concurrency rather than being throttled. This provides seamless scalability and maintains service availability even during traffic surges, while still providing the performance benefits of pre-warmed execution environments for the majority of requests.

The Smartsheet team configured provisioned concurrency to match their historical P95 concurrency needs. This resulted in immediate improvements—the number of cold starts dropped dramatically and P95 invocation latency dropped by 83%. As the team monitored system performance, they quickly identified another architecture optimization opportunity—the Lambda functions were heavily used during work hours but had significantly fewer invocations at night and on weekends, as illustrated in the following graph.

Lambda functions were heavily used during work hours but had significantly fewer invocations at night and on weekends

Setting a static provisioned concurrency configuration worked great for busy periods, but was underutilized during off-times. The Smartsheet team wanted to further fine-tune their architecture and increase provisioned concurrency utilization rates to achieve higher cost-efficiency. This led them to look into provisioned concurrency auto scaling to match traffic patterns as well as adopting an AWS Graviton architecture.

Auto scaling provisioned concurrency and Graviton architecture

Two common approaches to enable provisioned concurrency are setting a static value and using auto scaling. With static configuration, you specify a fixed number of pre-initialized execution environments that remain continuously warm to serve invocations. This approach is highly effective for architectures that handle predictable traffic patterns. Unpredictable traffic patterns, however, can lead to under-provisioning during peak periods (with spillover to on-demand concurrency resulting in more cold starts) or underutilization during low-usage periods. To address that, provisioned concurrency with auto scaling dynamically adjusts the configuration based on utilization metrics, automatically scaling the number of execution environments up or down to match the actual demand. This dynamic approach optimizes for cost-efficiency and is particularly recommended for architectures with fluctuating traffic patterns.

The following figure compares static and dynamic provisioned concurrency.

static and dynamic provisioned concurrency

To further optimize the architecture for cost-efficiency, the Smartsheet team has implemented provisioned concurrency auto scaling based on utilization metrics. Smartsheet used an infrastructure as code (IaC) approach with Terraform to define auto scaling policies for maximum reusability across hundreds of functions. The policies track the LambdaProvisionedConcurrencyUtilization metric and define the scaling threshold according to the function purpose. For functions implementing interactive APIs, the auto scale threshold is 60% utilization to pre-provision execution environments early, keeping latency extra-low, and making functions more resilient towards traffic surges. For functions that implement asynchronous data processing, Smartsheet’s goal was to achieve the highest utilization rate and cost-efficiency, so they’ve defined the auto scale threshold at 90%.

The following diagram illustrates the architecture of auto scaling policies based on provisioned concurrency utilization rate and workload type.auto scaling policies based on provisioned concurrency utilization rate and workload type

Another optimization technique Smartsheet employed was switching the CPU architecture used by their Lambda functions from x86_64 to arm64 Graviton. To achieve this, Smartsheet adopted the ARM versions of Lambda layers they’ve used, such as Datadog and Lambda Insights extensions. This was required because binaries built using one architecture might be incompatible with a different one. Because Smartsheet functions were implemented with Java and packaged as JAR files, they didn’t have any compatibility issues when moving to Graviton. With Terraform used for codifying the infrastructure, this architecture switch was a simple property change in aws_lambda_function resources, as illustrated in the following code:

property change in aws_lambda_function resources

By switching to a Graviton architecture, Smartsheet saved 20% on function GB-second costs. See AWS Lambda pricing for details.

Best practices

Use the following techniques and best practices to optimize your serverless architectures, reduce cold starts, and increase cost-efficiency:

  • Fine-tune your Lambda functions to find the optimal balance between cost and performance. Increasing memory allocation also adds CPU capacity, which often means faster execution and can lead to reduced overall costs.
  • Use a Graviton2 architecture for compatible workloads to benefit from a better price-performance ratio. Depending on the workload type, switching to Graviton can yield up to 34% improvement.
  • Use provisioned concurrency and Lambda SnapStart to reduce cold starts in your serverless architectures. Start with static provisioned concurrency based on your historical concurrency requirements, monitor utilization, and introduce auto scaling into your architecture to achieve the optimal cost-performance profile.

Conclusion

Serverless architectures using services like Lambda and Amazon SQS offload the infrastructure management and scaling concerns to AWS, allowing teams to focus on innovation and delivering business value. As Smartsheet’s journey demonstrates, using provisioned concurrency and Graviton in your architectures can help significantly improve user experience by reducing latencies while also achieving better cost-efficiency, providing a practical blueprint for optimization across the organization. Whether you’re running large-scale enterprise applications or building new cloud solutions, these proven techniques can help you unlock similar performance gains and cost-efficiencies in your serverless architectures.

To learn more about serverless architectures, see Serverless Land.


About the authors

 

Startup Program update: empowering every stage of the startup journey

Post Syndicated from Christopher Rotas original https://blog.cloudflare.com/expanding-cloudflares-startup-program/

During Cloudflare’s Birthday Week in September 2024, we introduced a revamped Startup Program designed to make it easier for startups to adopt Cloudflare through a new credits system. This update focused on better aligning the program with how startups and developers actually consume Cloudflare, by providing them with clearer insight into their projected usage, especially as they approach graduation from the program.

Today, we’re excited to announce an expansion to that program: new credit tiers that better match startups at every stage of their journey. But before we dive into what’s new, let’s take a quick look at what the Startup Program is and why it exists.

A refresher: what is the Startup Program?

Cloudflare for Startups provides credits to help early-stage companies build the next big idea on our platform. Startups accepted into the program receive credits valid for one year or until they’re fully used, whichever comes first.

Beyond credits, the program includes access to up to three domains with enterprise-level services, giving startups the same advanced tools we provide to large companies to protect and accelerate their most critical applications.

We know that building a startup is expensive, and Cloudflare is uniquely positioned to support the full-stack needs of modern applications. Our goal is simple: ensure that you have access to the best of Cloudflare’s global network, without the barriers of cost or availability.

Since launching the revamped credits system in September, we’ve learned a lot from the startups in our program, including what they’re building, what they need, and where they need more flexibility. One of the most common requests was more credit tier options.

That’s why we’re introducing new tiers that provide even more options to startups as they scale.

Introducing additional credit tiers

The Cloudflare for Startups Program now offers four credit tiers: 

Credit Amount

$5,000

$25,000

$100,000

$250,000

Stage

Bootstrapped, stealth startups

Up-and-coming startups

Seed-funded startups

Tier 1 startups

Description

For startups who are just getting started. This tier is great for building, testing, and iterating your product.

For startups with early adopters and proving product market fit.

For startups that have raised capital, and are experiencing high growth.

For scaling startups that belong to our Tier 1 VC and accelerator network, are building a mission-critical AI application, or are participating in our Workers Launchpad Program.

Criteria

Building a software-based product or service

Founded in the last 5 years

Valid and matching email address

$5,000 criteria plus:

Active LinkedIn

Funded up to $1M

$25,000 criteria plus:

Funded between $1M and $5M

Belong to any of our 250+ approved VC or Accelerator partners

$100,000 criteria plus:

High growth / AI companies, OR

Tier 1 VC & Accelerators

These tiers are designed to offer simplicity and clarity by aligning with where you are in your growth journey. (You can check out eligibility criteria and apply to the Startup Program here). These tiers are still subject to the same Cloudflare for Startups Terms of Service. Credits are valid for up to one year or when all credits are consumed (whichever comes first).

Why are we adding additional credit tiers?

We understand that each startup may have different needs depending on where they’re at in their journey. Some are just getting off the ground, others are scaling rapidly, and each has unique infrastructure needs. With this expansion, we’re reaffirming Cloudflare’s commitment to startups of all sizes, making it easier for you to access the right level of support and resources, exactly when you need them.

Whether you’re launching your MVP or preparing for your next funding round, Cloudflare is here to help you grow.

What can I use the credit tiers for?

The vast majority of Cloudflare products (including all products found on the pay-as-you-go plans) can be used on the Startup Program. Beyond going to the website to see what products are included, below are a few examples of what you can use your credits for:  

Build AI applications

Store your training data in R2, build AI-powered agents (via Agents SDK) that autonomously perform tasks with Durable Objects and Workers, or use one of over 50 models to run inference tasks on Cloudflare’s global network.  


Create immersive realtime experiences

Deliver live audio and video via our Realtime Kit, enhance the experience with an AI-powered chatbot running on Workers AI to transcribe the call, broadcast to large audiences with Stream.  

Build durable multi-step applications

Design and run long-lived, multi-step processes like onboarding flows, document processing, or order fulfillment. Use Workflows to coordinate logic across Workers, Durable Objects, Queues, and AI tasks. Easily handle retries, timeouts, and state management without complex orchestration infrastructure.

What are startups saying about Cloudflare?

Webstudio’s no-code platform is powered by Cloudflare’s Developer Platform

“From a modern design tool, you’d expect real-time collaborative features and would like to have resources as close to users as possible. Since betting on the Developer Platform architecture, Cloudflare has done more for us than any other vendor out there!” – Oleg Isonen (Founder & CEO)

GrackerAI’s cybersecurity research engine runs on Cloudflare’s AI and serverless architecture

“Cloudflare’s fusion of edge computing and AI empowers developers to deploy and utilize AI models with unprecedented efficiency and scale, marking a significant leap forward in how we build and interact with intelligent systems.” – Deepak Gupta (Co-founder & CEO)

Render Better powers faster ecommerce experiences with Cloudflare Workers

“Each month Render Better optimizes billions of monthly requests for ecommerce visitors, delivering faster loading sites that make top brands millions more in revenue. We’re able to scale up with Cloudflare’s serverless workers, handling every request at the network edge within milliseconds, thanks to the rock solid, DX-friendly scope of the Developer Platform.” –  James Koshigoe (Co-founder & CEO)

What will you build on Cloudflare? 

We can’t wait to see what you will build on Cloudflare. Apply here to take advantage of the Cloudflare for Startups Program.


Startup spotlight: building AI agents and accelerating innovation with Cohort #5

Post Syndicated from Christopher Rotas original https://blog.cloudflare.com/ai-agents-and-innovation-with-launchpad-cohort5/

With quick access to flexible infrastructure and innovative AI tools, startups are able to deploy production-ready applications with speed and efficiency. Cloudflare plays a pivotal role for countless applications, empowering founders and engineering teams to build, scale, and accelerate their innovations with ease — and without the burden of technical overhead. And when applicable, initiatives like our Startup Program and Workers Launchpad offer the tooling and resources that further fuel these ambitious projects.

Cloudflare recently announced AI agents, allowing developers to leverage Cloudflare to deploy agents to complete autonomous tasks. We’re already seeing some great examples of startups leveraging Cloudflare as their platform of choice to invest in building their agent infrastructure. Read on to see how a few up-and-coming startups are building their AI agent platforms, powered by Cloudflare.

Lamatic AI built a scalable AI agent platform using Workers for Platform

Founded in 2023, Lamatic.ai empowers SaaS startups to seamlessly integrate intelligent AI agents into their products. Lamatic.ai simplifies the deployment of AI agents by offering a fully managed lifecycle with scalability and security in mind. SaaS providers have been leveraging Lamatic to replatform their AI workflows via a no-code visual builder to reduce technical debt and ship products faster. Designed for high availability, scalability, and low latency, Lamatic’s architecture enables developers to build AI-driven applications that remain performant under heavy load. After acquiring a high amount of users in a short amount of time on Product Hunt, Lamatic identified there was real interest to solve complex problems with AI Agents, and the team knew they needed to build a solution with scalability and performance in mind.

Cloudflare plays a key role in supporting Lamatic’s growth. Powered by Cloudflare Workers, Lamatic ensures requests process closer to end users, minimizing latency while offloading computational strain from centralized servers. In just a few months, Lamatic.ai has efficiently scaled to over three million serverless requests per month, supporting over 1,000 customers — all managed by a lean three-person team. 

Customers design their Agent Flows through a no-code visual builder, which generates an interoperable YAML configuration. Sensitive credentials such as API keys and model access tokens are securely encrypted and stored in Workers KV, ensuring they are only decrypted at runtime for enhanced security. All YAML configurations are then compiled into a Workers-compatible JavaScript bundle. When a project is deployed, Lamatic orchestrates critical components like sync jobs for scheduled data ETL operations and incoming webhooks to handle event-driven workflows via Cloudflare Queues. Once deployed, the project is fully operational as a Cloudflare Worker with an exposed API endpoint, allowing customers to integrate AI-powered automation directly into their applications with minimal friction.


To scale out their platform, Lamatic.ai built their architecture isolating serverless and AI logic on a per-customer basis. Rather than batching requests into a centralized cluster, Lamatic.ai distributes workloads across Cloudflare’s global network, ensuring each customer and endpoint is served by its own Worker executing dedicated logic. This per-customer deployment model — enabled by Workers for Platforms — allows Lamatic.ai to deliver customer-specific serverless functions at scale, and reduces technical overhead as they onboard additional customers. Each customer gets a dedicated Worker whose request and rate limits are enabled based on their level of subscription. 

Beyond request processing, Lamatic uses Cloudflare Workers KV as a distributed config store to ensure high availability and security. All values are encrypted at rest with AES-256-GCM and decrypted only at runtime, keeping operations both secure and low-latency. Tokens and user credentials are encrypted and stored in the database and KV. 

To further enhance performance, Cloudflare Queues plays a key role in orchestrating task completion. Lamatic uses Queues to offload work from Workers requests, and handle tasks such as webhooks and coordinating distributed processes, both essential for maintaining system consistency and reliability at scale. While Workers handle sync requests at point of execution, longer running jobs process via Queues. For example, during a scheduled ETL sync, new data records generated are stored as a message queue on Cloudflare Pub/Sub. A consumer Worker collects these messages and makes an API request to the pod using the Workers Queue. The consumer Worker consumes more messages as each queue is finished processing.

Another example of where this has been optimal is for managing AI workflows. Many AI workflows involve concurrent requests to multiple data sources, Queues streamlines data processing and efficiently feeds information into customers’ Retrieval Augmented Generation (RAG) workflows. This approach smooths out workload spikes, reduces bottlenecks, and ensures that AI agents can reliably aggregate and process data without delays.

Beyond this, Lamatic.ai offers Workers AI as one of the support inference providers that customers can use across their platform. Customers can choose to run one of the many open source models hosted on Workers AI, depending on their use case (chatbot, image generation, voice, etc.). Together, these layers solve the challenges of scaling AI agents by handling high volumes of data, maintaining low-latency responses, and ensuring robust security. With Cloudflare’s infrastructure as its backbone, Lamatic.ai has built a resilient and high-performing platform that meets the rigorous demands of modern AI applications, making it an ideal choice for startups embedding AI-driven features into their products.

Skyward AI automates compliance using AI agents with Durable Objects and agents

Skyward AI is transforming compliance operations by leveraging Cloudflare’s serverless computing capabilities to build AI-driven compliance agents that streamline critical tasks like evidence collection, real-time risk analysis, and policy updates. Compliance teams in fintech, supply chain, and other highly regulated industries use these AI Agents to extract and organize evidence, provide real-time recommendations, and orchestrate policy and procedural updates automatically. By handling document parsing, risk monitoring, and policy enforcement, these AI Agents reduce the risk of human error while allowing compliance professionals to focus on high-value tasks.

Skyward has built an AI agents platform designed with a serverless-first approach, avoiding the constraints of centralized cloud computing. To achieve this, the company leverages Cloudflare’s Developer Platform to create and maintain a highly responsive and scalable infrastructure. Workers handle incoming requests like chat inputs, compliance checks, or authentication, and route them efficiently across multiple geographies. Skyward initially built their AI agents infrastructure using Durable Objects, Workflows and JavaScript-native RPC for AI coordination, but has recently transitioned to Cloudflare’s new AI agents framework. Given that agents provides a framework for building and orchestrating AI agents, the migration has helped Skyward abstract the need to manage Durable Objects manually, significantly reducing time spent on managing these tools. While this release is fairly recent, the transition has helped simplify the way that agents communicate, but it also preserved the benefits of their original design like data privacy, isolation, and concurrency management. This has also made it easier to provide real-time feedback and responses to their end users.

Skyward optimizes real-time compliance automation by achieving sub-100 ms response times for AI agent queries. Workloads are structured to minimize unnecessary network round-trips, and a sync-engine approach proactively preloads and pushes data to clients, delivering a highly responsive user experience. To proxy AI inference, Skyward uses AI Gateway to provide observability into usage, performance, and costs across multiple vendors, improving their AI operational efficiency. Leveraging Cloudflare’s serverless Developer Platform has allowed Skyward to simplify their architecture while supporting global availability, avoiding the need for Kubernetes clusters or complex locking mechanisms. The team also avoids the burden of managing regional deployments, as Cloudflare’s multi-region support ensures consistent performance worldwide without added operational complexity.


State management is a critical component to execute agentic workflows. Each compliance session runs within a dedicated Durable Object, which keeps relevant data close to the execution layer. This setup minimizes database round-trips and ensures that tasks like Anti-Money Laundering (AML) checks, Know Your Business (KYB) validation, and document processing remain efficient. Once a compliance session is complete, the system summarizes and stores the relevant information in Postgres and R2, optimizing memory usage without requiring persistent cloud infrastructure.

To balance low-latency operations with long-term storage, Skyward employs a multi-layered data management strategy. The Skyward team, using Hyperdrive, has been able to reduce query latency by nearly 50%, allowing compliance teams to receive immediate feedback. At the company’s core, Skyward’s goal is to offer a platform that is “streamlined for compliance teams”. The team maintains that a speedy feedback loop ensures end customers get the data and responses needed to act. Whether there’s one agent or hundreds of agents processing tasks in parallel, Hyperdrive ensures that database requests to assets like extensive company documentation (i.e. regulations, policies, procedures, internal documents), complex regulatory knowledge graphs, and on-demand context information for conversational workflows are all as performant as possible.

Durable Objects facilitate real-time session state, ensuring AI agents function smoothly without complex locking mechanisms. For larger compliance-related documents, such as legal PDFs and archived data, Cloudflare R2 provides long-term storage, ensuring only frequently accessed information remains readily available. This approach enhances performance while keeping storage management efficient and cost-effective.

Security and scalability remain priorities for compliance-focused AI applications. Skyward enforces strict access controls, ensuring that only authorized users can access development and production environments. Each AI session maintains an auditable log of key events, user actions, and approvals, supporting the ability to export these insights for compliance and legal requirements. Because each agent is deployed in its own instance and has its own database, Skyward ensures that there is a detailed record of every required user, agent interaction, and auditing requirements. On top of this, the ability to deploy and scale globally with Cloudflare’s network has allowed Skyward to maintain consistent, high-performance operations across multiple regions without extensive infrastructure overhead.

Looking ahead, Skyward plans to further enhance AI agent responsiveness by running select models directly on Cloudflare Workers AI, reducing reliance on external inference providers. The team plans to further integrate Workers for Platforms in an effort to better isolate customer data and workflows, giving end users greater control over their compliance automation. As Cloudflare continues to evolve its AI capabilities, Skyward aims to push the boundaries of distributed AI compliance solutions, making regulatory adherence more automated, scalable, and secure.

Building on Cloudflare

We’re inspired by how startups like Lamatic AI and Skyward AI are building their AI agent platforms on Cloudflare. This kind of innovation is why we’re proud to see so many startups trust Cloudflare for a scalable, reliable, and efficient foundation. 

We’re also thrilled to share that both Lamatic AI and Skyward AI have been invited to join Cloudflare’s upcoming Workers Launchpad Cohort #5. Speaking of Workers Launchpad, it’s been a few months since our last update — let’s take a look at what’s new.

Thank you to Workers Launchpad Cohort #4, and a warm welcome to Cohort #5


The Workers Launchpad team is blown away by what customers are demonstrating on the Developer Platform. Members of Cohort #4 presented at our bi-annual Demo Day. We had customers demonstrate what they’re building across a multitude of industries, including (of course) AI / ML, developer tools, 3D design, cloud infrastructure, adtech, media, and beyond. It’s incredibly encouraging to see what all these amazing companies are building on the Cloudflare network, and we look forward to continuing to partner with them throughout their startup journey.

Following the Demo Day for Workers Launchpad Cohort #4, we’ve seen the largest influx of applications from startups across the globe eager to join Cohort #5. This next wave of founders is pushing the boundaries of what’s possible, building in areas like AI agents, developer tooling, MCP, media, and beyond. With each new cohort, we’re continually inspired by the caliber of founding teams, the bold ideas they bring to life, and the real-world problems they’re tackling with technology.

Help us give some love and a warm welcome to the participants of Cohort #5:


We can’t wait to share more about what Cohort #5 achieves. Be sure to follow @CloudflareDev on X and join our Developer Discord server to hear updates on the cohorts.

If you’re developing your application on our Developer Platform, we’d love to learn how Cloudflare is powering your journey. Please share more about what you’re building, and our team will be sure to review your submission. And if you’re a startup and interested in joining Workers Launchpad, feel free to apply for Cohort 6 — applications are now open!

Company

About

Acemate

AI learning platform for university students and educators

Centillion AI

Building an MCP for identity verification & fraud prevention

Ductize

Productize your service and launch your website with integrated payments within 30 minutes

Firmly

Agentic commerce platform that enables consumers to shop at the moment of inspiration

Heartspace

Deliver quality PR on demand: pay for placements, not retainers

Lamatic

Managed AI middleware and IDE to embed AI features quickly and reliably

Lu.ma

Application that makes it easy to host great events

Manticore

AI-driven penetration testing: identify, remediate, and retest on-demand

MC² Finance

Crowdsource DeFi strategies and list them as ETFs

Muppet

Platform for building and managing MCPs at scale

Navatech

Remove barriers to language, access, and modality for frontline workers

New Era

Craft beautiful emails using natural language

New Harbor

Simple, friendly, all-in-one cybersecurity for organizations

Nexartis

Enable new paths for rights management, protect intellectual property, and properly monetize individual contributions

Notika AI

Add transactional notifications into your app in minutes with no engineers required

Periculum

AI provider offering data analytics software solutions to organizations in underserved markets

Pressbox

AI-powered multi-modal content personalization for sports and media organizations

Prompteus

Guardrails, logging, and cost reduction for AI integrations

Remix Labs

Enable next-gen digital engagement with agentic app experiences

Skyward

The AI-native workspace for compliance teams

Sonora

AI-native customer intelligence platform that synthesizes customer feedback and delivers actionable insights across all communication channels

SSOJet

Intelligent enterprise SSO that just works

Syrenn

AI outbound sales-coaching platform

Testdriver

Increase test coverage with Computer-Use agents

Toddle

Web development engine that lets product teams build beautiful interactive web applications

Toolhouse

Platform that enables any developer to build AI agents and workflows with a great developer experience

Unravo

Agentic AI-powered, end-to-end business research platform

Wittify.ai

Advanced Arabic conversational AI for customer engagement activities

Zero Sum Defense

Digital freedom through tailored security and privacy


Enriching and customizing notifications with Amazon EventBridge Pipes

Post Syndicated from Chris McPeek original https://aws.amazon.com/blogs/compute/enriching-and-customizing-notifications-with-amazon-eventbridge-pipes/

This blog post authored by Elie Elmalem, Associate Scale Solutions Architect

When implementing event-driven architectures, customers frequently need to enrich their incoming events with additional information to make them more valuable for downstream consumers. Traditionally, customers using Amazon EventBridge would accomplish this by writing AWS Lambda functions to augment their events with supplementary data. However, this approach requires writing and maintaining custom code, adding complexity to their event processing pipeline.

Amazon EventBridge Pipes simplifies this process by providing a streamlined, managed service for event enrichment without the need to write and manage custom Lambda functions. This blog post demonstrates how you can use EventBridge Pipes’ built-in data enrichment capabilities to dynamically enhance your events with additional context and customer-specific details, making event processing more efficient and easier to maintain.

Amazon EventBridge Pipes

Amazon EventBridge creates a direct connection between sources and targets. Using an EventBridge bus helps you route and fan-out events to services in a pub/sub pattern. EventBridge Pipes on the other hand help you with point-to-point service integrations patterns. What sets it apart from the traditional event bus/rule pattern is its data transformation and enrichment support.

When defining an EventBridge Pipes, you specify the source and the target of the pipe. Pipes support a variety of sources and targets. Between the source and target, EventBridge Pipes supports filtering and enrichment. The filtering enables you to select and process a targeted subset of events. Enrichment allows you to enhance data by adding missing information before sending it to a target. For instance, if an event lacks necessary information, it ensures the target can properly consume the event. Enriching data can be very powerful, as it makes it possible to enhance a generic event and transform it. EventBridge Pipes support enrichment using Lambda functions, AWS Step Functions, Amazon API Gateway and EventBridge API destinations. More details about these concepts can be found in the Amazon EventBridge Pipes concepts documentation.

Representation of EventBridge Pipes showing filter and enrichment steps.

Figure 1: Representation of EventBridge Pipes showing filter and enrichment steps

This blog post will use the enrichment step of the pipe to create custom notifications.

Overview

To illustrate the functionality, this post uses a use case from a clothing retailer. Businesses such as this retailer want to keep their loyal customers engaged. Often, they rely on bulk promotional emails which lack personalization. In this use case, the retailer wants to send targeted promotion codes. As soon as the 10th order is placed, the code is sent via email or SMS to their customer.

Without EventBridge Pipes, this would be implemented using EventBridge to respond to the order event. All the events are sent to a custom Lambda function to process it. If the order meets the right conditions, the Lambda function sends a notification with the discount code to the customer using Amazon Simple Notification Service (Amazon SNS).

Traditional approach using EventBridge.

Figure 2: Traditional approach using EventBridge

While this architecture works, it requires you to maintain the integration code as well as the data enriching logic within the Lambda function as the function needs to extract the necessary information from the events and manage routing to SNS. As more microservices follow the same pattern, the code becomes more complex. This can lead to longer execution times along with higher cost and greater maintenance effort.

Simplifying using Amazon EventBridge Pipes

Amazon EventBridge Pipes can be used to simplify the previous implementation by handling the enrichment and integration between services. Amazon EventBridge Pipes take care of sending the event to your configured enrichment step and then routes the enriched event to the target. If the chosen method is a Lambda function, it leaves the function code to only focus on enrichment logic. It eliminates the need for code to extract the necessary fields from the event and to send notifications.

Solution architecture using EventBridge Pipes

Figure 3: Solution architecture using EventBridge Pipes

As the event comes into the pipe, the enrichment step triggers a Lambda function, which will check eligibility and returns the message to route to SNS. If the customer is not eligible for a discount code, it returns an order confirmed message with the data retrieved from the original order event. If the customer is eligible for a discount, the message also contains the discount code.

This is the architecture for the updated flow:

  1. A customer orders a new item. The order is sent to a Simple Queue Service (SQS) Orders queue.
  2. The new message on the Orders queue triggers the EventBridge Pipes.
  3. The pipe triggers an AWS Lambda function to enrich the data.
  4. The functions checks if the customer is eligible for a discount code against an Amazon DynamoDB table. The table contains the number of times each customer has ordered.
  5. The Lambda function returns the custom message that will be sent to the customer, either with or without the discount code.
  6. The message is routed to an SNS topic by the EventBridge Pipe
  7. Customer receives the notification via its preferred subscription method.

Building the updated flow

To build the updated flow, I have chosen to use the AWS Cloud Development Kit (CDK) in Python. You can use the code given here to deploy it into your account. The code can also be found on GitHub.

Note: This sample code is for testing purposes only and is not intended to be used in a production account.

For this solution, you need the following prerequisites:

  1. The AWS Command Line Interface (CLI) installed and configured for use.
  2. An Identity and Access Management (IAM) role or user with enough permissions to create an IAM policy, DynamoDB table, SQS queue, SNS topic, Lambda Function and EventBridge Pipes.
  3. AWS CDK
  4. Python version 3.9 or above, with pip and virtual virtualenv.

Once the prerequisites are met, set up a new Python CDK project in an empty directory:

mkdir blog_code
cd blog_code
cdk init app –-language python

Then, activate the virtual environment and install the CDK’s dependencies:

source .venv/bin/activate
python -m pip install -r requirements.txt

The cdk init command creates a blog_code folder. The GitHub repository contains the code for the blog_code_stack.py file inside the blog_code folder.

Then, within the blog_code folder, create a new folder called lambda. Inside this new folder, create a file called index.py. This file will contain the code for the enrichment lambda function. Once again, this code can be found in the GitHub repository. Here is a section of the Lambda code:

def lambda_handler(event, context):

    message = json.loads(event[0]['body'])

    id = message['id']
    order_content = message['order_content']
    
    nmb_orders = get_number_of_orders(id)
    
    # Calculate orders left
    orders_left = MAX_ORDERS - nmb_orders
    
    # Update the DynamoDB table with the new number of orders
    if nmb_orders == MAX_ORDERS:
        update_table(id, 0)
    else:
        update_table(id, nmb_orders)
    
    if orders_left == 0:
        return [f"Thank you for your order of {order_content}. You have earned a 10% discount code on your next order: XA5GT2SF"]
    else:  
        # Return the confirmation message
        return [f"Thank you for your order of {order_content}. This is your confirmation message! Only {orders_left} orders left until a 10% discount!"]

The Lambda function works in the following way:

  1. It receives an event from the EventBridge Pipe which consists of the order and the ID of the user who made the order
  2. It gets the number of orders that the user has already placed by calling a GetItem command on the DynamoDB table.
  3. It calculates how many orders are left before the user gets the discount code.
  4. It updates the DynamoDB table with the new number of orders to account for the one that was just placed.
  5. If the user has placed the right number of orders, it returns a confirmation message with the discount code. Otherwise, it notifies the user of the number of orders that still need to be placed to get the discount.

Now, deploy the CDK stack into your account. Make sure that you are in the root directory of your project:

cdk bootstrap
cdk deploy

Once the stack has finished deploying, you will find an EventBridge pipe visible on the console by going to the EventBridge service page and clicking on Pipes in the left panel.

Testing the solution

To test the solution, you must first set up a subscription to the SNS topic to receive notifications. It is recommended to set up email notifications for simplicity and testing purposes. To do so, follow the instructions on the Amazon SNS documentation for the topic with name TargetTopic. When the subscription is set up, don’t forget to check your email inbox and confirm the subscription.

Once notifications are set up, visit the DynamoDB console page. You need to manually add an entry to the eligibility table to mimic a real environment:

  1. Click on Tables in the left panel
  2. Select the EligibilityTable table.
  3. Click Explore table items then Create item
  4. Enter an id with a value of 01.
  5. Click Add new attribute and select String.
  6. Under attribute name, enter orders, and under value enter 8.
  7. Click Create item.

The Items returned table should look like the following. This assumes that the customer has already place 8 orders.

Items returned table after adding a new item.

Figure 4: Items returned table after adding a new item

Now, visit the SQS console page. You will need to send a message to the queue to mimic new orders being placed.

  1. Click on the queue called SourceQueue.
  2. Click Send and receive messages.
  3. Under message body, paste in the following message and click Send message:
{
    "order_content": "large shirt",
    "id": "01",
    "username": "johndoe01",
    "transaction_time": "10:04:00"
  }

After a few minutes, you should receive an email confirming your order, as your order message is considered to be the 9th order. Send the message again to place a 10th order and you should receive your discount code!

Email received with a discount code.

Figure 5: Email received with a discount code

Cleanup

To delete the resources in your account, run the following command in the root directory of your project:

cdk destroy

Conclusion

This blog post showed how Amazon EventBridge Pipes and its enrichment feature can help you create tailored notifications. First, it discussed how it could be implemented using EventBridge and then presented a simplified implementation using EventBridge Pipes.

For more information on common patterns for EventBridge Pipes, you can check out Implementing architectural patterns with Amazon EventBridge Pipes.

For more serverless learning resources, visit Serverless Land. To find more patterns, go directly to the Serverless Patterns Collection.

Serverless ICYMI 2025 Q1

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/serverless-icymi-2025-q1/

Welcome to the 28th edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. At the end of a quarter, we share the most recent product launches, feature enhancements, blog posts, videos, live streams, and other interesting things that you might have missed!

In case you missed our last ICYMI, check out what happened in Q4 2024 here.

Serverless calendar Q1 2025

Serverless calendar Q1 2025

AWS Step Functions

The AWS Step Functions team continues to improve developer experience. Workflow Studio is now available within Visual Studio Code (VS Code) through the AWS Toolkit extension.

AWS Step Functions in IDE

AWS Step Functions in IDE

You can now design, test, and deploy your Step Functions workflows without leaving your IDE. The extension provides a drag-and-drop interface with all the familiar Workflow Studio capabilities, making it even easier to build state machines locally.

To get started, install the AWS Toolkit for Visual Studio Code and visit the user guide on Workflow Studio integration.

Step Functions private integrations now allows you to integrate applications seamlessly across private networks, on-premises infrastructure, and cloud platforms. Learn more in a blog post and explanation video.

AWS Step Functions private integrations video

AWS Step Functions private integrations video

Step Functions now integrates with 36 more AWS services that support user messaging capabilities. You can orchestrate notifications through Amazon SNS, Amazon SQS, Amazon EventBridge, Amazon Pinpoint, and more, all using the optimized integrations you’re familiar with.

Step Functions has increased the default quota for state machines and activities from 10,000 to 100,000 per AWS account. This tenfold increase means you can create more workflows to automate your business processes without worrying about hitting quota limits.

Distributed Map is expanding capabilities by adding support for JSON Lines (JSONL) format. JSONL, a highly efficient text-based format, stores structured data as individual JSON objects separated by newlines, making it particularly suitable for processing large datasets.

AWS Step Functions Distributed Map

AWS Step Functions Distributed Map

Distributed Map can also process data from a broader range of delimited file formats stored in Amazon S3 and offers new output transformations for greater control over result formatting.

Developer Tools

Serverless Land patterns are now available directly within VS Code.

You no longer need to switch between your IDE and external resources when building serverless architectures. Browse, search, and implement pre-built serverless patterns directly in VS Code.

Example Serverless Pattern

Example Serverless Pattern

AWS Lambda

Learn how AWS Lambda handles billions of invocations.

AWS Lambda asynchronous invocations

AWS Lambda asynchronous invocations

This blog post provides recommendations and insights for implementing highly distributed applications based on the Lambda service team’s experience building its robust asynchronous event processing system. It dives into challenges you might face, solution techniques, and best practices for handling noisy neighbors.

A new video walks through using the enhanced local IDE experience for Lambda developers.

AWS Lambda new IDE experience

AWS Lambda new IDE experience

The VS Code extension for Lambda now supports live tailing of CloudWatch Logs directly in your IDE following on from previous support for Live Tail in the Lambda console. Watch logs in real-time as your functions execute, making debugging and troubleshooting more efficient than ever.

You can now enable Application Performance Monitoring (APM) for Java and .NET runtimes using Amazon CloudWatch Application Signals.

Amazon CloudWatch Application Signals for Java and .NET AWS Lambda runtimes

Amazon CloudWatch Application Signals for Java and .NET AWS Lambda runtimes

This provides deep visibility into your function’s performance, including method-level tracing, memory profiling, and automated anomaly detection.

Amazon Bedrock features

Multi-agent collaboration is now available in Bedrock as a preview, enabling you to create systems where multiple AI agents work together to solve complex problems. Agents can specialize in different domains, share context, and coordinate their actions to achieve goals that would be difficult for a single agent.

RAG evaluation is now generally available. This provides metrics to assess and improve your retrieval augmented generation pipelines. GraphRAG for Bedrock Knowledge Bases is now generally available, allowing you to enhance retrievals with graph-based context.

Amazon Bedrock Flows now supports multi-turn conversations, allowing you to build dynamic AI applications that maintain context across multiple user interactions. Bedrock data automation is now generally available, streamlining the process of preparing, ingesting, and maintaining data for your GenAI applications. Bedrock now offers LLM-as-a-judge capability for model evaluation, providing automated assessment of model outputs without requiring human reviewers. Compare different models or prompt strategies against your specific criteria at scale.

Bedrock’s capabilities are now integrated into the Amazon SageMaker Unified Studio, creating a seamless experience for machine learning practitioners who want to incorporate foundation models into their workflows. Access Bedrock models, fine-tuning, and evaluation directly from SageMaker.

Amazon Nova is a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry leading price-performance. Nova has expanded its tool use and converse API capabilities, making it easier for developers to build AI assistants that can use external tools to complete tasks.

Amazon Bedrock Guardrails image content filters are now generally available. Define and enforce boundaries for your AI applications with controls for both text and image content, ensuring outputs align with your organization’s policies.

Bedrock Knowledge Bases now supports using your existing OpenSearch clusters as the vector storage backend. This integration allows you to leverage your investments in OpenSearch while benefiting from the managed RAG capabilities of Bedrock.

New Amazon Bedrock models

  • Anthropic’s Claude 3.7 Sonnet hybrid reasoning allows you to toggle between standard and extended thinking modes. In standard mode, it functions as an upgraded version of Claude 3.5 Sonnet. While in extended thinking mode, it employs self-reflection to achieve improved results across a wide range of tasks.
  • DeepSeek R1, an advanced model specialized in research and scientific reasoning excels at complex problem-solving tasks and technical content generation.
  • Cohere Embed 3 models are now available in both multilingual and English-specific versions. These embedding models support text and images, providing more accurate representation for multimodal content and improving retrieval augmented generation (RAG) applications.
  • Ray2, Luma AI’s new visual AI model is capable of creating realistic visuals with fluid, natural movement. You can use it for image understanding, 3D scene reconstruction, and visual content generation, opening new possibilities for immersive and visual applications.
  • Bedrock now supports fine-tuning of Meta’s latest Llama 3.2 models. These upgraded models deliver improved performance across reasoning, coding, and multilingual tasks while being more efficient with computational resources.

Amazon Q Developer

Amazon Q Developer is now available as a CLI agent, bringing AI-assisted development to the command line. Get contextual recommendations, generate shell commands, and solve coding problems without leaving your terminal.

Amazon Q CLI

Amazon Q CLI

Amazon Q Developer transformation now supports upgrading Java applications using Maven to Java 21. It offers enhanced code suggestions, refactoring, and optimization recommendations for applications using the latest Java features, like virtual threads and pattern matching.

AWS AppSync

AWS AppSync Events now supports events publishing for WebSocket APIs, enabling real-time publish-subscribe functionality. This feature makes it easier to build applications requiring instant updates, like chat applications, collaborative tools, and real-time dashboards.

AWS AppSync Events

AWS AppSync Events

There are new AWS Cloud Development Kit (AWS CDK) L2 constructs for AppSync WebSocket APIs. These make it simpler to define and deploy real-time APIs using infrastructure as code. These high-level constructs handle the details of WebSocket connections, authorization, and messaging patterns.

Amazon SNS

Amazon SNS now supports high throughput mode for SNS FIFO topics, with default throughput matching SNS standard topics. When you enable high-throughput mode, SNS FIFO topics will maintain order within message group, while reducing the de-duplication scope to the message-group level.

Amazon EventBridge

Amazon EventBridge now supports direct delivery to targets across AWS accounts, simplifying multi-account architectures. This reduces latency and improves reliability when routing events between accounts in your organization.

Amazon EventBridge cross account

Amazon EventBridge cross account

The EventBridge console now features event source discovery, making it easier to find and visualize available event sources in your AWS environment. This tool helps you identify potential event producers and understand the event schemas they emit.

AWS Amplify

AWS Amplify now offers a TypeScript data client optimized for server-side Lambda functions, providing type-safe access to your data sources. This client reduces code complexity and improves reliability when working with databases and APIs in server environments.

Serverless compute blog posts

January

February

March

Serverless Office Hours weekly livestream

February

March

Still looking for more?

The Serverless landing page has more information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials.

You can also follow the Developer Advocacy team members who work on Serverless to see the latest news, follow conversations, and interact with the team.

And finally, visit the Serverless Land  for all your serverless needs.

Simplifying private API integrations with Amazon EventBridge and AWS Step Functions

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/simplifying-private-api-integrations-with-amazon-eventbridge-and-aws-step-functions-2/

This blog written by Pawan Puthran, Principal Specialist TAM, Serverless and Vamsi Vikash Ankam, Senior Serverless Solutions Architect.

In December 2024, AWS announced that Amazon EventBridge and AWS Step Functions support integration with private APIs using AWS PrivateLink and Amazon VPC Lattice. This feature allows users to integrate applications seamlessly across private networks, on-premises infrastructure, and cloud platforms. It provides operational simplicity, enabling secure and controlled communication between services within a Virtual Private Cloud (VPC). This blog post explores how to leverage this new capability to integrate Step Functions with private APIs, making application interactions across private networks more efficient and secure.

Overview

Private integrations are essential for secure communication between cloud services within a VPC. As organizations modernize their applications in the cloud, they often need to integrate existing systems with private network environments. EventBridge and Step Functions previously needed proxies to send events to HTTPS applications. These proxies, such as AWS Lambda or Amazon Simple Queue Service (Amazon SQS), delivered events to applications running on Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Kubernetes Service (Amazon EKS), or Amazon Elastic Container Service (Amazon ECS). Now, users can directly invoke private HTTPS-based endpoints running within their VPC using EventBridge and Step Functions.

This new capability offers several key benefits:

  1. Enhanced security and compliance: Private API integrations significantly enhance security by keeping APIs within private networks, minimizing exposure to internet threats and making sure of compliance in regulated industries such as finance and healthcare.
  2. Simplified architecture and increased developer productivity: This feature streamlines integration by enabling direct access to private APIs, eliminating complex network setups and proxy solutions. It allows developers to focus on core logic, resulting in cleaner architectures, faster development, and reduced maintenance. By removing the need for custom code and unifying application architecture, the integration process accelerates, leading to faster time to market and enhanced innovation.
  3. Improved performance and reliability: Private API integrations to VPC resources enhance performance by leveraging the AWS backbone network. This direct connectivity improves speed, increases reliability, and minimizes external network dependencies and points of failure.

EventBridge and Step Functions use new capabilities of PrivateLink and VPC Lattice, Resource Gateway and Resource Configuration, to facilitate secure network connectivity to services and resources inside of a VPC. To establish the private connectivity, you need the following components:

  1. Resource Gateway: A Resource Gateway serves as a secure entry point for the inbound traffic to the resource. This acts as an ingress point within the VPC where the resources reside.
  2. Resource Configuration: A Resource Configuration is a logical entity that identifies the resource and specifies how and who can access it. Defining a resource configuration allows you to allow private, secure, and unidirectional network connectivity to resources in your VPC from clients and services in other VPCs and accounts.
  3. EventBridge Connections: EventBridge Connections used in EventBridge API destinations and Step Function workflows, establishes connectivity to your private HTTPS endpoints by using resource configurations.
  4. AWS Resource Access Manager: You can share the resource configuration through AWS Resource Access Manager (AWS RAM), a service that securely shares your VPC resources across your organizations and with other AWS accounts.

Workload overview

To illustrate how Step Functions invoke private HTTPS APIs, consider the following workflow that classifies product reviews as fake or real.

  1. The Step Functions workflow processes an array of product reviews using Distributed Map.
  2. It involves calling the Amazon Nova Micro model through Amazon Bedrock to classify the review text.
  3. If a review is classified as fake, then the workflow publishes an event to an EventBridge bus, providing a flexible integration for potential downstream analysis or notifications.
  4. If a review is classified as real, then Step Functions calls the private HTTPS endpoint, using DNS address to further process the reviews.
  5. This private API is hosted in AWS Fargate behind an internal Application Load Balancer (ALB) within a VPC.
Step Functions workflow calling private HTTPS-based endpoint running in AWS Fargate

Figure 1: Step Functions workflow calling private HTTPS-based endpoint running in AWS Fargate

In real-world scenarios, this includes analyzing text patterns, user behavior, and linguistic cues to determine the authenticity of each review. Suspicious reviews are automatically flagged by building customized workflows to maintain the integrity of the product feedback system.

Deploying the example

Before configuring the private integration, create an Amazon Route53 public hosted zone with a registered domain (such as api.com), and an AWS Certificate Manager (ACM) certificate corresponding to the domain. While Amazon Route53 private hosted zones is currently not supported, utilizing public hosted zones resolves the domain name to a private IP address, accessible only from within the VPC.

This post includes a sample application and deployment instructions. For complete details, refer to the README.

Scenario 1: Single account

In this scenario, the Step Functions, EventBridge connections, and private resources reside in the same account, as shown in the following figure

Overview of a single account setup with Step Functions workflow and private API in the same account

Figure 2: Overview of a single account setup with Step Functions workflow and private API in the same account

  1. VPC Resource Gateway acts like the entry point to access the private resources running within your VPC. As a best-practice, consider creating a resource gateway to span across multiple private subnets (Availability Zones) for high availability. Refer to the AWS Cloud Development Kit (AWS CDK) code snippet in lib/vpclattice-stack.ts for resource gateway implementation.
  2. Resource Configurations establish the connection between the private endpoint and the Resource Gateway and are used to uniquely identify the private resources running within your VPC. Refer to the AWS CDK code snippet in lib/vpclattice-stack.ts to create Resource Configuration, and configure the domain name and port.
  3. To enable Step Functions to communicate with the private VPC resources, you create an EventBridge Connection. This handles the authorization and private connectivity to connect to the private API. Refer to the AWS CDK code snippet in lib/workflow-stack.ts for creating EventBridge Connections.
  4. The Step Functions state machine deployed as part of the sample application uses the HTTPS Invoke task type to call the private API. Calling private APIs from Step Functions allows you to use features such as built-in error handling like retries for transient issues and redrive for errors.

You can use the following payload to test the Step Functions execution:

{
  "items": [
    {
      "asin": "B000FA64PA",
      "helpful": [ 0, 0],
      "overall": 5,
      "reviewText": "Darth Maul working under cloak of darkness committing sabotage now that is a story worth reading many times over. Great story.",
      "reviewTime": "10 11, 2013",
      "unixReviewTime": 1381449600
    },
    {
      "asin": "B000F83SZQ",
      "helpful": [ 1, 1],
      "overall": 4,
      "reviewText": "Never heard of Amy Brewster. But I don't need to like Amy Brewster to like this book. Actually, Amy Brewster is a sidekick in this story, who added mystery to the story not the one resolved it. The story brings back the old times, simple life, simple people, and straight relationships.",
      "reviewTime": "03 22, 2014",
      "unixReviewTime": 1395446400
    }
  ]
}

The following figure shows the Step Functions execution where the review is classified as real and successfully invokes the private HTTPS endpoint.

Step Functions execution classifying the product reviews as real and successfully invoking the private API

Figure 3: Step Functions execution classifying the product reviews as real and successfully invoking the private API

Scenario 2: Cross account

In this scenario, all the private resources reside in Account A. The Step Functions and EventBridge Connections reside in Account B. The cross-account resource sharing is powered by AWS RAM, as shown in the following figure.

Cross-account setup

Figure 4: Cross-account setup

Following the creation of the Resource Gateway and the Resource Configuration, as described in the previous section, configure the resource share using AWS RAM in Account A.

  1. The sample application creates the AWS RAM resource share in Account A. This allows Account B to access private VPC resources in Account A, enabling secure, AWS Identity and Access Management (IAM) authorized access to the VPC resources in Account A. Refer to the CDK code snippet in lib/vpclattice-stack.ts to create cross-account resource share using AWS RAM.
  2. In Account B, AWS RAM receives an invitation from Account A to access the private VPC resources. Upon acceptance, the resource share status changes to Active, granting access to the private VPC resources in Account A.
  3. To enable access from Account B’s Step Function or EventBridge to Account A’s private VPC resources, create an EventBridge Connection as described in Step 3 (Single account scenario). Map this connection to the shared AWS RAM Resource Configuration created from the previous step.

Enterprises with distributed development teams operate across multiple AWS accounts. The setup described above enables secure cross-account access to VPC resources.

New connection state events

EventBridge now publishes change in the state events for new or existing connections. This is useful when taking actions on state changes or for troubleshooting purposes. The following example shows the state change events published for Connection Authorized and Connection Activated.

Figure 5: EventBridge connections state change

Figure 5: EventBridge connections state change

Conclusion

The new integration allows Amazon EventBridge and AWS Step Functions to integrate with private APIs, powered by AWS PrivateLink and Amazon VPC Lattice. Users can integrate legacy on-premises systems with cloud-native applications using event-driven architectures and workflow orchestration. The integration helps enterprises modernize distributed applications across public and private networks, enabling faster innovation, higher performance, and lower costs by eliminating the need for custom networking or integration code.

For more details, refer to the EventBridge and Step Functions documentation. Check out this video on setting up integrations with EventBridge and Step Functions. Get the sample code used in this post from this GitHub repository.

To expand your serverless knowledge, visit Serverless Land.

From virtual machine to Kubernetes to serverless: How dacadoo saved 78% on cloud costs and automated operations

Post Syndicated from Andreas Gehrig original https://aws.amazon.com/blogs/architecture/from-virtual-machine-to-kubernetes-to-serverless-how-dacadoo-saved-78-on-cloud-costs-and-automated-operations/

dacadoo is a global Swiss-based technology company that develops solutions for digital health engagement and health risk quantification. Their products include a software-as-a-service (SaaS)-based digital health engagement platform that uses behavioral science, AI, and gamification to help end users improve their health outcomes.

The company embarked on a journey to modernize an API to quantify health and lifestyle data plus a risk engine to calculate mortality and morbidity probabilities based on years of scientific research data.

To transform a virtual machine–based API service into a globally redundant, scalable health score and risk calculation solution dacadoo chose Amazon Web Services (AWS) technology. The service handles highly sensitive health data from a global customer base and must comply with regional regulations.

The result is a cost reduction of 78% and an infrastructure maintenance effort of less than an hour per year , allowing dacadoo to deliver and operate more AWS infrastructure without scaling its site reliability engineering (SRE) team, thanks to a high level of automation and an agile mindset.

In this post, we walk you step-by-step through dacadoo’s journey of embracing managed services, highlighting their architectural decisions as we go.

Background

The solution architecture went through a three-stage journey:

  1. Incubation – Single virtual machine on premises with disaster recovery (DR) in Switzerland
  2. Global and scalable – Multiple global Kubernetes clusters
  3. Operational excellence – Fully serverless and geo-redundant on AWS

Stage 1: Incubation with a virtual machine

After years of scientific research and development, the service was launched, running on a single on-premises virtual machine that used hypervisor technology to provide disaster recovery (DR). However, it had no high availability (HA) capability and it required manual recovery.

The application serving the API requests and the NoSQL database were both running on the same host. Software deployment and operating system maintenance were performed manually using Secure Shell (SSH)—a typical low-automation setup that also included downtime.

The following architecture diagram shows a virtual machine encompassing the monolithic application and its database.

Monolithic architecture

Challenges

A single virtual machine was quick to set up and inexpensive to operate, but it had considerable shortcomings. The health API was only available in Switzerland, infrastructure maintenance was performed manually, and software deployment was handled manually. Additionally, database backups were done using virtual machine snapshots, uptime monitoring only, and testing was conducted on the developer workstation.

Stage 2: Global and scalable with Kubernetes

At that time, dacadoo made a strategic decision to heavily invest in Kubernetes for managing containerized workloads on a global scale. As part of this technology rollout, the health score and risk service were migrated to Kubernetes.

Due to the geographically distributed customer base and low latency requirements, three Kubernetes clusters were deployed, one on each continent. The NoSQL database was hosted in proximity to the workload to reduce service latency and keep the migration effort low.

To reduce the operational maintenance, the NoSQL database was integrated as a SaaS offering, and monitoring was centralized using Datadog.

All cloud infrastructure was provisioned exclusively with Terraform, covering the Kubernetes cluster, NoSQL database , and integration with GitLab and Datadog.

dacadoo containerized the API service and used Gitlab continuous integration and continuous deployment (CI/CD) pipelines to deploy multiple environments and clusters on a global hyperscaler.

In retrospect, this was a typical replatform modernization project from virtual machine to Kubernetes, with a high level of automation and a SaaS-first approach.

The following diagram is the architecture for the container solution with managed NoSQL database.

Containers architecture

Challenges

The service faced several challenges, including increased costs from deploying three regional Kubernetes clusters across three environments, resulting in 27 cluster nodes and additional expenses from managing NoSQL database SaaS instances for each cluster. The complexity of CI/CD pipelines for multi-environment multi-cluster deployments added to the difficulty. Significant operational effort was required to keep infrastructure and Kubernetes components up to date.

Stage 3: Operational excellence with serverless

The Kubernetes-based architecture met the requirements, but some features in the dacadoo API service backlog needed to fit better with the application architecture at the time.

This was the right moment to take a holistic view of the infrastructure and software architecture and refactor the solution according to the latest AWS technologies and best practices, the next frontier for dacadoo’s engineering team.

Solution requirements

Requirements for the solution refactoring were as follows:

  • Keep the functionality of the API unmodified
  • Constrain data processing to a region of choice for compliance with local data protection laws
  • Avoid weekly patch cycles by exclusively using managed serverless services
  • Reduce costs by choosing services with a pay-as-you-go billing model
  • Delegate authentication to a dedicated service
  • Use an established web framework with an extensive ecosystem

Refactoring the apps

The API service has two components: a developer portal and the health score and risk calculations API. The database is only required for API keys, algorithm parameters, quotas, and usage statistics. Health data is processed regionally by the compute layer but not persisted, opening the door for a distributed database: Amazon DynamoDB global tables is the perfect fit for the solution. Writes are distributed to all connected Regions, whereas reads are local, providing low latency for complying with dacadoo service level agreements (SLAs).

The developer portal is a web UI with API documentation and API key management features. AWS Lambda is a great fit because it scales automatically and has a pay-per-request billing model.

The health and risk API uses algorithms implemented in the C programming language for short bursting, compute-intense simulations. These calls are wrapped by a REST API using the Python FastAPI framework. These characteristics make AWS Lambda a great fit.

Serverless architecture

HTTP requests are routed to the Lambda functions using Amazon API Gateway with AWS WAF for protection from malicious requests and attacks. Static assets are served from an Amazon Simple Storage Service (Amazon S3) bucket through API Gateway. The additional features of Amazon CloudFront aren’t required, and Amazon S3 reduces the complexity.

Amazon Route 53 provides a powerful feature known as latency-based routing, which allows it to direct DNS queries to the endpoint that offers the lowest latency for the requester.

This feature provides Regional high availability for API users without data processing location requirements. Alternatively, the user can call specific Regional endpoints to make sure requests are processed in the desired Region.

API authorization is HTTP header-based and is performed in the application with data stored in Amazon DynamoDB.

The following diagram is the architecture for a geo-redundant fully serverless solution.

Serverless architecture

With a dacadoo SRE team proficient in Python, they opted for Pulumi for its advanced features such as programming language flow control constructs, powerful configuration capabilities, and multi-cloud support.

For continuous integration, GitLab CI compiles the algorithm library, tests the FastAPI applications and packages everything. The application deployment is just an update of the AWS Lambda, a simple and reliable workflow.

Summary

The solution evolved from a managed infrastructure setup, where the customer held most of the responsibility, to an AWS managed service architecture.

Infrastructure provisioning evolved from manual, error-prone processes to powerful code-driven workflows in Pulumi. The SRE needed to enhance their software engineering skills to adopt Pulumi, transitioning from configuration-based approaches to designing and maintaining an infrastructure code base using object-oriented Python. This was part of dacadoo’s investment in the SRE team and broader modernization efforts. The serverless architecture enabled a GitOps engineering culture focused on productivity.

The transformation maximized scalability and availability while reducing costs and operational effort:

Virtual machine

  • Scalability: Low
  • Availability: Best effort
  • Infrastructure costs: Low
  • Maintenance effort: High

Kubernetes

  • Scalability: High
  • Availability: 99.95%
  • Infrastructure costs: High
  • Maintenance effort: Medium

Serverless

  • Scalability: Very high
  • Availability: 99.999% (with failover to another AWS Region)
  • Infrastructure costs: Low
  • Maintenance effort: Very low

The global redundancy elevates availability to an impressive 99.999% while keeping the costs low.

Conclusion

Migrating from a virtual machine to Kubernetes and ultimately to AWS Lambda demonstrates the progression of cloud engineering toward enhanced efficiency and scalability.

Each step in this journey reduced the complexity of managing resources while increasing flexibility and automation. Transitioning dacadoo’s API service to a fully serverless, geo-redundant architecture not only advanced the platform but also upskilled engineers, maintained a lean SRE team, and kept infrastructure costs low. Get started with your own AWS serverless solution.


About the Authors

Optimizing network footprint in serverless applications

Post Syndicated from Chris McPeek original https://aws.amazon.com/blogs/compute/optimizing-network-footprint-in-serverless-applications/

This post is authored by Anton Aleksandrov, Principal Solution Architect, AWS Serverless and Daniel Abib, Senior Specialist Solutions Architect, AWS

Serverless application developers may commonly encounter scenarios where they need to transport large payloads, especially when building modern cloud applications that need rich data. Examples include analytics services with detailed reports, e-commerce platforms with extensive product catalogs, healthcare applications transmitting patient records, or financial services aggregating transactional data.

Many serverless services have a well-defined maximum payload size. For example, AWS Lambda maximum request/response payload size is 6 MB, and Amazon Simple Queue Service (Amazon SQS) and Amazon EventBridge maximum message size is 256 KB. In this post, you will learn how to use data compression techniques to reduce your network footprint and transport larger payloads under existing constraints.

Overview

Cloud applications evolve continuously and need to be adjusted frequently for new requirements, such as new business features or new Service Level Objectives (SLO) for higher throughput and lower latency. As new use cases and data patterns are added, it is common to see request and response payload sizes increase. At some point, you might hit the maximum service payload size limits, such as 6 MB for synchronous Lambda function invokes, 10 MB for Amazon API Gateway, and 256 KB for Amazon SQS, EventBridge, and asynchronous Lambda invokes.

There are several techniques you can apply when dealing with large payloads. If your payloads are tens of MBs or more, or you need to transport large binary objects with API Gateway, you can store the payload on Amazon Simple Storage Service (Amazon S3) and use pre-signed URLs for clients to directly upload and download from S3.

A sample of architecture for handling large payloads.

Figure 1. A sample of architecture for handling large payloads

Lambda function URLs response streaming supports up to 20 MB responses. For handling large messages with services such as SQS or EventBridge, you can store the message in S3 and pass a reference. The downstream consumer will use the reference to download the message directly from S3. One common characteristic of these techniques is that they introduce architectural complexity and may necessitate modifications to your existing solution architecture and data flow patterns.

Furthermore, as your payloads grow in size, you will see increased data transfer costs, especially if your solution is transporting data through Amazon Virtual Private Cloud (VPC) NAT Gateways, VPC endpoints, or sending data across AWS Regions. For example, it is common for VPC-based solutions to have Lambda functions in their architecture. A container running on Amazon Elastic Kubernetes Service (Amazon EKS) might need to invoke a Lambda function, or a VPC-attached Lambda function might need to reach out to the public internet.

Examples of using virtual network appliances with serverless applications.

Figure 2. Examples of using virtual network appliances with serverless applications

Both NAT Gateway and VPC Endpoint are billed per GB of data processed, which makes data compression a valuable optimization technique. Go to NAT Gateway pricing and VPC Endpoint pricing for details.

The following sections explore data compression techniques and demonstrate how to apply them in your serverless applications. You can learn how to send larger payloads within the existing payload size boundaries and reduce your network footprint without significant architectural changes. This post discusses compression techniques in the context of Lambda and API Gateway, but the same principles can be applied to other services, such as SQS, EventBridge, and AWS AppSync. Understanding compression concepts better equips you to optimize your application’s data-handling capabilities.

What is data compression?

Compression is a widely used approach to reduce data size in order to improve cost-effectiveness and performance for data storage and transmission. Many tools and frameworks incorporate data compression techniques, such as gzip or zstd. It is thoroughly documented in the official IANA specification and IETF RFC 9110. Browsers such as Chrome and Firefox, HTTP toolkits such as curl and Postman, and runtimes such as Node.js and Python natively handle compression, often without user involvement.

Consider HTTP protocol. When a client wants to send a compressed payload, it specifies it in the Content-Type header. To receive a compressed response, the client specifies supported compression methods in the Accept-Encoding request header.

Accept-Encoding request header specifying supported compression methods.

Figure 3. Accept-Encoding request header specifying supported compression methods

The server compresses the response payload using one of the supported methods and uses the Content-Encoding response header to indicate the method to the client.

Content-Encoding response header specifying compression method.

Figure 4. Content-Encoding response header specifying compression method

This mechanism can accelerate client-server communications by reducing the number of bytes transmitted over the network. Compression efficiency depends on the data type. Text-based formats like JSON, XML, HTML, and YAML compress well, while binary data such as PDF and JPEG generally compress less effectively.

Data compression with API Gateway

API Gateway provides built-in compression support. Use the minimumCompressionSize configuration to set the smallest payload size to compress automatically. The value can be between 0 bytes to 10 MB. Compressing very small payloads might actually increase the final payload size, and you should always test with your real payload patterns to determine the optimal threshold.

Handling data compression in API Gateway.

Figure 5. Handling data compression in API Gateway

API Gateway enables clients to interact with your API using compressed payloads through supported content encodings. The compression mechanism works bi-directionally. For JSON payloads, API Gateway seamlessly handles compression and decompression, maintaining compatibility with mapping templates. It decompresses incoming payloads before applying request mapping templates and compresses outgoing responses after applying response mapping templates. This automated compression optimizes data transfer:

  • When sending compressed data, clients supply the appropriate Content-Encoding header. API Gateway handles the decompression and applies configured mapping templates before forwarding the request to the integration.
  • When API Gateway receives an integration response and compression is enabled, it compresses the response payload and returns it to the client, provided that the client has included a matching Accept-Encoding header.

A sample test using the compression technique with API Gateway and JSON payload yielded the following results.

  • Compression disabled. Response size = 1 MB, response latency = 660 ms
  • Compression enabled. Response size = 220 KB, response latency = 550 ms

Compressing data resulted in 78% network footprint reduction and improved latency by 110 ms.

This configuration-based technique uses the API Gateway native compression. However, payloads are decompressed before being delivered to downstream integrations, thus they still remain subject to Lambda’s 6 MB max payload size. To address this, you can configure binaryMediaTypes in the API Gateway to pass compressed payloads to Lambda directly, enabling the function to handle decompression.

CDK code to configure API Gateway for data compression and binary data passthrough.

Figure 6. CDK code to configure API Gateway for data compression and binary data passthrough

Handling compressed data in Lambda functions

The Lambda Invoke API supports payloads in plain-text formats, such as JSON. The maximum payload size is 6 MB for synchronous invocations and 256 KB for asynchronous. Although the Invoke API supports uncompressed text-based payloads, you can introduce data compression in your function code and use API Gateway or Function URLs to facilitate content conversion, as illustrated in the following figure.

Transporting compressed payloads in a serverless applications.

Figure 7. Transporting compressed payloads in a serverless applications

Handling data compression in your Lambda function code can be done through libraries commonly embedded in the runtime. The following code snippet shows the compressing response payload using Node.js. Similar techniques can be applied to other runtimes.

Sample code implementing response payload compression in a Lambda function.

Figure 8. Sample code implementing response payload compression in a Lambda function

  • Line 1: Import gzip functionality from the zlib module.
  • Lines 11: Compress and Base64-encode data. Gzip compression, similar to many other compression methods, produces a binary stream. Base64 encoding converts it to the text-based format expected by the Lambda service
  • Lines 13-21: Response object is created with isBase64Encoded=true and response headers telling the client that the response is a gzip-encoded JSON object.

The following screenshot shows the result: 20 MB uncompressed JSON returned from a Lambda function as a 2.5 MB compressed response body. Network footprint reduced by over 80%.

A screenshot from Postman showing the original and compressed payload size.

Figure 9. A screenshot from Postman showing the original and compressed payload size

Using this technique, you can reduce your network footprint and transport payload sizes several times higher than the Lambda maximum payload size.

Using Function URLs with compressed payloads

Transporting compressed payloads through Lambda Function URLs doesn’t necessitate any extra configuration. For handler responses, your code needs to compress and Base64-encode the data as shown in the preceding figure. For invocation requests, the Function URL endpoint recognizes the incoming compressed payload as binary and passes it to your handler as a Base64 encoded string in the event body.

Sample code implementing request payload decompression in a Lambda function.

Figure 10. Sample code implementing request payload decompression in a Lambda function

Trade-offs and testing results

Compressing data in function code is a CPU-intensive activity, potentially increasing invocation duration and, as a result, function cost. This, however, can be balanced by the benefits of data compression. As you’ve seen in previous sections, while compressing data adds compute latency, transporting smaller payloads over the network reduces network latency. The following section summarizes a series of tests performed to estimate the impact of data compression on Lambda function invocation duration, Lambda function invocation cost, and data transfer savings with both NAT Gateway and VPC Endpoint. The tests were performed with several assumptions and randomly generated JSON data. You can see full testing results in the sample GitHub.com repo.

Test results demonstrated that the impact on function latency and cost primarily depends on two key factors: payload size and allocated memory (which determines vCPU capacity). Using a Node.js runtime with ARM architecture as an example, compressing a 1 MB JSON object in a function with 1 GB of allocated memory resulted in 124 ms of added processing time on average. For 10 million invocations, this extra processing time adds approximately $16. At the same time, the compression yielded a 70% reduction in payload size. With the same number of invocations, this translates to approximately $300 in savings when using NAT Gateway and $70 in savings when using VPC Endpoints (depending on the number of Availability Zones (AZs)).

AWS Service pricing is updated regularly, you should always consult the respective pricing pages for the latest information. Moreover, you should conduct your own performance and cost estimates using payloads that represent your workloads. Compression effectiveness varies significantly depending on the data type: payloads with low compression rates might not benefit from this technique.

Sample application

Follow the instructions in this GitHub repository to provision the sample in your AWS account. The project creates two Lambda functions to demonstrate receiving and returning compressed JSON using Function URLs and API Gateway.

The sample shows how to GET and POST JSON payloads using gzip compression to reduce the network footprint by over 80%.

A screenshot from Postman showing the original and compressed payload size.

Figure 11. A screenshot from Postman showing the original and compressed payload size

Conclusion

Data compression enables larger payload transfers and reduces network footprint. It can help to lower network latencies and optimize data transfer costs. When implementing compression within Lambda functions, it is important to consider its CPU-bound nature, which may increase function duration and costs. You should always evaluate the added compute cost against potential data transfer savings to make sure the technique benefits your use case.

Compression is most effective for handling large text-based payloads and when a slight increase in compute latency balanced by reduced network latency is acceptable.

To learn more about Serverless architectures and asynchronous Lambda invocation patterns, see Serverless Land.

Build a data lakehouse in a hybrid Environment using Amazon EMR Serverless, Apache DolphinScheduler, and TiDB

Post Syndicated from Shiyang Wei original https://aws.amazon.com/blogs/big-data/build-a-data-lakehouse-in-a-hybrid-environment-using-amazon-emr-serverless-apache-dolphinscheduler-and-tidb/

While helping our customers build systems on AWS, we found out that a large number of enterprise customers who pay great attention to data security and compliance, such as B2C FinTech enterprises, build data-sensitive applications on premises and use other applications on AWS to take advantage AWS managed services. Using AWS managed services can greatly simplify daily operation and maintenance, as well as help you achieve optimized resource utilization and performance.

This post discusses a decoupled approach of building a serverless data lakehouse using AWS Cloud-centered services, including Amazon EMR Serverless, Amazon Athena, Amazon Simple Storage Service (Amazon S3), Apache DolphinScheduler (an open source data job scheduler) as well as PingCAP TiDB, a third-party data warehouse product that can be deployed either on premises or on the cloud or through a software as a service (SaaS).

Solution overview

For our use case, an enterprise data warehouse with business data is hosted on an on-premises TiDB platform, an AWS Global Partner that is also available on AWS through AWS Marketplace.

The data is then processed by an Amazon EMR Serverless Job to achieve data lakehouse tiering logic. Different tiering data are stored in separate S3 buckets or separate S3 prefixes under the same S3 bucket. Typically, there are four layers in terms of data warehouse design.

  1. Operational data store layer (ODS) – This layer stores raw data of the data warehouse
  2. Data warehouse stage layer (DWS) – This layer is a temporary staging area within the data warehousing architecture where data from various sources is loaded, cleaned, transformed, and prepared before being loaded into the data warehouse database layer;
  3. Data warehouse database layer (DWD) – This layer is the central repository in a data warehousing environment where data from various sources is integrated, transformed, and stored in a structured format for analytical purposes;
  4. Analytical data store (ADS) – This layer is a subset of the data warehousing that is specifically designed and optimized for a particular business function, department, or analytical purpose.

For this post, we only use ODS and ADS layers to demonstrate the technical feasibility.

The schema of this data is managed through the AWS Glue Data Catalog, and can be queried using Athena. The EMR Serverless Jobs are orchestrated using Apache DolphinScheduler deployed in cluster mode on Amazon Elasctic Compute Cloud (Amazon EC2) instances, with meta data stored in an Amazon Relational Database Service (Amazon RDS) for MySQL instance.

Using DolphinScheduler as the data lakehouse job orchestrator offers the following advantages:

  • Its distributed architecture allows for better scalability, and the visual DAG designer makes workflow creation more intuitive for team members with varying technical expertise
  • It provides more granular task-level controls and supports a wider range of task types out-of-the-box, including Spark, Flink, and machine learning (ML) workflows, without requiring additional plugin installations;
  • Its multi-tenancy feature enables better resource isolation and access control across different teams within an organization.

However, DolphinScheduler requires more initial setup and maintenance effort, making it more suitable for organizations with strong DevOps capabilities and a desire for complete control over their workflow infrastructure.

The following diagram illustrates the solution architecture.

Prerequisites

You need to create an AWS account and set up an AWS Identity and Access Management (IAM) user as a prerequisite for the following implementation. Complete the following steps:

For AWS account signing up, please follow up the actions guided per page link.

  1. Create an AWS account.
  2. Sign in to the account using the root user for the first time.
  3. One the IAM console, create an IAM user with AdministratorAccess Policy.
  4. Use this IAM user to log in AWS Management Console rather the root user.
  5. On the IAM console, choose Users in the navigation pane.
  6. Navigate to your user, and on the Security credentials tab, create an access key.
  7. Store the access key and secret key in a secure place and use them for further API access of the resources of this AWS account.

Set up DolphinScheduler, IAM configuration, and the TiDB Cloud table

In this section, we walk through the steps to install DolphinScheduler, complete additional IAM configurations to enable the EMR Serverless job, and provision the TiDB Cloud table.

Install DolphinScheduler on an EC2 instance with an RDS for MySQL instance storing DolphinScheduler metadata. The production deployment mode of DolphinScheduler is cluster mode. In this blog, we use pseudo cluster mode which has the same installation steps as cluster mode, and could achieve resource economy. We name the EC2 instance ds-pseudo.

Make sure the inbound rule of the security group attached to the EC2 instance allows port 12345’s TCP traffic. Then complete the following steps:

  1. Log in to Amazon EC2 as the root user, and install jvm:
    sudo dnf install java-1.8.0-amazon-corretto
    java -version

  2. Switch to dir /usr/local/src:
    cd /usr/local/src
  3. Install Apache Zookeeper:
    wget https://archive.apache.org/dist/zookeeper/zookeeper-3.8.0/apache-zookeeper-3.8.0-bin.tar.gz
    tar -zxvf apache-zookeeper-3.8.0-bin.tar.gz
    cd apache-zookeeper-3.8.0-bin/conf
    cp zoo_sample.cfg zoo.cfg
    cd ..
    nohup bin/zkServer.sh start-foreground &> nohup_zk.out &
    bin/zkServer.sh status

  4. Check the Python version:
    python3 --version

    The version should be 3.9 or above. It is recommended that you use Amazon Linux 2023 or later as the Amazon EC2 operating system (OS); Python version 3.9 meets the requirement. For detail information, refer to Python in AL2023.

  5. Install Dolphinscheduler
    1. Download the dolphinscheduler package:
      cd /usr/local/src
      wget https://dlcdn.apache.org/dolphinscheduler/3.1.9/apache-dolphinscheduler-3.1.9-bin.tar.gz
      tar -zxvf apache-dolphinscheduler-3.1.9-bin.tar.gz
      mv apache-dolphinscheduler-3.1.9-bin apache-dolphinscheduler
    2. Download the mysql connector package:
      wget https://downloads.mysql.com/archives/get/p/3/file/mysql-connector-j-8.0.31.tar.gz
      tar -zxvf mysql-connector-j-8.0.31.tar.gz
    3. Copy specific mysql connector JAR file to the following destinations:
      cp mysql-connector-j-8.0.31/mysql-connector-j-8.0.31.jar ./apache-dolphinscheduler/api-server/libs/
      cp mysql-connector-j-8.0.31/mysql-connector-j-8.0.31.jar ./apache-dolphinscheduler/alert-server/libs/
      cp mysql-connector-j-8.0.31/mysql-connector-j-8.0.31.jar ./apache-dolphinscheduler/master-server/libs/
      cp mysql-connector-j-8.0.31/mysql-connector-j-8.0.31.jar ./apache-dolphinscheduler/worker-server/libs/
      cp mysql-connector-j-8.0.31/mysql-connector-j-8.0.31.jar ./apache-dolphinscheduler/tools/libs/
    4. Add the user dolphinscheduler, and make sure the directory apache-dolphinscheduler and the files under it are owned by the user dolphinscheduler:
      useradd dolphinscheduler
      echo "dolphinscheduler" | passwd --stdin dolphinscheduler
      sed -i '$adolphinscheduler ALL=(ALL) NOPASSWD: NOPASSWD: ALL' /etc/sudoers
      sed -i 's/Defaults   requirett/#Defaults requirett/g' /etc/sudoers
      chown -R dolphinscheduler:dolphinscheduler apache-dolphinscheduler
  6. Install the mysql client:
    sudo dnf update -y 
    sudo dnf install mariadb105
  7. On the Amazon RDS console, provision an RDS for MySQL instance with the following configurations:
    1. For Database Creation Method, select Standard create.
    2. For Engine options, choose MySQL.
    3. For Edition: choose MySQL 8.0.35.
    4. For Templates: select Dev/Test.
    5. For Availability and durability, select Single DB instance.
    6. For Credentials management, select Self-managed.
    7. For Connectivity, select Connect to an EC2 compute resource, and choose the EC2 instance created earlier.
    8. For Database Authentication: choose Password Authentication.
  8. Navigate to the ds- mysql database details page, and under Connectivity & security, copy the RDS for MySQL endpoint.
  9. Configure the intance:
    mysql -h <RDS for mysql Endpoint> -u admin -p
    mysql> CREATE DATABASE dolphinscheduler DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
    mysql> exit;
  10. Configure the dolphinscheduler configuration file:
    cd /usr/local/src/apache-dolphinscheduler/
  11. Revise dolphinscheduler_env.sh:
    vim bin/env/dolphinscheduler_env.sh
    export DATABASE=${DATABASE:-mysql}
    export SPRING_PROFILES_ACTIVE=${DATABASE}
    export SPRING_DATASOURCE_URL="jdbc:mysql://ds-mysql.cq**********.us-east-1.rds.amazonaws.com/dolphinscheduler?useUnicode=true&amp;characterEncoding=UTF-8&amp;useSSL=false"
    export SPRING_DATASOURCE_USERNAME="admin"
    export SPRING_DATASOURCE_PASSWORD="<your password>"
  12. On the Amazon EC2 console, navigate to the instance details page and copy the private IP address.
  13. Revise install_env.sh:
    vim bin/env/install_env.sh
    ips=${ips:-"<private ip address of ds-pseudo EC2 instance>"}
    masters=${masters:-"<private ip address of ds-pseudo EC2 instance>"}
    workers=${workers:-" private ip address of ds-pseudo EC2 instance:default"}
    alertServer=${alertServer:-" private ip address of ds-pseudo EC2 instance "}
    apiServers=${apiServers:-" private ip address of ds-pseudo EC2 instance "}
    installPath=${installPath:-"~/dolphinscheduler"}
    export JAVA_HOME=${JAVA_HOME:-/usr/lib/jvm/jre-1.8.0-openjdk}
    export PYTHON_HOME=${PYTHON_HOME:-/bin/python3}
  14. Configure the dolphinscheduler configuration file:
    cd /usr/local/src/apache-dolphinscheduler/
    bash tools/bin/upgrade-schema.sh
  15. Install DolphinScheduler:
    cd /usr/local/src/apache-dolphinscheduler/
    su dolphinscheduler
    bash ./bin/install.sh
  16. Start DolphinScheduler after installation:
    cd /usr/local/src/apache-dolphinscheduler/
    su dolphinscheduler
    bash ./bin/start-all.sh
  17. Open the DolphinScheduler console:
    http://<ec2 ip address>:12345/dolphinscheduler/ui/login

After input the initial username and password, press Login button to enter into the dashboard shown as below.

initial user/password admin/dolphinscheduler123

Configure IAM role to enable the EMR serverless job

The EMR serverless job role needs to have permission to access a specific S3 bucket to read job scripts and potentially write results, and also have permission to access AWS Glue to read the Data Catalog which stores the tables’ meta data. For detailed guidance, please refer to Grant permission to use EMR Serverless or EMR Serverless Samples.

The following screenshot shows the IAM role configured with the trust policy attached.


The IAM role should have the following permissions policies attached, as shown in the following screenshot.

Provision the TiDB Cloud table

  1. To provision the TiDB Cloud table, complete the following steps:
    1. Register for TiDB Cloud.
    2. Create a serverless cluster, as shown in the following screenshot. For this post, we name the cluster Cluster0.
  2. Choose Cluster0, then choose SQL Editor to create a database named test:
    create table testtable (id varchar(255));
    insert into testtable values (1);
    insert into testtable values (2);
    insert into testtable values (3);

Synchronize data between on-premises TiDB and AWS

In this section, we discuss how to synchronize historical data as well as incremental data between TiDB and AWS.

Use TiDB Dumpling to sync historical data from TiDB to Amazon S3

Use the commands in this section to dump data stored in TiDB as CSV files into a S3 bucket. For full details on how to achieve a data sync from on-premises TiDB to Amazon S3, see Export data to Amazon S3 cloud storage. For this post, we use TiDB tool Dumpling. Complete the following steps:

  1. Log in to the EC2 instance created earlier as root.
  2. Run the following command to install TiUP:
    curl --proto '=https' --tlsv1.2 -sSf https://tiup-mirrors.pingcap.com/install.sh | sh
    
    cd /root
    source .bash_profile
    
    tiup --version

  3. Run the following command to install Dumpling:
    tiup install dumpling
  4. Run the following command to achieve target database table dumpling to the specific S3 bucket.
    tiup dumpling -u <prefix.root> -P 4000 -h <tidb serverless endpoint/host> -r 200000 -o "s3://<specific s3 bucket>" --sql "select * from <target database>.<target table>" --ca "/etc/pki/tls/certs/ca-bundle.crt" --password <tidb serverless password>
  5. To acquire the TiDB serverless connection information, navigate to the TiDB Cloud console and choose Connect.

You can collect the specific connection information of test database from the following screenshot.

Yan can view the data stored in the S3 bucket on the Amazon S3 console.

You can use Amazon S3 Select to query the data and get results similar to the following screenshot, confirming that the data has been ingested into testtable.

Use TiDB Dumpling with a self-managed checkpoint to sync incremental data from TiDB to Amazon S3

To achieve incremental data synchronization using TiDB Dumpling, it’s essential to self-manage the check point of the target synchronized data. One recommended way is to store the ID of the final ingested record into a certain media (such as Amazon ElastiCache for Redis, Amazon DynamoDB) to achieve a self-managing checkpoint when running the shell/Python job that trigges TiDB Dumpling. The prerequisite for implementing this is that the target table has a monotonically increasing id field as its primary key.

You can use the following TiDB Dumpling command to filter the exported data:

tiup dumpling -u <prefix.root> -P 4000 -h <tidb serverless endpoint/host> -r 200000 -o "s3://<specific s3 bucket>" --sql "select * from <target database>.<target table> where id > 2" --ca "/etc/pki/tls/certs/ca-bundle.crt" --password <tidb serverless password>

Use the TiDB CDC connector to sync incremental data from TiDB to Amazon S3

The advantage of using TiDB CDC connector to achieve incremental data synchronization from TiDB to Amazon S3 is that there is built-in change data capture (CDC) mechanism, and because the backend engine is Flink, the performance is fast. However, there is one trade-off: you need to create several Flink tables to map the ODS tables on AWS.

For instructions to implement the TiDB CDC connector, refer to TiDB CDC.

Use an EMR serverless job to sync historical and incremental data from a Data Catalog table to the TiDB table

Data usually flows from on premises to the AWS Cloud. However, in some cases, the data might flow from the AWS Cloud to your on-premises database.

After landing on AWS, the data will be wrapped up and managed by the Data Catalog by created Athena tables with the specific tables’ schema. The table DDL script is as follows:

CREATE EXTERNAL TABLE IF NOT EXISTS `testtable`(
  `id` string
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
LOCATION 's3://<bucket_name>/<prefix_name>/';  

The screenshot below showcases the DDL running result using Athena console.

The data stored in testtable table is queried using select * from testable SQL. The query result is shown as follows:

In this case, an EMR serverless spark job can accomplish the work of synchronizing data from an AWS Glue table to your on premises table.

If the Spark job is written in Scala, the sample code is as below:

package com.example
import org.apache.spark.sql.{DataFrame, SparkSession}

object Main  {

  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder()
      .appName("<specific app name>")
      .enableHiveSupport()
      .getOrCreate()

    spark.sql("show databases").show()
    spark.sql("use default")
    var df=spark.sql("select * from testtable")

    df.write
      .format("jdbc")
      .option("driver","com.mysql.cj.jdbc.Driver")
      .option("url", "jdbc:mysql://<tidbcloud_endpoint>:4000/namespace")
      .option("dbtable", "<table_name>")
      .option("user", "<user_name>")
      .option("password", "<password_string>")
      .save()

    spark.close()
  }
}

You can acquire the TiDB serverless endpoint connection information on the TiDB console by choosing Connect, as shown earlier in this post.

After you have wrapped the Scala code as JAR file using SBT, you can submit the job to EMR Serverless with the following AWS Command Line Interface (AWS CLI) command:

export applicationId=00fev6mdk***

export job_role_arn=arn:aws:iam::<aws account id>:role/emr-serverless-job-role

aws emr-serverless start-job-run \
    --application-id $applicationId \
    --execution-role-arn $job_role_arn \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "<s3 object url for the wrapped jar file>",
            "sparkSubmitParameters": "--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory --conf spark.driver.cores=1 --conf spark.driver.memory=3g --conf spark.executor.cores=4 --conf spark.executor.memory=3g --jars s3://spark-sql-test-nov23rd/mysql-connector-j-8.2.0.jar"
        }
    }'

If the Spark job is written in PySpark, the sample code is as follows:

import os
import sys
import pyspark.sql.functions as F
from pyspark.sql import SparkSession

if __name__ == "__main__":

    spark = SparkSession\
        .builder\
        .appName("app1")\
        .enableHiveSupport()\
        .getOrCreate()

    df=spark.sql(f"select * from {str(sys.argv[1])}")

    df.write.format("jdbc").options(
        driver="com.mysql.cj.jdbc.Driver",
        url="jdbc:mysql://tidbcloud_endpoint:4000/namespace ",
        dbtable="table_name",
        user="use_name",
        password="password_string").save()

    spark.stop()

You can submit the job to EMR Serverless using the following AWS CLI command:

export applicationId=00fev6mdk***

export job_role_arn=arn:aws:iam::<aws account id>:role/emr-serverless-job-role

aws emr-serverless start-job-run \
    --application-id $applicationId \
    --execution-role-arn $job_role_arn \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "<s3 object url for the python script file>",
            "entryPointArguments": ["testspark"],
            "sparkSubmitParameters": "--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory --conf spark.driver.cores=1 --conf spark.driver.memory=3g --conf spark.executor.cores=4 --conf spark.executor.memory=3g --jars s3://spark-sql-test-nov23rd/mysql-connector-j-8.2.0.jar"
        }
    }'

The preceding PySpark code and AWS CLI command achieves outbound parameter input as well: the table name (specifically testspark) is ingested into the SQL sentence when submitting the job.

EMR Serverless job pperation essentials

An EMR Serverless application is a resource pool concept. An application holds a certain capacity of compute, memory, and storage resources for jobs running on it to use. You can configure the resource capacity using AWS CLI or the console. Because it’s a resource pool, EMR Serverless application creation is usually a one-time action with the initial capacity and maximum capacity being configured.

An EMR Serverless job is a working unit that actually processes the compute task. In order for a job to work, you need to set the EMR Serverless application ID, the execution IAM role (discussed previously), and the specific application configuration (the resources the job is planning to use). Although you can create the EMR Serverless job on the console, it’s recommended to create the EMR Serverless job using the AWS CLI for further integration with the scheduler and scripts.

For more details on EMR Serverless application creation and EMR Serverless job provisioning, refer to EMR Serverless Hive query or EMR Serverless PySpark job

DolphinScheduler integration and job orchestration

DolphinScheduler is a modern data orchestration platform. It’s agile to create high- performance workflows with low code. It also provides a powerful UI, dedicated to solving complex task dependencies in the data pipeline and providing various types of jobs out of the box.

DolphinScheduler is developed and maintained by WhaleOps, and available in AWS Marketplace as WhaleStudio.

DolphinScheduler has been natively integrated with Hadoop: DolphinScheduler cluster mode is by default recommended to be deployed on a Hadoop cluster (usually on HDFS data nodes), and the HQL scripts uploaded to DolphinScheduler Resource Manager are stored by default on HDFS, and can be orchestrated using the following native Hive shell command:

Hive -f example.sql

Moreover, for specific case in which the orchestration DAGs are quite complicated, each DAG consists of several jobs (for example, more than 300), and almost all the jobs are HQL scripts stored in DolphinScheduler Resource Manager.

Complete the steps listed in this section to achieve a seamless integration between DolphinScheduler and EMR Serverless.

Switch the storage layer of DolphinScheduler Resource Center from HDFS to Amazon S3

Edit the common.properties files under directories /usr/local/src/apache-dolphinscheduler/api-server/ and directory /usr/local/src/apache-dolphinscheduler/worker-server/conf. The following code snippet shows the part of the file that needs to be revised:

# resource storage type: HDFS, S3, OSS, NONE
#resource.storage.type=NONE
resource.storage.type=S3
# resource store on HDFS/S3 path, resource file will store to this base path, self configuration, please make sure the directory exists on hdfs and have read write permissions. "/dolphinscheduler" is recommended
resource.storage.upload.base.path=/dolphinscheduler

# The AWS access key. if resource.storage.type=S3 or use EMR-Task, This configuration is required
resource.aws.access.key.id=AKIA************
# The AWS secret access key. if resource.storage.type=S3 or use EMR-Task, This configuration is required
resource.aws.secret.access.key=lAm8R2TQzt*************
# The AWS Region to use. if resource.storage.type=S3 or use EMR-Task, This configuration is required
resource.aws.region=us-east-1
# The name of the bucket. You need to create them by yourself. Otherwise, the system cannot start. All buckets in Amazon S3 share a single namespace; ensure the bucket is given a unique name.
resource.aws.s3.bucket.name=dolphinscheduler-shiyang
# You need to set this parameter when private cloud s3. If S3 uses public cloud, you only need to set resource.aws.region or set to the endpoint of a public cloud such as S3.cn-north-1.amazonaws.com.cn
resource.aws.s3.endpoint=s3.us-east-1.amazonaws.com

After editing and saving the two files, restart the api-server and worker-server by running the following commands, under folder path /usr/local/src/apache-dolphinscheduler/

bash ./bin/stop-all.sh
bash ./bin/start-all.sh
bash ./bin/status-all.sh

You can validate whether switching the storage layer to Amazon S3 was successful by uploading a script using DolphinScheduler Resource Center Console, check if the file appears in relevant S3 bucket folder.

Before verifying that Amazon S3 is now the storage location of DolphinScheduler, you need to create a tenant on the DolphinScheduler console and bundle the admin user with the tenant, as illustrated in the following screenshots:

After that, you can create a folder on the DolphinScheduler console, and check whether the folder is visible on the Amazon S3 console.

Make sure the job scripts uploaded from Amazon S3 are available in the DolphinScheduler Resource Center

After accomplishing the first task, you can upload the scripts from the DolphinScheduler Resource Center console, and confirm that the scripts are stored in Amazon S3. However, in practice, you need to migrate all scripts directly to Amazon S3. You can find and modify the scripts stored in Amazon S3 using DolphinScheduler Resource Center console. To do so, you can revise the metadata table t_ds_resources by inserting all the scripts’ metadata. The table schema of table t_ds_resources is shown in the following screenshot.

The insert command is as follows:

insert into t_ds_resources values(6, 'count.java', ' count.java','',1,1,0,'2024-11-09 04:46:44', '2024-11-09 04:46:44', -1, 'count.java',0);

Now there are two records in the table t_ds_resoruces.

You can access relevant records on the DolphinScheduler console.

The following screenshot shows the files on the Amazon S3 console.

Make the DolphinScheduler DAG orchestrator aware of the jobs’ status so the DAG can move forward or take relevant actions

As mentioned earlier, DolphinScheduler is natively integrated with the Hadoop ecosystem, and the HQL scripts can be orchestrated by the DolphinScheduler DAG orchestrator via Hive -f xxx.sql command. As a result, when the scripts changed to shell scripts or Python scripts (EMR Severless jobs needs to be orchestrated via shell scripts or Python scripts rather than the simple Hive command), the DAG orchestrator can start the job, but can’t get the real time status of the job, and therefore can’t continue the workflow to further steps. Because the DAGs in this case are very complicated, it’s not feasible to amend the DAGs; instead we follow a lift-and-shift strategy.

We use the following scripts to capture jobs’ status and take appropriate actions.

Persist the application ID list with the following code:

var=$(cat applicationlist.txt|grep appid1)
applicationId=${var#* }
echo $applicationId

Enable the DolphinScheduler step status auto-check using a Linux shell:

app_state
{
  response2=$(aws emr-serverless get-application --application-id $applicationId)
  application=$(echo $response1 | jq -r '.application')
  state=$(echo $application | jq -r '.state')
  echo $state
}

job_state
{
  response4=$(aws emr-serverless get-job-run --application-id $applicationId --job-run-id $JOB_RUN_ID)
  jobRun=$(echo $response4 | jq -r '.jobRun')
  JOB_RUN_ID=$(echo $jobRun | jq -r '.jobRunId')
  JOB_STATE=$(echo $jobRun | jq -r '.state')
  echo $JOB_STATE
}

state=$(job_state)

while [ $state != "SUCCESS" ]; do
  case $state in
    RUNNING)
         state=$(job_state)
         ;;
    SCHEDULED)
         state=$(job_state)
         ;;
    PENDING)
         state=$(job_state)
         ;;
    FAILED)
         break
         ;;
   esac
done

if [ $state == "FAILED" ]
then
  false
else
  true
fi

Clean up

To clean up your resources, we recommend using APIs through the following steps:

  1. Delete the EC2 instance:
    1. Find the instance using the following command:
      aws ec2 describe-instances 
    2. Delete the instance using the following command:
      aws ec2 terminate-instances –instance-ids <specific instance id>
  2. Delete the RDS instance:
    1. Find the instance using the following command:
      aws rds describe-db-instances
    2. Delete the instance using the following command:
      aws rds delete-db-instances –db-instance-identifier <speficic rds instance id>
  3. Delete the EMR Serverless application
    1. Find the EMR Serverless application using the following command:
      aws emr-serverless list-applications 
    2. Delete the EMR Serverless application using the following command:
       aws emr-serverless delete-application –application-id <specific application id>

Conclusion

In this post, we discussed how EMR Serverless, as AWS managed serverless big data compute engine, integrates with popular OSS products like TiDB and DolphinScheduler. We discussed how to achieve data synchronization between TiDB and the AWS Cloud, and how to use DolphineScheduler to orchestrate EMR Serverless jobs.

Try out the solution with your own use case, and share your feedback in the comments.


About the Author

Shiyang Wei is Senior Solutions Architect at Amazon Web Services. He is specializing in cloud system architecture and solution design for the financial industry. Particularly, he focused on big data and machine learning applications in finance, as well as the impact of regulatory compliance on cloud architecture design in the financial sector. He has over 10 years of experience in data domain development and architectural design.

Handling billions of invocations – best practices from AWS Lambda

Post Syndicated from Chris McPeek original https://aws.amazon.com/blogs/compute/handling-billions-of-invocations-best-practices-from-aws-lambda/

This post is written by Anton Aleksandrov, Principal Solution Architect, AWS Serverless and Rajesh Kumar Pandey, Principal Engineer, AWS Lambda

AWS Lambda is a highly scalable and resilient serverless compute service. With over 1.5 million monthly active customers and tens of trillions of invocations processed, scalability and reliability are two of the most important service tenets. This post provides recommendations and insights for implementing highly distributed applications based on the Lambda service team’s experience building its robust asynchronous event processing system. It dives into challenges you might face, solution techniques, and best practices for handling noisy neighbors.

Overview

Developers building serverless applications create Lambda functions to run their code in the cloud. After uploading the code, the functions are invoked using synchronous or asynchronous mode.

Synchronous invocations are commonly used for interactive applications that expect immediate responses, such as web APIs. The Lambda service receives the invocation request, invokes the function handler, waits for the handler response, and returns it in response to the original request. With synchronous invocations, the client waits for the function handler to return, and is responsible for managing timeouts and retries for failed invocations.

Synchronous invocation sequence diagram.

Figure 1. Synchronous invocation sequence diagram

Asynchronous invocations enable decoupled function executions. Clients submit payloads for processing without expecting immediate responses. This is used for scenarios like asynchronous data processing or order/job submissions. The Lambda service immediately returns a confirmation for accepted invocation and proceeds to manage further handler invocation, timeouts, and retries asynchronously.

Asynchronous invocation sequence diagram.

Figure 2. Asynchronous invocation sequence diagram

Asynchronous invocations under-the-hood

To accommodate asynchronous invocations, the Lambda service places requests into its internal queue and immediately returns HTTP 202 back to the client. After that, a separate internal poller component reads messages from the queue and synchronously invokes the function.

Asynchronous invocations workflow high-level topology.

Figure 3. Asynchronous invocations workflow high-level topology

The same system also takes care of timeouts and retries in case of handler exceptions. When code execution completes, the system sends handler response to either onSuccess or onFailure destination, if configured.

Asynchronous invocations workflow detailed sequence diagram.

Figure 4. Asynchronous invocations workflow detailed sequence diagram

Scaling highly distributed systems for billions of asynchronous requests presents unique challenges, such as managing noisy neighbors and potential traffic spikes to prevent system overload. Solutions vary by scale – what works for millions of requests may not suite billions. As workload size increases, solutions typically become more complex and costly, so right-sizing the approach is critical and should evolve with changing needs.

Simple queueing

A simple implementation of an asynchronous architecture can start with a single shared queue. This is a common approach for many asynchronous systems, particularly in early stages. It is effective when you’re not concerned about tenant isolation and when capacity planning indicates that a single queue can handle estimated incoming traffic efficiently.

Asynchronous workflow with a single queue.

Figure 5. Asynchronous workflow with a single queue

Even with this simple setup, it is critical to instrument your solution for observability to detect potential issues as soon as possible. You should monitor key metrics like queue backlog size, processing time, and errors, to indicate insufficient processing capacity early. Periods of unexpected traffic spikes and degraded performance may be a signal you have noisy neighbors impacting other tenants.Top of FormBottom of Form

To address this, you can scale your solution horizontally. You can implement random request placement across multiple queues to spread the load. Using a serverless service like Amazon SQS allows you to easily add and remove queues on-demand. One notable benefit of this approach is its simplicity – you do not need to introduce any complex routing mechanisms; requests are evenly spread across the queues. The downside is that you still do not have tenant boundaries. As your system grows, high-volume tenants and noisy neighbors can potentially affect all queues, thus impacting all tenants.

Asynchronous workflow with multiple queues and random request placement.

Figure 6. Asynchronous workflow with multiple queues and random request placement

Intelligent partitioning with consistent hashing

In order to further reduce potential impact, you can partition your tenants using sticky tenant-to-partition assignment with a hashing technique such as consistent hashing. This method uses a hash function to assign each tenant to a queue on a consistent hash ring.

Asynchronous workflow with multiple queues and consistent hashing placement.

Figure 7. Asynchronous workflow with multiple queues and consistent hashing placement

This technique ensures individual tenants stay in their queue partitions without the risk of disturbing the whole system. It helps to solve the problem where a few noisy neighbors have the potential to overflow all queues and as such impact all other tenants.

The consistent hashing approach proved to be efficient and enabled Lambda to offer robust asynchronous invocation performance to customers. As the volume of traffic and number of customers continued to grow, the Lambda service team came up with an innovative shuffle-sharding technique to further optimize the experience, and proactively eliminate any potential noisy-neighbor issues.

Shuffle-sharding

Drawing inspiration from the “The Power of Two Random Choices” paper, the Lambda team explored the shuffle-sharding technique for its asynchronous invocations processing. Using this technique, you shuffle-shard tenants into several randomly assigned queues. Upon receiving an asynchronous invocation, you place the message in the queue with the smallest backlog to optimize load distribution. This approach helps to minimize the likelihood of assigning tenants to a busy queue.

Asynchronous workflow with multiple queues and shuffle-sharding placement.

Figure 8. Asynchronous workflow with multiple queues and shuffle-sharding placement

To illustrate the benefit of this approach, consider a scenario where you’re using а 100 queues. The following formula helps to calculate the number of unique queue shards (combinations), where n is the total number of queues and r is the shard size (the number of queues you’re assigning per tenant).

Formula to show queue shard calculation.

With n=100, r=2 (each tenant is assigned randomly to 2 out of 100 queues), you get 4,950 unique combinations (shards). The probability of two tenants assigned to exactly the same shard is 0.02%. In case of r=3, the number of combinations spikes to 161,700. The probability of two tenants assigned to exactly the same shard drops to 0.0006%.

The shuffle-sharding technique proved remarkably effective. By distributing tenants across shards, the approach ensures that only a very small subset of tenants could be affected by a noisy neighbor. The potential impact is also minimized since each affected tenant maintains access to unaffected queues. As your workloads grow, increasing the number of queues enhances resilience and further reduces the probability of multiple tenants being assigned to the same shard. This significantly lowers the risk of a single point of failure, making shuffle sharding a robust strategy for workload isolation and fault tolerance.

Proactive detection, automated isolation, sidelining

Many distributed services will have a cohort of tenants with legitimate spiky asynchronous invocation traffic. This can be driven by seasonal factors, such as holiday shopping, or periodical batch processing. Recognizing these as real business needs, not malicious actions, you want to improve service quality for these tenants as well, while maintaining the overall system stability. For example, you can further improve solution performance by continuously monitoring queue depth to detect traffic spikes and route traffic to dynamically allocated dedicated queues. When you use Lambda asynchronous invocations, this internal complexity is managed for you by the service, ensuring seamless consumption experience.

Tenant D is automatically reallocated to a dedicated queue.

Figure 9. Tenant D is automatically reallocated to a dedicated queue

Resilience and failure handling

“Everything fails, all the time” is a famous quote from Amazon’s Chief Technology Officer Werner Vogels. Lambda’s distributed and resilient architecture is built to withstand potential outages of its dependencies and internal components to limit the fallout for customers. Specifically for asynchronous invocation processing, the frontend service builds a processing backlog during an outage, allowing the backend to gradually recover without losing any in-flight messages.

Lambda service maintains resilience during component outage.

Figure 10. Lambda service maintains resilience during component outage

Upon recovery, the service gradually ramps up the traffic to process the accumulated backlog. During this time, automated mechanisms are in place to coordinate between system components, preventing inadvertently DDoSing itself.

To further improve the recovery ramp-up process and provide a smooth restoration of normal operations, the Lambda service uses load-shedding technique to ensure fair resource allocation during recovery. While trying to drain the backlog as fast as possible, the service ensures that no single customer ends up consuming an outsized share of the available resources. Adopting such techniques can help you to improve your mean-time-to-recovery (MTTR).

Observability for asynchronous invocations processing

When using the Lambda service for asynchronous processing, you want to monitor your invocations for situational awareness and potential slowdowns. Use metrics such as AsyncEventReceived, AsyncEventAge, and AsyncEventDropped to get insights about internal processing.

AsyncEventReceived tracks the number of async invocations the Lambda service was able to successfully queue for processing. A drop in this metric indicates that invocations are not being delivered to the Lambda service and you should check your invocation source. Potential issues include misconfigurations, invalid access permissions, or throttling. Check your invocation source configuration, logs, and the function resource policy for further analysis.

AsyncEventAge tracks how long has a message spent in the internal queue before being processed by a function. This metric increases when async invocations processing is delayed due to insufficient concurrency, execution failures, or throttles. Increase your function concurrency to process more asynchronous invocations at a time and optimize function performance for better throughput, i.e. by increasing memory allocation to add more vCPU capacity. Experiment with adjusting batch size to enable functions to process more messages at a time. Use invocation logs to identify whether the problem is caused by function code throwing exceptions. Check Throttles and Errors metrics for further analysis.

AsyncEventDropped tracks the number of messages in the internal queue that were dropped because Lambda could not process them. This can be due to throttling, exceeding number of retries, exceeding maximum message age, or function code throwing an exception. Configure OnFailure destination or a dead-letter queue to avoid losing data and save dropped messages for re-processing. Use function logs and metrics described above to investigate whether you can address the issue by increasing function concurrency or allocating more memory.

By monitoring these metrics and addressing the underlying issues, you can ensure that your Lambda functions run smoothly, with minimal event processing delays and failures. You can also enable AWS X-Ray tracing to capture Lambda service traces. The AWS::Lambda trace segment captures the breakdown of the time that Lambda service spends routing requests to internal queues, the time a message spends in a queue, and the time before a function is invoked. This is a powerful tool to get insights into Lambda’s internal processing.

Conclusion

AWS Lambda processes tens of trillions of monthly invocations across more than 1.5 million active customers, demonstrating its exceptional scalability and resilience. Gaining an understanding of the underlying mechanisms of AWS services like Lambda enables you to proactively address potential challenges in your own applications. By learning how these services handle traffic, manage resources, and recover from failures, you can incorporate similar capabilities into your own solutions. For instance, leveraging Lambda’s asynchronous invocation metrics allows you to optimize workflow performance. This knowledge empowers you to implement strategies such as automated scaling, proactive monitoring, and graceful recovery during outages.

See below resources to learn about using queues and shuffle sharding at scale at Amazon

To learn more about Serverless architectures and asynchronous Lambda invocation patterns, see Serverless Land.

Build an enterprise API management solution using Amazon API Gateway

Post Syndicated from Roger Zhang original https://aws.amazon.com/blogs/architecture/build-an-enterprise-api-management-solution-using-amazon-api-gateway/

Enterprises face many challenges when they build and manage application programming interfaces (APIs). These challenges include security controls, version management, traffic control, and usage analytics. As digital businesses expand, a mature API management (APIM) solution is crucial for ensuring scalability, security, and operational efficiency.

This blog post shows how you can use Amazon API Gateway—along with AWS Lambda, Amazon DynamoDB, and other AWS services—to create a comprehensive and customizable APIM solution. This solution addresses the complex requirements of large enterprises managing APIs at scale.

Core features of APIM

API Management (APIM) centralizes the management and publishing of APIs for the entire enterprise, acting as a hub between clients, applications, and administrators on one side, and internal services, external systems, and large language models (LLMs) on the other, as shown in the following figure.

APIM capabilities

The key features of APIM include:

  • Security and governance
    • Authentication, authorization, rate limiting, and security policy enforcement.
    • Helps ensure APIs meet organizational or industry standards.
  • Monitoring and logging
    • Provides monitoring, alarms, and logging to track API performance and troubleshoot issues quickly.
  • Customization and transformation
    • Offers protocol and field transformations, plus orchestration and aggregation.
    • Makes it easier to integrate with different systems and meet various client needs.
  • API lifecycle management
    • Publishing, rollback, version control, and documentation.
    • Streamlines development and maintenance throughout the API lifecycle.
  • Developer and business tools
    • Portals for developers, business owners, and administrators to manage documentation, billing, and analytics.
  • Integration with LLMs
    • Specialized adapters, proxy configurations, and switching to integrate AI models seamlessly.
  • Flexible deployment options
    • Canary releases, pipeline automation, and other advanced release strategies.
    • Helps ensure stable, controlled API updates.

Unified management of multiple API gateways

API Gateway enforces resource limits of 300 resources per gateway, with a hard limit of 600. For enterprises that require more resources, managing multiple gateways individually can be time-consuming and error prone. APIM simplifies this by integrating API Gateway, Lambda, and DynamoDB; creating a centralized platform for managing APIs across multiple gateways. This integration streamlines the process, making it easier to scale and maintain APIs.

API lifecycle management

Managing API versions, publishing updates, and maintaining documentation often requires separate tools and manual processes, leading to inefficiencies. APIM centralizes these tasks in one portal, offering version control, publishing workflows, and rollback options. This streamlines the API lifecycle, ensuring consistency and reducing the chances for errors.

Enhanced security

Enterprises often need to implement different authentication strategies for various clients. These configurations typically require custom Lambda logic and database lookups, adding complexity and cost. APIM introduces configurable security policies that allow client-specific authentication without the need for additional custom code, reducing both complexity and operational overhead.

Customization and transformation

Enterprises frequently handle diverse client requests that involve different formats and protocols. Traditional API management approaches might struggle to support such variations. APIM allows for seamless protocol and field transformations, enabling integrations that meet a wide range of client requirements without additional development effort.

Developer portal

Developers need clear documentation, easy testing environments, and efficient API key management to work effectively. Traditional systems often lack these features, slowing down adoption. APIM provides a developer portal that consolidates API documentation, offers sandbox environments for testing, and simplifies API key management, reducing onboarding time and improving the developer experience.

Logging and monitoring

Log management is key to maintaining API performance, diagnosing issues, and gaining insights into usage. APIM uses API Gateway custom access logging, allowing teams to define logs based on business needs; whether creating separate CloudWatch metrics for each API path or exporting data to external platforms like ELK or Grafana.

Architecture overview

The APIM architecture, shown in the following figure, includes a management state (represented by numbers) and a runtime state (represented by letters). Both parts use a serverless paradigm.

APIM Architecture

Management state

The management state includes the following elements:

  1. Administrator portal access: Administrators access the APIM solution through a secured web portal.
  2. API Requests to APIM Lambda: Requests from the administrator’s API go through API Gateway, which then invokes the APIM Lambda function. This function handles logic related to configuration changes and other administrative actions.

In the following example, we show you how the APIM Lambda function dynamically applies different middleware based on the route configuration. This approach allows for flexible handling of authentication, client access restrictions, and request/response transformations. Here’s a quick breakdown of the key elements:

// If the route requires OIDC (OpenID Connect) authentication,
// add the OIDC authentication middleware to the route.
if route.Auth == "OIDC" {
    r.Use(middleware.OidcAuthenticator)
}
// If the route configuration specifies a list of allowed clients
// and the list is not empty, add a middleware to restrict access
// to only the specified clients.
if route.Allow.Clients != nil && len(route.Allow.Clients) != 0 {
    r.Use(middleware.AllowClients(route.Allow.Clients, cfg.Clients))
}
// Remove specific headers injected by the API Gateway
// to reduce exposure of internal details to downstream systems.
r.Use(middleware.RemoveGatewayHeaders)

// Add additional middleware for handling outbound logic.
// This could include retries, logging, or other outbound-specific functionality.
r.Use(outboundMiddlewares)
// Dynamically constructs and applies a chain of middlewares 
// based on the outbound configuration associated with the current request.
func outboundMiddlewares(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        // Retrieve the outbound configuration from the request context.
        outbound, _ := r.Context().Value(selectedOutboundContext).(config.Outbound)

        // Initialize a slice to store the middlewares to be applied.
        middlewares := []func(http.Handler) http.Handler{}

        // Middleware to rewrite the HTTP request based on the outbound configuration.
        middlewares = append(middlewares, middleware.ProxyRequestRewrite(&outbound))

        // Add a middleware for mapping request data if specified in the outbound configuration.
        if len(outbound.Convert.Request) != 0 {
            middlewares = append(middlewares, middleware.RequestDataMapping(outbound.Convert.Request))
        }

        // Middleware to log the outbound response for monitoring or debugging purposes.
        middlewares = append(middlewares, middleware.OutboundResponseLog)

        // Add a middleware for mapping response data if specified in the outbound configuration.
        if len(outbound.Convert.Response) != 0 {
            middlewares = append(middlewares, middleware.ResponseDataMapping(outbound.Convert.Response))
        }

        // Add a middleware for modifying the response if a modification function is defined.
        if outbound.ModifyResponse != "" {
            f, ok := system.MODIFY[outbound.ModifyResponse]
            if ok {
                middlewares = append(middlewares, f())
            }
        }

        // Chain the constructed middlewares together and apply them to the request.
        chain := chi.Chain(middlewares...)
        chain.Handler(next).ServeHTTP(w, r)
    })
}

By using a middleware chain, you can customize how each request and response is processed on a per-route basis. This architecture not only keeps your code organized but also makes the API Gateway-integrated Lambda function far more adaptable to changing requirements. You can add or remove configurations from APIM portal as new use cases emerge—such as data transformations, custom logging, or additional security checks—without rewriting core logic.

  1. Configuration management: Administrators set up server-side and client-side settings, such as API Gateway parameters, authentication requirements, transformations, and more.
  2. Persistence:  DynamoDB stores these configurations, providing persistent data storage and auditing capabilities.
  3. Asynchronous resource provisioning: After administrators save configurations and release them from the APIM portal, APIM creates or updates AWS resources—such as API Gateway, Lambda functions, and AWS Identity and Access Management (IAM). Lambda runs these updates in the background, so administrators can continue working uninterrupted.

Runtime state

The runtime state includes the following elements:

A. Client request: Clients send requests to the APIM endpoint.

B. Routing to the correct gateway: APIM uses the URI prefix in the API mappings associated with custom domain names to route requests to the appropriate API gateway, as shown in the following figure. Each mapping defines a specific API, stage, and an optional path. When a request arrives, APIM checks the path and directs the request to the correct stage and API if it matches. Unmatched requests default to the mapping with no path defined.

C. APIM core processing: A Lambda function (APIM CORE) uses DynamoDB configurations to handle authentication, authorization, protocol conversion, field transformation, and routing.

D. Downstream service call: APIM forwards each request to the configured internal or external endpoint.

E. Logging and monitoring: API Gateway access logs and custom logs track requests in detail.

F. Alarm: Metrics and alarms detect anomalies and notify stakeholders. Use Amazon CloudWatch or self-hosted solutions such as ELK to enable real-time monitoring and alerting.

api-mapping

Conclusion

In this post, we’ve demonstrated how to build an enterprise API management (APIM) solution using Amazon API Gateway, AWS Lambda, Amazon DynamoDB, and other AWS services. We’ve also shown how APIM centralizes critical features—such as version management, security policies, and request/response transformations—to accommodate large-scale enterprise requirements.

You can use the APIM portal to store and manage configurations in DynamoDB, dynamically applying these settings to multiple API gateways without rewriting code. This approach ensures consistent governance across diverse client types and business scenarios, helping to keep APIs both secure and flexible.

Finally, you’ve seen how the APIM architecture unifies the management state and runtime state, streamlines administrative tasks, and provides end-to-end monitoring and alerting. By adopting these best practices, your enterprise can establish a robust, scalable, and secure API management foundation, all within a serverless paradigm.


About the Authors

Introducing an enhanced local IDE experience for AWS Step Functions

Post Syndicated from Chris McPeek original https://aws.amazon.com/blogs/compute/introducing-an-enhanced-local-ide-experience-for-aws-step-functions/

This post written by Ben Freiberg, Senior Solutions Architect.

AWS Step Functions introduces an enhanced local IDE experience to simplify building state machines. Workflow Studio is now available within Visual Studio Code (VS Code) through the AWS Toolkit extension. With this integration, developers can author and edit state machines in their local IDE using the same powerful visual authoring experience found in the AWS Console.

Step Functions is a visual workflow service that helps developers use AWS services to build distributed applications, automate processes, orchestrate microservices, and create data and machine learning (ML) pipelines.

Customers choose Step Functions to build workflows that involve multiple services such as AWS Lambda, AWS Fargate, Amazon Bedrock, and HTTP API integrations. Developers create these workflows as state machines through the AWS Console using Workflow Studio or as code using Amazon States Language (ASL), a JSON-based domain specific language. Developers maintain their workflows definitions alongside the application and Infrastructure as code (IaC) code. Now, builders have even more capabilities to build and test their workflow in VS Code that matches the same experience as in AWS console.

Simplifying local workflow development

The integrated Workflow Studio provides developers with a seamless experience for building Step Functions workflows within their local IDE. You’ll use the same canvas used in the AWS Console to drag and drop states to build your workflows. As you modify the workflow visually, the ASL definition updates automatically, so you can focus on business logic rather than syntax. The Workflow Studio integration offers the same intuitive and visual approach to designing state machines as the AWS Console, without switching context.

Getting started

To use the updated IDE experience, verify that you have the AWS Toolkit with at least version 3.49.0 installed as a VS Code Extension.

AWS Toolkit extension in VS Code which can be updated.

Figure 1: AWS Toolkit update available

After installing the AWS Toolkit extension, you can start building with Workflow Studio by opening a state machine definition. You can use a definition file from your local workspace or use AWS Explorer to download an existing state machine definition from the cloud. VS Code integration supports ASL definitions in JSON and YAML formats. (Note: Files must end in .asl.json, asl.yml or .asl.yaml for Workflow Studio to automatically open the file.) While working with YAML files, Workflow Studio converts the definition to JSON for editing, then converts back to YAML before saving.

A sample state machine in Workflow Studio with Design mode open.

Figure 2: Design mode in Workflow Studio

Workflow Studio in VS Code supports both the Design and Code mode. Design mode provides a graphical interface to build and inspect your workflows. In Code mode, you can use an integrated code editor to view and edit the Amazon States Language (ASL) definition of your workflows. You can always switch back to text-based editing by selecting the Return to Default Editor link at the top right of Workflow Studio, as shown in the following screen.

A sample state machine in Workflow Studio with Code mode open.

Figure 3: Code mode in Workflow Studio

To open Workflow Studio in VS Code manually, you can use the “Open with Workflow Studio” action at the top of a workflow definition file or the icon in the top right of the editor pane. Both options are highlighted in the following screen. Additionally, you can use the file context-menu to open Workflow Studio from the file explorer pane.

A asl file in the default editor showing the different ways to open Workflow Studio.

Figure 4: Integrations of Workflow Studio into the editor

Edits you make in Workflow Studio are automatically synced to the underlying file as unsaved changes. To persist your changes, you must either save the changes from Workflow Studio or the file editor. Similarly, any changes you make to the local file are synced to Workflow Studio on save.

Workflow Studio is aware of Definition Substitutions, so you can even edit workflows which have been integrated with your IaC tooling like AWS CloudFormation or the AWS Cloud Development Kit (CDK). Definition Substitutions is a feature of CloudFormation that lets you add dynamic references in your workflow definition to a value that you provide in your IaC template.

AWSTemplateFormatVersion: "2010-09-09"
Description: "State machine with Definition Substitutions"
Resources:
  MyStateMachine:
    Type: AWS::StepFunctions::StateMachine
    Properties:
      StateMachineName: HelloWorld-StateMachine
      DefinitionS3Location:
        Bucket: amzn-s3-demo-bucket
        Key: state-machine-definition.json
      DefinitionSubstitutions:
        TableName: DemoTable

You can then use the Definition Substitutions in the definition of your state machine.

"Write message to DynamoDB": {
  "Type": "Task",
  "Resource": "arn:aws:states:::dynamodb:putItem",
  "Next": "Remove message from SQS queue",
  "Arguments": {
    "TableName": "${TableName}",
    "Item": {
      ... omitted for brevity ...
     }
  },
  "Output": "{% $states.input %}"
}

The code defines a Step Functions task state that writes a message to DynamoDB using the putItem operation. The ${TableName} substitution syntax allows for a dynamic DynamoDB table name that can be passed as a parameter when the state machine is executed.

Testing and Deployment

Workflow Studio integration supports testing a single state through Step Functions TestState API. With the TestState API, you can test a state in the cloud from your local IDE without creating a state machine or updating an existing state machine. With the power of localized granular testing, you can build and debug changes for individual states without needing to invoking the entire state machine. For example, you can refine the input or output processing, or update the conditional logic in a choice state without ever leaving your IDE.

Testing a state

  1. Open any state machine definition file in Workflow Studio
  2. Select a state from the canvas or the code tab
  3. Open the Inspector panel on the right side if not already openA DynamoDB PutItem opened in the Workflow Studio inspector panel with the Arguments showing a Definition Substitution..
    Figure 5: Arguments of an Individual state
  4. Select Test state button at the top
  5. Select your IAM role and add the input. Make sure that the role has the necessary permissions for using TestState API
  6. If your state contains any Definition Substitutions, you’ll see an additional section where you can replace them with your specific values.
  7. Select Start Test

Modal popup of the TestState with a role selected and showing the entered value of a Definition Substitution.

Figure 6: TestState configuration with a Definition Substitution

After the test succeeds, you can publish your workflow from the IDE using the AWS Toolkit. You can also use IaC tools such as AWS Serverless Application Model, AWS CDK, or CloudFormation to deploy your state machine.

Conclusion

Step Functions is introducing an enhanced local IDE experience to simplify the development of workflows using the VS Code IDE and AWS Toolkit. This streamlines the code-test-deploy-debug cycle and offers developers a seamless integration of Workflow Studio. By combining visual workflow design with the power of a full-featured IDE, developers can now build Step Functions workflows more efficiently.

To get started, install the AWS Toolkit for Visual Studio Code and visit the user guide on Workflow Studio integration. Find hands-on examples, best practices, and useful resources for AWS serverless at Serverless Land.