WhatsApp Case Against NSO Group Progressing

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2025/04/whatsapp-case-against-nso-group-progressing.html

Meta is suing NSO Group, basically claiming that the latter hacks WhatsApp and not just WhatsApp users. We have a procedural ruling:

Under the order, NSO Group is prohibited from presenting evidence about its customers’ identities, implying the targeted WhatsApp users are suspected or actual criminals, or alleging that WhatsApp had insufficient security protections.

[…]

In making her ruling, Northern District of California Judge Phyllis Hamilton said NSO Group undercut its arguments to use evidence about its customers with contradictory statements.

“Defendants cannot claim, on the one hand, that its intent is to help its clients fight terrorism and child exploitation, and on the other hand say that it has nothing to do with what its client does with the technology, other than advice and support,” she wrote. “Additionally, there is no evidence as to the specific kinds of crimes or security threats that its clients actually investigate and none with respect to the attacks at issue.”

I have written about the issues at play in this case.

How Pebble Supports ACME Client Developers

Post Syndicated from Let's Encrypt original https://letsencrypt.org/2025/04/30/pebbleacmeimplementation.html

How Pebble Supports ACME Client Developers

Together with the IETF community, we created the ACME standard to support completely automated certificate issuance. This open standard is now supported by dozens of clients. On the server side, did you know that we have not one but two open-source ACME server implementations?

The big implementation, which we use ourselves in production, is called Boulder. Boulder handles all of the facets and details needed for a production certificate authority, including policy compliance, database interfaces, challenge verifications, and logging. You can adapt and use Boulder yourself if you need to run a real certificate authority, including an internal, non-publicly-trusted ACME certificate authority within an organization.

The small implementation is called Pebble. It’s meant entirely for testing, not for use as a real certificate authority, and we and ACME client developers use it for various automated and manual testing purposes. For example, Certbot has used Pebble in its development process for years in order to perform a series of basic but realistic checks of the ability to request and obtain certificates from an ACME server.

Pebble is Easy to Use for ACME Client Testing

For any developer or team creating an ACME client application, Pebble solves a range of problems along the lines of “how do I check whether I’ve implemented ACME correctly, so that I could actually get certificates from a CA, without necessarily using a real domain name, and without running into CA rate limits during my routine testing?” Pebble is quick and easy to set up if you need to test an ACME client’s functionality.

It runs in RAM without dependencies or persistence; you won’t need to set up a database or a configuration for it. You can get Pebble running with a single golang command in just a few seconds, and immediately start making local ACME requests. That’s suitable for inclusion in a client’s integration test suite, making much more realistic integration tests possible without needing to worry about real domains, CA rate limits, or network outages.

We see Pebble getting used in the official test suites for ACME clients including getssl, Lego, Certbot, simp_le, and others. In many cases, every change committed to the ACME client’s code base is automatically tested against Pebble.

Pebble is Intentionally Different From Boulder

Pebble is also deliberately different from Boulder in some places in order to provide clients with an opportunity to interoperate with slightly different ACME implementations. The Pebble code explains that

[I]n places where the ACME specification allows customization/CA choice Pebble aims to make choices different from Boulder. For instance, Pebble changes the path structures for its resources and directory endpoints to differ from Boulder. The goal is to emphasize client specification compatibility and to avoid “over-fitting” on Boulder and the Let’s Encrypt production service.

For instance, the Let’s Encrypt service currently offers its newAccount resource at the path /acme/new-acct, whereas Pebble uses a different name /sign-me-up, so clients will be reminded to check the directory rather than assuming a specific path. Other substantive differences include:

  • Pebble rejects 5% of all requests as having a invalid nonce, even if the nonce was otherwise valid, so clients can test how they respond this error condition
  • Pebble only reuses valid authorizations 50% of the time, so clients can check their ability to perform validations when they might not have expected to
  • Pebble truncates timestamps to a different degree of precision than Boulder
  • Unlike Boulder, Pebble respects the notBefore and notAfter fields of new-order requests

The ability of ACME clients to work with both versions is a good test of their conformance to the ACME specification, rather than making assumptions about the current behavior of the Let’s Encrypt service in particular. This helps ensure that clients will work properly with other ACME CAs, and also with future versions of Let’s Encrypt’s own API.

Pebble is Useful to Both Let’s Encrypt and Client Developers as ACME Evolves

We often test out new ACME features by implementing them, at least in a simplified form, in Pebble before Boulder. This lets us and client developers experiment with support for those features even before they get rolled out in our staging service. We can do this quickly because a Pebble feature implementation doesn’t have to work with a full-scale CA backend.

We continue to encourage ACME client developers to use a copy of Pebble to test their clients’ functionality and ACME interoperability. It’s convenient and it’s likely to increase the correctness and robustness of their client applications.

Try Out Pebble Yourself

Want to try Pebble with your ACME client right now? On a Unix-like system, you can run

git clone https://github.com/letsencrypt/pebble/
cd pebble
go run ./cmd/pebble

Wait a few seconds; now you have a working ACME CA directory available at https://localhost:14000/dir! Your local ACME Server can immediately receive requests and issue certificates, though not publicly-trusted ones, of course. (If you prefer, we also offer other options for installing Pebble, like a Docker image.)

We welcome code contributions to Pebble. For example, ACME client developers may want to add simple versions of an ACME feature that’s not currently tested in Pebble in order to make their test suites more comprehensive. Also, if you notice a possibly unintended divergence between Pebble and Boulder or Pebble and the ACME specification, we’d love for you to let us know.

How Pebble Supports ACME Client Developers

Post Syndicated from Let's Encrypt original https://letsencrypt.org/2025/04/30/pebbleacmeimplementation/

How Pebble Supports ACME Client Developers

Together with the IETF community, we created the ACME standard to support completely automated certificate issuance. This open standard is now supported by dozens of clients. On the server side, did you know that we have not one but two open-source ACME server implementations?

The big implementation, which we use ourselves in production, is called Boulder. Boulder handles all of the facets and details needed for a production certificate authority, including policy compliance, database interfaces, challenge verifications, and logging. You can adapt and use Boulder yourself if you need to run a real certificate authority, including an internal, non-publicly-trusted ACME certificate authority within an organization.

The small implementation is called Pebble. It’s meant entirely for testing, not for use as a real certificate authority, and we and ACME client developers use it for various automated and manual testing purposes. For example, Certbot has used Pebble in its development process for years in order to perform a series of basic but realistic checks of the ability to request and obtain certificates from an ACME server.

Pebble is Easy to Use for ACME Client Testing

For any developer or team creating an ACME client application, Pebble solves a range of problems along the lines of “how do I check whether I’ve implemented ACME correctly, so that I could actually get certificates from a CA, without necessarily using a real domain name, and without running into CA rate limits during my routine testing?” Pebble is quick and easy to set up if you need to test an ACME client’s functionality.

It runs in RAM without dependencies or persistence; you won’t need to set up a database or a configuration for it. You can get Pebble running with a single golang command in just a few seconds, and immediately start making local ACME requests. That’s suitable for inclusion in a client’s integration test suite, making much more realistic integration tests possible without needing to worry about real domains, CA rate limits, or network outages.

We see Pebble getting used in the official test suites for ACME clients including getssl, Lego, Certbot, simp_le, and others. In many cases, every change committed to the ACME client’s code base is automatically tested against Pebble.

Pebble is Intentionally Different From Boulder

Pebble is also deliberately different from Boulder in some places in order to provide clients with an opportunity to interoperate with slightly different ACME implementations. The Pebble code explains that

[I]n places where the ACME specification allows customization/CA choice Pebble aims to make choices different from Boulder. For instance, Pebble changes the path structures for its resources and directory endpoints to differ from Boulder. The goal is to emphasize client specification compatibility and to avoid “over-fitting” on Boulder and the Let’s Encrypt production service.

For instance, the Let’s Encrypt service currently offers its newAccount resource at the path /acme/new-acct, whereas Pebble uses a different name /sign-me-up, so clients will be reminded to check the directory rather than assuming a specific path. Other substantive differences include:

  • Pebble rejects 5% of all requests as having a invalid nonce, even if the nonce was otherwise valid, so clients can test how they respond this error condition
  • Pebble only reuses valid authorizations 50% of the time, so clients can check their ability to perform validations when they might not have expected to
  • Pebble truncates timestamps to a different degree of precision than Boulder
  • Unlike Boulder, Pebble respects the notBefore and notAfter fields of new-order requests

The ability of ACME clients to work with both versions is a good test of their conformance to the ACME specification, rather than making assumptions about the current behavior of the Let’s Encrypt service in particular. This helps ensure that clients will work properly with other ACME CAs, and also with future versions of Let’s Encrypt’s own API.

Pebble is Useful to Both Let’s Encrypt and Client Developers as ACME Evolves

We often test out new ACME features by implementing them, at least in a simplified form, in Pebble before Boulder. This lets us and client developers experiment with support for those features even before they get rolled out in our staging service. We can do this quickly because a Pebble feature implementation doesn’t have to work with a full-scale CA backend.

We continue to encourage ACME client developers to use a copy of Pebble to test their clients’ functionality and ACME interoperability. It’s convenient and it’s likely to increase the correctness and robustness of their client applications.

Try Out Pebble Yourself

Want to try Pebble with your ACME client right now? On a Unix-like system, you can run

git clone https://github.com/letsencrypt/pebble/
cd pebble
go run ./cmd/pebble

Wait a few seconds; now you have a working ACME CA directory available at https://localhost:14000/dir! Your local ACME Server can immediately receive requests and issue certificates, though not publicly-trusted ones, of course. (If you prefer, we also offer other options for installing Pebble, like a Docker image.)

We welcome code contributions to Pebble. For example, ACME client developers may want to add simple versions of an ACME feature that’s not currently tested in Pebble in order to make their test suites more comprehensive. Also, if you notice a possibly unintended divergence between Pebble and Boulder or Pebble and the ACME specification, we’d love for you to let us know.

Defending Against SMS Pumping: New AWS Features to Help Combat Artificially Inflated Traffic

Post Syndicated from Tyler Holmes original https://aws.amazon.com/blogs/messaging-and-targeting/defending-against-sms-pumping-new-aws-features-to-help-combat-artificially-inflated-traffic/

As businesses increasingly rely on SMS messaging to engage customers, AWS End User Messaging is enhancing its SMS Protect feature to now include automated message filtering based on the risk of Artificially Inflated Traffic (AIT) from each message request. This new capability helps protect against AIT, also known as SMS pumping. AIT occurs when malicious actors use bots and other measures to generate fake SMS traffic, targeting businesses’ customer communication workflows like one-time password triggers, app downloads, and promotional signups. In a recent report co-authored by Enea it was shown that AIT accounted for 19.8 billion to 35.7 billion fraudulent SMS messages in 2023, costing over $1 billion. All workflows with user generated messages are susceptible to AIT but insecure public webforms are the most commonly used as a vector to exploit and generate SMS messages. The goal is to artificially inflate the number of SMS messages a business sends, resulting in increased costs and a negative impact on the sender’s reputation.

We launched AWS End User Messaging Protect to help our customers combat this growing threat. Initially launched with Country Level Blocking, we’ve now launched two new features, called Monitor and Filter, within AWS End User Messaging’s Protect capabilities. Updating your current security posture for SMS with Monitor and Filter, along with adhering to some other best practice security measures we will cover later, will make it harder for bad actors to target and inflate your SMS costs with bots or other measures.

What is SMS Protect Filter and Monitor?

Filter and Monitor are the next layers of defense in our Protect Feature Set. These features are designed to provide enhanced protection against AIT for countries in which you need to send messages by analyzing and proactively blocking messages that are suspected to be fraudulent. The Filter setting blocks suspected AIT messages. The Monitor mode allows you to evaluate how Filter would affect your sending, without blocking. Monitor could also be used for the events it emits, which could be leveraged in your own custom AIT solutions, but again, does not automatically block messages.

Filter Mode: Automated Blocking of Suspected Artificial Traffic

The Filter mode in Protect takes your AIT mitigation efforts to the next level by automatically blocking messages that exhibit patterns of artificial inflation. When you set your configuration to “Filter” the model will automatically filter any messages being sent that match patterns indicative of AIT.

Filter mode provides automated defense against AIT by analyzing and proactively blocking AIT messages before they leave AWS, reducing your exposure to the financial and reputational impacts of SMS pumping. Turning on Filter at the Account level is the quickest way to protecting yourself. The tutorial below will walk you through configuration.

Importantly, when a message is blocked in Filter mode, you do not incur the normal per-message fees, instead you only pay for the lesser costs associated with the Protect Filter capabilities, providing a more cost effective approach to message security.

Monitor Mode: Gain Visibility and Insights into Potentially Suspicious Traffic

The Monitor mode in Protect works identical to filter, it uses the same AIT prediction models behind the scenes, but rather than blocking suspected AIT it simply emits recommendations for blocking based on the patterns of data. The recommendations are delivered in a new field attached to the Delivery Receipts (DLRs) that are already streamed via Event Destinations. The recommendations are also logged in summary to CloudWatch and the End User Messaging Console Dashboards. This provides you with valuable data and insights to help inform your AIT mitigation strategy.

Messages sent while in monitor mode will not be blocked and will be charged the country per message cost as well as the Protect Monitor per message cost.

If you want to see what our AIT prediction models recommend without AWS actually blocking messages, you can start in Monitor Mode and change to Filter when you are more comfortable. This allows you to understand how your traffic is analyzed by our AIT prediction models without immediately blocking messages, offering a cautious and informed approach to how Filter will affect your Account.

The Monitor mode reports include detailed analytics on blocked message volumes, geographic distribution, carrier patterns, and more. By analyzing this data, you can identify specific countries, number ranges, or sending behaviors that may be indicative of artificially inflated traffic. This helps you make informed decisions about where to apply more stringent controls.

Importantly, during the monitoring phase, Protect also provides recommendations on whether a particular message would have been blocked and whether certain numbers should be blocked in the future. This gives you the ability to fine-tune your configurations and better understand your traffic before taking enforcement actions.

How do you get started with Protect Monitor and Filter?

Every customer’s needs are unique, but for most customers, we suggest the following steps:

  1. Block all countries to which you do not send messages
    1. Your first line of defense should be to block all traffic to countries where you don’t conduct business or need to send messages. Preventing unwanted messages from being sent is the simplest way to help prevent SMS pumping in the first place. You can use Protect Country Blocking rules to do this and they can can be applied to SMS, MMS, and voice messages sent from your AWS account. For a tutorial on how to do this you can read this earlier blog on Protect.
  2. Create an account level “Filter” configuration
    1. When considering the risk of AIT in a specific country we recommend aligning risk level with the SMS per message cost. The higher the cost the higher the risk.
  3. Make sure that your forms and other vulnerable public facing messaging workflows are protected with best practice security measures that we will review further on in this post.

How to create a protect configuration

You can use a Protect configuration at different levels of granularity:

  1. As the default for your entire AWS account(Good for customers with a single use case)
  2. Associated with a specific Configuration Set
  3. Directly specified when calling the SendMediaMessage, SendTextMessage, or SendVoiceMessage APIs
    NOTE: You can only change your MMS country rules list through the AWS End User Messaging SMS and voice v2 API or AWS CLI. The Voice rules can be changed in the console but only after creating an SMS Protect Configuration. Once you have created your first Configuration you can edit it and select the “Voice Rules” tab.

The main benefit of Protect configurations is the ability to control where you send messages and avoid unexpected costs or compliance issues. By creating multiple configurations you can apply specific rules that control how messages are processed and delivered based on your unique business needs. Let’s walk through how to set them up.

Creating a Protect Configuration

  1. To create a Protect Configuration, log into the AWS Management Console and navigate to End User Messaging.
  2. From there, go to the “SMS” section and select “Protect configurations”.
  3. Click the “Create protect configuration” button and give your new configuration a name.
    1. Define the specific allow and block rules for SMS, MMS, and voice messages.
      1. Checking a box next to a country blocks that country and checking the box for a region will block all countries associated with that region.

Once you’ve configured the country rules, you can choose how to associate this Protect configuration:

  1. Set it as the default for your entire AWS account
    1. For many customers this should be the default. Having an account level configuration as a fallback helps protect you incase you forget to specify a protect configuration in your request.
    2. Note: To use a protect configuration with other AWS services to send messages, like Amazon SNS, Amazon Connect, or Amazon Pinpoint, you need to set your protect configuration as the account default
  2. Associate it with one or more Configuration Sets
    1. This setting will be applied anytime you send SMS with the config set associated with this Protect Configuration
  3. Leave it unassociated to use it explicitly in API calls
    1. This setting allows you to apply it whenever you want. This will override any previous associations when you reference the “ProtectConfigurationId” in your SendMediaMessage, SendTextMessage, or SendVoiceMessage calls

You can also add optional tags to help organize your resources.

  1. Click “Create protect configuration”
    1. NOTE: You can only change your MMS country rules list through the AWS End User Messaging SMS and voice v2 API or AWS CLI. The Voice rules can be changed in the console but only after creating an SMS Protect Configuration. Once you have created your first Configuration you can edit it and select the “Voice Rules” tab.
  2. How to add Filter or Monitor to the Protect Configuration you just created
    1. Click into the Protect Configuration you just created
      1. Note the “SMS Rules” tab and the “Voice Rules” tab can have different rule settings. Make sure you are editing the right channel
  3. You will once again select the country or region you wish to set to Filter(recommended) or Monitor

    1. Confirm the changes and you will see your changes in the next screen

Getting more granular with Protect Configurations

In most cases you should be using “Filter” account wide for the countries you are concerned about AIT in, but If you have different public and/or private messaging workflows you may benefit from a more precise, or granular, approach to your messaging and security practices. If you want more control, the first step is to identify your traffic that is a high risk for SMS pumping. Any public-facing forms or workflows that trigger SMS being sent are prime targets for attackers to try and pump SMS are at high risk, such as:

  • One-time passwords or 2FA flows
  • Password/User resets
  • New user registrations
  • Other

Creating a separate Protect Configuration for each of these different workflows will help the models in Protect more effectively identify anomalies and tailor its detection models to your specific messaging patterns. Service-initiated messages, such as appointment reminders or marketing campaigns that are not user-generated are at much less risk of SMS pumping attacks so you may decide not to include them in the same Protect config as a public facing workflow to reduce overall costs.

You can follow the directions above for creating a Protect Configuration for each of the workflows you identify. You might configure something like the below, where “OTP New Sign Up” and “Password Reset” have Filter enabled for the countries of concern and the “Marketing Newsletter” Configuration would not have either configured since that use case does not involve a publicly available form that triggers an SMS being sent. Creating a Protect Configuration for different use cases gives you more granular control over your messaging, your messaging budget, and ensuring the integrity of your communications

Updating an Existing Protect Configuration

After creating a Protect configuration, you may need to modify the country rules, change the association or as we saw above, add Filter or Monitor to certain countries. To do this, simply navigate back to the “Protect configurations” section and select the one you want to update.

From here you can edit the allow/block country lists, change the association, or even delete the configuration if needed. Just be careful with the account default – you’ll want to be sure you have another default in place before removing the existing one.

Using Protect Configurations

Once you have your Protect configurations set up, you can start putting them to use. If you’ve associated one with a Configuration Set, any messages sent using that Configuration Set will automatically have the Protect rules applied.

Alternatively, you can specify the ProtectConfigurationId parameter when calling the SendMediaMessage, SendTextMessage, or SendVoiceMessage APIs. This allows you to override the account default or Configuration Set association on a per-message basis.

Reporting on Protect Configurations

There are two places within the console that you can see metrics for your Protect Configurations. The Monitoring tab on a protect configuration provides an overview of message delivery metrics for the protect configuration. To view all metrics for your account in the AWS End User Messaging SMS console choose Dashboard in the left hand navigation. You can also use CloudWatch to view and create alarms. For more information on CloudWatch metrics, see Dashboard metrics, and Create CloudWatch Alarms.

Monitoring tab on a specific Protect Configuration

End User Messaging provides multiple charts that helps you understand how your country rule configurations (Allow, Block, Monitor, or Filter), along with phone number rule overrides are controlling SMS sending overall, and to specific countries.

The included charts are:

  • Number and Percentage of Blocked Messages: Shows the count and percentage of SMS and MMS messages that were blocked during the selected time period. This includes messages blocked by country rules set to ‘block’ or ‘filter’ mode, as well as messages blocked by phone number override rules.
  • Number of Blocked Messages by Country: Shows the count of SMS and MMS messages that were blocked during the selected time period, broken down by destination country.
  • Number and Percentage of Messages Recommended to Block: Shows the count and percentage of SMS and MMS messages that were identified as risky by the AIT risk prediction model. This includes messages in both ‘monitor’ and ‘filter’ modes. In monitor mode, these messages are delivered but flagged; in filter mode, these messages are blocked.
  • Number of Messages Recommended to Block by Country: Shows the count of SMS and MMS messages identified as risky by the AIT prediction model, broken down by destination country.

Implementing a Layered Approach to SMS Security

While Filter and Monitor are new tools in the fight against AIT, they should be implemented as part of a broader, layered security strategy for your SMS messaging infrastructure. Here are some best practices to consider:

Identify and compartmentalize Your Traffic

You are able to create multiple Protect Configurations based on different use cases, such as one-time passwords, marketing campaigns, and appointment reminders. This granular approach allows Protect’s prediction models to better understand your expected traffic patterns and identify anomalies more accurately. Once you have identified your traffic types you can assign different configurations to them. You may set a marketing configuration to not be filtered or monitored because it’s not user generated but an OTP type with a publicly available form you may want to set to Filter. In this way you save money by protecting only the messages that are more likely to be susceptible to AIT. Each of these may block the same countries but operate differently with regards to identifying and blocking potentially fraudulent traffic.

Leverage Geographic Controls:

Always start by blocking countries where you have no business presence, then allow-list the regions where you actively engage customers and have not seen AIT issues. For countries where you suspect potential abuse, utilize the Monitor mode to gather data before deciding on a blocking strategy.

Allow-list Legitimate phone numbers in countries you are blocking

To avoid impacting your critical messaging workflows, implement phone number rule overrides for specific countries where you are blocking traffic. As an example, if you have engineers in Columbia that you want to be able to send SMS to but you don’t have any legitimate reason other than that to send to Columbian handsets you can block Columbia but allow-list those engineer’s phone numbers. You can also provide your front end support teams the functionality to add numbers to allow-lists in case a number is mistakenly blocked by Filter recommendations

  1. To create a phone number override rule using the console, follow these steps:
  2. Open the AWS End User Messaging SMS console at https://console.aws.amazon.com/sms-voice/.
  3. In the navigation pane, under Protect, choose the Protect configuration you want to add allow-list numbers in
  4. Choose the Rule overrides tab and in the Rules override section choose Add override.
    1. In the Rule override details section, enter the following:
      1. For Destination phone number enter the phone number to create the rule for. The phone number must start with a ‘+’ and can’t contain any spaces, hyphens, or parentheses. For example, +1 (206) 555-0142 is not in the correct format, but +12065550142 is.
      2. For Override type choose either Always allow or Always block.
      3. For Expiration date – optional choose a date for the rule expire or leave it blank for the rule to never expire.
  5. Choose Add rule override.

Integrate with Complementary Security Services

Enhance your SMS security posture by integrating Protect with other AWS services, such as AWS Web Application Firewall (WAF) for web-based attack protection and Amazon Cognito for robust user authentication. See this post on Cognito Security for more detailed information on how to add self-service sign-up, sign-in, and control access features to your web and mobile applications while benefitting from SMS authentication and fraud protection with End User Messaging Protect Block, Monitor, and Filter.

WAF has out of the box support for complementary security protections such as CAPTCHA, IP blocking, and JA3 fingerprint matching which are all best practice features to help protect your public forms that may be at risk for SMS pumping.

Review and Iterate

Regularly review your Protect configurations, analyze false positive rates, and update your allow-lists and rules as your messaging patterns evolve. If you are satisfied with your blocking, leave it alone. If you want to get more precise and remove false positives, look for which protect configurations have identified suspected AIT, and try to make them more granular. For example, if you have a sign-up form that is currently being triggered from two separate web pages, you could have a config set for each of those pages and trigger a different config set with Filter mode activated for each. Maintaining an agile, data-driven approach is key to ensuring optimal balance between security and service availability for your legitimate customers.

Conclusion

Take a proactive, multilayered approach to combating the growing threat of SMS fraud by leveraging the new Filter and Monitor capabilities within AWS End User Messaging Protect. These features empower you to gain visibility into potentially malicious traffic, automate the blocking of suspected AIT, and protect your messaging infrastructure while preserving the seamless experience your customers expect.

To get started with Protect and explore these new features, visit the AWS End User Messaging documentation or reach out to your AWS account team. We’re here to help you strengthen the security and integrity of your SMS communications.

Optimizing cold start performance of AWS Lambda using advanced priming strategies with SnapStart

Post Syndicated from Shan Kandaswamy original https://aws.amazon.com/blogs/compute/optimizing-cold-start-performance-of-aws-lambda-using-advanced-priming-strategies-with-snapstart/

Introduced at re:Invent 2022, SnapStart is a performance optimization that makes it easier to build highly responsive and scalable applications using AWS Lambda. The largest contributor to startup latency (often referred to as cold-start time) is the time spent initializing a function. This includes loading the function’s code and initializing dependencies. For latency-sensitive workloads such as APIs and real-time data processing applications, high startup latency can result in a suboptimal end user experience. Lambda SnapStart can reduce startup duration from several seconds to as low as sub-second, with minimal or no code changes. This post discusses ‘Priming’, a technique to further optimize startup times for AWS Lambda functions built using Java and Spring Boot.

Spring Boot applications typically experience high cold start latency during JVM and framework initialization, where significant time is spent loading classes and performing Just-In-Time (JIT) compilation of Java bytecode. This blog post uses a Spring Boot application as an example that retrieves 10 records from a ‘UnicornEmployee’ table in an Amazon RDS for PostgreSQL database, where each employee record includes employee name, location, and hire date.

The sample application uses Amazon API Gateway which triggers an AWS Lambda function that connects to the database through RDS Proxy to return the employee data. While this sample application uses dummy employee data for demonstration, the patterns and optimization techniques discussed in this post are applicable to real-world scenarios with similar data access patterns. Sample code for this implementation can be found in our GitHub repository at lambda-priming-crac-java-cdk.

Background: How SnapStart works

The post assumes familiarity with SnapStart and provides a short background. For additional details, refer to the SnapStart documentation.

To quickly recap, the INIT phase for a Lambda function involves downloading the function’s code, starting the runtime and any external dependencies, and running the function’s initialization code. For functions that don’t use SnapStart, this phase occurs each time your application scales up to create a new execution environment. When SnapStart is activated, the INIT phase happens when you publish a function version.

The following image shows a comparison of a Lambda request lifecycle with and without SnapStart.

Figure 1 – comparison of a Lambda request lifecycle with and without SnapStart

At the end of the INIT phase, Lambda executes your before-checkpoint runtime hooks. Lambda then snapshots the memory and disk state of the initialized execution environment, persists the encrypted snapshot, and caches it for low-latency access. When the function is subsequently invoked, new execution environments are resumed from the cached snapshot (during the RESTORE phase), speeding up function startup.

Figure 2 – new execution environments are resumed from the cached snapshot.

You can validate this speedup by comparing the RESTORE duration with the INIT duration recorded before SnapStart in your Lambda function’s Amazon CloudWatch Logs. As demonstrated in the following table, enabling SnapStart reduces the startup latency of our sample Spring Boot application by 4.3x from 6.1s to 1.4s. The 6.1s cold start latency for ON_DEMAND is primarily due to the combination of (1) initializing the JVM and Spring Boot framework, (2) JIT compilation of lazy loaded application code during initial invocation and (3) the time needed to establish a database connection with RDS through Amazon RDS Proxy. By enabling SnapStart, Lambda initializes the JVM and Spring Boot prior to the function invocation – resulting in the significantly reduced latency of 1.4s.

Method Cold Start Invocations p50 P90 P99 p99.9
PrimingLogGroup-1_ON_DEMAND 128 5047.94 ms 5386.78 ms 6158.80 ms 6195.84 ms
PrimingLogGroup-2_SnapStart_NO_PRIMING 111 1177.87 ms 1288.73 ms 1419.94 ms 1425.63 ms

You can reduce cold starts even further for your latency-sensitive Spring Boot applications by using priming techniques on Lambda functions. Let’s explore how to implement priming techniques.

Priming explained

Priming is the process of preloading dependencies and initializing resources during the INIT phase, rather than during the INVOKE phase to further optimize startup performance with SnapStart. This is required because Java frameworks that use dependency injection load classes into memory when these classes are explicitly invoked, which typically happens during Lambda’s INVOKE phase. You can proactively load classes using Java runtime hooks, that are part of the open source CRaC (Coordinated Restore at Checkpoint) project. This post demonstrates how to use this hook, called beforeCheckpoint(), to prime SnapStart-enabled Java functions, in two ways:

  1. Invoke Priming: This approach involves directly invoking application endpoints or methods in your pre-snapshotting hook so that they are JIT compiled during the INIT phase and included in the snapshot. This can include operations such as invoking API Gateway endpoints or fetching data from an S3 bucket or RDS database to proactively execute the code paths, ensuring that the underlying classes are included in the snapshot.
  2. Class Priming: This approach involves proactive initialization of classes during the INIT phase, ensuring that they are included in the function’s snapshot without risking unwanted changes to application state or data. This can be achieved by leveraging Java’s forName() method, which loads, links, and initializes the specified class. Initialization refers to the JVM process of loading the class definition into memory, verifying the bytecode, preparing static fields with default values, and executing static initializers. This is different from instantiation, which creates objects of the class using constructors. To generate a list of the classes required for pre-loading, you can use the following VM option, writing the list to a file called classes-loaded.txt:
    -Xlog:class+load=info:classes-loaded.txt

While invoke priming can offer better performance, it requires additional effort to ensure that the actions performed are idempotent and do not have unintended side effects, for instance processing financial transactions in a banking application. For this reason, invoke priming should only be used when code executed during priming is either idempotent or does not modify state. For scenarios where this is not possible, class priming provides a safer alternative by only initializing classes without executing their methods. Note that this assumes your application does not execute state-modifying code during class initialization.

With this context, let’s look at how to implement Invoke and Class priming for a Spring Boot sample application.

Example priming Implementation using CRaC runtime hooks before taking a Lambda snapshot

This post demonstrates both Invoke priming and Class priming using the sample Spring Boot application. The choice between the two approaches depends on the specific requirements and complexities of your application.

Step 1: Set up your Spring Boot Application using the aws-serverless-springboot3-archetype as explained in our Quick Start Spring Boot3 guide, adding database connectivity code, or simply clone the sample project from GitHub repository.

  1. Create a Spring Boot Application.
    // src/main/java/software/amazon/awscdk/examples/unicorn/UnicornApplication.java
    package software.amazon.awscdk.examples.unicorn;
    …
    @Import({ UnicornConfig.class })
    @SpringBootApplication
    public class UnicornApplication {
    
        private static final Logger log = LoggerFactory.getLogger(UnicornApplication.class);
    
        public static void main(String... arguments) {
            SpringApplication.run(UnicornApplication.class, arguments);
        }
    
    }

  2. Add all the necessary Maven dependencies for Spring Boot, AWS Lambda, and Database Connection in your pom.xml file. The following, highlighted, dependency contains the classes required to use the CRaC runtime hooks.
    ...
            <dependency>
                <groupId>org.crac</groupId>
                <artifactId>crac</artifactId>
            </dependency>
    ...

  3. Configure Database Connection – Set up the database connection details in application.properties.
    spring.datasource.password=${SPRING_DATASOURCE_PASSWORD} 
    spring.datasource.url=${SPRING_DATASOURCE_URL} 
    spring.datasource.username=postgres 
    spring.datasource.hikari.maximumPoolSize=1 

Step 2: Implement Lambda Function Handler with CRaC runtime hooks and Invoke Priming Approach:

Create Lambda Function Handler and integrate CRaC runtime hooks to execute beforeCheckpoint() and afterRestore() methods in your application for before taking and after restoring the snapshot.

  1. Implement the RequestHandler<UnicornRequest, UnicornResponse> interface in the Lambda function handler class.
  2. Implement the CRaC resource interface with two methods: beforeCheckpoint() and afterRestore(), which defines actions performed before Lambda creates the snapshot and after the snapshot is restored.
  3. Add invoke priming by creating a UnicornRequest object with a GET request to a specific endpoint (such as, /unicorn) and call the handleRequest(unicornRequest, null) method.

This ensures that the code paths associated with the specified endpoint are JIT compiled and optimized for faster execution during the first invocation after the snapshot is restored.

/src/main/java/software/amazon/awscdk/examples/unicorn/handler/InvokePriming.java
package software.amazon.awscdk.examples.unicorn.handler;

import org.crac.Core;
import org.crac.Resource;
...
public class InvokePriming implements RequestHandler<APIGatewayV2HTTPEvent, APIGatewayV2HTTPResponse>, Resource {
	...

@Override
public APIGatewayV2HTTPResponse handleRequest(APIGatewayV2HTTPEvent event, Context context) {
    var awsLambdaInitializationType = System.getenv("AWS_LAMBDA_INITIALIZATION_TYPE");
    var unicorns = getUnicorns();
    var body = gson.toJson(unicorns);
    return APIGatewayV2HTTPResponse.builder().withStatusCode(200).withBody(body).build();
}

@Override
public void beforeCheckpoint(org.crac.Context<? extends Resource> context)
        throws Exception {
    var event = APIGatewayV2HTTPEvent.builder().build();
    handleRequest(event, null);
}
...
}

Step 3: Implement Class priming Approach:

The class priming approach focuses on pre-loading required classes to achieve optimal performance. To implement class priming, generate the list of classes that are loaded during the application startup and function execution by running the application locally using the following JVM argument: -Xlog:class+load=info:classes-loaded.txt

  1. Ensure that your application classes included in the generated classes-loaded.txt file are not mutating state during static initialization.
    Note: the generated classes-loaded.txt contains class entries in the following format:

    [0.068s][info][class,load] software.amazon.awscdk.examples.unicorn.handler.ClassPriming source: file:/var/task/

  2. Extract only the fully qualified class names from each line and remove the additional logging information. For Example:
    software.amazon.awscdk.examples.unicorn.handler.ClassPriming

  3. Use the ClassLoaderUtil.loadClassesFromFile() utility method to extract the generated class entries.
    	     //src/main/java/software/amazon/awscdk/examples/unicorn/service/ClassLoaderUtil.java
    package software.amazon.awscdk.examples.unicorn;
    	...
    public class ClassLoaderUtil {
    	...
        public static void loadClassesFromFile() {
            log.info("loadClassesFromFile->started");
            Path path = Paths.get("classes-loaded.txt");
    
            try (BufferedReader bufferedReader = Files.newBufferedReader(path)) {
                Stream<String> lines = bufferedReader.lines();
                lines.forEach(line -> {
                    var index1 = line.indexOf("[class,load] ");
                    var index2 = line.indexOf(" source: ");
    
                    if (index1 < 0 || index2 < 0) {
                        return;
                    }
    
                    var className = line.substring(index1 + 13, index2);
                    try {
                        Class.forName(className, true,
                                ClassPriming.class.getClassLoader());
                    } catch (Throwable ignored) {
                    }
                });
    
                log.info("loadClassesFromFile->finished");
            } catch (IOException exception) {
                log.error("Error on newBufferedReader", exception);
            }
        }
    ...
    }

  4. Read a file (such as, /classes-loaded.txt) that contains a list of classes that have been loaded during the application’s execution in the beforeCheckpoint() method.
  5. Use the Class.forName() method to load and initialize the class, ensuring that it is ready during the snapshot.
    Note: by systematically pre-loading these classes, the Class priming approach simplifies the optimization process and reduces the complexities associated with Invoke priming.

    //src/main/java/software/amazon/awscdk/examples/unicorn/handler/ClassPriming.java
    package software.amazon.awscdk.examples.unicorn.handler;
    
    ...
    import org.crac.Core;
    import org.crac.Resource;
    
    public class ClassPriming implements RequestHandler<APIGatewayV2HTTPEvent, APIGatewayV2HTTPResponse>, Resource {
    
    ...
            ConfigurableApplicationContext configurableApplicationContext =
    				SpringApplication.run(UnicornApplication.class);
    
            this.unicornService = configurableApplicationContext.getBean(UnicornService.class);
            this.gson = configurableApplicationContext.getBean(Gson.class);
    
            Core.getGlobalContext().register(this);
        }
    
        @Override
        public APIGatewayV2HTTPResponse handleRequest(APIGatewayV2HTTPEvent event, Context context) {
            var unicorns = getUnicorns();
            var body = gson.toJson(unicorns);
    
            return APIGatewayV2HTTPResponse.builder().withStatusCode(200).withBody(body).build();
        }
    
        @Override
        public void beforeCheckpoint(org.crac.Context<? extends Resource> context)
                throws Exception {
    
            ClassLoaderUtil.loadClassesFromFile();
    
        }
    ...
    }

Step 4: AWS CDK Infrastructure Setup

Before proceeding, review the prerequisites in the project README file.

The CDK stack deploys a serverless application and required infrastructure for testing different Lambda optimization strategies. It creates a VPC with private subnets, an RDS for PostgreSQL instance with a database proxy, and five Lambda functions implementing different optimization approaches (ON_DEMAND without SnapStart, SnapStart without priming, SnapStart with invoke priming, and SnapStart with class priming). Each Lambda function is integrated with API Gateway for HTTP access, configured with Java 21 runtime on ARM64 architecture, and includes CloudWatch log groups for monitoring.

Follow these steps to deploy the infrastructure:

  1. Clone the sample repository:
    git clone https://github.com/aws-samples/lambda-priming-crac-java-cdk.git

  2. Deploy the CDK stack:
    cd lambda-priming-crac-java-cdk/infrastructure
    cdk synth
    cdk deploy --require-approval never --all 2>&1 | tee cdk_output.txt

  3. Save the API Gateway URLs:
    The deployment will output five URLs in this format:

    # ON_DEMAND endpoint (without SnapStart)
    LambdaPrimingCracJavaCdkStack.PrimingJavaRestApi1ONDEMANDEndpoint = https://[id].execute-api.us-east-1.amazonaws.com/prod/
    
    # SnapStart without priming endpoint
    LambdaPrimingCracJavaCdkStack.PrimingJavaRestApi2SnapStartNOPRIMINGEndpoint = https://[id].execute-api.us-east-1.amazonaws.com/prod/
    
    # SnapStart with invoke priming endpoint
    LambdaPrimingCracJavaCdkStack.PrimingJavaRestApi3SnapStartINVOKEPRIMINGEndpoint = https://[id].execute-api.us-east-1.amazonaws.com/prod/
    
    # SnapStart with class priming endpoint
    LambdaPrimingCracJavaCdkStack.PrimingJavaRestApi4SnapStartCLASSPRIMINGEndpoint = https://[id].execute-api.us-east-1.amazonaws.com/prod/
    
    # Database setup endpoint
    LambdaPrimingCracJavaCdkStack.PrimingJavaRestApi5DBLOADEREndpoint = https://[id].execute-api.us-east-1.amazonaws.com/prod/

  4.  Extract the URLs into variables for testing:
    ONDEMAND_URL=$(grep -oE 'https://[a-zA-Z0-9.-]+\.execute-api\.[a-zA-Z0-9-]+\.amazonaws\.com/prod/' "cdk_output.txt" | head -n 1) \
    
    NOPRIMING_URL=$(grep -oE 'https://[a-zA-Z0-9.-]+\.execute-api\.[a-zA-Z0-9-]+\.amazonaws\.com/prod/' "cdk_output.txt" | head -n 2 | tail -n 1) \
    
    INVOKEPRIMING_URL=$(grep -oE 'https://[a-zA-Z0-9.-]+\.execute-api\.[a-zA-Z0-9-]+\.amazonaws\.com/prod/' "cdk_output.txt" | head -n 3 | tail -n 1) \
    
    CLASSPRIMING_URL=$(grep -oE 'https://[a-zA-Z0-9.-]+\.execute-api\.[a-zA-Z0-9-]+\.amazonaws\.com/prod/' "cdk_output.txt" | head -n 4 | tail -n 1) \
    
    SETUP_URL=$(grep -oE 'https://[a-zA-Z0-9.-]+\.execute-api\.[a-zA-Z0-9-]+\.amazonaws\.com/prod/' "cdk_output.txt" | head -n 5 | tail -n 1)

Step 5: Load database and run performance testing using artillery:

  1. Initialize the database with sample data.
    curl -X GET "$SETUP_URL"
    
    #Expected output: {"message":"Database schema initialized and data loaded"}

  2. Run performance tests for all endpoints
    artillery run -t "$ONDEMAND_URL" -v '{ "url": "/unicorn" }' ./loadtest.yaml && \
    artillery run -t "$NOPRIMING_URL" -v '{ "url": "/unicorn" }' ./loadtest.yaml && \
    artillery run -t "$INVOKEPRIMING_URL" -v '{ "url": "/unicorn" }' ./loadtest.yaml && \
    artillery run -t "$CLASSPRIMING_URL" -v '{ "url": "/unicorn" }' ./loadtest.yaml

Step 6: Compare the load test results for On-demand (non-SnapStart), SnapStart, Invoke priming, and Class priming

The performance test results in the table below are sorted from slowest to fastest startup latency. The function without SnapStart performs the slowest due to JVM initialization, class loading and JIT compilation that occurs when the function is invoked. Notice a 4.3x improvement with SnapStart, which resumes invocations from a pre-initialized snapshot thereby avoiding JVM initialization and initial JIT compilation. SnapStart with class priming achieves a 1.4x speed-up over SnapStart, by proactively loading/initializing classes during INIT so that they are included in your function’s snapshot. Finally, SnapStart with invoke priming achieves the fastest performance – with a 781.68ms p99.9 cold-start latency that is 1.8x faster than SnapStart. This is because in addition to initializing classes, it also executes methods on the instances of those classes, resulting in even more components being included in the function’s snapshot.

Note that with invoke priming, any application code you execute must either be idempotent or modify stub data only. For instance, consider application code that triggers a financial transaction. If this code is executed during invoke priming with real user data, it may drive unintended effects with potentially serious consequences. Class priming avoids this, since application classes are initialized rather than being instantiated and their methods executed. This assumes that application code does not execute state modifying logic during class initialization. We recommend that you keep these considerations in mind when using invoke and/or class priming, and choose the appropriate approach for your use case.

Method Cold Start Invocations p50 P90 P99 p99.9
PrimingLogGroup-1_ON_DEMAND 128 5047.94 ms 5386.78 ms 6158.80 ms 6195.84 ms
PrimingLogGroup-2_SnapStart_NO_PRIMING 111 1177.87 ms 1288.73 ms 1419.94 ms 1425.63 ms
PrimingLogGroup-4_SnapStart_CLASS_PRIMING 82 857.81 ms 997.49 ms 1085.94 ms 1085.94 ms
PrimingLogGroup-3_SnapStart_INVOKE_PRIMING 66 608.42 ms 688.88 ms 781.68 ms 781.68 ms

 Conclusion

This post showed how AWS Lambda SnapStart, enhanced by CRaC runtime hooks, unlocks granular control over cold-start optimization for Java applications through two distinct priming strategies:

  • Invoke Priming: improves performance by executing critical endpoints during snapshot creation, ideal for idempotent workflows.
  • Class Priming: preloads classes without triggering business logic, mitigating side-effect risks.

To implement these optimization techniques in your applications evaluate your use case and opt for the optimal priming approach. Track latency reductions and resource utilization of your application via Amazon CloudWatch metrics to quantify performance improvements. By integrating these strategies, developers can achieve sub-second cold starts while maintaining the scalability and cost-efficiency of serverless architecture using Java.

To dive deeper, check out the GitHub repository with the full example code, including setup instructions and reusable patterns you can adapt to your own projects. For more examples of Java applications running on AWS Lambda, visit serverlessland.com and explore a wide range of resources, tutorials, and real-world use cases.

Announcing second-generation AWS Outposts racks with breakthrough performance and scalability on-premises

Post Syndicated from Micah Walter original https://aws.amazon.com/blogs/aws/announcing-second-generation-aws-outposts-racks-with-breakthrough-performance-and-scalability-on-premises/

Today we’re announcing the general availability of second-generation AWS Outposts racks, which marks the latest innovation from AWS for edge computing. This new generation includes support for the latest x86-powered Amazon Elastic Compute Cloud (Amazon EC2) instances, new simplified network scaling and configuration, and accelerated networking instances designed specifically for ultra-low latency and high-throughput workloads. These enhancements deliver greater performance for a broad range of on-premises workloads, such as core trading systems of financial services and telecom 5G Core workloads.

Customers like athenahealth, FanDuel, First Abu Dhabi Bank, Mercado Libre, Liberty Latin America, Riot Games, Vector Limited, and Wiwynn are already using Outposts racks for workloads that need to stay on-premises. The second-generation Outposts rack can provide low latency, local data processing, or data residency needs, such as game servers for multi-player online games, customer transaction data, medical records, industrial and manufacturing control systems, telecom Business Support Systems (BSS), and edge inference of a variety of machine learning (ML) models. Customers can now take advantage of the latest generation of processors and more advanced configurations of Outposts racks to support faster processing, higher memory capacity, and increased network bandwidth.

Latest generation EC2 instances

We’re excited to announce local support for the latest generation (7th generation) of x86-powered Amazon EC2 instances on AWS Outposts racks, starting with C7i compute-optimized instances, M7i general-purpose instances, and R7i memory-optimized instances. These new instances deliver twice the vCPU, memory, and network bandwidth while providing up to 40% better performance compared to C5, M5, and R5 instances on previous generation Outposts racks. They are powered by 4th Gen Intel Xeon Scalable processors and are ideal for a broad range of on-premises workloads requiring enhanced performance such as larger databases, more memory-intensive applications, advanced real-time big data analytics, high-performance video encoding and streaming, and CPU-based edge inference with more sophisticated ML models. Support for more latest generation EC2 instances, including GPU-enabled instances, is coming soon.

Simplified network scaling and configuration

We’ve completely reimagined networking in our latest Outposts generation, making it simpler and more scalable than ever. At the heart of this upgrade is our new Outposts network rack, which acts as a central hub for all your compute and storage traffic.

This new design brings three major benefits to the table. First, you can now scale your compute resources independently from your networking infrastructure, giving you more flexibility and cost efficiency as your workloads grow. Second, we’ve built in network resilience from the ground up, with the network rack automatically handling device failures to keep your systems running smoothly. Third, connecting to your on-premises environment and AWS Regions is now a breeze – you can configure everything from IP addresses to VLAN and BGP settings through straightforward APIs or our updated console interface.

Image of an AWS Outposts rack device

Specialized Amazon EC2 instances with accelerated networking

We’re introducing a new category of specialized Amazon EC2 instances on Outposts racks with accelerated networking. These instances are purpose built for the most latency-sensitive, compute-intensive, and throughput-intensive mission-critical workloads on-premises. To deliver the best possible performance, in addition to the Outpost logical network, these instances feature a secondary physical network with network accelerator cards connected to top-of-rack (TOR) switches.

First in this category are bmn-sf2e instances, designed for ultra-low latency with deterministic performance. The new instances run on Intel’s latest Sapphire Rapids processors (4th Gen Xeon Scalable), delivering 3.9 GHz sustained performance across all cores with generous memory allocation – 8GB of RAM for every CPU core. We’ve equipped bmn-sf2e instances with AMD Solarflare X2522 network cards that connect directly to top-of-rack switches.

For financial services customers, especially capital market firms, these instances offer deterministic networking through native Layer 2 (L2) multicast, precision time protocol (PTP), and equal cable lengths. This enables customers to meet regulatory requirements around fair trading and equal access while easily connecting to their existing trading infrastructure.

Instance Name vCPUs Memory (DDR5) Network Bandwidth NVMe SSD Storage Accelerated Network Cards Accelerated Bandwidth (Gbps)
bmn-sf2e.metal-16xl 64 512 GiB 25 Gbps 2 x 8 TB (16 TB) 2 100
bmn-sf2e.metal-32xl 128 1024 GiB 50 Gbps 4 x 8 TB (32 TB) 4 200

The second instance type, bmn-cx2, is optimized for high throughput and low latency. This instance features NVIDIA ConnectX-7 400G NICs physically connected to high-speed top-of-rack switches, delivering up to 800 Gbps bare metal network bandwidth operating at near line rate. With native Layer 2 (L2) multicast and hardware PTP support, this instance is ideal for high-throughput workloads like real-time market data distribution, risk analytics, and telecom 5G core network applications.

Instance Name vCPUs Memory (DDR5) Network Bandwidth NVMe SSD Storage Accelerated Network Cards Accelerated Bandwidth (Gbps)
bmn-cx2.metal-48xl 192 1536 GiB 50 Gbps 4 x 4 TB (16 TB) 2 800

Bottom line, the new generation of Outposts racks deliver enhanced performance, scalability, and resiliency for a broad range of on-premises workloads, even for mission-critical workloads with the most stringent latency and throughput requirements. You can make your selection and initiate your order from the AWS Management Console. The new instances maintain consistency with regional deployments by supporting the same APIs, AWS Management Console, automation, governance policies, and security controls in the cloud and on-premises, improving developer productivity and IT efficiency.

Things to know

At launch, second-generation Outposts racks can be shipped to US and Canada and be parented back to 6 AWS Regions including US East (N. Virginia and Ohio), US West (Oregon), EU West (London and France) and Asia Pacific (Singapore). Support for more countries and territories and AWS Regions is coming soon. At launch, second-generation Outposts racks locally support a subset of AWS services found in previous generation Outposts racks. Support for more EC2 instance types and more AWS services is coming soon.

To learn more, visit the AWS Outposts racks product page and user guide. You can also talk to an Outposts expert if you are ready to discuss your on-premises needs.

— Micah;


How is the News Blog doing? Take this 1 minute survey!

(This survey is hosted by an external company. AWS handles your information as described in the AWS Privacy Notice. AWS will own the data gathered via this survey and will not share the information collected with survey respondents.)

AWS Lambda standardizes billing for INIT Phase

Post Syndicated from Shubham Gupta original https://aws.amazon.com/blogs/compute/aws-lambda-standardizes-billing-for-init-phase/

Effective August 1, 2025, AWS will standardize billing for the initialization (INIT) phase across all AWS Lambda function configurations. This change specifically affects on-demand invocations of Lambda functions packaged as ZIP files that use managed runtimes, for which the INIT phase duration was previously unbilled. This update standardizes billing of the INIT phase across all runtime types, deployment packages, and invocation modes. Most users will see minimal impact on their overall Lambda bill from this change, as the INIT phase typically occurs for a very small fraction of function invocations. In this post, we discuss the Lambda Function Lifecycle and upcoming changes to INIT phase billing. You will learn what happens in the INIT phase and when it occurs, how to monitor your INIT phase duration, and strategies to optimize this phase and minimize costs.

Understanding the Lambda function execution lifecycle

The Lambda function execution lifecycle consists of three distinct phases: INIT, INVOKE, and SHUTDOWN. The INIT phase is triggered during a “cold start” when Lambda creates a new execution environment for a function in response to an invocation. This is followed by the INVOKE phase where the request is processed, and finally, the SHUTDOWN phase where the execution environment is terminated. For a summary of the execution lifecycle, watch AWS Lambda execution environment lifecycle.

During the INIT phase, Lambda performs a series of preparatory steps within a maximum duration of 10 seconds. The service retrieves the function code from an internal Amazon S3 bucket, or from Amazon Elastic Container Registry (Amazon ECR) for functions using container packaging. Then, it configures an environment with the specified memory, runtime, and other settings. When the execution environment is prepared, Lambda executes four key tasks in sequence:

  1. Initiate any extensions configured (Extension INIT)
  2. Bootstrap the runtime (Runtime INIT)
  3. Execute the function’s static code (Function INIT)
  4. Run any before-checkpoint runtime hooks (applicable only for Lambda SnapStart)

Understanding the billing changes

Lambda charges are based on the number of requests and the duration it takes for the code to run. The duration is calculated from the moment the function code begins running until it completes or terminates, rounded up to the nearest millisecond. Duration cost depends on the amount of memory that you allocate to your function.
https://docs.aws.amazon.com/lambda/latest/dg/provisioned-concurrency.html
Previously, the INIT phase duration wasn’t included in the Billed Duration for functions using managed runtimes with ZIP archive packaging, as evidenced in Amazon CloudWatch logs:

REPORT RequestId: xxxxx   Duration: 250.06 ms  Billed Duration: 251 ms  Memory Size: 1024 MB
Max Memory Used: 350 MB   Init Duration: 100.77 ms

However, functions configured with custom runtimes, Provisioned Concurrency (PC), or OCI packaging already included the INIT phase duration in their Billed Duration. Effective August 1, 2025, INIT phase will be billed across all configuration types and the INIT phase duration will be included in the Billed Duration for on-demand invocations of functions using managed runtimes with ZIP archive packaging as well. After this change, the REPORT Request ID log line will show the following:

REPORT RequestId: xxxxx   Duration: 250.06 ms  Billed Duration: 351 ms  Memory Size: 1024 MB
Max Memory Used: 350 MB   Init Duration: 100.77 ms 

The further INIT phase duration charges will follow the standard on-demand duration pricing that is specific to each AWS Region, which can be found on the Lambda pricing page. For AWS Lambda@Edge functions, the INIT phase duration will be billed according to Lambda@Edge duration rates.

Finding the INIT phase duration and impact to Lambda billing

You can already monitor the time spent in the INIT phase of your function invocations using the “init_duration” CloudWatch metric. This metric is also reported as “Init Duration” in the “REPORT RequestId” log line within CloudWatch Logs. These tools offer valuable insights into the INIT time of Lambda functions, which will now be factored into billing calculations.

For a more comprehensive analysis, you can use the following CloudWatch Log Insights query to generate a detailed report estimating the previously unbilled duration of the INIT phase. The query helps you understand the proportion of the unbilled INIT phase time relative to your overall Lambda usage, enabling more accurate cost projections following this billing change.

filter @type = "REPORT" and @billedDuration < (@duration + @initDuration) 
| stats sum((@memorySize/1000000/1024) * (@billedDuration/1000)) as BilledGBs, 
sum((@memorySize/1000000/1024) * ((ceil(@duration + @initDuration) - @billedDuration)/1000)) as UnbilledInitGBs, 
(UnbilledInitGBs/ (UnbilledInitGBs+BilledGBs)) as Ratio

The CloudWatch Log Insights query provides three essential metrics:

  1. BilledGBs: Represents the total GB-s (gigabyte-seconds) currently being billed for the chosen log groups.
  2. UnbilledInitGBs: Shows the total GB-s consumed during INIT phase that was previously not included in billing.
  3. Ratio: Indicates the percentage of total GB-s attributed to previously unbilled INIT phase duration.

Using these existing monitoring capabilities allows you to proactively assess and optimize your Lambda function INIT times, potentially minimizing the impact of the new billing structure on your overall costs.

Understanding and optimizing Lambda INIT phase

The Lambda INIT phase is triggered in two specific scenarios: during the creation of a new execution environment and when a function scales up to meet demand. This INIT code runs only during these “cold starts” and is bypassed during subsequent invocations that use existing warm environments. After the INIT phase, Lambda runs the function handler code to process the invocation.

Following the handler execution, Lambda freezes the execution environment. To improve resource management and performance, the Lambda service retains the execution environment for a non-deterministic period of time. During this time, if another request arrives for the same function, then the service may reuse the environment. This second request typically finishes faster, because the execution environment already exists and it isn’t necessary to download the code and run the INIT code. This is called a “warm start.”

Developers can use the INIT phase to create, initialize, and configure objects expected to be reused across multiple invocations during function INIT instead of doing it in the handler. Initializing the dependencies/shared objects upfront reduces the latency of subsequent invocations. For example:

  • Download more libraries or dependencies
  • Establish client connections to other AWS services such as Amazon S3 or Amazon DynamoDB
  • Create database connections to be shared across invocations
  • Retrieve application parameters or secrets from Amazon Systems Manager Parameter Store or AWS Secrets Manager

When developing Lambda functions, it’s important to strategically decide what code runs during the INIT phase as opposed to the handler phase, because it affects both performance and costs.

Optimizing package/library size

The INIT phase includes creating an execution environment, downloading the function code and initializing it. Three main factors influence its performance:

  1. The size of the function package, in terms of imported libraries and dependencies, and Lambda layers.
  2. The amount of code and INIT work.
  3. The performance of libraries and other services in setting up connections and other resources.

Larger function packages increase code download times. You can decrease INIT phase duration by reducing package size, resulting in faster cold starts and lower INIT costs. Furthermore, optimizing loading of libraries can also significantly impact package size. For example, in Node.js functions, you should use specific path imports (for example import DynamoDB from "aws-sdk/clients/dynamodb") rather than wildcard imports (for example import {* as AWS} from "aws-sdk") to speed up the INIT phase. Tools such as esbuild can further optimize performance by minifying and bundling packages. For details, read Optimizing node.js dependencies in AWS Lambda.

Optimizing INIT phase execution and cost efficiency

The frequency of INIT phase executions (or cold starts) directly impacts both performance and cost efficiency. According to an analysis of production Lambda workloads, INITs (cold starts) typically occur in under 1% of invocations—meaning code in the INIT phase may execute just once per hundred invocations.

You can use the INIT phase to perform one-time operations that benefit subsequent invocations. Common optimization patterns include pre-calculating lookup tables or transforming static datasets. For example, downloading static data from Amazon S3 or DynamoDB during INIT, making it available for all subsequent function invocations without repeated downloads.

Lambda SnapStart

Lambda SnapStart provides an effective solution for reducing cold start latency and INIT phase costs. When it’s enabled, SnapStart creates a snapshot during the first function INIT and reuses it for subsequent cold starts, eliminating the need for repeated INIT phase executions. This approach is particularly valuable for functions with longer INIT times due to loading module dependencies/frameworks, initializing the runtime, or executing one-time INIT code. SnapStart is supported for Java, .NET, and Python runtimes. You can implement SnapStart through the Lambda console or AWS Command Line Interface (AWS CLI), making sure that your code adheres to the AWS serialization guidelines for snapshot restoration compatibility. Using SnapStart allows you to significantly improve function startup times and optimize costs across multiple popular programming languages.

Provisioned Concurrency

Provisioned Concurrency is a Lambda feature that pre-initializes execution environments before any invocations occur. This proactive approach effectively eliminates the performance impact of the INIT phase on individual function calls, because the INIT is completed in advance.

Although all functions using the Provisioned Concurrency benefit from reduced startup times as compared to on-demand execution, the impact is particularly pronounced for certain runtime environments. For example, C# and Java functions—which typically experience slower INIT but faster execution times as compared to Node.js or Python—can achieve significant performance gains through this feature. Implementing Provisioned Concurrency allows you to effectively manage both consistent traffic patterns and expected usage spikes, thereby minimizing cold start latency across your serverless applications. This optimization strategy is particularly valuable for functions with complex INIT requirements or those serving latency-sensitive workloads. From a cost optimization perspective, Provisioned Concurrency is most suitable for workloads with sustained usage patterns above 60% usage, because this typically provides better cost efficiency compared to on-demand execution.

Conclusion

Effective August 1, 2025, AWS is standardizing the INIT phase billing for AWS Lambda. AWS provides multiple ways for you to optimize both the performance and costs of your Lambda functions. Whether you’re using SnapStart, implementing Provisioned Concurrency, or optimizing INIT code, we recommend working closely with AWS support teams to identify the most suitable optimization approach for your specific workload requirements.

For more support and guidance, consider participating in AWS Cost Optimization workshops or consulting the Lambda documentation.

Extend the Amazon Q Developer CLI with Model Context Protocol (MCP) for Richer Context

Post Syndicated from Brian Beach original https://aws.amazon.com/blogs/devops/extend-the-amazon-q-developer-cli-with-mcp/

Earlier today, Amazon Q Developer announced Model Context Protocol (MCP) support in the command line interface (CLI). Developers can connect external data sources to Amazon Q Developer CLI with MCP support for more context-aware responses. By integrating MCP tools and prompts into Q Developer CLI, you get access to an expansive list of pre-built integrations or any MCP Servers that support stdio. This extra context helps Q Developer write more accurate code, understand your data structures, generate appropriate unit tests, create database documentation, and execute precise queries, all without needing to develop custom integration code. By extending Q Developer with MCP tools and prompts, developers can execute development tasks faster, streamlining the developer experience. At AWS, we’re committed to supporting popular open source protocols for agents like Model Context Protocol (MCP) proposed by Anthropic. We’ll continue to support this effort by extending this functionality within the Amazon Q Developer IDE plugins in the coming weeks.

Introduction

I’m always on the lookout for tools and technologies that can streamline my workflow and unlock new capabilities. That’s why I was excited about the recent addition of Model Context Protocol (MCP) support in the Amazon Q Developer command line interface (CLI). MCP is an open protocol that standardizes how applications can seamlessly integrate with LLMs, providing a common way to share context, access data sources, and enable powerful AI-driven functionality. You can read more about MCP in this introduction.

Q Developer has had the ability to use tools for a while. I previously discussed the ability to run CLI commands and describe AWS resources. With the Q Developer CLI’s support for MCP tools and prompts, I now have the ability to add additional tools. For example, while I have had the ability to describe my AWS resources, I also need to describe database schemas, message formats, etc. to build an application. Let’s see how I can configure MCP to provide this additional context.

In this post, I will configure an MCP server to provide Q Developer with my database schema for a simple Learning Management System (LMS) that I am working on. While Q Developer is great at writing SQL, it does not know the schema of my database. The table structure and relationships are stored in the database and are not part of the source code of my project. Therefore, I am going to use an MCP server that can query the database schema. Specifically, I am using the official PostgreSQL reference implementation to connect to my Amazon Relational Database Service (RDS). Let’s get started.

Before Model Context Protocol

Prior to the introduction of MCP support, the Q Developer CLI provided a set of native tools, including the ability to execute bash commands, interact with files and the file system, and even make calls to AWS services. However, when it came to querying a database, the CLI was limited in its capabilities.

For example, prior to configuring the MCP server, I asked Q Developer to “Write a query that lists the students and the number of credits each student is taking.” In the following image you can see that Q Developer could only provide a generic SQL query, as it lacked the specific knowledge of the database schema for my LMS.

Screenshot of Amazon Q Developer CLI showing a response to a query request. The response includes explanatory text acknowledging the lack of schema information, followed by a generic SQL query written in green text. The query joins students, student_courses, and courses tables to calculate total credit hours per student, demonstrating Q's limited ability without MCP configuration.

While this is a great start, I know that Q developer could do so much more if it knew the database schema.

Configuring Model Context Protocol

The introduction of MCP support in the Q Developer CLI allows me to easily configure MCP servers. I configure one or more MCP servers in a file called mcp.json. I can store the configuration in my home directory (e.g. ~/.aws/amazonq/mcp.json) and it is applied to all projects on my machine. Alternatively, I can store the configuration in the workspace root (e.g. .amazonq/mcp.json) so it is shared among project members. Here is an example of the configuration for the PostgreSQL MCP server.

{
  "mcpServers": {
    "postgres": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-postgres",
        "postgresql://USERNAME:PASSWORD@HOST:5432/DBNAME"
      ]
    }
  }
}

With the MCP server configured, let’s see how Amazon Q Developer enhances my experience.

After Model Context Protocol

First, I start a new Q Developer session and immediately see the benefits. In addition to the existing tools, Q Developer now has access to PostgreSQL as shown in the following image. This means I can easily explore the schema of my database, understand the structure of the tables, and even execute complex SQL queries, all without having to write any additional integration code.

Screenshot of Amazon Q Developer CLI displaying a list of available tools. The tools are categorized into file system tools, bash execution, AWS tools, PostgreSQL database tools, and issue reporting. The PostgreSQL category is highlighted, showing the integration of MCP for database access.

Let’s test the MCP server by asking Q Developer to “List the database tables.” As you can see in the following example, Q Developer now understands that I am asking about the PostgreSQL database, and uses the MCP server to list my three tables: students, courses, and enrollment.

Screenshot of Amazon Q Developer CLI showing a database table listing request and response. The response shows a tool request using list_objects command with JSON parameters, followed by execution status and a list of three tables in the public schema: courses, enrollment, and students.

Let’s go back to the example from earlier in this post. Now, when I ask Q Developer to “Write a query that lists the students and the number of credits each student is taking,” it no longer responds with a generic query. Instead, Q Developer first describes the relevant tables in my database, generates the appropriate SQL query, and then executes it, providing me with the desired results.

Screenshot of Amazon Q Developer CLI showing a complete SQL query workflow. The image displays a precise SQL query in green syntax highlighting, followed by a results table showing student credit information, and an explanation of how the query works through five numbered steps. This demonstrates Q's ability to generate, execute, and explain database queries with schema knowledge.

Of course, Q Developer can do a lot more than just write queries. Q Developer can use the MCP server to write Java code that accesses the database, create unit tests for the data layer, document the database, and much more. For example, I asked Q Developer to “Create an entity-relationship (ER) diagram using Mermaid syntax.” Q Developer was able to generate a visual representation of the database schema, helping me better understand the relationships between the various entities.

Entity-Relationship (ER) diagram generated by Amazon Q Developer. The diagram shows three tables: STUDENTS, COURSES, and ENROLLMENT. Each table is represented by a box containing column names and data types. The ENROLLMENT table links STUDENTS and COURSES with 'enrolls in' and 'has enrolled' relationships. Primary and foreign keys are indicated. This visualizes the database schema structure for the Learning Management System.

The integration of MCP into the Q Developer CLI has significantly streamlined my workflow by allowing me to add additional tools as needed.

Conclusion

The addition of MCP support in the Amazon Q Developer CLI provides a standardized way to share context and access data sources. In this post, I’ve demonstrated how I can use the Q Developer CLI’s MCP integration to quickly set up a connection to a PostgreSQL database, explore the schema, and generate complex SQL queries without having to write any additional integration code. Moving forward, I’m excited to see how you can leverage MCP to further enhance your development workflow. I encourage you to explore the MCP capabilities and the AWS MCP Servers repository on GitHub.

How Flutter UKI optimizes data pipelines with AWS Managed Workflows for Apache Airflow

Post Syndicated from Monica Cujerean, Ionut Hedesiu original https://aws.amazon.com/blogs/big-data/how-flutter-uki-optimizes-data-pipelines-with-aws-managed-workflows-for-apache-airflow/

This post is co-written with Monica Cujerean and Ionut Hedesiu from Flutter UKI.

In this post, we share how Flutter UKI transitioned from a monolithic Amazon Elastic Compute Cloud (Amazon EC2)-based Airflow setup to a scalable and optimized Amazon Managed Workflows for Apache Airflow (Amazon MWAA) architecture using features like Kubernetes Pod Operator, continuous integration and delivery (CI/CD) integration, and performance optimization techniques.

About Flutter UKI

As a division of Flutter Entertainment, Flutter UKI stands at the forefront of the sports betting and gaming industry. Flutter UKI offers a diverse portfolio of entertainment options, encompassing sports wagering, casino games, bingo, and poker experiences. Flutter UKI’s digital presence is robust, operating through an array of renowned online brands. These include the iconic Paddy Power, Sky Betting and Gaming, and Tombola. While Flutter UKI has established a strong online foothold, it maintains a significant physical presence with a network of 576 Paddy Power betting shops strategically located across the United Kingdom and Ireland.

The Data team at Flutter UKI is integral to the company’s mission of using data to drive business success and innovation. Specializing in data, their teams are dedicated to ensuring the seamless integration, management, and accessibility of data across multiple facets of the organization. By developing robust data pipelines and maintaining high data quality standards, Flutter UKI empowers stakeholders with reliable insights, optimizes operational efficiencies, and enhances the user experience. Its commitment to data excellence underpins its efforts to remain at the forefront of the online gaming and entertainment industry, delivering value and strategic advantage to the business.

The journey from self managing Airflow on Amazon EC2 to operating Airflow workloads at scale using Amazon MWAA

Flutter UKI’s data orchestration story began in 2017 with a modest Apache Airflow deployment on EC2 instances. As the company’s digital footprint expanded, so did their data pipeline requirements, leading to an increasingly complex monolithic cluster that demanded constant attention and resource scaling. The operational overhead of managing these EC2 instances became a significant challenge for their engineering teams. In 2022, Flutter UKI reached a crossroads. They needed to choose between re-architecting their service on Amazon Elastic Kubernetes Service (Amazon EKS) or embracing Amazon Managed Workflows for Apache Airflow (MWAA).

Flutter UKI was looking to transform their data orchestration service from a resource-intensive, self-managed system to a more efficient, managed service that would allow them to focus on their core business objectives rather than infrastructure management. Through extensive proof-of-concept (POC) testing and close collaboration with AWS Enterprise Support, Flutter UKI gained confidence in the ability of Amazon MWAA to handle their sophisticated workloads at scale. Their choice of MWAA over a self-managed solution on Amazon EKS reflected Flutter UKI’s strategic focus on using managed services to reduce operational complexity and accelerate innovation.

The migration to Amazon MWAA followed a methodical approach. There was extensive testing of multiple POCs. During the POCs, the engineering team found MWAA to have a good ease of use, which helped them reduce the learning curve resulting in faster. Learning from each POC, they iterated on the final architecture by making data-driven decisions. Starting with a small subset of directed acyclic graphs (DAG), the Flutter UKI team expanded their deployment over time, gradually moving hundreds and eventually thousands of workflows to the managed service. This careful, phased transition allowed them to validate the performance and reliability of MWAA while minimizing operational risk.

High-level architecture design

During the service re-architecture, the data team strategically managed over 3,500 dynamically generated DAGs by implementing a sophisticated distribution approach across multiple Amazon MWAA environments to create a workload isolated environment. Another reason for having multiple environments was to make sure that no one MWAA environment doesn’t get overloaded by multiple DAGs. By placing DAG files across diverse Amazon Simple Storage Service (Amazon S3) locations and configuring unique DAG_FOLDER paths for each environment, the data team created an intelligent load balancing mechanism that allocates workflows based on complex criteria including environment type, task volume, and environment-specific DAG affinity. A round-robin distribution strategy was designed to minimize single environment load, ensuring scalable infrastructure with zero performance degradation. This approach allowed the team to optimize workflow orchestration, maintaining high performance while efficiently managing an extensive collection of dynamically generated DAGs across multiple MWAA environments. To provide more compute to individual tasks and to keep the MWAA efficient, Flutter UKI delegated the DAG execution to an external compute environment using Amazon Elastic Kubernetes Service (Amazon EKS). The resulting high-level architecture is shown in the following figure.

  1. Kubernetes Pod Operator (KPO) for tasks: Flutter UKI transitioned from using custom operators and many native Airflow operators to exclusively utilizing the Kubernetes Pod Operator (KPO). This decision simplified their architecture by eliminating unnecessary complexity, reducing maintenance overhead, and mitigating potential bugs. Additionally, this approach enabled them to allocate compute resources on a per-task basis, optimizing overall service performance. It also enabled the use of different container images for different tasks, thereby avoiding library dependency conflicts.
  2. Kubernetes Pod Operator wrapper (KPOw): Instead of using KPO directly, they developed a wrapper (KPOw) around it. This wrapper abstracts the underlying complexity and minimizes the impact of signature changes in Airflow, Amazon MWAA, Amazon EKS, or operator versions. By centralizing these changes, they only need to update the wrapper rather than thousands of individual DAGs. The wrapper also simplifies DAGs by hiding repetitive parameters, such as node affinity, pod resources, and EKS cluster configurations. Furthermore, it enforces company-specific naming conventions and allows for parameter validation at task execution time rather than during DagBag refresh. They also introduced profiles and image files, where profile files contain necessary KPO parameters, and the corresponding image files link to the repository for the task’s container image. This setup ensures consistency across tasks using the same profile and facilitates simultaneous updates across tasks.
  3. Monthly image updates in Kubernetes: Enforcing a policy of monthly image updates made sure that their code remained current, preventing security vulnerabilities and avoiding extensive code changes due to deprecated libraries.
  4. Continuous Airflow updates: Flutter UKI maintains a cutting-edge infrastructure by implementing new Airflow versions shortly after release, while following a carefully orchestrated deployment strategy. Their approach uses standard Amazon MWAA configurations and employs a systematic testing protocol. New versions are first deployed to development and test environments for thorough validation before reaching production systems. This methodical progression significantly reduces the risk of disruptions to business-critical workflows.

To achieve operational excellence, Flutter UKI has implemented a comprehensive monitoring framework centered on Amazon CloudWatch metrics. Their monitoring solution includes strategically configured alarms that provide early warning signals for potential issues. This proactive monitoring approach enables their teams to quickly identify and investigate anomalies in production workload executions, ensuring high availability and performance of their data pipelines. The combination of careful version management and robust monitoring exemplifies Flutter UKI’s commitment to operational excellence in their cloud infrastructure.

  1. CI/CD integration: By managing their code in GitLab, with mandatory code reviews and using Argo Events and Argo Workflows for image updates in AWS ECR, they streamlined their development processes.
  2. Performance Optimization: A significant portion of the DAGs are dynamically generated based on database metadata. This generation process runs outside Amazon MWAA, with its own CI/CD pipeline, and the resulting DAG files are stored in the S3 DAG. Placing code outside of tasks was avoided, including parameter evaluation. Parameters and secrets are stored in AWS Secrets Manager and retrieved at task runtime. Engineers aim to minimize or eliminate inter-service dependencies within MWAA.

DAGs are scheduled to distribute execution times as evenly as possible. Task code and common modules are hosted on Amazon S3 and retrieved at runtime. For larger codebases, Amazon Elastic File System (Amazon EFS) volumes are mounted to task pods are used.

Results

Today, Flutter UKI’s infrastructure comprises four Amazon MWAA clusters, each executing tasks on dedicated Amazon EKS node groups. They manage approximately 5,500 DAGs encompassing over 30,000 tasks, handling more than 60,000 DAG runs daily with a concurrency exceeding 450 tasks running simultaneously across clusters. They anticipate a 10% monthly increase in this workload in the short to medium term. During major events like Cheltenham and Grand National, where data load increases by 30%, their MWAA service has demonstrated stability and scalability, achieving a 100% success rate for critical processes in 2025, a significant improvement over previous years.

Conclusion

Flutter UKI’s journey with AWS Managed Workflows for Apache Airflow (Amazon MWAA) has resulted in a stable, scalable, and resilient production environment. The careful re-architecting of Flutter UKI’s service, combined with strategic decisions around task execution and infrastructure management, has not only simplified their operations, but also enhanced performance and reliability. Security and compliance benefits were also noticed, because MWAA provides managed security updates, built-in encryption, and integration with AWS security services. Perhaps most importantly, the shift to MWAA has allowed Flutter UKI’s engineering teams to redirect their efforts from infrastructure maintenance to business-critical tasks, focusing on DAG development and improving data pipeline efficiency, ultimately accelerating innovation in their core business operations.

If you’re looking to reduce operational overhead and migrate to a fully managed Airflow solution on AWS, consider using Amazon MWAA. Get in touch with your Technical Account Manager or your Solutions Architect to discuss a solution specific to your use-case. You can also reach out to AWS Support by creating a case if you’re facing an issues setting up the service.

Ready to see what Amazon MWAA is like? Visit the AWS Management Console for Amazon MWAA. For more information, see What Is Amazon Managed Workflows for Apache Airflow. Additionally, Using Amazon MWAA with Amazon EKS shows you how to integrate Amazon MWAA with Amazon EKS.


About the authors

Monica Cujerean is a Principal Data Engineer at Flutter UKI, focusing on service related initiatives that cover performance optimization, cost effectiveness, and new feature adoption on most AWS service in our stack: Amazon MWAA, Amazon Redshift, Amazon Aurora, and Amazon SageMaker.

Ionut Hedesiu is a Senior Data Architect at Flutter UKI, responsible for designing strategic solutions to cover complex and varied business needs. His main expertise is on Amazon MWAA, Kubernetes, Amazon Sagemaker, and ETL solutions.

Nidhi Agrawal is a Technical Account Manager at AWS and works with large enterprise customers to provide the technical guidance, best practices, and strategic support to customers, helping them optimize their environments in the AWS Cloud.

John Kellett is a Senior Customer Solutions Manager with 25 years of experience across private and public sectors. John helps drive end-to-end customer engagement through program management excellence. By understanding and representing customers’ strategic visions, John aligns to develop the people, organizational readiness, and technology competencies to meet the desired outcomes.

Sidhanth Muralidhar is a Principal Technical Account Manager at AWS. He works with large enterprise customers who run their workloads on AWS. He is passionate about working with customers and helping them architect workloads for cost, reliability, performance, and operational excellence at scale in their cloud journey. He has a keen interest in data analytics as well.

How BMW Group built a serverless terabyte-scale data transformation architecture with dbt and Amazon Athena

Post Syndicated from Philipp Karg original https://aws.amazon.com/blogs/big-data/how-bmw-group-built-a-serverless-terabyte-scale-data-transformation-architecture-with-dbt-and-amazon-athena/

Businesses increasingly require scalable, cost-efficient architectures to process and transform massive datasets. At the BMW Group, our Cloud Efficiency Analytics (CLEA) team has developed a FinOps solution to optimize costs across over 10,000 cloud accounts. While enabling organization-wide efficiency, the team also applied these principles to the data architecture, making sure that CLEA itself operates frugally. After evaluating various tools, we built a serverless data transformation pipeline using Amazon Athena and dbt.

This post explores our journey, from the initial challenges to our current architecture, and details the steps we took to achieve a highly efficient, serverless data transformation setup.

Challenges: Starting from a rigid and costly setup

In our early stages, we encountered several inefficiencies that made scaling difficult. We were managing complex schemas with wide tables that required significant effort in maintainability. Initially, we used Terraform to create tables and views in Athena, allowing us to manage our data infrastructure as code (IaC) and automate deployments through continuous integration and delivery (CI/CD) pipelines. However, this method slowed us down when changing data models or dealing with schema changes, therefore requiring high development efforts.

As our solution grew, we faced challenges with query performance and costs. Each query scanned large amounts of raw data, resulting in increased processing time and higher Athena costs. We used views to provide a clean abstraction layer, but this masked underlying complexity because seemingly simple queries against these views scanned large volumes of raw data, and our partitioning strategy wasn’t optimized for these access patterns. As our datasets grew, the lack of modularity in our data design increased complexity, making scalability and maintenance increasingly difficult. We needed a solution for pre-aggregating, computing, and storing query results of computationally intensive transformations. The absence of robust testing and lineage solutions made it challenging to identify the root causes of data inconsistencies when they occurred.

As part of our business intelligence (BI) solution, we used Amazon QuickSight to build our dashboards, providing visual insights into our cloud cost data. However, our initial data architecture led to challenges. We were building dashboards on top of large, wide datasets, with some hitting the QuickSight per-dataset SPICE limit of 1 TB. Additionally, during SPICE ingest, our largest datasets required 4–5 hours of processing time due to performing full scans each time, often scanning over a terabyte of data. This architecture wasn’t helping us be more agile and quick while scaling up. The long processing times and storage limitations hindered our ability to provide timely insights and expand our analytics capabilities.

To address these issues, we enhanced the data architecture with AWS Lambda, AWS Step Functions, AWS Glue, and dbt. This tool stack significantly enhanced our development agility, empowering us to quickly modify and introduce new data models. At the same time, we improved our overall data processing efficiency with incremental loads and better schema management.

Solution overview

Our current architecture consists of a serverless and modular pipeline coordinated by GitHub Actions workflows. We chose Athena as our primary query engine for several strategic reasons: it aligns perfectly with our team’s SQL expertise, excels at querying Parquet data directly in our data lake, and alleviates the need for dedicated compute resources. This makes Athena an ideal fit for CLEA’s architecture, where we process around 300 GB daily from a data lake of 15 TB, with our largest dataset containing 50 billion rows across up to 400 columns. The capability of Athena to efficiently query large-scale Parquet data, combined with its serverless nature, enables us to focus on writing efficient transformations rather than managing infrastructure.

The following diagram illustrates the solution architecture.

Using this architecture, we’ve streamlined our data transformation process using dbt. In dbt, a data model represents a single SQL transformation that creates either a table or a view—essentially a building block of our data transformation pipeline. Our implementation includes around 400 such models, 50 data sources, and around 100 data tests. This setup enables seamless updates—whether creating new models, updating schemas, or modifying views—triggered simply by creating a pull request in our source code repository, with the rest handled automatically.

Our workflow automation includes the following features:

  • Pull request – When we create a pull request, it’s deployed to our testing environment first. After passing validation and being approved or merged, it’s deployed to production using GitHub workflows. This setup enables seamless model creation, schema updates, or view changes—triggered just by creating a pull request, with the rest handled automatically.
  • Cron scheduler – For nightly runs or multiple daily runs to reduce data latency, we use scheduled GitHub workflows. This setup allows us to configure specific models with different update strategies based on data needs. We can set models to update incrementally (processing only new or changed data), as views (querying without materializing data), or as full loads (completely refreshing the data). This flexibility optimizes processing time and resource usage. We can target only specific folders—like source, prepared, or semantic layers—and run the dbt test afterward to validate model quality.
  • On demand – When adding new columns or changing business logic, we need to update historical data to maintain consistency. For this, we use a backfill process, which is a custom GitHub workflow created by our team. The workflow allows us to select specific models, include their upstream dependencies, and set parameters like start and end dates. This makes sure that changes are applied accurately across the entire historical dataset, maintaining data consistency and integrity.

Our pipeline is organized into three primary stages—Source, Prepared, and Semantic—each serving a specific purpose in our data transformation journey. The Source stage maintains raw data in its original form. The Prepared stage cleanses and standardizes this data, handling tasks like deduplication and data type conversions. The Semantic stage transforms this prepared data into business-ready models aligned with our analytical needs. An additional QuickSight step handles visualization requirements. To achieve low cost and high performance, we use dbt models and SQL code to manage all transformations and schema changes. By implementing incremental processing strategies, our models process only new or changed data rather than reprocessing the entire dataset with each run.

The Semantic stage (not to be confused with dbt’s semantic layer feature) introduces business logic, transforming data into aggregated datasets that are directly consumable by BMW’s Cloud Data Hub, internal CLEA dashboards, data APIs, or In-Console Cloud Assistant (ICCA) chatbot. The QuickSight step further optimizes data by selecting only necessary columns by using a column-level lineage solution and setting a dynamic date filter with a sliding window to ingest only relevant hot data into SPICE, avoiding unused data in dashboards or reports.

This approach aligns with BMW Group’s broader data strategy, which includes streamlining data access using AWS Lake Formation for fine-grained access control.

Overall, as a high-level structure, we’ve fully automated schema changes, data updates, and testing through GitHub pull requests and dbt commands. This approach enables controlled deployment with robust version control and change management. Continuous testing and monitoring workflows uphold data accuracy, reliability, and quality across transformations, supporting efficient, collaborative model iteration.

Key benefits of the dbt-Athena architecture

To design and manage dbt models effectively, we use a multi-layered approach combined with cost and performance optimizations. In this section, we discuss how our approach has yielded significant benefits in five key areas.

SQL-based, developer-friendly environment

Our team already had strong SQL skills, so dbt’s SQL-centric approach was a natural fit. Instead of learning a new language or framework, developers could immediately start writing transformations using familiar SQL syntax with dbt. This familiarity aligns well with the SQL interface of Athena and, combined with dbt’s added functionality, has increased our team’s productivity.

Behind the scenes, dbt automatically handles synchronization between Amazon Simple Storage Service (Amazon S3), the AWS Glue Data Catalog, and our models. When we need to change a model’s materialization type—for example, from a view to a table—it’s as simple as updating a configuration parameter rather than rewriting code. This flexibility has reduced our development time dramatically, allowed us to focus on building better data models rather than managing infrastructure.

Agility in modeling and deployment

Documentation is crucial for any data platform’s success. We use dbt’s built-in documentation capabilities by publishing them to GitHub Pages, which creates an accessible, searchable repository of our data models. This documentation includes table schemas, relationships between models, and usage examples, enabling team members to understand how models interconnect and how to use them effectively.

We use dbt’s built-in testing capabilities to implement comprehensive data quality checks. These include schema tests that verify column uniqueness, referential integrity, and null constraints, as well as custom SQL tests that validate business logic and data consistency. The testing framework runs automatically on every pull request, validating data transformations at each step of our pipeline. Additionally, dbt’s dependency graph provides a visual representation of how our models interconnect, helping us understand the upstream and downstream impacts of any changes before we implement them. When stakeholders need to modify models, they can submit changes through pull requests, which, after they’re approved and merged, automatically trigger the necessary data transformations through our CI/CD pipeline. This streamlined process enabled us to create new data products within days compared to weeks and reduced ongoing maintenance work by catching issues early in the development cycle.

Athena workgroup separation

We use Athena workgroups to isolate different query patterns based on their execution triggers and purposes. Each workgroup has its own configuration and metric reporting, allowing us to monitor and optimize separately. The dbt workgroup handles our scheduled nightly transformations and on-demand updates triggered by pull requests through our Source, Prepared, and Semantic stages. The dbt-test workgroup executes automated data quality checks during pull request validation and nightly builds. The QuickSight workgroup manages SPICE data ingestion queries, and the Ad-hoc workgroup supports interactive data exploration by our team.

Each workgroup can be configured with specific data usage quotas, enabling teams to implement granular governance policies. This separation provides several benefits: it enables clear cost allocation, provides isolated monitoring of query patterns across different use cases, and helps enforce data governance through custom workgroup settings. Amazon CloudWatch monitoring per workgroup helps us track usage patterns, identify query performance issues, and adjust configurations based on actual needs.

Using QuickSight SPICE

QuickSight SPICE (Super-fast, Parallel, In-memory Calculation Engine) provides powerful in-memory processing capabilities that we’ve optimized for our specific use cases. Rather than loading entire tables into SPICE, we create specialized views on top of our materialized semantic models. These views are carefully crafted to include only the necessary columns, relevant metadata joins, and appropriate time filtering to have only recent data available in dashboards.

We’ve implemented a hybrid refresh strategy for these SPICE datasets: daily incremental updates keep the data fresh, and weekly full refreshes maintain data consistency. This approach strikes a balance between data freshness and processing efficiency. The result is responsive dashboards that maintain high performance while keeping processing costs under control.

Scalability and cost-efficiency

The serverless architecture of Athena eliminates manual infrastructure management, automatically scaling based on query demand. Because costs are based solely on the amount of data scanned by queries, optimizing queries to scan as little data as possible directly reduces our costs. We use the distributed query execution capabilities of Athena through our dbt model structure, enabling parallel processing across data partitions. By implementing effective partitioning strategies and using Parquet file format, we minimize the amount of data scanned while maximizing query performance.

Our architecture offers flexibility in how we materialize data through views, full tables, and incremental tables. With dbt’s incremental models and partitioning strategy, we process only new or modified data instead of entire datasets. This approach has proven highly effective—we’ve observed significant reductions in data processing volume as well as data scanning, particularly in our QuickSight workgroup.

The effectiveness of these optimizations implemented at the end of 2023 is visible in the following diagram, showing costs by Athena workgroups.

The workgroups are illustrated as follows:

  • Green (QuickSight): Shows reduced data scanning post-optimization.
  • Light blue (Ad-hoc): Varies based on analysis needs.
  • Dark blue (dbt): Maintains consistent processing patterns
  • Orange (dbt-test): Shows regular, efficient test execution.

The increased dbt workload costs directly correlate with decreased QuickSight costs, reflecting our architectural shift from using complex views in QuickSight workgroups (which previously masked query complexity but led to repeated computations) to using dbt for materializing these transformations. Although this increased the dbt workload, the overall cost-efficiency improved significantly because materialized tables reduced redundant computations in QuickSight. This demonstrates how our optimization strategies successfully manage growing data volumes while achieving net cost reduction through efficient data materialization patterns.

Conclusion

Our data architecture uses dbt and Athena to provide a scalable, cost-efficient, and flexible framework for building and managing data transformation pipelines. Athena’s ability to query data directly in Amazon S3 alleviates the need to move or copy data into a separate data warehouse, and its serverless model and dbt’s incremental processing minimize both operational overhead and processing costs. Given our team’s strong SQL expertise, expressing these transformations in SQL through dbt and Athena was a natural choice, enabling rapid model development and deployment. With dbt’s automatic documentation and lineage, troubleshooting and identifying data issues is simplified, and the system’s modularity allows for quick adjustments to meet evolving business needs.

Starting with this architecture is quick and straightforward: all that is needed is the dbt-core and dbt-athena libraries, and Athena itself requires no setup, because it’s a fully serverless service with seamless integration with Amazon S3. This architecture is ideal for teams looking to rapidly prototype, test, and deploy data models, optimizing resource usage, accelerating deployment, and providing high-quality, accurate data processing.

For those interested in a managed solution from dbt, see From data lakes to insights: dbt adapter for Amazon Athena now supported in dbt Cloud.


About the Authors

Philipp Karg is a Lead FinOps Engineer at BMW Group and has a strong background in data engineering, AI, and FinOps. He focuses on driving cloud efficiency initiatives and fostering a cost-aware culture within the company to leverage the cloud sustainably.

Selman Ay is a Data Architect specializing in end-to-end data solutions, architecture, and AI on AWS. Outside of work, he enjoys playing tennis and engaging outdoor activities.

Cizer Pereira is a Senior DevOps Architect at AWS Professional Services. He works closely with AWS customers to accelerate their journey to the cloud. He has a deep passion for cloud-based and DevOps solutions, and in his free time, he also enjoys contributing to open source projects.

The collective thoughts of the interwebz