Optimizing cold start performance of AWS Lambda using advanced priming strategies with SnapStart

Post Syndicated from Shan Kandaswamy original https://aws.amazon.com/blogs/compute/optimizing-cold-start-performance-of-aws-lambda-using-advanced-priming-strategies-with-snapstart/

Introduced at re:Invent 2022, SnapStart is a performance optimization that makes it easier to build highly responsive and scalable applications using AWS Lambda. The largest contributor to startup latency (often referred to as cold-start time) is the time spent initializing a function. This includes loading the function’s code and initializing dependencies. For latency-sensitive workloads such as APIs and real-time data processing applications, high startup latency can result in a suboptimal end user experience. Lambda SnapStart can reduce startup duration from several seconds to as low as sub-second, with minimal or no code changes. This post discusses ‘Priming’, a technique to further optimize startup times for AWS Lambda functions built using Java and Spring Boot.

Spring Boot applications typically experience high cold start latency during JVM and framework initialization, where significant time is spent loading classes and performing Just-In-Time (JIT) compilation of Java bytecode. This blog post uses a Spring Boot application as an example that retrieves 10 records from a ‘UnicornEmployee’ table in an Amazon RDS for PostgreSQL database, where each employee record includes employee name, location, and hire date.

The sample application uses Amazon API Gateway which triggers an AWS Lambda function that connects to the database through RDS Proxy to return the employee data. While this sample application uses dummy employee data for demonstration, the patterns and optimization techniques discussed in this post are applicable to real-world scenarios with similar data access patterns. Sample code for this implementation can be found in our GitHub repository at lambda-priming-crac-java-cdk.

Background: How SnapStart works

The post assumes familiarity with SnapStart and provides a short background. For additional details, refer to the SnapStart documentation.

To quickly recap, the INIT phase for a Lambda function involves downloading the function’s code, starting the runtime and any external dependencies, and running the function’s initialization code. For functions that don’t use SnapStart, this phase occurs each time your application scales up to create a new execution environment. When SnapStart is activated, the INIT phase happens when you publish a function version.

The following image shows a comparison of a Lambda request lifecycle with and without SnapStart.

Figure 1 – comparison of a Lambda request lifecycle with and without SnapStart

At the end of the INIT phase, Lambda executes your before-checkpoint runtime hooks. Lambda then snapshots the memory and disk state of the initialized execution environment, persists the encrypted snapshot, and caches it for low-latency access. When the function is subsequently invoked, new execution environments are resumed from the cached snapshot (during the RESTORE phase), speeding up function startup.

Figure 2 – new execution environments are resumed from the cached snapshot.

You can validate this speedup by comparing the RESTORE duration with the INIT duration recorded before SnapStart in your Lambda function’s Amazon CloudWatch Logs. As demonstrated in the following table, enabling SnapStart reduces the startup latency of our sample Spring Boot application by 4.3x from 6.1s to 1.4s. The 6.1s cold start latency for ON_DEMAND is primarily due to the combination of (1) initializing the JVM and Spring Boot framework, (2) JIT compilation of lazy loaded application code during initial invocation and (3) the time needed to establish a database connection with RDS through Amazon RDS Proxy. By enabling SnapStart, Lambda initializes the JVM and Spring Boot prior to the function invocation – resulting in the significantly reduced latency of 1.4s.

Method Cold Start Invocations p50 P90 P99 p99.9
PrimingLogGroup-1_ON_DEMAND 128 5047.94 ms 5386.78 ms 6158.80 ms 6195.84 ms
PrimingLogGroup-2_SnapStart_NO_PRIMING 111 1177.87 ms 1288.73 ms 1419.94 ms 1425.63 ms

You can reduce cold starts even further for your latency-sensitive Spring Boot applications by using priming techniques on Lambda functions. Let’s explore how to implement priming techniques.

Priming explained

Priming is the process of preloading dependencies and initializing resources during the INIT phase, rather than during the INVOKE phase to further optimize startup performance with SnapStart. This is required because Java frameworks that use dependency injection load classes into memory when these classes are explicitly invoked, which typically happens during Lambda’s INVOKE phase. You can proactively load classes using Java runtime hooks, that are part of the open source CRaC (Coordinated Restore at Checkpoint) project. This post demonstrates how to use this hook, called beforeCheckpoint(), to prime SnapStart-enabled Java functions, in two ways:

  1. Invoke Priming: This approach involves directly invoking application endpoints or methods in your pre-snapshotting hook so that they are JIT compiled during the INIT phase and included in the snapshot. This can include operations such as invoking API Gateway endpoints or fetching data from an S3 bucket or RDS database to proactively execute the code paths, ensuring that the underlying classes are included in the snapshot.
  2. Class Priming: This approach involves proactive initialization of classes during the INIT phase, ensuring that they are included in the function’s snapshot without risking unwanted changes to application state or data. This can be achieved by leveraging Java’s forName() method, which loads, links, and initializes the specified class. Initialization refers to the JVM process of loading the class definition into memory, verifying the bytecode, preparing static fields with default values, and executing static initializers. This is different from instantiation, which creates objects of the class using constructors. To generate a list of the classes required for pre-loading, you can use the following VM option, writing the list to a file called classes-loaded.txt:
    -Xlog:class+load=info:classes-loaded.txt

While invoke priming can offer better performance, it requires additional effort to ensure that the actions performed are idempotent and do not have unintended side effects, for instance processing financial transactions in a banking application. For this reason, invoke priming should only be used when code executed during priming is either idempotent or does not modify state. For scenarios where this is not possible, class priming provides a safer alternative by only initializing classes without executing their methods. Note that this assumes your application does not execute state-modifying code during class initialization.

With this context, let’s look at how to implement Invoke and Class priming for a Spring Boot sample application.

Example priming Implementation using CRaC runtime hooks before taking a Lambda snapshot

This post demonstrates both Invoke priming and Class priming using the sample Spring Boot application. The choice between the two approaches depends on the specific requirements and complexities of your application.

Step 1: Set up your Spring Boot Application using the aws-serverless-springboot3-archetype as explained in our Quick Start Spring Boot3 guide, adding database connectivity code, or simply clone the sample project from GitHub repository.

  1. Create a Spring Boot Application.
    // src/main/java/software/amazon/awscdk/examples/unicorn/UnicornApplication.java
    package software.amazon.awscdk.examples.unicorn;
    …
    @Import({ UnicornConfig.class })
    @SpringBootApplication
    public class UnicornApplication {
    
        private static final Logger log = LoggerFactory.getLogger(UnicornApplication.class);
    
        public static void main(String... arguments) {
            SpringApplication.run(UnicornApplication.class, arguments);
        }
    
    }

  2. Add all the necessary Maven dependencies for Spring Boot, AWS Lambda, and Database Connection in your pom.xml file. The following, highlighted, dependency contains the classes required to use the CRaC runtime hooks.
    ...
            <dependency>
                <groupId>org.crac</groupId>
                <artifactId>crac</artifactId>
            </dependency>
    ...

  3. Configure Database Connection – Set up the database connection details in application.properties.
    spring.datasource.password=${SPRING_DATASOURCE_PASSWORD} 
    spring.datasource.url=${SPRING_DATASOURCE_URL} 
    spring.datasource.username=postgres 
    spring.datasource.hikari.maximumPoolSize=1 

Step 2: Implement Lambda Function Handler with CRaC runtime hooks and Invoke Priming Approach:

Create Lambda Function Handler and integrate CRaC runtime hooks to execute beforeCheckpoint() and afterRestore() methods in your application for before taking and after restoring the snapshot.

  1. Implement the RequestHandler<UnicornRequest, UnicornResponse> interface in the Lambda function handler class.
  2. Implement the CRaC resource interface with two methods: beforeCheckpoint() and afterRestore(), which defines actions performed before Lambda creates the snapshot and after the snapshot is restored.
  3. Add invoke priming by creating a UnicornRequest object with a GET request to a specific endpoint (such as, /unicorn) and call the handleRequest(unicornRequest, null) method.

This ensures that the code paths associated with the specified endpoint are JIT compiled and optimized for faster execution during the first invocation after the snapshot is restored.

/src/main/java/software/amazon/awscdk/examples/unicorn/handler/InvokePriming.java
package software.amazon.awscdk.examples.unicorn.handler;

import org.crac.Core;
import org.crac.Resource;
...
public class InvokePriming implements RequestHandler<APIGatewayV2HTTPEvent, APIGatewayV2HTTPResponse>, Resource {
	...

@Override
public APIGatewayV2HTTPResponse handleRequest(APIGatewayV2HTTPEvent event, Context context) {
    var awsLambdaInitializationType = System.getenv("AWS_LAMBDA_INITIALIZATION_TYPE");
    var unicorns = getUnicorns();
    var body = gson.toJson(unicorns);
    return APIGatewayV2HTTPResponse.builder().withStatusCode(200).withBody(body).build();
}

@Override
public void beforeCheckpoint(org.crac.Context<? extends Resource> context)
        throws Exception {
    var event = APIGatewayV2HTTPEvent.builder().build();
    handleRequest(event, null);
}
...
}

Step 3: Implement Class priming Approach:

The class priming approach focuses on pre-loading required classes to achieve optimal performance. To implement class priming, generate the list of classes that are loaded during the application startup and function execution by running the application locally using the following JVM argument: -Xlog:class+load=info:classes-loaded.txt

  1. Ensure that your application classes included in the generated classes-loaded.txt file are not mutating state during static initialization.
    Note: the generated classes-loaded.txt contains class entries in the following format:

    [0.068s][info][class,load] software.amazon.awscdk.examples.unicorn.handler.ClassPriming source: file:/var/task/

  2. Extract only the fully qualified class names from each line and remove the additional logging information. For Example:
    software.amazon.awscdk.examples.unicorn.handler.ClassPriming

  3. Use the ClassLoaderUtil.loadClassesFromFile() utility method to extract the generated class entries.
    	     //src/main/java/software/amazon/awscdk/examples/unicorn/service/ClassLoaderUtil.java
    package software.amazon.awscdk.examples.unicorn;
    	...
    public class ClassLoaderUtil {
    	...
        public static void loadClassesFromFile() {
            log.info("loadClassesFromFile->started");
            Path path = Paths.get("classes-loaded.txt");
    
            try (BufferedReader bufferedReader = Files.newBufferedReader(path)) {
                Stream<String> lines = bufferedReader.lines();
                lines.forEach(line -> {
                    var index1 = line.indexOf("[class,load] ");
                    var index2 = line.indexOf(" source: ");
    
                    if (index1 < 0 || index2 < 0) {
                        return;
                    }
    
                    var className = line.substring(index1 + 13, index2);
                    try {
                        Class.forName(className, true,
                                ClassPriming.class.getClassLoader());
                    } catch (Throwable ignored) {
                    }
                });
    
                log.info("loadClassesFromFile->finished");
            } catch (IOException exception) {
                log.error("Error on newBufferedReader", exception);
            }
        }
    ...
    }

  4. Read a file (such as, /classes-loaded.txt) that contains a list of classes that have been loaded during the application’s execution in the beforeCheckpoint() method.
  5. Use the Class.forName() method to load and initialize the class, ensuring that it is ready during the snapshot.
    Note: by systematically pre-loading these classes, the Class priming approach simplifies the optimization process and reduces the complexities associated with Invoke priming.

    //src/main/java/software/amazon/awscdk/examples/unicorn/handler/ClassPriming.java
    package software.amazon.awscdk.examples.unicorn.handler;
    
    ...
    import org.crac.Core;
    import org.crac.Resource;
    
    public class ClassPriming implements RequestHandler<APIGatewayV2HTTPEvent, APIGatewayV2HTTPResponse>, Resource {
    
    ...
            ConfigurableApplicationContext configurableApplicationContext =
    				SpringApplication.run(UnicornApplication.class);
    
            this.unicornService = configurableApplicationContext.getBean(UnicornService.class);
            this.gson = configurableApplicationContext.getBean(Gson.class);
    
            Core.getGlobalContext().register(this);
        }
    
        @Override
        public APIGatewayV2HTTPResponse handleRequest(APIGatewayV2HTTPEvent event, Context context) {
            var unicorns = getUnicorns();
            var body = gson.toJson(unicorns);
    
            return APIGatewayV2HTTPResponse.builder().withStatusCode(200).withBody(body).build();
        }
    
        @Override
        public void beforeCheckpoint(org.crac.Context<? extends Resource> context)
                throws Exception {
    
            ClassLoaderUtil.loadClassesFromFile();
    
        }
    ...
    }

Step 4: AWS CDK Infrastructure Setup

Before proceeding, review the prerequisites in the project README file.

The CDK stack deploys a serverless application and required infrastructure for testing different Lambda optimization strategies. It creates a VPC with private subnets, an RDS for PostgreSQL instance with a database proxy, and five Lambda functions implementing different optimization approaches (ON_DEMAND without SnapStart, SnapStart without priming, SnapStart with invoke priming, and SnapStart with class priming). Each Lambda function is integrated with API Gateway for HTTP access, configured with Java 21 runtime on ARM64 architecture, and includes CloudWatch log groups for monitoring.

Follow these steps to deploy the infrastructure:

  1. Clone the sample repository:
    git clone https://github.com/aws-samples/lambda-priming-crac-java-cdk.git

  2. Deploy the CDK stack:
    cd lambda-priming-crac-java-cdk/infrastructure
    cdk synth
    cdk deploy --require-approval never --all 2>&1 | tee cdk_output.txt

  3. Save the API Gateway URLs:
    The deployment will output five URLs in this format:

    # ON_DEMAND endpoint (without SnapStart)
    LambdaPrimingCracJavaCdkStack.PrimingJavaRestApi1ONDEMANDEndpoint = https://[id].execute-api.us-east-1.amazonaws.com/prod/
    
    # SnapStart without priming endpoint
    LambdaPrimingCracJavaCdkStack.PrimingJavaRestApi2SnapStartNOPRIMINGEndpoint = https://[id].execute-api.us-east-1.amazonaws.com/prod/
    
    # SnapStart with invoke priming endpoint
    LambdaPrimingCracJavaCdkStack.PrimingJavaRestApi3SnapStartINVOKEPRIMINGEndpoint = https://[id].execute-api.us-east-1.amazonaws.com/prod/
    
    # SnapStart with class priming endpoint
    LambdaPrimingCracJavaCdkStack.PrimingJavaRestApi4SnapStartCLASSPRIMINGEndpoint = https://[id].execute-api.us-east-1.amazonaws.com/prod/
    
    # Database setup endpoint
    LambdaPrimingCracJavaCdkStack.PrimingJavaRestApi5DBLOADEREndpoint = https://[id].execute-api.us-east-1.amazonaws.com/prod/

  4.  Extract the URLs into variables for testing:
    ONDEMAND_URL=$(grep -oE 'https://[a-zA-Z0-9.-]+\.execute-api\.[a-zA-Z0-9-]+\.amazonaws\.com/prod/' "cdk_output.txt" | head -n 1) \
    
    NOPRIMING_URL=$(grep -oE 'https://[a-zA-Z0-9.-]+\.execute-api\.[a-zA-Z0-9-]+\.amazonaws\.com/prod/' "cdk_output.txt" | head -n 2 | tail -n 1) \
    
    INVOKEPRIMING_URL=$(grep -oE 'https://[a-zA-Z0-9.-]+\.execute-api\.[a-zA-Z0-9-]+\.amazonaws\.com/prod/' "cdk_output.txt" | head -n 3 | tail -n 1) \
    
    CLASSPRIMING_URL=$(grep -oE 'https://[a-zA-Z0-9.-]+\.execute-api\.[a-zA-Z0-9-]+\.amazonaws\.com/prod/' "cdk_output.txt" | head -n 4 | tail -n 1) \
    
    SETUP_URL=$(grep -oE 'https://[a-zA-Z0-9.-]+\.execute-api\.[a-zA-Z0-9-]+\.amazonaws\.com/prod/' "cdk_output.txt" | head -n 5 | tail -n 1)

Step 5: Load database and run performance testing using artillery:

  1. Initialize the database with sample data.
    curl -X GET "$SETUP_URL"
    
    #Expected output: {"message":"Database schema initialized and data loaded"}

  2. Run performance tests for all endpoints
    artillery run -t "$ONDEMAND_URL" -v '{ "url": "/unicorn" }' ./loadtest.yaml && \
    artillery run -t "$NOPRIMING_URL" -v '{ "url": "/unicorn" }' ./loadtest.yaml && \
    artillery run -t "$INVOKEPRIMING_URL" -v '{ "url": "/unicorn" }' ./loadtest.yaml && \
    artillery run -t "$CLASSPRIMING_URL" -v '{ "url": "/unicorn" }' ./loadtest.yaml

Step 6: Compare the load test results for On-demand (non-SnapStart), SnapStart, Invoke priming, and Class priming

The performance test results in the table below are sorted from slowest to fastest startup latency. The function without SnapStart performs the slowest due to JVM initialization, class loading and JIT compilation that occurs when the function is invoked. Notice a 4.3x improvement with SnapStart, which resumes invocations from a pre-initialized snapshot thereby avoiding JVM initialization and initial JIT compilation. SnapStart with class priming achieves a 1.4x speed-up over SnapStart, by proactively loading/initializing classes during INIT so that they are included in your function’s snapshot. Finally, SnapStart with invoke priming achieves the fastest performance – with a 781.68ms p99.9 cold-start latency that is 1.8x faster than SnapStart. This is because in addition to initializing classes, it also executes methods on the instances of those classes, resulting in even more components being included in the function’s snapshot.

Note that with invoke priming, any application code you execute must either be idempotent or modify stub data only. For instance, consider application code that triggers a financial transaction. If this code is executed during invoke priming with real user data, it may drive unintended effects with potentially serious consequences. Class priming avoids this, since application classes are initialized rather than being instantiated and their methods executed. This assumes that application code does not execute state modifying logic during class initialization. We recommend that you keep these considerations in mind when using invoke and/or class priming, and choose the appropriate approach for your use case.

Method Cold Start Invocations p50 P90 P99 p99.9
PrimingLogGroup-1_ON_DEMAND 128 5047.94 ms 5386.78 ms 6158.80 ms 6195.84 ms
PrimingLogGroup-2_SnapStart_NO_PRIMING 111 1177.87 ms 1288.73 ms 1419.94 ms 1425.63 ms
PrimingLogGroup-4_SnapStart_CLASS_PRIMING 82 857.81 ms 997.49 ms 1085.94 ms 1085.94 ms
PrimingLogGroup-3_SnapStart_INVOKE_PRIMING 66 608.42 ms 688.88 ms 781.68 ms 781.68 ms

 Conclusion

This post showed how AWS Lambda SnapStart, enhanced by CRaC runtime hooks, unlocks granular control over cold-start optimization for Java applications through two distinct priming strategies:

  • Invoke Priming: improves performance by executing critical endpoints during snapshot creation, ideal for idempotent workflows.
  • Class Priming: preloads classes without triggering business logic, mitigating side-effect risks.

To implement these optimization techniques in your applications evaluate your use case and opt for the optimal priming approach. Track latency reductions and resource utilization of your application via Amazon CloudWatch metrics to quantify performance improvements. By integrating these strategies, developers can achieve sub-second cold starts while maintaining the scalability and cost-efficiency of serverless architecture using Java.

To dive deeper, check out the GitHub repository with the full example code, including setup instructions and reusable patterns you can adapt to your own projects. For more examples of Java applications running on AWS Lambda, visit serverlessland.com and explore a wide range of resources, tutorials, and real-world use cases.

Announcing second-generation AWS Outposts racks with breakthrough performance and scalability on-premises

Post Syndicated from Micah Walter original https://aws.amazon.com/blogs/aws/announcing-second-generation-aws-outposts-racks-with-breakthrough-performance-and-scalability-on-premises/

Today we’re announcing the general availability of second-generation AWS Outposts racks, which marks the latest innovation from AWS for edge computing. This new generation includes support for the latest x86-powered Amazon Elastic Compute Cloud (Amazon EC2) instances, new simplified network scaling and configuration, and accelerated networking instances designed specifically for ultra-low latency and high-throughput workloads. These enhancements deliver greater performance for a broad range of on-premises workloads, such as core trading systems of financial services and telecom 5G Core workloads.

Customers like athenahealth, FanDuel, First Abu Dhabi Bank, Mercado Libre, Liberty Latin America, Riot Games, Vector Limited, and Wiwynn are already using Outposts racks for workloads that need to stay on-premises. The second-generation Outposts rack can provide low latency, local data processing, or data residency needs, such as game servers for multi-player online games, customer transaction data, medical records, industrial and manufacturing control systems, telecom Business Support Systems (BSS), and edge inference of a variety of machine learning (ML) models. Customers can now take advantage of the latest generation of processors and more advanced configurations of Outposts racks to support faster processing, higher memory capacity, and increased network bandwidth.

Latest generation EC2 instances

We’re excited to announce local support for the latest generation (7th generation) of x86-powered Amazon EC2 instances on AWS Outposts racks, starting with C7i compute-optimized instances, M7i general-purpose instances, and R7i memory-optimized instances. These new instances deliver twice the vCPU, memory, and network bandwidth while providing up to 40% better performance compared to C5, M5, and R5 instances on previous generation Outposts racks. They are powered by 4th Gen Intel Xeon Scalable processors and are ideal for a broad range of on-premises workloads requiring enhanced performance such as larger databases, more memory-intensive applications, advanced real-time big data analytics, high-performance video encoding and streaming, and CPU-based edge inference with more sophisticated ML models. Support for more latest generation EC2 instances, including GPU-enabled instances, is coming soon.

Simplified network scaling and configuration

We’ve completely reimagined networking in our latest Outposts generation, making it simpler and more scalable than ever. At the heart of this upgrade is our new Outposts network rack, which acts as a central hub for all your compute and storage traffic.

This new design brings three major benefits to the table. First, you can now scale your compute resources independently from your networking infrastructure, giving you more flexibility and cost efficiency as your workloads grow. Second, we’ve built in network resilience from the ground up, with the network rack automatically handling device failures to keep your systems running smoothly. Third, connecting to your on-premises environment and AWS Regions is now a breeze – you can configure everything from IP addresses to VLAN and BGP settings through straightforward APIs or our updated console interface.

Image of an AWS Outposts rack device

Specialized Amazon EC2 instances with accelerated networking

We’re introducing a new category of specialized Amazon EC2 instances on Outposts racks with accelerated networking. These instances are purpose built for the most latency-sensitive, compute-intensive, and throughput-intensive mission-critical workloads on-premises. To deliver the best possible performance, in addition to the Outpost logical network, these instances feature a secondary physical network with network accelerator cards connected to top-of-rack (TOR) switches.

First in this category are bmn-sf2e instances, designed for ultra-low latency with deterministic performance. The new instances run on Intel’s latest Sapphire Rapids processors (4th Gen Xeon Scalable), delivering 3.9 GHz sustained performance across all cores with generous memory allocation – 8GB of RAM for every CPU core. We’ve equipped bmn-sf2e instances with AMD Solarflare X2522 network cards that connect directly to top-of-rack switches.

For financial services customers, especially capital market firms, these instances offer deterministic networking through native Layer 2 (L2) multicast, precision time protocol (PTP), and equal cable lengths. This enables customers to meet regulatory requirements around fair trading and equal access while easily connecting to their existing trading infrastructure.

Instance Name vCPUs Memory (DDR5) Network Bandwidth NVMe SSD Storage Accelerated Network Cards Accelerated Bandwidth (Gbps)
bmn-sf2e.metal-16xl 64 512 GiB 25 Gbps 2 x 8 TB (16 TB) 2 100
bmn-sf2e.metal-32xl 128 1024 GiB 50 Gbps 4 x 8 TB (32 TB) 4 200

The second instance type, bmn-cx2, is optimized for high throughput and low latency. This instance features NVIDIA ConnectX-7 400G NICs physically connected to high-speed top-of-rack switches, delivering up to 800 Gbps bare metal network bandwidth operating at near line rate. With native Layer 2 (L2) multicast and hardware PTP support, this instance is ideal for high-throughput workloads like real-time market data distribution, risk analytics, and telecom 5G core network applications.

Instance Name vCPUs Memory (DDR5) Network Bandwidth NVMe SSD Storage Accelerated Network Cards Accelerated Bandwidth (Gbps)
bmn-cx2.metal-48xl 192 1536 GiB 50 Gbps 4 x 4 TB (16 TB) 2 800

Bottom line, the new generation of Outposts racks deliver enhanced performance, scalability, and resiliency for a broad range of on-premises workloads, even for mission-critical workloads with the most stringent latency and throughput requirements. You can make your selection and initiate your order from the AWS Management Console. The new instances maintain consistency with regional deployments by supporting the same APIs, AWS Management Console, automation, governance policies, and security controls in the cloud and on-premises, improving developer productivity and IT efficiency.

Things to know

At launch, second-generation Outposts racks can be shipped to US and Canada and be parented back to 6 AWS Regions including US East (N. Virginia and Ohio), US West (Oregon), EU West (London and France) and Asia Pacific (Singapore). Support for more countries and territories and AWS Regions is coming soon. At launch, second-generation Outposts racks locally support a subset of AWS services found in previous generation Outposts racks. Support for more EC2 instance types and more AWS services is coming soon.

To learn more, visit the AWS Outposts racks product page and user guide. You can also talk to an Outposts expert if you are ready to discuss your on-premises needs.

— Micah;


How is the News Blog doing? Take this 1 minute survey!

(This survey is hosted by an external company. AWS handles your information as described in the AWS Privacy Notice. AWS will own the data gathered via this survey and will not share the information collected with survey respondents.)

AWS Lambda standardizes billing for INIT Phase

Post Syndicated from Shubham Gupta original https://aws.amazon.com/blogs/compute/aws-lambda-standardizes-billing-for-init-phase/

Effective August 1, 2025, AWS will standardize billing for the initialization (INIT) phase across all AWS Lambda function configurations. This change specifically affects on-demand invocations of Lambda functions packaged as ZIP files that use managed runtimes, for which the INIT phase duration was previously unbilled. This update standardizes billing of the INIT phase across all runtime types, deployment packages, and invocation modes. Most users will see minimal impact on their overall Lambda bill from this change, as the INIT phase typically occurs for a very small fraction of function invocations. In this post, we discuss the Lambda Function Lifecycle and upcoming changes to INIT phase billing. You will learn what happens in the INIT phase and when it occurs, how to monitor your INIT phase duration, and strategies to optimize this phase and minimize costs.

Understanding the Lambda function execution lifecycle

The Lambda function execution lifecycle consists of three distinct phases: INIT, INVOKE, and SHUTDOWN. The INIT phase is triggered during a “cold start” when Lambda creates a new execution environment for a function in response to an invocation. This is followed by the INVOKE phase where the request is processed, and finally, the SHUTDOWN phase where the execution environment is terminated. For a summary of the execution lifecycle, watch AWS Lambda execution environment lifecycle.

During the INIT phase, Lambda performs a series of preparatory steps within a maximum duration of 10 seconds. The service retrieves the function code from an internal Amazon S3 bucket, or from Amazon Elastic Container Registry (Amazon ECR) for functions using container packaging. Then, it configures an environment with the specified memory, runtime, and other settings. When the execution environment is prepared, Lambda executes four key tasks in sequence:

  1. Initiate any extensions configured (Extension INIT)
  2. Bootstrap the runtime (Runtime INIT)
  3. Execute the function’s static code (Function INIT)
  4. Run any before-checkpoint runtime hooks (applicable only for Lambda SnapStart)

Understanding the billing changes

Lambda charges are based on the number of requests and the duration it takes for the code to run. The duration is calculated from the moment the function code begins running until it completes or terminates, rounded up to the nearest millisecond. Duration cost depends on the amount of memory that you allocate to your function.
https://docs.aws.amazon.com/lambda/latest/dg/provisioned-concurrency.html
Previously, the INIT phase duration wasn’t included in the Billed Duration for functions using managed runtimes with ZIP archive packaging, as evidenced in Amazon CloudWatch logs:

REPORT RequestId: xxxxx   Duration: 250.06 ms  Billed Duration: 251 ms  Memory Size: 1024 MB
Max Memory Used: 350 MB   Init Duration: 100.77 ms

However, functions configured with custom runtimes, Provisioned Concurrency (PC), or OCI packaging already included the INIT phase duration in their Billed Duration. Effective August 1, 2025, INIT phase will be billed across all configuration types and the INIT phase duration will be included in the Billed Duration for on-demand invocations of functions using managed runtimes with ZIP archive packaging as well. After this change, the REPORT Request ID log line will show the following:

REPORT RequestId: xxxxx   Duration: 250.06 ms  Billed Duration: 351 ms  Memory Size: 1024 MB
Max Memory Used: 350 MB   Init Duration: 100.77 ms 

The further INIT phase duration charges will follow the standard on-demand duration pricing that is specific to each AWS Region, which can be found on the Lambda pricing page. For AWS Lambda@Edge functions, the INIT phase duration will be billed according to Lambda@Edge duration rates.

Finding the INIT phase duration and impact to Lambda billing

You can already monitor the time spent in the INIT phase of your function invocations using the “init_duration” CloudWatch metric. This metric is also reported as “Init Duration” in the “REPORT RequestId” log line within CloudWatch Logs. These tools offer valuable insights into the INIT time of Lambda functions, which will now be factored into billing calculations.

For a more comprehensive analysis, you can use the following CloudWatch Log Insights query to generate a detailed report estimating the previously unbilled duration of the INIT phase. The query helps you understand the proportion of the unbilled INIT phase time relative to your overall Lambda usage, enabling more accurate cost projections following this billing change.

filter @type = "REPORT" and @billedDuration < (@duration + @initDuration) 
| stats sum((@memorySize/1000000/1024) * (@billedDuration/1000)) as BilledGBs, 
sum((@memorySize/1000000/1024) * ((ceil(@duration + @initDuration) - @billedDuration)/1000)) as UnbilledInitGBs, 
(UnbilledInitGBs/ (UnbilledInitGBs+BilledGBs)) as Ratio

The CloudWatch Log Insights query provides three essential metrics:

  1. BilledGBs: Represents the total GB-s (gigabyte-seconds) currently being billed for the chosen log groups.
  2. UnbilledInitGBs: Shows the total GB-s consumed during INIT phase that was previously not included in billing.
  3. Ratio: Indicates the percentage of total GB-s attributed to previously unbilled INIT phase duration.

Using these existing monitoring capabilities allows you to proactively assess and optimize your Lambda function INIT times, potentially minimizing the impact of the new billing structure on your overall costs.

Understanding and optimizing Lambda INIT phase

The Lambda INIT phase is triggered in two specific scenarios: during the creation of a new execution environment and when a function scales up to meet demand. This INIT code runs only during these “cold starts” and is bypassed during subsequent invocations that use existing warm environments. After the INIT phase, Lambda runs the function handler code to process the invocation.

Following the handler execution, Lambda freezes the execution environment. To improve resource management and performance, the Lambda service retains the execution environment for a non-deterministic period of time. During this time, if another request arrives for the same function, then the service may reuse the environment. This second request typically finishes faster, because the execution environment already exists and it isn’t necessary to download the code and run the INIT code. This is called a “warm start.”

Developers can use the INIT phase to create, initialize, and configure objects expected to be reused across multiple invocations during function INIT instead of doing it in the handler. Initializing the dependencies/shared objects upfront reduces the latency of subsequent invocations. For example:

  • Download more libraries or dependencies
  • Establish client connections to other AWS services such as Amazon S3 or Amazon DynamoDB
  • Create database connections to be shared across invocations
  • Retrieve application parameters or secrets from Amazon Systems Manager Parameter Store or AWS Secrets Manager

When developing Lambda functions, it’s important to strategically decide what code runs during the INIT phase as opposed to the handler phase, because it affects both performance and costs.

Optimizing package/library size

The INIT phase includes creating an execution environment, downloading the function code and initializing it. Three main factors influence its performance:

  1. The size of the function package, in terms of imported libraries and dependencies, and Lambda layers.
  2. The amount of code and INIT work.
  3. The performance of libraries and other services in setting up connections and other resources.

Larger function packages increase code download times. You can decrease INIT phase duration by reducing package size, resulting in faster cold starts and lower INIT costs. Furthermore, optimizing loading of libraries can also significantly impact package size. For example, in Node.js functions, you should use specific path imports (for example import DynamoDB from "aws-sdk/clients/dynamodb") rather than wildcard imports (for example import {* as AWS} from "aws-sdk") to speed up the INIT phase. Tools such as esbuild can further optimize performance by minifying and bundling packages. For details, read Optimizing node.js dependencies in AWS Lambda.

Optimizing INIT phase execution and cost efficiency

The frequency of INIT phase executions (or cold starts) directly impacts both performance and cost efficiency. According to an analysis of production Lambda workloads, INITs (cold starts) typically occur in under 1% of invocations—meaning code in the INIT phase may execute just once per hundred invocations.

You can use the INIT phase to perform one-time operations that benefit subsequent invocations. Common optimization patterns include pre-calculating lookup tables or transforming static datasets. For example, downloading static data from Amazon S3 or DynamoDB during INIT, making it available for all subsequent function invocations without repeated downloads.

Lambda SnapStart

Lambda SnapStart provides an effective solution for reducing cold start latency and INIT phase costs. When it’s enabled, SnapStart creates a snapshot during the first function INIT and reuses it for subsequent cold starts, eliminating the need for repeated INIT phase executions. This approach is particularly valuable for functions with longer INIT times due to loading module dependencies/frameworks, initializing the runtime, or executing one-time INIT code. SnapStart is supported for Java, .NET, and Python runtimes. You can implement SnapStart through the Lambda console or AWS Command Line Interface (AWS CLI), making sure that your code adheres to the AWS serialization guidelines for snapshot restoration compatibility. Using SnapStart allows you to significantly improve function startup times and optimize costs across multiple popular programming languages.

Provisioned Concurrency

Provisioned Concurrency is a Lambda feature that pre-initializes execution environments before any invocations occur. This proactive approach effectively eliminates the performance impact of the INIT phase on individual function calls, because the INIT is completed in advance.

Although all functions using the Provisioned Concurrency benefit from reduced startup times as compared to on-demand execution, the impact is particularly pronounced for certain runtime environments. For example, C# and Java functions—which typically experience slower INIT but faster execution times as compared to Node.js or Python—can achieve significant performance gains through this feature. Implementing Provisioned Concurrency allows you to effectively manage both consistent traffic patterns and expected usage spikes, thereby minimizing cold start latency across your serverless applications. This optimization strategy is particularly valuable for functions with complex INIT requirements or those serving latency-sensitive workloads. From a cost optimization perspective, Provisioned Concurrency is most suitable for workloads with sustained usage patterns above 60% usage, because this typically provides better cost efficiency compared to on-demand execution.

Conclusion

Effective August 1, 2025, AWS is standardizing the INIT phase billing for AWS Lambda. AWS provides multiple ways for you to optimize both the performance and costs of your Lambda functions. Whether you’re using SnapStart, implementing Provisioned Concurrency, or optimizing INIT code, we recommend working closely with AWS support teams to identify the most suitable optimization approach for your specific workload requirements.

For more support and guidance, consider participating in AWS Cost Optimization workshops or consulting the Lambda documentation.

Extend the Amazon Q Developer CLI with Model Context Protocol (MCP) for Richer Context

Post Syndicated from Brian Beach original https://aws.amazon.com/blogs/devops/extend-the-amazon-q-developer-cli-with-mcp/

Earlier today, Amazon Q Developer announced Model Context Protocol (MCP) support in the command line interface (CLI). Developers can connect external data sources to Amazon Q Developer CLI with MCP support for more context-aware responses. By integrating MCP tools and prompts into Q Developer CLI, you get access to an expansive list of pre-built integrations or any MCP Servers that support stdio. This extra context helps Q Developer write more accurate code, understand your data structures, generate appropriate unit tests, create database documentation, and execute precise queries, all without needing to develop custom integration code. By extending Q Developer with MCP tools and prompts, developers can execute development tasks faster, streamlining the developer experience. At AWS, we’re committed to supporting popular open source protocols for agents like Model Context Protocol (MCP) proposed by Anthropic. We’ll continue to support this effort by extending this functionality within the Amazon Q Developer IDE plugins in the coming weeks.

Introduction

I’m always on the lookout for tools and technologies that can streamline my workflow and unlock new capabilities. That’s why I was excited about the recent addition of Model Context Protocol (MCP) support in the Amazon Q Developer command line interface (CLI). MCP is an open protocol that standardizes how applications can seamlessly integrate with LLMs, providing a common way to share context, access data sources, and enable powerful AI-driven functionality. You can read more about MCP in this introduction.

Q Developer has had the ability to use tools for a while. I previously discussed the ability to run CLI commands and describe AWS resources. With the Q Developer CLI’s support for MCP tools and prompts, I now have the ability to add additional tools. For example, while I have had the ability to describe my AWS resources, I also need to describe database schemas, message formats, etc. to build an application. Let’s see how I can configure MCP to provide this additional context.

In this post, I will configure an MCP server to provide Q Developer with my database schema for a simple Learning Management System (LMS) that I am working on. While Q Developer is great at writing SQL, it does not know the schema of my database. The table structure and relationships are stored in the database and are not part of the source code of my project. Therefore, I am going to use an MCP server that can query the database schema. Specifically, I am using the official PostgreSQL reference implementation to connect to my Amazon Relational Database Service (RDS). Let’s get started.

Before Model Context Protocol

Prior to the introduction of MCP support, the Q Developer CLI provided a set of native tools, including the ability to execute bash commands, interact with files and the file system, and even make calls to AWS services. However, when it came to querying a database, the CLI was limited in its capabilities.

For example, prior to configuring the MCP server, I asked Q Developer to “Write a query that lists the students and the number of credits each student is taking.” In the following image you can see that Q Developer could only provide a generic SQL query, as it lacked the specific knowledge of the database schema for my LMS.

Screenshot of Amazon Q Developer CLI showing a response to a query request. The response includes explanatory text acknowledging the lack of schema information, followed by a generic SQL query written in green text. The query joins students, student_courses, and courses tables to calculate total credit hours per student, demonstrating Q's limited ability without MCP configuration.

While this is a great start, I know that Q developer could do so much more if it knew the database schema.

Configuring Model Context Protocol

The introduction of MCP support in the Q Developer CLI allows me to easily configure MCP servers. I configure one or more MCP servers in a file called mcp.json. I can store the configuration in my home directory (e.g. ~/.aws/amazonq/mcp.json) and it is applied to all projects on my machine. Alternatively, I can store the configuration in the workspace root (e.g. .amazonq/mcp.json) so it is shared among project members. Here is an example of the configuration for the PostgreSQL MCP server.

{
  "mcpServers": {
    "postgres": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-postgres",
        "postgresql://USERNAME:PASSWORD@HOST:5432/DBNAME"
      ]
    }
  }
}

With the MCP server configured, let’s see how Amazon Q Developer enhances my experience.

After Model Context Protocol

First, I start a new Q Developer session and immediately see the benefits. In addition to the existing tools, Q Developer now has access to PostgreSQL as shown in the following image. This means I can easily explore the schema of my database, understand the structure of the tables, and even execute complex SQL queries, all without having to write any additional integration code.

Screenshot of Amazon Q Developer CLI displaying a list of available tools. The tools are categorized into file system tools, bash execution, AWS tools, PostgreSQL database tools, and issue reporting. The PostgreSQL category is highlighted, showing the integration of MCP for database access.

Let’s test the MCP server by asking Q Developer to “List the database tables.” As you can see in the following example, Q Developer now understands that I am asking about the PostgreSQL database, and uses the MCP server to list my three tables: students, courses, and enrollment.

Screenshot of Amazon Q Developer CLI showing a database table listing request and response. The response shows a tool request using list_objects command with JSON parameters, followed by execution status and a list of three tables in the public schema: courses, enrollment, and students.

Let’s go back to the example from earlier in this post. Now, when I ask Q Developer to “Write a query that lists the students and the number of credits each student is taking,” it no longer responds with a generic query. Instead, Q Developer first describes the relevant tables in my database, generates the appropriate SQL query, and then executes it, providing me with the desired results.

Screenshot of Amazon Q Developer CLI showing a complete SQL query workflow. The image displays a precise SQL query in green syntax highlighting, followed by a results table showing student credit information, and an explanation of how the query works through five numbered steps. This demonstrates Q's ability to generate, execute, and explain database queries with schema knowledge.

Of course, Q Developer can do a lot more than just write queries. Q Developer can use the MCP server to write Java code that accesses the database, create unit tests for the data layer, document the database, and much more. For example, I asked Q Developer to “Create an entity-relationship (ER) diagram using Mermaid syntax.” Q Developer was able to generate a visual representation of the database schema, helping me better understand the relationships between the various entities.

Entity-Relationship (ER) diagram generated by Amazon Q Developer. The diagram shows three tables: STUDENTS, COURSES, and ENROLLMENT. Each table is represented by a box containing column names and data types. The ENROLLMENT table links STUDENTS and COURSES with 'enrolls in' and 'has enrolled' relationships. Primary and foreign keys are indicated. This visualizes the database schema structure for the Learning Management System.

The integration of MCP into the Q Developer CLI has significantly streamlined my workflow by allowing me to add additional tools as needed.

Conclusion

The addition of MCP support in the Amazon Q Developer CLI provides a standardized way to share context and access data sources. In this post, I’ve demonstrated how I can use the Q Developer CLI’s MCP integration to quickly set up a connection to a PostgreSQL database, explore the schema, and generate complex SQL queries without having to write any additional integration code. Moving forward, I’m excited to see how you can leverage MCP to further enhance your development workflow. I encourage you to explore the MCP capabilities and the AWS MCP Servers repository on GitHub.

How Flutter UKI optimizes data pipelines with AWS Managed Workflows for Apache Airflow

Post Syndicated from Monica Cujerean, Ionut Hedesiu original https://aws.amazon.com/blogs/big-data/how-flutter-uki-optimizes-data-pipelines-with-aws-managed-workflows-for-apache-airflow/

This post is co-written with Monica Cujerean and Ionut Hedesiu from Flutter UKI.

In this post, we share how Flutter UKI transitioned from a monolithic Amazon Elastic Compute Cloud (Amazon EC2)-based Airflow setup to a scalable and optimized Amazon Managed Workflows for Apache Airflow (Amazon MWAA) architecture using features like Kubernetes Pod Operator, continuous integration and delivery (CI/CD) integration, and performance optimization techniques.

About Flutter UKI

As a division of Flutter Entertainment, Flutter UKI stands at the forefront of the sports betting and gaming industry. Flutter UKI offers a diverse portfolio of entertainment options, encompassing sports wagering, casino games, bingo, and poker experiences. Flutter UKI’s digital presence is robust, operating through an array of renowned online brands. These include the iconic Paddy Power, Sky Betting and Gaming, and Tombola. While Flutter UKI has established a strong online foothold, it maintains a significant physical presence with a network of 576 Paddy Power betting shops strategically located across the United Kingdom and Ireland.

The Data team at Flutter UKI is integral to the company’s mission of using data to drive business success and innovation. Specializing in data, their teams are dedicated to ensuring the seamless integration, management, and accessibility of data across multiple facets of the organization. By developing robust data pipelines and maintaining high data quality standards, Flutter UKI empowers stakeholders with reliable insights, optimizes operational efficiencies, and enhances the user experience. Its commitment to data excellence underpins its efforts to remain at the forefront of the online gaming and entertainment industry, delivering value and strategic advantage to the business.

The journey from self managing Airflow on Amazon EC2 to operating Airflow workloads at scale using Amazon MWAA

Flutter UKI’s data orchestration story began in 2017 with a modest Apache Airflow deployment on EC2 instances. As the company’s digital footprint expanded, so did their data pipeline requirements, leading to an increasingly complex monolithic cluster that demanded constant attention and resource scaling. The operational overhead of managing these EC2 instances became a significant challenge for their engineering teams. In 2022, Flutter UKI reached a crossroads. They needed to choose between re-architecting their service on Amazon Elastic Kubernetes Service (Amazon EKS) or embracing Amazon Managed Workflows for Apache Airflow (MWAA).

Flutter UKI was looking to transform their data orchestration service from a resource-intensive, self-managed system to a more efficient, managed service that would allow them to focus on their core business objectives rather than infrastructure management. Through extensive proof-of-concept (POC) testing and close collaboration with AWS Enterprise Support, Flutter UKI gained confidence in the ability of Amazon MWAA to handle their sophisticated workloads at scale. Their choice of MWAA over a self-managed solution on Amazon EKS reflected Flutter UKI’s strategic focus on using managed services to reduce operational complexity and accelerate innovation.

The migration to Amazon MWAA followed a methodical approach. There was extensive testing of multiple POCs. During the POCs, the engineering team found MWAA to have a good ease of use, which helped them reduce the learning curve resulting in faster. Learning from each POC, they iterated on the final architecture by making data-driven decisions. Starting with a small subset of directed acyclic graphs (DAG), the Flutter UKI team expanded their deployment over time, gradually moving hundreds and eventually thousands of workflows to the managed service. This careful, phased transition allowed them to validate the performance and reliability of MWAA while minimizing operational risk.

High-level architecture design

During the service re-architecture, the data team strategically managed over 3,500 dynamically generated DAGs by implementing a sophisticated distribution approach across multiple Amazon MWAA environments to create a workload isolated environment. Another reason for having multiple environments was to make sure that no one MWAA environment doesn’t get overloaded by multiple DAGs. By placing DAG files across diverse Amazon Simple Storage Service (Amazon S3) locations and configuring unique DAG_FOLDER paths for each environment, the data team created an intelligent load balancing mechanism that allocates workflows based on complex criteria including environment type, task volume, and environment-specific DAG affinity. A round-robin distribution strategy was designed to minimize single environment load, ensuring scalable infrastructure with zero performance degradation. This approach allowed the team to optimize workflow orchestration, maintaining high performance while efficiently managing an extensive collection of dynamically generated DAGs across multiple MWAA environments. To provide more compute to individual tasks and to keep the MWAA efficient, Flutter UKI delegated the DAG execution to an external compute environment using Amazon Elastic Kubernetes Service (Amazon EKS). The resulting high-level architecture is shown in the following figure.

  1. Kubernetes Pod Operator (KPO) for tasks: Flutter UKI transitioned from using custom operators and many native Airflow operators to exclusively utilizing the Kubernetes Pod Operator (KPO). This decision simplified their architecture by eliminating unnecessary complexity, reducing maintenance overhead, and mitigating potential bugs. Additionally, this approach enabled them to allocate compute resources on a per-task basis, optimizing overall service performance. It also enabled the use of different container images for different tasks, thereby avoiding library dependency conflicts.
  2. Kubernetes Pod Operator wrapper (KPOw): Instead of using KPO directly, they developed a wrapper (KPOw) around it. This wrapper abstracts the underlying complexity and minimizes the impact of signature changes in Airflow, Amazon MWAA, Amazon EKS, or operator versions. By centralizing these changes, they only need to update the wrapper rather than thousands of individual DAGs. The wrapper also simplifies DAGs by hiding repetitive parameters, such as node affinity, pod resources, and EKS cluster configurations. Furthermore, it enforces company-specific naming conventions and allows for parameter validation at task execution time rather than during DagBag refresh. They also introduced profiles and image files, where profile files contain necessary KPO parameters, and the corresponding image files link to the repository for the task’s container image. This setup ensures consistency across tasks using the same profile and facilitates simultaneous updates across tasks.
  3. Monthly image updates in Kubernetes: Enforcing a policy of monthly image updates made sure that their code remained current, preventing security vulnerabilities and avoiding extensive code changes due to deprecated libraries.
  4. Continuous Airflow updates: Flutter UKI maintains a cutting-edge infrastructure by implementing new Airflow versions shortly after release, while following a carefully orchestrated deployment strategy. Their approach uses standard Amazon MWAA configurations and employs a systematic testing protocol. New versions are first deployed to development and test environments for thorough validation before reaching production systems. This methodical progression significantly reduces the risk of disruptions to business-critical workflows.

To achieve operational excellence, Flutter UKI has implemented a comprehensive monitoring framework centered on Amazon CloudWatch metrics. Their monitoring solution includes strategically configured alarms that provide early warning signals for potential issues. This proactive monitoring approach enables their teams to quickly identify and investigate anomalies in production workload executions, ensuring high availability and performance of their data pipelines. The combination of careful version management and robust monitoring exemplifies Flutter UKI’s commitment to operational excellence in their cloud infrastructure.

  1. CI/CD integration: By managing their code in GitLab, with mandatory code reviews and using Argo Events and Argo Workflows for image updates in AWS ECR, they streamlined their development processes.
  2. Performance Optimization: A significant portion of the DAGs are dynamically generated based on database metadata. This generation process runs outside Amazon MWAA, with its own CI/CD pipeline, and the resulting DAG files are stored in the S3 DAG. Placing code outside of tasks was avoided, including parameter evaluation. Parameters and secrets are stored in AWS Secrets Manager and retrieved at task runtime. Engineers aim to minimize or eliminate inter-service dependencies within MWAA.

DAGs are scheduled to distribute execution times as evenly as possible. Task code and common modules are hosted on Amazon S3 and retrieved at runtime. For larger codebases, Amazon Elastic File System (Amazon EFS) volumes are mounted to task pods are used.

Results

Today, Flutter UKI’s infrastructure comprises four Amazon MWAA clusters, each executing tasks on dedicated Amazon EKS node groups. They manage approximately 5,500 DAGs encompassing over 30,000 tasks, handling more than 60,000 DAG runs daily with a concurrency exceeding 450 tasks running simultaneously across clusters. They anticipate a 10% monthly increase in this workload in the short to medium term. During major events like Cheltenham and Grand National, where data load increases by 30%, their MWAA service has demonstrated stability and scalability, achieving a 100% success rate for critical processes in 2025, a significant improvement over previous years.

Conclusion

Flutter UKI’s journey with AWS Managed Workflows for Apache Airflow (Amazon MWAA) has resulted in a stable, scalable, and resilient production environment. The careful re-architecting of Flutter UKI’s service, combined with strategic decisions around task execution and infrastructure management, has not only simplified their operations, but also enhanced performance and reliability. Security and compliance benefits were also noticed, because MWAA provides managed security updates, built-in encryption, and integration with AWS security services. Perhaps most importantly, the shift to MWAA has allowed Flutter UKI’s engineering teams to redirect their efforts from infrastructure maintenance to business-critical tasks, focusing on DAG development and improving data pipeline efficiency, ultimately accelerating innovation in their core business operations.

If you’re looking to reduce operational overhead and migrate to a fully managed Airflow solution on AWS, consider using Amazon MWAA. Get in touch with your Technical Account Manager or your Solutions Architect to discuss a solution specific to your use-case. You can also reach out to AWS Support by creating a case if you’re facing an issues setting up the service.

Ready to see what Amazon MWAA is like? Visit the AWS Management Console for Amazon MWAA. For more information, see What Is Amazon Managed Workflows for Apache Airflow. Additionally, Using Amazon MWAA with Amazon EKS shows you how to integrate Amazon MWAA with Amazon EKS.


About the authors

Monica Cujerean is a Principal Data Engineer at Flutter UKI, focusing on service related initiatives that cover performance optimization, cost effectiveness, and new feature adoption on most AWS service in our stack: Amazon MWAA, Amazon Redshift, Amazon Aurora, and Amazon SageMaker.

Ionut Hedesiu is a Senior Data Architect at Flutter UKI, responsible for designing strategic solutions to cover complex and varied business needs. His main expertise is on Amazon MWAA, Kubernetes, Amazon Sagemaker, and ETL solutions.

Nidhi Agrawal is a Technical Account Manager at AWS and works with large enterprise customers to provide the technical guidance, best practices, and strategic support to customers, helping them optimize their environments in the AWS Cloud.

John Kellett is a Senior Customer Solutions Manager with 25 years of experience across private and public sectors. John helps drive end-to-end customer engagement through program management excellence. By understanding and representing customers’ strategic visions, John aligns to develop the people, organizational readiness, and technology competencies to meet the desired outcomes.

Sidhanth Muralidhar is a Principal Technical Account Manager at AWS. He works with large enterprise customers who run their workloads on AWS. He is passionate about working with customers and helping them architect workloads for cost, reliability, performance, and operational excellence at scale in their cloud journey. He has a keen interest in data analytics as well.

How BMW Group built a serverless terabyte-scale data transformation architecture with dbt and Amazon Athena

Post Syndicated from Philipp Karg original https://aws.amazon.com/blogs/big-data/how-bmw-group-built-a-serverless-terabyte-scale-data-transformation-architecture-with-dbt-and-amazon-athena/

Businesses increasingly require scalable, cost-efficient architectures to process and transform massive datasets. At the BMW Group, our Cloud Efficiency Analytics (CLEA) team has developed a FinOps solution to optimize costs across over 10,000 cloud accounts. While enabling organization-wide efficiency, the team also applied these principles to the data architecture, making sure that CLEA itself operates frugally. After evaluating various tools, we built a serverless data transformation pipeline using Amazon Athena and dbt.

This post explores our journey, from the initial challenges to our current architecture, and details the steps we took to achieve a highly efficient, serverless data transformation setup.

Challenges: Starting from a rigid and costly setup

In our early stages, we encountered several inefficiencies that made scaling difficult. We were managing complex schemas with wide tables that required significant effort in maintainability. Initially, we used Terraform to create tables and views in Athena, allowing us to manage our data infrastructure as code (IaC) and automate deployments through continuous integration and delivery (CI/CD) pipelines. However, this method slowed us down when changing data models or dealing with schema changes, therefore requiring high development efforts.

As our solution grew, we faced challenges with query performance and costs. Each query scanned large amounts of raw data, resulting in increased processing time and higher Athena costs. We used views to provide a clean abstraction layer, but this masked underlying complexity because seemingly simple queries against these views scanned large volumes of raw data, and our partitioning strategy wasn’t optimized for these access patterns. As our datasets grew, the lack of modularity in our data design increased complexity, making scalability and maintenance increasingly difficult. We needed a solution for pre-aggregating, computing, and storing query results of computationally intensive transformations. The absence of robust testing and lineage solutions made it challenging to identify the root causes of data inconsistencies when they occurred.

As part of our business intelligence (BI) solution, we used Amazon QuickSight to build our dashboards, providing visual insights into our cloud cost data. However, our initial data architecture led to challenges. We were building dashboards on top of large, wide datasets, with some hitting the QuickSight per-dataset SPICE limit of 1 TB. Additionally, during SPICE ingest, our largest datasets required 4–5 hours of processing time due to performing full scans each time, often scanning over a terabyte of data. This architecture wasn’t helping us be more agile and quick while scaling up. The long processing times and storage limitations hindered our ability to provide timely insights and expand our analytics capabilities.

To address these issues, we enhanced the data architecture with AWS Lambda, AWS Step Functions, AWS Glue, and dbt. This tool stack significantly enhanced our development agility, empowering us to quickly modify and introduce new data models. At the same time, we improved our overall data processing efficiency with incremental loads and better schema management.

Solution overview

Our current architecture consists of a serverless and modular pipeline coordinated by GitHub Actions workflows. We chose Athena as our primary query engine for several strategic reasons: it aligns perfectly with our team’s SQL expertise, excels at querying Parquet data directly in our data lake, and alleviates the need for dedicated compute resources. This makes Athena an ideal fit for CLEA’s architecture, where we process around 300 GB daily from a data lake of 15 TB, with our largest dataset containing 50 billion rows across up to 400 columns. The capability of Athena to efficiently query large-scale Parquet data, combined with its serverless nature, enables us to focus on writing efficient transformations rather than managing infrastructure.

The following diagram illustrates the solution architecture.

Using this architecture, we’ve streamlined our data transformation process using dbt. In dbt, a data model represents a single SQL transformation that creates either a table or a view—essentially a building block of our data transformation pipeline. Our implementation includes around 400 such models, 50 data sources, and around 100 data tests. This setup enables seamless updates—whether creating new models, updating schemas, or modifying views—triggered simply by creating a pull request in our source code repository, with the rest handled automatically.

Our workflow automation includes the following features:

  • Pull request – When we create a pull request, it’s deployed to our testing environment first. After passing validation and being approved or merged, it’s deployed to production using GitHub workflows. This setup enables seamless model creation, schema updates, or view changes—triggered just by creating a pull request, with the rest handled automatically.
  • Cron scheduler – For nightly runs or multiple daily runs to reduce data latency, we use scheduled GitHub workflows. This setup allows us to configure specific models with different update strategies based on data needs. We can set models to update incrementally (processing only new or changed data), as views (querying without materializing data), or as full loads (completely refreshing the data). This flexibility optimizes processing time and resource usage. We can target only specific folders—like source, prepared, or semantic layers—and run the dbt test afterward to validate model quality.
  • On demand – When adding new columns or changing business logic, we need to update historical data to maintain consistency. For this, we use a backfill process, which is a custom GitHub workflow created by our team. The workflow allows us to select specific models, include their upstream dependencies, and set parameters like start and end dates. This makes sure that changes are applied accurately across the entire historical dataset, maintaining data consistency and integrity.

Our pipeline is organized into three primary stages—Source, Prepared, and Semantic—each serving a specific purpose in our data transformation journey. The Source stage maintains raw data in its original form. The Prepared stage cleanses and standardizes this data, handling tasks like deduplication and data type conversions. The Semantic stage transforms this prepared data into business-ready models aligned with our analytical needs. An additional QuickSight step handles visualization requirements. To achieve low cost and high performance, we use dbt models and SQL code to manage all transformations and schema changes. By implementing incremental processing strategies, our models process only new or changed data rather than reprocessing the entire dataset with each run.

The Semantic stage (not to be confused with dbt’s semantic layer feature) introduces business logic, transforming data into aggregated datasets that are directly consumable by BMW’s Cloud Data Hub, internal CLEA dashboards, data APIs, or In-Console Cloud Assistant (ICCA) chatbot. The QuickSight step further optimizes data by selecting only necessary columns by using a column-level lineage solution and setting a dynamic date filter with a sliding window to ingest only relevant hot data into SPICE, avoiding unused data in dashboards or reports.

This approach aligns with BMW Group’s broader data strategy, which includes streamlining data access using AWS Lake Formation for fine-grained access control.

Overall, as a high-level structure, we’ve fully automated schema changes, data updates, and testing through GitHub pull requests and dbt commands. This approach enables controlled deployment with robust version control and change management. Continuous testing and monitoring workflows uphold data accuracy, reliability, and quality across transformations, supporting efficient, collaborative model iteration.

Key benefits of the dbt-Athena architecture

To design and manage dbt models effectively, we use a multi-layered approach combined with cost and performance optimizations. In this section, we discuss how our approach has yielded significant benefits in five key areas.

SQL-based, developer-friendly environment

Our team already had strong SQL skills, so dbt’s SQL-centric approach was a natural fit. Instead of learning a new language or framework, developers could immediately start writing transformations using familiar SQL syntax with dbt. This familiarity aligns well with the SQL interface of Athena and, combined with dbt’s added functionality, has increased our team’s productivity.

Behind the scenes, dbt automatically handles synchronization between Amazon Simple Storage Service (Amazon S3), the AWS Glue Data Catalog, and our models. When we need to change a model’s materialization type—for example, from a view to a table—it’s as simple as updating a configuration parameter rather than rewriting code. This flexibility has reduced our development time dramatically, allowed us to focus on building better data models rather than managing infrastructure.

Agility in modeling and deployment

Documentation is crucial for any data platform’s success. We use dbt’s built-in documentation capabilities by publishing them to GitHub Pages, which creates an accessible, searchable repository of our data models. This documentation includes table schemas, relationships between models, and usage examples, enabling team members to understand how models interconnect and how to use them effectively.

We use dbt’s built-in testing capabilities to implement comprehensive data quality checks. These include schema tests that verify column uniqueness, referential integrity, and null constraints, as well as custom SQL tests that validate business logic and data consistency. The testing framework runs automatically on every pull request, validating data transformations at each step of our pipeline. Additionally, dbt’s dependency graph provides a visual representation of how our models interconnect, helping us understand the upstream and downstream impacts of any changes before we implement them. When stakeholders need to modify models, they can submit changes through pull requests, which, after they’re approved and merged, automatically trigger the necessary data transformations through our CI/CD pipeline. This streamlined process enabled us to create new data products within days compared to weeks and reduced ongoing maintenance work by catching issues early in the development cycle.

Athena workgroup separation

We use Athena workgroups to isolate different query patterns based on their execution triggers and purposes. Each workgroup has its own configuration and metric reporting, allowing us to monitor and optimize separately. The dbt workgroup handles our scheduled nightly transformations and on-demand updates triggered by pull requests through our Source, Prepared, and Semantic stages. The dbt-test workgroup executes automated data quality checks during pull request validation and nightly builds. The QuickSight workgroup manages SPICE data ingestion queries, and the Ad-hoc workgroup supports interactive data exploration by our team.

Each workgroup can be configured with specific data usage quotas, enabling teams to implement granular governance policies. This separation provides several benefits: it enables clear cost allocation, provides isolated monitoring of query patterns across different use cases, and helps enforce data governance through custom workgroup settings. Amazon CloudWatch monitoring per workgroup helps us track usage patterns, identify query performance issues, and adjust configurations based on actual needs.

Using QuickSight SPICE

QuickSight SPICE (Super-fast, Parallel, In-memory Calculation Engine) provides powerful in-memory processing capabilities that we’ve optimized for our specific use cases. Rather than loading entire tables into SPICE, we create specialized views on top of our materialized semantic models. These views are carefully crafted to include only the necessary columns, relevant metadata joins, and appropriate time filtering to have only recent data available in dashboards.

We’ve implemented a hybrid refresh strategy for these SPICE datasets: daily incremental updates keep the data fresh, and weekly full refreshes maintain data consistency. This approach strikes a balance between data freshness and processing efficiency. The result is responsive dashboards that maintain high performance while keeping processing costs under control.

Scalability and cost-efficiency

The serverless architecture of Athena eliminates manual infrastructure management, automatically scaling based on query demand. Because costs are based solely on the amount of data scanned by queries, optimizing queries to scan as little data as possible directly reduces our costs. We use the distributed query execution capabilities of Athena through our dbt model structure, enabling parallel processing across data partitions. By implementing effective partitioning strategies and using Parquet file format, we minimize the amount of data scanned while maximizing query performance.

Our architecture offers flexibility in how we materialize data through views, full tables, and incremental tables. With dbt’s incremental models and partitioning strategy, we process only new or modified data instead of entire datasets. This approach has proven highly effective—we’ve observed significant reductions in data processing volume as well as data scanning, particularly in our QuickSight workgroup.

The effectiveness of these optimizations implemented at the end of 2023 is visible in the following diagram, showing costs by Athena workgroups.

The workgroups are illustrated as follows:

  • Green (QuickSight): Shows reduced data scanning post-optimization.
  • Light blue (Ad-hoc): Varies based on analysis needs.
  • Dark blue (dbt): Maintains consistent processing patterns
  • Orange (dbt-test): Shows regular, efficient test execution.

The increased dbt workload costs directly correlate with decreased QuickSight costs, reflecting our architectural shift from using complex views in QuickSight workgroups (which previously masked query complexity but led to repeated computations) to using dbt for materializing these transformations. Although this increased the dbt workload, the overall cost-efficiency improved significantly because materialized tables reduced redundant computations in QuickSight. This demonstrates how our optimization strategies successfully manage growing data volumes while achieving net cost reduction through efficient data materialization patterns.

Conclusion

Our data architecture uses dbt and Athena to provide a scalable, cost-efficient, and flexible framework for building and managing data transformation pipelines. Athena’s ability to query data directly in Amazon S3 alleviates the need to move or copy data into a separate data warehouse, and its serverless model and dbt’s incremental processing minimize both operational overhead and processing costs. Given our team’s strong SQL expertise, expressing these transformations in SQL through dbt and Athena was a natural choice, enabling rapid model development and deployment. With dbt’s automatic documentation and lineage, troubleshooting and identifying data issues is simplified, and the system’s modularity allows for quick adjustments to meet evolving business needs.

Starting with this architecture is quick and straightforward: all that is needed is the dbt-core and dbt-athena libraries, and Athena itself requires no setup, because it’s a fully serverless service with seamless integration with Amazon S3. This architecture is ideal for teams looking to rapidly prototype, test, and deploy data models, optimizing resource usage, accelerating deployment, and providing high-quality, accurate data processing.

For those interested in a managed solution from dbt, see From data lakes to insights: dbt adapter for Amazon Athena now supported in dbt Cloud.


About the Authors

Philipp Karg is a Lead FinOps Engineer at BMW Group and has a strong background in data engineering, AI, and FinOps. He focuses on driving cloud efficiency initiatives and fostering a cost-aware culture within the company to leverage the cloud sustainably.

Selman Ay is a Data Architect specializing in end-to-end data solutions, architecture, and AI on AWS. Outside of work, he enjoys playing tennis and engaging outdoor activities.

Cizer Pereira is a Senior DevOps Architect at AWS Professional Services. He works closely with AWS customers to accelerate their journey to the cloud. He has a deep passion for cloud-based and DevOps solutions, and in his free time, he also enjoys contributing to open source projects.

Best practices for least privilege configuration in Amazon MWAA

Post Syndicated from Elizabeth Davis original https://aws.amazon.com/blogs/big-data/best-practices-for-least-privilege-configuration-in-amazon-mwaa/

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) provides a secure and managed environment to run Apache Airflow on AWS. Airflow is often used in highly regulated industries, such as finance and healthcare. These customers might want to further restrict access and traffic to enhance security posture than what the Amazon MWAA default configurations provide. This post covers some recommended practices.

The principle of least privilege is a fundamental tenet that should be followed diligently. When it comes to configuring AWS services, it’s essential to grant only the minimum required permissions to resources, avoiding overly broad or permissive policies.

In this post, we explore how to apply the principle of least privilege to your Amazon MWAA environment by tightening network security using security groups, network access control lists (ACLs), and virtual private cloud (VPC) endpoints. We also discuss the Amazon MWAA execution and deployment roles and their respective permissions.

Understanding the Amazon MWAA environment

When an Amazon MWAA environment is created, resources are created in an AWS managed service VPC and your customer managed VPC. In the customer VPC provided at environment creation, the necessary resources to run the Airflow environment are deployed, including schedulers and workers running on Amazon Elastic Container Service (Amazon ECS) clusters. These clusters are deployed in your VPC and they assume Elastic Network Interfaces (ENIs) with private IP addresses in the customer account. These ENIs span private subnets across two Availability Zones to connect to the Airflow database and web server, which reside in the service-owned account (if in private access mode). The following diagram illustrates this architecture.

MWAA Architecture

VPC security groups act as virtual firewalls that can control network traffic at the ENI level, or instance level. Security groups are stateful, meaning that inbound traffic is automatically permitted outbound and vice versa. The default security group configuration in a VPC starts with is no inbound rules and an outbound rule allowing all traffic. By definition, a security group with no inbound rules denies all ingress traffic that wasn’t allowed out through the 0.0.0.0/0 outbound rule.

Amazon MWAA offers two web server access modes inside the customer VPC: public and private. Public web server mode must have a way for traffic to access the web servers in the customer-owned VPC through the public internet. This requires routing to the public internet using public subnets and a NAT gateway. A NAT gateway can be used to provide internet access for resources in private subnets. With private access mode, the security group for the Amazon MWAA environment doesn’t need to allow traffic to and from the NAT gateway, only granting access to the Airflow UI to users with appropriate permissions from within the VPC. An Application Load Balancer is only provisioned in public mode to route traffic to the public web servers. The customer must provision the rest of the networking components.

If your Amazon MWAA environment needs to communicate with resources outside your VPC (such as external data sources or APIs), you might need to configure appropriate security group rules and routing to allow the necessary traffic. In such cases, you would typically use a NAT gateway or VPN connection to facilitate the communication between your Amazon MWAA environment and the external resources and VPC endpoints for AWS resources.

For tighter security restrictions, an environment with private routing without internet access is possible, and finer-grained security group rules can be applied and VPC endpoint policies can be used. Because this post is focusing on least privilege, we will focus on the minimum security requirements needed for an Amazon MWAA environment.

Security groups: Minimizing permissions

Your Amazon MWAA environment will have a security group associated with your VPC’s environment resources. This security group is also used by the ENIs created by the interface VPC endpoint that is used to communicate with the database and web server. By default, security groups deny all inbound traffic and security group rules need to be explicitly stated, denoting the ports and source that the instance will allow network traffic from. At a minimum, the Amazon MWAA environment must allow for traffic to and from the Amazon Aurora PostgreSQL-Compatible Edition metadata database that is owned and managed by Amazon MWAA. The metadata database is a crucial component of Airflow that acts as a centralized source of truth for task execution, configuration, and monitoring. Both the scheduler and workers require access to this database to perform their respective roles in orchestrating and running tasks. This database listens on TCP port 5432. Additionally, the web server traffic can be restricted to HTTPS through TCP port 443. At a minimum, the Amazon MWAA security group must have the two inbound rules, detailed in the following table.

Type Protocol Port Range Source Type Source
Custom TCP TCP 5432 Custom sg-xxxxx / my-mwaa-vpc-security-group
HTTPS TCP 443 Custom sg-xxxxx / my-mwaa-vpc-security-group

Many customers have other AWS resources residing in VPCs, to which the Amazon MWAA workers need access. These resources can be granted network access in a private routing configuration using security groups as well. If the resource sits in the same security group, add an additional inbound rule with the port needed. For example, if an Amazon Redshift cluster sits in the same security group, add the following rule.

Type Protocol Port Range Source Type Source
Custom TCP TCP 5439 Custom sg-xxxxx / my-mwaa-vpc-security-group

If the Redshift cluster is in a different security group, change the source to the Redshift security group.

Type Protocol Port Range Source Type Source
Custom TCP TCP 5439 Custom sg-xxxxx / redshift-security-group

If the resources are in another VPC, then VPC peering must be enabled before referencing that other VPC’s security group. For resources that don’t reside in a subnet, a VPC endpoint will also provide private routing to and from the Amazon MWAA environment and those resources. For example, a VPC endpoint for Amazon Simple Storage Service (Amazon S3) can provide enhanced security, improved performance, and lower costs.

Network ACLs: Minimizing permissions

Network ACLs can manage (by allow or deny rules) inbound and outbound traffic at the subnet level. An ACL is stateless, which means that inbound and outbound rules must be specified separately and explicitly. It is used to specify the types of network traffic that are allowed in or out from the instances in a VPC network.

Every Amazon VPC has a default ACL that allows all inbound and outbound traffic, with a rule as follows.

Rule number Type Protocol Port Range Source Allow/Deny
100 All IPv4 traffic All All 0.0.0.0/0 Allow
* All IPv4 traffic All All 0.0.0.0/0 Deny

You can edit the default ACL rules or create a custom ACL and attach it to your subnets. A subnet can only have one ACL attached to it at any time, but one ACL can be attached to multiple subnets. To implement least privilege in your Amazon MWAA environment, restrict the inbound ACL to allow traffic from the metadata database and web server and restrict the outbound to allow traffic to only the clients in the private subnet. Note the following examples use example private IPs for the subnets used.

Inbound NACL

Rule number Type Protocol Port Range Source Allow/Deny Comments
100 Custom TCP TCP 5432 10.192.21.0/16 Allow Allow inbound database traffic from private subnet
110 HTTPS TCP 443 10.192.21.0/16 Allow Allow inbound HTTPS traffic from private subnet
* All traffic All All 0.0.0.0/0 Deny Denies all inbound IPv4 traffic not already handled by a preceding rule (not modifiable)

Outbound NACL

Rule number Type Protocol Port Range Source Allow/Deny Comments
100 Custom TCP TCP 1024-65535 10.192.21.0/24 Allow Allows outbound return IPv4 traffic to clients in private subnet
* All traffic All All 0.0.0.0/0 Deny Denies all outbound IPv4 traffic not already handled by a preceding rule (not modifiable)

VPC endpoints: Minimizing permissions

When you create an Amazon MWAA environment, it is deployed within a VPC. This allows you to control the network access and security of your Airflow deployment. However, some customer workloads executing in the Amazon MWAA environment might need to orchestrate tasks using other AWS services, such as Amazon S3 to access files, AWS Glue to start ETL (extract, transform, and load) jobs, or Amazon Redshift for running data warehouse queries, which reside outside of your VPC. To establish a secure and private connection between your Amazon MWAA environment and these external AWS services, you can use VPC endpoints. The purpose of VPC endpoints in Amazon MWAA is to provide a secure and private connection between your Amazon MWAA environment and other AWS services within your VPC. VPC endpoints are virtual devices that are provisioned within your VPC and act as an entry point for the specified AWS service, allowing your Amazon MWAA environment to communicate with the service using a private IP address, without needing to go through the public internet. The following diagram illustrates this architecture.

VPCEndpointsMWAA

VPC endpoints allow you to keep your Amazon MWAA environment’s network traffic within the AWS network, reducing the exposure to the public internet and enhancing the overall security of your Airflow deployment. Although private VPC endpoints are automatically created for the database and web server, to create a least privileged environment without internet access, additional VPC endpoints will be needed for the additional Amazon MWAA required resources. Amazon S3, Amazon Simple Queue Service (Amazon SQS), Amazon CloudWatch, and optionally AWS Key Management Service (AWS KMS) will need VPC endpoints created. For more details, see Creating the required VPC service endpoints in an Amazon VPC with private routing. Outside of the necessary services, many customers run Amazon MWAA workflows that orchestrate additional AWS services, such as Amazon Redshift, Amazon EMR, and AWS Glue. Let’s look at an example VPC endpoint that we want to use to connect to Amazon Redshift, which is commonly called in the Airflow DAGS using the Redshift Operator for workflows that interact with Amazon Redshift as a data warehouse. For more information on creating Amazon VPC interface endpoints, see Access an AWS service using an interface VPC endpoint.

Create a VPC endpoint

Complete the following steps to create a VPC endpoint using Amazon Virtual Private Cloud (Amazon VPC):

  1. On the Amazon VPC console, create a new VPC endpoint for the amazonaws.region.redshift service, where region is the AWS Region where your Amazon MWAA environment and Redshift cluster are located. Make sure that private DNS is enabled.
  2. Create a VPC endpoint policy. This can be used to limit access to the Redshift cluster only to the Amazon MWAA environment, preventing unauthorized access from other resources. The following is an example policy:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": [
          "arn:aws:iam::123456789012:role/YourMWAAExecutionRoleName"
        ]
      },
      "Action": [
        "redshift:DescribeClusters",
        "redshift:DescribeClusterParameters",
        "redshift:DescribeClusterSecurityGroups",
        "redshift:DescribeClusterSubnetGroups",
        "redshift:DescribeEventSubscriptions",
        "redshift:DescribeLoggingStatus",
        "redshift:DescribeReservedNodeOfferings",
        "redshift:DescribeReservedNodes",
        "redshift:DescribeTableRestoreStatus",
        "redshift:DescribeTags",
        "redshift:GetClusterCredentials",
        "redshift:ListTagsForResource",
        "redshift:PurchaseReservedNodeOffering",
        "redshift:ResetClusterParameterGroup",
        "redshift:RestoreFromClusterSnapshot",
        "redshift:RevokeClusterSecurityGroupIngress",
        "redshift:RevokeSnapshotAccess",
        "redshift:ViewQueriesInConsole"
      ],
      "Resource": "arn:aws:redshift:us-east-1:123456789012:cluster/my-redshift-cluster"
    }
  ]
}

The policy contains the following parameters:
  • The Version field specifies the policy language version.
  • The Statement section contains a single statement that allows the specified actions on the Redshift cluster.
  • The Effect field is set to Allow, which means the policy grants the specified permissions.
  • The Principal field specifies the AWS Identity and Access Management (IAM) role associated with your Amazon MWAA execution role, which is authorized to access the Redshift cluster.
  • The Action field lists the specific Redshift actions that the Amazon MWAA execution role is allowed to perform, such as describing the cluster, getting cluster credentials, and restoring from a snapshot.
  • The Resource field specifies the Amazon Resource Name (ARN) of the Redshift cluster that the policy applies to.
  1. Associate the VPC endpoint with the correct route table. This route table should be used by the subnets where your Amazon MWAA environment is deployed. If using a VPC interface endpoint, associate the endpoint with the two private subnets and security group used by Amazon MWAA.
  2. Make sure that the security groups associated with the Amazon MWAA environment and the Redshift cluster allow the necessary inbound and outbound traffic between them. This typically includes allowing access on the Redshift port (typically 5439) from the Amazon MWAA environment’s security group.
  3. On the Amazon MWAA console, under Admin, Connections, update the Redshift connection details to use the VPC endpoint address instead of the public Redshift endpoint. This makes sure that the connection between Amazon MWAA and Amazon Redshift is secure and stays within the VPC.

By configuring VPC endpoints for the AWS services your Amazon MWAA environment needs to access, you can provide secure, private, and efficient communication between your Airflow deployment and AWS resources.

Restricting traffic within AWS with a customer managed endpoints for Amazon MWAA resources

As mentioned earlier, Amazon MWAA integrates with various AWS services, such as CloudWatch for logging, Amazon S3 for DAGs and requirements, Amazon SQS as a messaging middleware, and optionally AWS KMS for encryption. You can create VPC endpoints for these services to make sure traffic stays within the AWS network. Access to these endpoints can be restricted by allowing only the Amazon MWAA security group as the ingress source. For details on how to create these endpoints and policies, see Introducing shared VPC support on Amazon MWAA. If the Amazon MWAA environment was updated after April 2, 2024, it will be on AWS Fargate v1.4 and will not use Amazon Elastic Container Registry (Amazon ECR) and therefore you will not need to create a VPC endpoint for it.

Managing permissions to deploy an Amazon MWAA environment

To create and deploy an Amazon MWAA environment, you need to have the appropriate permissions granted to your IAM user or role. The required permissions can be granted through an IAM policy attached to your user or role. When you create an Amazon MWAA environment, you can specify an execution role that will be assumed by the Airflow workers to perform tasks. The execution role should have the necessary permissions to access the required AWS services and resources based on your workflow requirements. It’s important to follow the principle of least privilege when granting permissions to IAM roles and users. You should only grant the minimum permissions required for your Amazon MWAA environment and Airflow workflows to function correctly.

Amazon MWAA trust policy

Amazon MWAA needs to be able to assume the execution role in order to perform actions on your behalf.  To do this, create a trust policy, allowing the Amazon MWAA service the ability to AssumeRole. To avoid the confused deputy problem, we add a condition to the trust policy, and replace the AWS account number and Region as needed. The following is an example policy:

{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
            "Service": ["airflow.amazonaws.com","airflow-env.amazonaws.com"]
        },
        "Action": "sts:AssumeRole",
        "Condition":{
            "ArnLike":{
               "aws:SourceArn":"arn:aws:airflow:your-region:123456789012:environment/your-environment-name"
            },
            "StringEquals":{
               "aws:SourceAccount":"123456789012"
            }
         }
      }
   ]
}

VPC endpoint permissions for the deployer role

Although the service-linked role creates the VPC endpoints, the deployer role requires permissions to create VPC endpoints and perform a dry run. You can limit these permissions by allowing the ec2:CreateVpcEndpoint action and specifying resource ARNs for VPC endpoints, VPCs, subnets, and security groups. Additionally, you can use the aws:CalledVia condition key to restrict access to the airflow.amazonaws.com service.

Amazon MWAA execution role: Required permissions

When creating an Amazon MWAA environment, you need to specify an execution role that grants the necessary permissions for Airflow to interact with other AWS services. Instead of using a wildcard policy, you can create a custom policy with the minimum required permissions.

The following is an example of an execution role policy that allows Amazon MWAA to interact with various services using an AWS managed key:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "airflow:PublishMetrics",
            "Resource": "arn:aws:airflow:{your-region}:{your-account-id}:environment/{your-environment-name}"
        },
        { 
            "Effect": "Deny",
            "Action": "s3:ListAllMyBuckets",
            "Resource": [
                "arn:aws:s3:::{your-s3-bucket-name}",
                "arn:aws:s3:::{your-s3-bucket-name}/*"
            ]
        },
        { 
            "Effect": "Allow",
            "Action": [ 
                "s3:GetObject*",
                "s3:GetBucket*",
                "s3:List*"
            ],
            "Resource": [
                "arn:aws:s3:::{your-s3-bucket-name}",
                "arn:aws:s3:::{your-s3-bucket-name}/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:CreateLogGroup",
                "logs:PutLogEvents",
                "logs:GetLogEvents",
                "logs:GetLogRecord",
                "logs:GetLogGroupFields",
                "logs:GetQueryResults"
            ],
            "Resource": [
                "arn:aws:logs:{your-region}:{your-account-id}:log-group:airflow-{your-environment-name}-*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:DescribeLogGroups"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetAccountPublicAccessBlock"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "cloudwatch:PutMetricData",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "sqs:ChangeMessageVisibility",
                "sqs:DeleteMessage",
                "sqs:GetQueueAttributes",
                "sqs:GetQueueUrl",
                "sqs:ReceiveMessage",
                "sqs:SendMessage"
            ],
            "Resource": "arn:aws:sqs:{your-region}:*:airflow-celery-*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt",
                "kms:DescribeKey",
                "kms:GenerateDataKey*",
                "kms:Encrypt"
            ],
            "Resource": "arn:aws:kms:your-region:your-account-id:key/your-kms-cmk-id",
            "Condition": {
                "StringLike": {
                    "kms:ViaService": [
                        "sqs.{your-region}.amazonaws.com",
                        "s3.{your-region}.amazonaws.com"
                    ]
                }
            }
        }
    ]
}

This policy grants Amazon MWAA the necessary permissions to interact with CloudWatch Logs, Amazon S3, Amazon SQS, and AWS KMS when using the AWS managed key offering, while explicitly specifying the resources it can access. You can further refine this policy based on your specific requirements.

The following is an example of an execution policy that allows Amazon MWAA to interact with various services using a KMS customer managed key:

{
    "Version": "2012-10-17",
    "Statement": [
        { 
            "Effect": "Deny",
            "Action": "s3:ListAllMyBuckets",
            "Resource": [
                "arn:aws:s3:::{your-s3-bucket-name}",
                "arn:aws:s3:::{your-s3-bucket-name}/*"
            ]
        }, 
        { 
            "Effect": "Allow",
            "Action": [ 
                "s3:GetObject*",
                "s3:GetBucket*",
                "s3:List*"
            ],
            "Resource": [
                "arn:aws:s3:::{your-s3-bucket-name}",
                "arn:aws:s3:::{your-s3-bucket-name}/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:CreateLogGroup",
                "logs:PutLogEvents",
                "logs:GetLogEvents",
                "logs:GetLogRecord",
                "logs:GetLogGroupFields",
                "logs:GetQueryResults"
            ],
            "Resource": [
                "arn:aws:logs:{your-region}:{your-account-id}:log-group:airflow-{your-environment-name}-*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:DescribeLogGroups"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetAccountPublicAccessBlock"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "cloudwatch:PutMetricData",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "sqs:ChangeMessageVisibility",
                "sqs:DeleteMessage",
                "sqs:GetQueueAttributes",
                "sqs:GetQueueUrl",
                "sqs:ReceiveMessage",
                "sqs:SendMessage"
            ],
            "Resource": "arn:aws:sqs:{your-region}:*:airflow-celery-*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt",
                "kms:DescribeKey",
                "kms:GenerateDataKey*",
                "kms:Encrypt"
            ],
            "Resource": "arn:aws:kms:{your-region}:{your-account-id}:key/{your-kms-cmk-id}",
            "Condition": {
                "StringLike": {
                    "kms:ViaService": [
                        "sqs.{your-region}.amazonaws.com",
                        "s3.{your-region}.amazonaws.com"
                    ]
                }
            }
        }
    ]
}

For the use case of using the customer managed key, attach the following JSON policy to the key to provide access to the Airflow logs in CloudWatch Logs:

{
    "Sid": "Allow logs access",
    "Effect": "Allow",
    "Principal": {
        "Service": "logs.{your-region}.amazonaws.com"
    },
    "Action": [
        "kms:Encrypt*",
        "kms:Decrypt*",
        "kms:ReEncrypt*",
        "kms:GenerateDataKey*",
        "kms:Describe*"
    ],
    "Resource": "*",
    "Condition": {
        "ArnLike": {
            "kms:EncryptionContext:aws:logs:arn": "arn:aws:logs:{your-region}:{your-account-id}:*"
        }
    }
}

You can attach multiple policies to the execution role as needed to allow your workers to access additional AWS resources. For example, let’s explore how to enable Amazon EMR access. You can create a JSON policy that contains the narrowest permissions you can configure, as in the following example:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "elasticmapreduce:DescribeStep",
                "elasticmapreduce:AddJobFlowSteps",
                "elasticmapreduce:RunJobFlow"
            ],
            "Resource": "arn:aws:elasticmapreduce:*:xxxxxxxxxxxx:cluster/*"
        },
        {
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": [
                "arn:aws:iam::xxxxxxxxxxxx:role/EMR_EC2_DefaultRole",
                "arn:aws:iam::xxxxxxxxxxxx:role/EMR_DefaultRole"
            ]
        }
    ]
}

Conclusion

In this post, we discussed best practices for least privilege configuration in Amazon MWAA. By following these approaches, you can adhere to the principle of least privilege and maintain a secure posture within your Amazon MWAA environment, without compromising functionality or relying on overly permissive policies. Security is always top priority; to learn more about security in Amazon MWAA, see Security in Amazon Managed Workflows for Apache Airflow and Security best practices on Amazon MWAA.


About the Authors

elizaws-headshotElizabeth Davis is a Sr Solutions Architect at Amazon Web Services (AWS). She currently works with educational technology companies and has a passion for serverless and data orchestration technologies. She has been an Amazon MWAA as a subject matter expert (SME) for the last 3+ years.

mark headshotMark Richman is a Principal Solutions Architect at Amazon Web Services with 30 years of experience building complex web and enterprise software. He contributes to Apache Airflow, bringing his expertise in cloud computing and serverless technologies to the open-source platform. Mark is also an accomplished writer and speaker who has authored commercial publications and AWS courses while regularly presenting at industry events.

Intel Foundry Direct Connect 2025 Make-or-Break Time For Intel

Post Syndicated from Patrick Kennedy original https://www.servethehome.com/intel-foundry-direct-connect-2025-make-or-break-time-for-intel/

We are at the Intel Foundry 2025 event which is a make-or-break moment for Intel as it looks to bolster its semiconductor foundry business

The post Intel Foundry Direct Connect 2025 Make-or-Break Time For Intel appeared first on ServeTheHome.

Barnes: Parallel ./configure

Post Syndicated from corbet original https://lwn.net/Articles/1019303/

Tavian Barnes takes on
the tedious process
of waiting for configure scripts to run.

I paid good money for my 24 CPU cores, but ./configure can only
manage to use 69% of one of them. As a result, this random project
takes about 13.5× longer to configure the build than it does to
actually do the build.

The purpose of a ./configure script is basically to run the
compiler a bunch of times and check which runs succeeded. In this
way it can test whether particular headers, functions, struct
fields, etc. exist, which lets people write portable software. This
is an embarrassingly parallel problem, but Autoconf can’t
parallelize it, and neither can CMake, neither can Meson, etc.,
etc.

(Thanks to Paul Wise).

Meet B2 Overdrive: Terabit-Speed Throughput for AI/ML and HPC Workloads

Post Syndicated from David Ngo original https://www.backblaze.com/blog/b2-overdrive-announcement/

A decorative image showing a drive, the Backblaze logo, and a speedometer.

If you’re wrangling massive datasets for AI, machine learning (ML), high-performance compute (HPC), content delivery networks (CDNs), or analytics, you’re familiar with the trade-off: Pay a premium for the highest speeds, or compromise on performance to keep costs manageable. 

Backblaze B2 Overdrive changes that. You can now move exabyte-scale datasets at up to terabit speeds without the eye-watering price tag. Starting at $15 per terabyte per month, Backblaze B2 Overdrive gives you the power to run data-intensive workloads at peak performance, with unlimited free egress and private networking options that keep things fast, secure, and predictable.

See it in action

Join our upcoming webinar with Pat Patterson, Chief Technical Evangelist and Dave Ngo, Chief Product Officer, to learn more about how B2 Overdrive supercharges your data.

Sign Up ➔ 

What makes B2 Overdrive different?

B2 Overdrive offers a specialized cloud object storage solution at a fraction of competitors’ costs. Here’s what you get:

  • Up to 1Tbps throughput: In other words, the kind of speed that lets you move petabytes of data fast without complex architecture. 
  • Unlimited free egress: Move as much data as you want, whenever you want, to wherever you want. Egress is totally free. 
  • Private networking support: Transfer data at maximum speed through secure private networking connections to your infrastructure.

It’s built on the foundation of our always-hot cloud storage infrastructure, with no minimum file size requirements, no deletion fees, and powerful features like Event Notifications so you can build responsive and automated workflows. We’ll be sharing some of the innovations under the hood in the coming months—so, stay tuned to our series on the engineering behind performance. 

Who’s it for?

The simple answer: The status quo isn’t cutting it. Today’s workloads demand both the ability to move massive datasets and predictable economics that don’t penalize success. B2 Overdrive challenges the assumption that mind-bending performance has to come with mind-boggling prices.

We need to store an insane amount of data and, at the same time, download it to different GPU clusters around the world, and for all that to not cost an insane amount of money. That’s why we chose Backblaze.

—Dean Leitersdorf, CEO and Co-Founder, Decart

Ready to go?

Backblaze B2 Overdrive is generally available today for organizations with multi-petabyte storage instances and workloads. 

Want to learn more or see if it’s the right fit for your team? Get in touch with our Sales team—we’d love to talk about how we can help you ditch the trade-offs and go full speed ahead.

The post Meet B2 Overdrive: Terabit-Speed Throughput for AI/ML and HPC Workloads appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Reinforcing resilience with financial assurance: Breach protection matters now more than ever

Post Syndicated from Cindy Stanton original https://blog.rapid7.com/2025/04/29/reinforcing-resilience-with-financial-assurance-breach-protection-matters-now-more-than-ever/

Introducing Rapid7’s value-added Breach Protection Warranty that delivers confidence, clarity, and coverage when it matters most.

Reinforcing resilience with financial assurance: Breach protection matters now more than ever

Life’s old adage often applies in security: Hope for the best, prepare for the worst. In today’s threat landscape, even the best-prepared organizations can’t guarantee immunity from cyberattacks. The cost of a breach is no longer just a line item—it’s a business risk with board-level visibility. In 2024, the average total cost of a breach soared to $4.88 million, a record high and a 10% increase from the previous year​.1

As threats grow in complexity and breach response becomes more expensive and unpredictable, security leaders aren’t just investing in detection and response—they’re looking for assurance that they are, in fact, prepared for the worst. That’s why Rapid7 is introducing its Breach Protection Warranty: real-world financial coverage, built directly into our flagship Managed Detection and Response (MDR) offering, Managed Threat Complete Ultimate. This coverage is designed to give customers confidence that if that dreaded day comes, they’re ready.

Rapid7 continues to invest heavily in this industry-leading service—trusted by thousands of customers worldwide—which processes trillions of events and investigates millions of alerts annually. Leveraging this immense scale to ensure robust threat detection and rapid response, the results speak for themselves: 99.6% of our MDR customers remain unaffected by ransomware. But beyond numbers, our commitment is clear—we’re dedicated to continuously enhancing MDR capabilities and standing shoulder-to-shoulder with our customers to safeguard their operations.

Built-in financial protection, when it counts

The Rapid7 Breach Protection Warranty provides up to $1 million in breach-related coverage, based on the size of a customer’s environment. It’s designed to offset the real-world costs organizations face in the wake of a cyberattack, including:

  • Forensic investigation expenses
  • Legal consultation fees
  • Public relations costs
  • Post-security incident expenses

Unlike standalone breach coverage or third-party insurance policies, this warranty comes at no additional cost to eligible customers. There are no upsells or hidden fees—just value-added protection that’s already embedded in the service​.

Integrated into a world-class detection and response program

Rapid7’s holistic and outcome-driven approach to detection and response includes unlimited digital forensics and incident response (DFIR), remote containment, and active remediation of commodity malware. All of these capabilities are amplified by the Rapid7 Command Platform, which seamlessly integrates vulnerability findings and threat intelligence to deliver complete coverage across the entire security incident lifecycle.

Together with the Breach Protection Warranty, this cohesive model transforms your cybersecurity program into one not just backed by technology and expertise, but also a financial safety net built into the service.

Simplifying breach response, not complicating it

A warranty is, at its core, a legal agreement—and it’s understandable that many come with conditions or “strings attached.” Some require customers to exhaust other forms of insurance before coverage kicks in. Others recapture the financial benefit through billable incident response services, which can quietly reduce the actual value received.

Rapid7 takes a different approach. We’ve built the Breach Protection Warranty to maximize customer value with transparency. There are no hidden clauses designed to funnel reimbursement back to us. And because unlimited incident response (IR) is already included in the service, customers don’t need to worry about separate IR contracts or an unexpected changing of the guard in the frenetic aftermath of an incident​.

Strengthening resilience through readiness

While financial protection is the headline, eligibility for the warranty reinforces the fundamentals of a strong security posture. Customers must meet a set of best-practice requirements that align with Rapid7’s proven approach to resilience, including but not limited to:

  • Hardened endpoint configurations
  • Deployment of core protection modules such as Ransomware Prevention
  • Updated, compliant operating systems and software across covered assets​

These aren’t just checkboxes—they’re meaningful controls that improve visibility, reduce risk, and better prepare customers to prevent and respond to threats; with the added benefit of unlocking a meaningful financial backstop.

Ready to learn more?

This is about more than cost coverage. It’s about trust—trust that your security investments are driving true resilience and that your provider is prepared to stand beside you when it matters most.

Rapid7’s Breach Protection Warranty is now available to all Managed Threat Complete Ultimate customers. To learn more about your eligibility or sign up, please reach out to your Rapid7 account team.

1 IBM

InsightIDR AI Alert Triage Automatically Classifies Alerts with 99.93% Accuracy

Post Syndicated from Chris Wraight original https://blog.rapid7.com/2025/04/29/insightidr-ai-alert-triage-automatically-classifies-alerts-with-99-93-accuracy/

Rapid7 AI Alert Triage helps SOC analysts quickly and accurately triage thousands of daily alerts, improving efficiency and enabling focus.

InsightIDR AI Alert Triage Automatically Classifies Alerts with 99.93% Accuracy

One universal truth in Security Operations Centers (SOCs) is that analysts are overwhelmed by the high volume of alerts they receive. In a recent survey, SOC teams reported they are inundated with an average of 4,484 alerts daily, with a staggering 67% being ignored due to alert fatigue and the high volume of false positives. Also reported in the same survey, the number of security alerts they received had “significantly increased” in the last three years. All this can lead to alert fatigue, resulting in missed or ignored alerts and potentially exposing an organization to legitimate threats, impacting SOC performance.

Introducing AI Alert Triage for InsightIDR

Rapid7’s AI Alert Triage –  trained and tested by the Rapid7 global MDR service across trillions of alerts worldwide — will soon be available to users of our next-gen SIEM, InsightIDR, at no additional cost. The AI Alert Triage engine quickly suggests an initial disposition (benign or malicious) for alerts, providing clarity into why that disposition was chosen and supporting information from the investigation.

InsightIDR AI Alert Triage Automatically Classifies Alerts with 99.93% Accuracy
The new AI Suggested Disposition field shows the alert classification, along with detailed information to assist the SOC analyst.

Without access to the Rapid7 AI Alert Triage capability, SOC teams can waste significant time manually evaluating and correctly classifying malicious alerts, increasing their threat exposure and contributing to SOC inefficiency. With AI Alert Triage, SOC analysts can automatically and accurately focus limited security resources on legitimate threats and improve SOC performance.

Built on Decades of AI Expertise in Security

Rapid7 is not new to infusing AI with its security applications. Rapid7 is a pioneer in AI development for security use cases, starting in our earliest days with our VM Expert System in the early 2000s. Since then, Rapid7 has integrated Generative AI into the Command Platform to supercharge SecOps and augment MDR services.

Our AI-powered platform processes 4.8 trillion alerts weekly for our MDR customer base with a 99.93% benign alert closure rate. This has resulted in hundreds of hours of manual effort saved for the SOC analysts.

AI Alert Triage Improves Your SOC’s Effectiveness

InsightIDR AI Alert Triage Automatically Classifies Alerts with 99.93% Accuracy
Having thousands of daily alerts automatically and correctly classified is a huge productivity boost for overwhelmed SOC analysts.

SOC analysts being overwhelmed by high volumes of daily alerts is hardly a new phenomenon. However, up until now, SOC analysts have been unable to effectively deal with this massive number of alerts, and organizations have become victims of alert fatigue. The same survey referenced above reports that analysts spend nearly 3 hours (2.7) each day manually triaging alerts, a figure rising to more than 4 hours a day for 27% of respondents. And, on average, security analysts are unable to deal with over two-thirds (67%) of the daily alerts they receive. What’s more, they say 83% of these alerts are false positives and not worth their time.

By using AI Alert Triage, Rapid7 customers can leverage decades of proven Rapid7 AI technology to quickly and accurately classify the deluge of alerts and not be forced into a situation where they intentionally ignore alerts. Key capabilities include:

  • Rapid identification and prioritization of genuine threats: AI Alert Triage helps customers quickly distinguish true positives from noise, enabling security teams to prioritize investigations based on validated, high-confidence alerts.
  • Enhanced Threat Detection Speed and Accuracy: Leveraging MDR-validated AI ensures alerts reflect real threats, helping SOC teams respond swiftly and confidently to advanced threats and subtle attack indicators.
  • Human oversight of automatic classification: AI Alert Triage has attained a 99.93% benign alert closure rate with nearly 5 trillion weekly alerts; every alert is documented, with full transparency and opportunity for human intervention.
  • Reduced Alert Fatigue and False Positives: With AI Alert Triage validated by MDR analysts, customers experience dramatically reduced false positives, significantly cutting down time wasted on non-critical alerts.
  • Streamlined Workflows and Focus: AI Alert Triage automates repetitive tasks to streamline initial analysis, enabling security teams to jumpstart investigations and dedicate more time and resources to critical initiatives.

Transform Your SOC Performance Through Proven AI Assistance

Unmatched 99.93% accuracy and speed in Rapid7’s AI models drive trust and confidence in automatic decision-making and reduce alert fatigue for SOC analysts, improving SOC performance and speed. At Rapid7, we are pioneering the infusion of artificial intelligence into the Command platform, empowering  SOCs around the globe and dramatically transforming their effectiveness through GenAI.

To learn more about AI Alert Triage, contact your account team or your Customer Success Advisor.

Deepening the MDR partnership: Rapid7 now delivers Active Remediation with Velociraptor

Post Syndicated from Conner Goldstein original https://blog.rapid7.com/2025/04/29/deepening-the-mdr-partnership-rapid7-now-delivers-active-remediation-with-velociraptor/

Rapid7 is expanding its response capabilities to meet the demands and relentless pace of today’s threat landscape – and the operational needs of our customers.

Deepening the MDR partnership: Rapid7 now delivers Active Remediation with Velociraptor

Partnership means many things to us here at Rapid7. It means showing up with trusted expertise, providing clear guidance in moments of uncertainty, and helping security teams stay ahead of ever-evolving threats. Most of all, we see partnership as foundational to building security resilience – and that requires not only a proactive,  risk-aware mindset but also the capability to respond when the inevitable happens.

As attacks grow faster, more complex, and more persistent, the need for decisive, transparent remediation has become more urgent – some estimates place the average time-to-ransom at just 16.88 hours1 –  with that kind of speed, every moment matters. We pride our Managed Detection and Response (MDR) service on delivering best-in-class detection, investigation, and actionable response guidance. Now, we are evolving that partnership – and the strength of your security program – even further.

Introducing Active Remediation with Velociraptor

Powered by our best-in-class, open-source digital forensics and incident response (DFIR) tool, Rapid7 MDR analysts can take direct, approved remediation actions on your behalf – removing malware, terminating rogue processes, and restoring system integrity while minimizing the need to reimage affected endpoints unless it’s truly required. Every action is executed with precision, transparency, and within clearly defined boundaries.

This is more than a new capability. It’s a reflection of our commitment to move in lockstep with you – not just at the point of detection, but all the way through to resolution. From unlimited incident response support to deeply collaborative investigations and tailored recommendations, Rapid7 has always prioritized being hands-on when you need us most. Active Remediation with Velociraptor extends that same principle to the final – and often most difficult – step: taking action on your behalf to eradicate threats.

Delivered with Precision, Transparency, and Trust

Active Remediation with Velociraptor is designed not just to take action, but to take the right action, the right way. Every remediation workflow is executed by Rapid7’s expert analysts using Velociraptor’s purpose-built query language (VQL) – a DFIR language engineered for precision, traceability, and scale. This allows the analyst to target specific artifacts, processes, and configurations – avoiding the blunt-force actions that often lead to full endpoint reimaging.

  • You stay in control – Remediation is performed based on clearly defined and approved scopes and parameters aligned to your security policies.
  • You see what we do – Every action is logged, auditable, and built using readable logic within Velociraptor, with full visibility provided through detailed post-incident reports.
  • You gain precision without disruption – Remove only what’s malicious and reverse unauthorized configurations without pulling systems offline or fully reimaging machines.

Rapid7’s New Response Workflow

Deepening the MDR partnership: Rapid7 now delivers Active Remediation with Velociraptor
  1. Alert detection: Identify malicious activity across customer endpoints and network.
  2. Active Response: Quarantine affected endpoints to stem the spread of the attack.
  3. Rapid7 investigation: SOC validates threat, determines scope, and develops response plan.
  4. Active Remediation with Velociraptor: Rapid7 analysts remove malicious artifacts with precision.
  5. Mitigation guidance: Recommendations to help your team prevent threat reemergence.

Remediating in the Real World

Our approach brings analyst-led, logic-driven remediation into live environments – solving the post-containment challenges security teams face every day. Unlike session-based access that relies on endpoints being on and connected to the internet, Rapid7 are delivering remediation that meets the auditability, practicality, and scalability needs of the real world:

  • Targeted threat removal without reimaging: Identify and remove only malicious artifacts – files, processes, persistence mechanisms, or unauthorized configurations – linked to a confirmed threat.
  • Outcome: Your endpoints stay online and productive, while the threat is neutralized with minimal disruption. Avoiding unnecessary reimaging means faster recovery, reduced IT workload, and less downtime for end users.
  • Controlled execution with transparent logic: Every remediation workflow is written in VQL – visible and reviewable by customers before deployment. There’s no scripting or ‘trust us’ execution.
  • Outcome: Builds trust and accountability into the remediation process. You get full visibility into every action, supporting compliance requirements and reducing uncertainty in regulated environments.
  • Distributed remediation across endpoints: When multiple endpoints are compromised by a single campaign – such as credential theft malware – we will queue high-fidelity remediation workflows across many machines simultaneously – even if some are offline.
  • Outcome: Lays the foundation for consistent threat removal across your environment without manual intervention or system-by-system cleanup. This enables a timely, coordinated response that keeps pace with fast-moving attacks.
  • Reducing friction between security and IT teams
    Rather than working through lengthy remediation steps with your IT team, we execute the most critical actions directly – within approved scope – and document every step.
  • Outcome: Fewer delays and less back-and-forth between teams. With Rapid7 handling the complete, end-to-end lifecycle of an alert, internal teams stay focused on business priorities, knowing remediation is being executed safely and effectively.

Setting the stage for remediation with Active Response

Remediation begins with strategic containment and detailed investigation. Rapid7’s Active Response enables rules-based quarantining of affected endpoints in the immediate aftermath of a credible threat detection. This stops lateral movement before it begins and preserves the system state for investigation.

Active Remediation builds directly on this foundation. By first containing the threat, we can then investigate confidently and move quickly to identify and remove malicious artifacts – mitigating the risk of reinfection or spread. The integrated workflow – from containment to investigation to remediation – helps ensure our response is not only fast, but precise.

Together, Active Response and Active Remediation form the cornerstone of a continuous response pipeline that reduces attacker dwell time, limits impact, and restores normal operations faster.

Unlimited incident response – now deeper than ever

Rapid7 MDR customers have long relied on Rapid7’s unlimited DFIR support to guide them through the most critical moments of a threat. That hands-on expertise – delivered without surprise fees, hourly caps, or the need to navigate third-party providers and tools – is a defining part of how we ensure customers receive the fastest, most comprehensive response possible.

Active Remediation builds on that foundation by closing the final gap in the response lifecycle. Where detection, containment, and investigation have long been Rapid7’s strengths, we can now fully execute on the next step: resolving the threat. This combination of expert-led triage with decisive, hands-on remediation, delivers a more unified, end-to-end response – minimizing delays, reducing reliance on your internal resources, and accelerating your path to recovery. It’s not just about reacting faster – it’s about responding smarter, start to finish.

More than a capability – it’s a commitment

The Rapid7 MDR service has always been built around standing shoulder to shoulder with our customers, especially when it matters most. As we expand this partnership through the finale of the detection and response lifecycle – taking action to remove threats, reduce disruption, and accelerate recovery – we do it with the same transparency, accountability, and respect for your control that defines every part of the Rapid7 experience. In the name of building true security resilience, partnership doesn’t end with guidance – it means staying with you all the way through to resolution. Active Remediation with Velociraptor is in closed early access and will roll out to MDR customers in mid-May. To learn more, please contact your account team or Cybersecurity Advisor.

1 gbhackers

[$] Cache awareness for the CPU scheduler

Post Syndicated from corbet original https://lwn.net/Articles/1018334/

The kernel’s CPU scheduler has to balance a wide range of objectives. The
tasks in the system must be scheduled fairly, with latency for any given
task kept within bounds. All of the CPUs in the system should be kept busy
if there is enough work to do, but unneeded CPUs should be shut down to
reduce power consumption. A task should also run on the CPU that is most
likely to have cached the memory that task is using. This patch
series
from Chen Yu aims to improve how the scheduler handles cache
locality for multi-threaded processes.

Driving down MTTR with Remediation Hub, Available in Rapid7 Exposure Command

Post Syndicated from Rapid7 original https://blog.rapid7.com/2025/04/29/driving-down-mttr-with-remediation-hub-available-in-rapid7-exposure-command/

Driving down MTTR with Remediation Hub, Available in Rapid7 Exposure Command

Co-authored by Peter Whibley, Ed Montgomery, and Joel Alcon

Technology innovation combined with the highly fragmented nature of today’s IT landscape means that vulnerabilities are being exploited faster and at greater scale than ever. Security teams contend with a daily surge of new threat actors and attack vectors. Without a unified view of assets, business context, and compensating controls, they waste weeks identifying which risks are truly critical.

Many organizations try to tackle this challenge by implementing exposure management  and risk-based vulnerability management (RBVM) approaches, where vulnerability data from various tools is consolidated into one dashboard. But many of these tools present risk scores without demonstrating a holistic view of the business impact of vulnerabilities, mitigating controls for endpoints, patch management status, and remediation steps.

Without that end-to-end context, security teams are struggling to keep up with the volume of new vulnerabilities. In fact, once the National Vulnerability Database (NVD) announced in February 2024 that it would no longer provide vulnerability scores for all CVEs, the shortcomings of traditional vulnerability management, including RBVM, became more evident.

From chasing vulnerabilities, to proactively mitigating risk

Rapid7’s Remediation Hub enables security teams to go beyond simply identifying vulnerabilities and focus more on remediating risk. By augmenting vulnerability findings with business context, threat intelligence, and compensating controls, organizations gain a continuous, all-in-one view of how to detect and respond to risks across their enterprise. These new capabilities empower security teams to:

  1. Assess the impact of remediation steps. Reimagine your attack surface by viewing the number of vulnerabilities addressed by each remediation action.
  2. Prioritize remediation with confidence. Leverage dynamic, threat-aware risk scores to assess the criticality of issues and quickly go from vulnerability to action.
  3. Optimize risk mitigation. Accelerate risk response through streamlined remediation workflows.

Third-party vulnerability findings elevate risk remediation

Security teams leverage multiple vulnerability scanning tools for different parts of their infrastructure, including cloud environments, containers, web applications, and endpoints. Each tool reports findings in its own format and utilizes different scoring methods, making it difficult to get a clear, unified picture of an organization’s risk exposure.

By unifying this data into a centralized platform, security teams reduce unnecessary noise caused by redundant vulnerability findings, streamlining triage efforts, reducing silos, and driving faster, more informed remediation efforts.

Driving down MTTR with Remediation Hub, Available in Rapid7 Exposure Command

Rapid7 Remediation Hub delivers this normalized view of third-party vulnerabilities, enabling teams to stop wasting time chasing low-impact issues or overlook high-severity threats. The solution takes this unified lens further via risk scores that combine these vulnerability findings with business context to help security teams quickly identify the most critical vulnerabilities, allocate resources efficiently, and communicate risk more effectively to stakeholders. These capabilities not only boost operational efficiency, but also strengthen an organization’s security posture.

Context-based visibility into endpoint protection and patch management

Context is an essential component of managing risk in today’s increasingly complex technology landscape. By solely relying on vulnerability scores without also understanding business impact or breach likelihood, security teams are left with a hazy, incomplete view of their attack surface.

Rapid7 Exposure Command empowers security teams to prioritize vulnerabilities based on attacker behavior, exploitability, and potential impact – all without the need to export data into separate security tools. Rapid7 delivers deep, multi-layered risk scores calculated from Rapid7 Labs’s threat intelligence, first-party scans, third-party vulnerability findings, and an organization’s unique mitigating controls. Furthermore, Remediation Hub is seamlessly integrated with Rapid7 Surface Command, arming security teams with a continuous view of key mitigating controls of assets across the enterprise, including endpoint protection and patch management in place.

Driving down MTTR with Remediation Hub, Available in Rapid7 Exposure Command
  • Endpoint protection – Remediation Hub displays which assets have active endpoint protection, as well as the protection type on the asset. Users can use intuitive filters to hone in on critical findings, such as the assets that lack endpoint protection and prioritize remediation efforts via a risk-based approach that gives higher priority to assets that lack endpoint protection.
  • Patch management – Remediation Hub shows the patch management availability status of each asset, arming security teams with a view of assets that are available for patching by a patch management system. Users can filter on assets with vulnerabilities where no patching is active.

Faster risk response, fewer security silos

Security teams often operate in silos, with a team handling risk identification and another focused on remediation. CISA recommends that critical vulnerabilities be remediated within 15 calendar days of initial detection, but to achieve this, organizations require tight collaboration between these disparate teams.

Unfortunately, because these groups operate with poorly integrated security tools, going from vulnerability finding to risk remediation can take months, with some vulnerabilities going unpatched for years. For instance, the 2024 Verizon Data Breach Investigations Report finds that it takes an estimated 55 days to remediate 50% of critical vulnerabilities once their patches are available.

Driving down MTTR with Remediation Hub, Available in Rapid7 Exposure Command

Remediation Hub tackles this challenge with purpose-built SOAR integrations that help improve collaboration and drive down MTTR (mean time to remediate). The new capabilities automatically trigger remediation workflows, with notifications auto-generated and sent to adjacent teams responsible for implementing the recommended remediations.

For example, users can leverage Remediation Hub to automatically trigger a workflow in Jira or create an incident report in ServiceNow based on the severity or business impact of a vulnerability. Each workflow is fully customizable based on unique security thresholds.

Embracing faster, continuous exposure management

Organizations are rapidly transitioning from traditional vulnerability management to more continuous, exposure management approaches. Rapid7’s Remediation Hub – an integral component of the Exposure Command platform – empowers security teams to embrace the shift.

With a remediation-based approach to vulnerability management and risk reduction, organizations are taking command of their attack surface and discovering a simpler, more effective approach to managing and truly mitigating risk.

If you are interested in learning more about Remediation Hub and our Exposure Command platform, check out our Exposure Command product tour.

The collective thoughts of the interwebz