Tag Archives: Advanced (300)

Run Apache Spark and Iceberg 4.5x faster than open source Spark with Amazon EMR

Post Syndicated from Atul Payapilly original https://aws.amazon.com/blogs/big-data/run-apache-spark-and-iceberg-4-5x-faster-than-open-source-spark-with-amazon-emr/

This post shows how Amazon EMR 7.12 can make your Apache Spark and Iceberg workloads up to 4.5x faster performance.

The Amazon EMR runtime for Apache Spark provides a high-performance runtime environment with full API compatibility with open source Apache Spark and Apache Iceberg. Amazon EMR on EC2, Amazon EMR Serverless, Amazon EMR on Amazon EKS, Amazon EMR on AWS Outposts and AWS Glue use the optimized runtimes.

Our benchmarks show Amazon EMR 7.12 runs TPC-DS 3 TB workloads 4.5x faster than open source Spark 3.5.6 with Iceberg 1.10.0.

Performance improvements include optimizations for metadata caching, parallel I/O, adaptive query planning, data type handling, and fault tolerance. There were also some Iceberg specific regressions around data scans that we identified and fixed.

These optimizations let you match Parquet performance on Amazon EMR while keeping the key features of Iceberg key features: ACID transactions, time travel, and schema evolution.

Benchmark results compared to open source

To assess the performance of the Spark engine with the Iceberg table format, we performed benchmark tests using the 3 TB TPC-DS dataset, version 2.13, a popular industry standard benchmark. Benchmark tests for the Amazon EMR runtime for Apache Spark and Apache Iceberg were conducted on Amazon EMR 7.12 EC2 clusters compared to open source Apache Spark 3.5.6 and Apache Iceberg 1.10.0 on EC2 clusters.

Note: Our results derived from the TPC-DS dataset are not directly comparable to the official TPC-DS results due to setup differences.

The setup instructions and technical details are available in our GitHub repository. To minimize the influence of external catalogs like AWS Glue and Hive, we used the Hadoop catalog for the Iceberg tables. This uses the underlying file system, specifically Amazon S3, as the catalog. We can define this setup by configuring the property spark.sql.catalog.<catalog_name>.type. The fact tables used the default partitioning by the date column, which vary from 200–2,100 partitions. No precalculated statistics were used for these tables.

We ran a total of 104 SparkSQL queries in 3 sequential rounds, and the average runtime of each query across these rounds was taken for comparison. The average runtime for the 3 rounds on Amazon EMR 7.12 with Iceberg enabled was 0.37 hours, demonstrating a 4.5x speed increase compared to open source Spark 3.5.6 and Iceberg 1.10.0. The following figure presents the total runtimes in seconds.

The following table summarizes the metrics.

Metric Amazon EMR 7.12 on EC2 Amazon EMR 7.5 on EC2 Open source Apache Spark 3.5.6 and Apache Iceberg 1.10.0
Average runtime in seconds 1349.62 1535.62 6113.92
Geometric mean over queries in seconds 7.45910 8.30046 22.31854
Cost* $4.81 $5.47 $17.65

*Detailed cost estimates are discussed later in this post.

The following chart demonstrates the per-query performance improvement of Amazon EMR 7.12 relative to open source Spark 3.5.6 and Iceberg 1.10.0. The extent of the speedup varies from one query to another, with the fastest up to 13.6x faster for q23b, with Amazon EMR outperforming open source Spark with Iceberg tables. The horizontal axis arranges the TPC-DS 3TB benchmark queries in descending order based on the performance improvement seen with Amazon EMR, and the vertical axis depicts the magnitude of this speedup as a ratio.

Cost comparison breakdown

Our benchmark provides the total runtime and geometric mean data to assess the performance of Spark and Iceberg in a complex, real-world decision support scenario. For additional insights, we also examine the cost aspect. We calculate cost estimates using formulas that account for EC2 On-Demand instances, Amazon Elastic Block Store (Amazon EBS), and Amazon EMR expenses.

  • Amazon EC2 cost (includes SSD cost) = number of instances * r5d.4xlarge hourly rate * job runtime in hours
    • 4xlarge hourly rate = $1.152 per hour
  • Root Amazon EBS cost = number of instances * Amazon EBS per GB-hourly rate * root EBS volume size * job runtime in hours
  • Amazon EMR cost = number of instances * r5d.4xlarge Amazon EMR cost * job runtime in hours
    • 4xlarge Amazon EMR cost = $0.27 per hour
  • Total cost = Amazon EC2 cost + root Amazon EBS cost + Amazon EMR cost

The calculations reveal that the Amazon EMR 7.12 benchmark yields a 3.6x cost efficiency improvement over open source Spark 3.5.6 and Iceberg 1.10.0 in running the benchmark job.

Metric Amazon EMR 7.12 Amazon EMR 7.5 Open source Apache Spark 3.5.6 and Apache Iceberg 1.10.0
Runtime in seconds 1349.62 1535.62 6113.92

Number of EC2 instances

(Includes primary node)

9 9 9
Amazon EBS Size 20gb 20gb 20gb

Amazon EC2

(Total runtime cost)

$3.89 $4.42 $17.61
Amazon EBS cost $0.01 $0.01 $0.04
Amazon EMR cost $0.91 $1.04 $0
Total cost $4.81 $5.47 $17.65
Cost savings Amazon EMR 7.12 is 3.6x better Amazon EMR 7.5 is 3.2x better Baseline

In addition to the time-based metrics discussed so far, data from Spark event logs show that Amazon EMR scanned approximately 4.3x less data from Amazon S3 and 5.3x fewer records than the open source version in the TPC-DS 3 TB benchmark. This reduction in Amazon S3 data scanning contributes directly to cost savings for Amazon EMR workloads.

Run open source Apache Spark benchmarks on Apache Iceberg tables

We used separate EC2 clusters, each equipped with 9 r5d.4xlarge instances, for testing both open source Spark 3.5.6 and Amazon EMR 7.12 for Iceberg workload. The primary node was equipped with 16 vCPU and 128 GB of memory, and the 8 worker nodes together had 128 vCPU and 1024 GB of memory. We conducted tests using the Amazon EMR default settings to showcase the typical user experience and minimally adjusted the settings of Spark and Iceberg to maintain a balanced comparison.

The following table summarizes the Amazon EC2 configurations for the primary node and 8 worker nodes of type r5d.4xlarge.

EC2 Instance vCPU Memory (GiB) Instance storage (GB) EBS root volume (GB)
r5d.4xlarge 16 128 2 x 300 NVMe SSD 20 GB

Prerequisites

The following prerequisites are required to run the benchmarking:

  1. Using the instructions in the emr-spark-benchmark GitHub repository, set up the TPC-DS source data in your S3 bucket and on your local computer.
  2. Build the benchmark application following the steps provided in Steps to build spark-benchmark-assembly application and copy the benchmark application to your S3 bucket. Alternatively, copy spark-benchmark-assembly-3.5.6.jar to your S3 bucket.
  3. Create Iceberg tables from the TPC-DS source data. Follow the instructions on GitHub to create Iceberg tables using the Hadoop catalog. For example, the following code uses an Amazon EMR 7.12 cluster with Iceberg enabled to create the tables:
aws emr add-steps --cluster-id <cluster-id> --steps Type=Spark,Name="Create Iceberg Tables",
Args=[--class,com.amazonaws.eks.tpcds.CreateIcebergTables,--conf,spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,
--conf,spark.sql.catalog.hadoop_catalog=org.apache.iceberg.spark.SparkCatalog,
--conf,spark.sql.catalog.hadoop_catalog.type=hadoop,
--conf,spark.sql.catalog.hadoop_catalog.warehouse=s3://<bucket>/<warehouse_path>/,
--conf,spark.sql.catalog.hadoop_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO,
s3://<bucket>/<jar_location>/spark-benchmark-assembly-3.5.6.jar,s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/,
/home/hadoop/tpcds-kit/tools,parquet,3000,true,<database_name>,true,true],ActionOnFailure=CONTINUE --region <AWS region>

Note: The Hadoop catalog warehouse location and database name from the preceding step. We use the same Iceberg tables to run benchmarks with Amazon EMR 7.12 and open source Spark.

This benchmark application is built from the branch tpcds-v2.13_iceberg. If you’re building a new benchmark application, switch to the correct branch after downloading the source code from the GitHub repository.

Create and configure a YARN cluster on Amazon EC2

To compare Iceberg performance between Amazon EMR on Amazon EC2 and open source Spark on Amazon EC2, follow the instructions in the emr-spark-benchmark GitHub repository to create an open source Spark cluster on Amazon EC2 using Flintrock with 8 worker nodes.

Based on the cluster selection for this test, the following configurations are used:

Make sure to replace the placeholder <private ip of primary node>, in the yarn-site.xml file, with the primary node’s IP address of your Flintrock cluster.

Run the TPC-DS benchmark with Apache Spark 3.5.6 and Apache Iceberg 1.10.0

Complete the following steps to run the TPC-DS benchmark:

  1. Log in to the open source cluster primary node using flintrock login $CLUSTER_NAME.
  2. Submit your Spark job:
    1. Choose the correct Iceberg catalog warehouse location and database that has the created Iceberg tables.
    2. The results are created in s3://<YOUR_S3_BUCKET>/benchmark_run.
    3. You can track progress in /media/ephemeral0/spark_run.log.
spark-submit \
--master yarn \
--deploy-mode client \
--class com.amazonaws.eks.tpcds.BenchmarkSQL \
--conf spark.driver.cores=4 \
--conf spark.driver.memory=10g \
--conf spark.executor.cores=16 \
--conf spark.executor.memory=100g \
--conf spark.executor.instances=8 \
--conf spark.network.timeout=2000 \
--conf spark.executor.heartbeatInterval=300s \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.shuffle.service.enabled=false \
--conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.InstanceProfileCredentialsProvider \
--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.jars.packages=org.apache.hadoop:hadoop-aws:3.3.4,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.0,org.apache.iceberg:iceberg-aws-bundle:1.10.0 \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions   \
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog    \
--conf spark.sql.catalog.local.type=hadoop  \
--conf spark.sql.catalog.local.warehouse=s3a://<YOUR_S3_BUCKET>/<warehouse_path>/ \
--conf spark.sql.defaultCatalog=local   \
--conf spark.sql.catalog.local.io-impl=org.apache.iceberg.aws.s3.S3FileIO   \
spark-benchmark-assembly-3.5.6.jar   \
s3://<YOUR_S3_BUCKET>/benchmark_run 3000 1 false  \
q1-v2.13,q10-v2.13,q11-v2.13,q12-v2.13,q13-v2.13,q14a-v2.13,q14b-v2.13,q15-v2.13,q16-v2.13,\
q17-v2.13,q18-v2.13,q19-v2.13,q2-v2.13,q20-v2.13,q21-v2.13,q22-v2.13,q23a-v2.13,q23b-v2.13,\
q24a-v2.13,q24b-v2.13,q25-v2.13,q26-v2.13,q27-v2.13,q28-v2.13,q29-v2.13,q3-v2.13,q30-v2.13,\
q31-v2.13,q32-v2.13,q33-v2.13,q34-v2.13,q35-v2.13,q36-v2.13,q37-v2.13,q38-v2.13,q39a-v2.13,\
q39b-v2.13,q4-v2.13,q40-v2.13,q41-v2.13,q42-v2.13,q43-v2.13,q44-v2.13,q45-v2.13,q46-v2.13,\
q47-v2.13,q48-v2.13,q49-v2.13,q5-v2.13,q50-v2.13,q51-v2.13,q52-v2.13,q53-v2.13,q54-v2.13,\
q55-v2.13,q56-v2.13,q57-v2.13,q58-v2.13,q59-v2.13,q6-v2.13,q60-v2.13,q61-v2.13,q62-v2.13,\
q63-v2.13,q64-v2.13,q65-v2.13,q66-v2.13,q67-v2.13,q68-v2.13,q69-v2.13,q7-v2.13,q70-v2.13,\
q71-v2.13,q72-v2.13,q73-v2.13,q74-v2.13,q75-v2.13,q76-v2.13,q77-v2.13,q78-v2.13,q79-v2.13,\
q8-v2.13,q80-v2.13,q81-v2.13,q82-v2.13,q83-v2.13,q84-v2.13,q85-v2.13,q86-v2.13,q87-v2.13,\
q88-v2.13,q89-v2.13,q9-v2.13,q90-v2.13,q91-v2.13,q92-v2.13,q93-v2.13,q94-v2.13,q95-v2.13,\
q96-v2.13,q97-v2.13,q98-v2.13,q99-v2.13,ss_max-v2.13    \
true <database> > /media/ephemeral0/spark_run.log 2>&1 &!

Summarize the results

After the Spark job finishes, retrieve the test result file from the output S3 bucket at s3://<YOUR_S3_BUCKET>/benchmark_run/timestamp=xxxx/summary.csv/xxx.csv. This can be done either through the Amazon S3 console by navigating to the specified bucket location or by using the Amazon Command Line Interface (AWS CLI). The Spark benchmark application organizes the data by creating a timestamp folder and placing a summary file within a folder labeled summary.csv. The output CSV files contain 4 columns without headers:

  • Query name
  • Median time
  • Minimum time
  • Maximum time

With the data from 3 separate test runs with 1 iteration each time, we can calculate the average and geometric mean of the benchmark runtimes.

Run the TPC-DS benchmark with Amazon EMR runtime for Apache Spark

Most of the instructions are similar to Steps to run Spark Benchmarking with a few Iceberg-specific details.

Prerequisites

Complete the following prerequisite steps:

  1. Run aws configure to configure the AWS CLI shell to point to the benchmarking AWS account. Refer to Configure the AWS CLI for instructions.
  2. Upload the benchmark application JAR file to Amazon S3.

Deploy Amazon EMR cluster and run the benchmark job

Complete the following steps to run the benchmark job:

  1. Use the AWS CLI command as shown in Deploy EMR on EC2 Cluster and run benchmark job to deploy an Amazon EMR on EC2 cluster. Make sure to enable Iceberg. See Create an Iceberg cluster for more details. Choose the correct Amazon EMR version, root volume size, and same resource configuration as the open source Flintrock setup. Refer to create-cluster for a detailed description of the AWS CLI options.
  2. Store the cluster ID from the response. We need this for the next step.
  3. Submit the benchmark job in Amazon EMR using add-steps from the AWS CLI:
    1. Replace <cluster ID> with the cluster ID from Step 2.
    2. The benchmark application is at s3://<your-bucket>/spark-benchmark-assembly-3.5.6.jar.
    3. Choose the correct Iceberg catalog warehouse location and database that has the created Iceberg tables. This should be the same as the one used for the open source TPC-DS benchmark run.
    4. The results will be in s3://<your-bucket>/benchmark_run.
aws emr add-steps   --cluster-id <cluster-id>
--steps Type=Spark,Name="SPARK Iceberg EMR TPCDS Benchmark Job",
Args=[--class,com.amazonaws.eks.tpcds.BenchmarkSQL,
--conf,spark.driver.cores=4,
--conf,spark.driver.memory=10g,
--conf,spark.executor.cores=16,
--conf,spark.executor.memory=100g,
--conf,spark.executor.instances=8,
--conf,spark.network.timeout=2000,
--conf,spark.executor.heartbeatInterval=300s,
--conf,spark.dynamicAllocation.enabled=false,
--conf,spark.shuffle.service.enabled=false,
--conf,spark.sql.iceberg.data-prefetch.enabled=true,
--conf,spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,
--conf,spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog,
--conf,spark.sql.catalog.local.type=hadoop,
--conf,spark.sql.catalog.local.warehouse=s3://<your-bucket>/<warehouse-path>,
--conf,spark.sql.defaultCatalog=local,
--conf,spark.sql.catalog.local.io-impl=org.apache.iceberg.aws.s3.S3FileIO,
s3://<your-bucket>/spark-benchmark-assembly-3.5.6.jar,
s3://<your-bucket>/benchmark_run,3000,1,false,
'q1-v2.13\,q10-v2.13\,q11-v2.13\,q12-v2.13\,q13-v2.13\,q14a-v2.13\,q14b-v2.13\,q15-v2.13\,q16-v2.13\,q17-v2.13\,q18-v2.13\,q19-v2.13\,q2-v2.13\,q20-v2.13\,q21-v2.13\,q22-v2.13\,q23a-v2.13\,q23b-v2.13\,q24a-v2.13\,q24b-v2.13\,q25-v2.13\,q26-v2.13\,q27-v2.13\,q28-v2.13\,q29-v2.13\,q3-v2.13\,q30-v2.13\,q31-v2.13\,q32-v2.13\,q33-v2.13\,q34-v2.13\,q35-v2.13\,q36-v2.13\,q37-v2.13\,q38-v2.13\,q39a-v2.13\,q39b-v2.13\,q4-v2.13\,q40-v2.13\,q41-v2.13\,q42-v2.13\,q43-v2.13\,q44-v2.13\,q45-v2.13\,q46-v2.13\,q47-v2.13\,q48-v2.13\,q49-v2.13\,q5-v2.13\,q50-v2.13\,q51-v2.13\,q52-v2.13\,q53-v2.13\,q54-v2.13\,q55-v2.13\,q56-v2.13\,q57-v2.13\,q58-v2.13\,q59-v2.13\,q6-v2.13\,q60-v2.13\,q61-v2.13\,q62-v2.13\,q63-v2.13\,q64-v2.13\,q65-v2.13\,q66-v2.13\,q67-v2.13\,q68-v2.13\,q69-v2.13\,q7-v2.13\,q70-v2.13\,q71-v2.13\,q72-v2.13\,q73-v2.13\,q74-v2.13\,q75-v2.13\,q76-v2.13\,q77-v2.13\,q78-v2.13\,q79-v2.13\,q8-v2.13\,q80-v2.13\,q81-v2.13\,q82-v2.13\,q83-v2.13\,q84-v2.13\,q85-v2.13\,q86-v2.13\,q87-v2.13\,q88-v2.13\,q89-v2.13\,q9-v2.13\,q90-v2.13\,q91-v2.13\,q92-v2.13\,q93-v2.13\,q94-v2.13\,q95-v2.13\,q96-v2.13\,q97-v2.13\,q98-v2.13\,q99-v2.13\,ss_max-v2.13',
true,<database>],ActionOnFailure=CONTINUE --region <aws-region>

Summarize the results

After the step is complete, you can see the summarized benchmark result at s3://<YOUR_S3_BUCKET>/benchmark_run/timestamp=xxxx/summary.csv/xxx.csv in the same way as the previous run and compute the average and geometric mean of the query runtimes.

Clean up

To help prevent future charges, delete the resources you created by following the instructions provided in the Cleanup section of the GitHub repository.

Summary

Amazon EMR optimizes the runtime for Spark when used with Iceberg tables, achieving 4.5x faster performance than open source Apache Spark 3.5.6 and Apache Iceberg 1.10.0 with Amazon EMR 7.12 on TPC-DS 3 TB, v2.13. This represents a significant advancement from Amazon EMR 7.5, which delivered 3.6x faster performance and closes the gap to parquet performance on Amazon EMR so customers can use the benefits of Iceberg without a performance penalty.

We encourage you to keep up to date with the latest Amazon EMR releases to fully benefit from ongoing performance improvements.

To stay informed, subscribe to the RSS feed for the AWS Big Data Blog, where you can find updates on the Amazon EMR runtime for Spark and Iceberg, as well as tips on configuration best practices and tuning recommendations.


About the authors

Atul Felix Payapilly is a software development engineer for Amazon EMR at Amazon Web Services.

Akshaya KP is a software development engineer for Amazon EMR at Amazon Web Services.

Hari Kishore Chaparala is a software development engineer for Amazon EMR at Amazon Web Services.

Giovanni Matteo is the Senior Manager for the Amazon EMR Spark and Iceberg group.

Apache Spark encryption performance improvement with Amazon EMR 7.9

Post Syndicated from Sonu Kumar Singh original https://aws.amazon.com/blogs/big-data/apache-spark-encryption-performance-improvement-with-amazon-emr-7-9/

The Amazon EMR runtime for Apache Spark is a performance-optimized runtime for Apache Spark that is 100% API compatible with open source Apache Spark. With Amazon EMR release 7.9.0, the EMR runtime for Apache Spark introduces significant performance improvements for encrypted workloads, supporting Spark version 3.5.5.

For compliance and security requirements, many customers need to enable Apache Spark’s local storage encryption (spark.io.encryption.enabled = true) in addition to Amazon Simple Storage Service (Amazon S3) encryption (such as server-side encryption (SSE) or AWS Key Management Service (AWS KMS)). This feature encrypts shuffle files, cached data, and other intermediate data written to local disk during Spark operations, protecting sensitive data at rest on Amazon EMR cluster instances.

Industries subject to regulations such as the Health Insurance Portability and Accountability Act (HIPAA) for healthcare, Payment Card Industry Data Security Standard (PCI-DSS) for financial services, General Data Protection Regulation (GDPR) for personal data, and Federal Risk and Authorization Management Program (FedRAMP) for government often require encryption of all data at rest, including temporary files on local storage. While Amazon S3 encryption protects data in object storage, Spark’s I/O encryption secures the intermediate shuffle and spill data that Spark writes to local disk during distributed processing—data that never reaches Amazon S3 but might contain sensitive information extracted from source datasets. Generally, encrypted operations require additional computational overhead that can impact overall job performance.

With the built-in encryption optimizations of Amazon EMR 7.9.0, customers might see significant performance improvements in their Apache Spark applications without requiring any application changes. In our performance benchmark tests, derived from TPC-DS performance tests at 3 TB scale, we observed up to 20% faster performance with the EMR 7.9 optimized Spark runtime compared to Spark without these optimizations. Individual results may vary depending on specific workloads and configurations.

In this post, we analyze the results from our benchmark tests comparing the Amazon EMR 7.9 optimized Spark runtime against Spark 3.5.5 without encryption optimizations. We walk through a detailed cost analysis and provide step-by-step instructions to reproduce the benchmark.

Results observed

To evaluate the performance improvements, we used an open source Spark performance test utility derived from the TPC-DS performance test toolkit. We ran the tests on two nine-node (eight core nodes and one primary node) r5d.4xlarge Amazon EMR 7.9.0 clusters, comparing two configurations:

  • Baseline: EMR 7.9.0 cluster with a bootstrap action installing Spark 3.5.5 without encryption optimizations
  • Optimized: EMR 7.9.0 cluster using the EMR Spark 3.5.5 runtime with encryption optimizations

Both tests used data stored in Amazon Simple Storage Service (Amazon S3). All data processing was configured identically except for the Spark runtime version.

To maintain benchmarking consistency and ensure a consistent, equivalent comparison, we disabled Dynamic Resource Allocation (DRA) in both test configurations. This approach eliminates variability from dynamic scaling and so we can measure pure computational performance improvements.

The following table shows the total job runtime for all queries (in seconds) in the 3 TB query dataset between the baseline and Amazon EMR 7.9 optimized configurations:

Configuration Total runtime (seconds) Geometric mean (seconds) Performance improvement
Baseline (Spark 3.5.5 without optimization) 1,485 10.24
EMR 7.9 (with encryption optimization) 1,176 8.15 20% faster

We observed that our TPC-DS tests with the Amazon EMR 7.9 optimized Spark runtime completed about 20% faster based on total runtime and 20% faster based on geometric mean compared to the baseline configuration.

The encryption optimizations in Amazon EMR 7.9 deliver performance benefits through:

  • Improved shuffle and decryption operations reducing overhead during data exchange without compromising security
  • Better memory management for intermediate results

Cost analysis

The performance improvements of the Amazon EMR 7.9 optimized Spark runtime directly translate to lower costs. We realized an approximately 20% cost savings running the benchmark application with encryption optimizations compared to the baseline configuration, because of reduced hours of EMR, Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic Block Store (Amazon EBS) using General Purpose SSD (gp2).

The following table summarizes the cost comparison in the us-east-1 AWS Region:

Configuration Runtime (hours) Estimated cost Total EC2 instances Total vCPU Total memory (GiB) Root device (EBS)
Baseline: Spark 3.5.5 without optimization, 1 primary and 8 core nodes 0.41 $5.28 9 144 1152 64 GiB gp2
Amazon EMR 7.9 with optimization, 1 primary and 8 core nodes 0.33 $4.25 9 144 1152 64 GiB gp2

Cost breakdown

Formulas used:

  • Amazon EMR cost – Number of instances × EMR hourly rate × Runtime hours
  • Amazon EC2 cost – Number of instances × EC2 hourly rate × Runtime hour)
  • Amazon EBS cost(EBS cost per GB per month ÷ hours in a month) × EBS volume size × number of instances × runtime hours

Note: EBS is priced monthly ($0.1 per GB per month), so we divide by 730 hours to convert to an hourly rate. EMR and EC2 are already priced hourly, so no conversion is needed.

Baseline configuration (0.41 hours):

  • Amazon EMR cost – 9 × $0.27 × 0.41 = $1.00
  • Amazon EC2 cost – 9 × $1.152 × 0.41 = $4.25
  • Amazon EBS cost – ($0.1/730 × 64 × 9 × 0.41) = $0.032
  • Total cost – $5.28

EMR 7.9 optimized configuration (0.33 hours):

  • Amazon EMR cost – (9 × $0.27 × 0.33) = $0.80
  • Amazon EC2 cost – (9 × $1.152 × 0.33) = $3.42
  • Amazon EBS cost – ($0.1/730 × 64 × 9 × 0.33) = $0.025
  • Total cost: $4.25

Total cost savings: 20% per benchmark run, which scales linearly with your production workload frequency.

Set up EMR benchmarking

For detailed instructions and scripts, see the companion GitHub repository.

Prerequisites

To set up Amazon EMR benchmarking, start by completing the following prerequisite steps:

  1. Configure your AWS Command Line Interface (AWS CLI) by running aws configure to point to your benchmarking account,
  2. Create an S3 bucket for test data and results.
  3. Copy the TPC-DS 3TB source data from a publicly available dataset to your S3 bucket using the following command:
    aws s3 cp s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned s3://<YOUR-BUCKET-NAME>/BLOG_TPCDS-TEST-3T-partitioned --recursive

    Replace <YOUR-BUCKET-NAME> with the name of the S3 bucket you created in step 2.

  4. Build or download the benchmark application JAR file (spark-benchmark-assembly-3.3.0.jar)
  5. Ensure you have appropriate AWS Identity Access Management (IAM) roles for EMR cluster creation and Amazon S3 access

Deploy the baseline EMR cluster (without optimization)

Step 1: Launch EMR 7.9.0 cluster with bootstrap action

The baseline configuration uses a bootstrap action to install Spark 3.5.5 without encryption optimizations. We have made the bootstrap script publicly available in an S3 bucket for your convenience.

Create the default Amazon EMR roles:

aws emr create-default-roles

Now create the cluster:

aws emr create-cluster \
  --name "EMR-7.9-Baseline-Spark-3.5.5" \
  --release-label emr-7.9.0 \
  --applications Name=Spark \
  --ec2-attributes SubnetId=<YOUR-SUBNET-ID>,InstanceProfile=EMR_EC2_DefaultRole  \
  --service-role EMR_DefaultRole
  --instance-groups \
    InstanceGroupType=MASTER,InstanceCount=1,InstanceType=r5d.4xlarge \
    InstanceGroupType=CORE,InstanceCount=8,InstanceType=r5d.4xlarge \
  --bootstrap-actions \
    Path=s3://spark-ba/install-spark-3-5-5-no-encryption.sh,Name="install spark 3.5.5 without encryption optimization" \
  --use-default-roles \
  --log-uri s3://<YOUR-BUCKET-NAME>/logs/baseline/

Note: The bootstrap script is available in a public S3 bucket at s3://spark-ba/install-spark-3-5-5-no-encryption.sh. This script installs Apache Spark 3.5.5 without the encryption optimizations present in the Amazon EMR runtime.

Step 2: Submit the benchmark job to the baseline cluster

Next submit the Spark job using the following commands:

aws emr add-steps \
  --cluster-id <YOUR-BASELINE-CLUSTER-ID> \  
  --steps 'Type=Spark,Name="EMR-7.9-Baseline-Spark-3.5.5 Step",ActionOnFailure=CONTINUE,Args=["--deploy-mode","client","--conf","spark.io.encryption.enabled=false","--class","com.amazonaws.eks.tpcds.BenchmarkSQL","s3://<YOUR-BUCKET-NAME>/jar/spark-benchmark-assembly-3.3.0.jar","s3:// <YOUR-BUCKET-NAME>/blog/BLOG_TPCDS-TEST-3T-partitioned","s3:// <YOUR-BUCKET-NAME>/blog/BASELINE_TPCDS-TEST-3T-RESULT","/opt/tpcds-kit/tools","parquet","3000","3","false","q1-v2.4,q10-v2.4,q11-v2.4,q12-v2.4,q13-v2.4,q14a-v2.4,q14b-v2.4,q15-v2.4,q16-v2.4,q17-v2.4,q18-v2.4,q19-v2.4,q2-v2.4,q20-v2.4,q21-v2.4,q22-v2.4,q23a-v2.4,q23b-v2.4,q24a-v2.4,q24b-v2.4,q25-v2.4,q26-v2.4,q27-v2.4,q28-v2.4,q29-v2.4,q3-v2.4,q30-v2.4,q31-v2.4,q32-v2.4,q33-v2.4,q34-v2.4,q35-v2.4,q36-v2.4,q37-v2.4,q38-v2.4,q39a-v2.4,q39b-v2.4,q4-v2.4,q40-v2.4,q41-v2.4,q42-v2.4,q43-v2.4,q44-v2.4,q45-v2.4,q46-v2.4,q47-v2.4,q48-v2.4,q49-v2.4,q5-v2.4,q50-v2.4,q51-v2.4,q52-v2.4,q53-v2.4,q54-v2.4,q55-v2.4,q56-v2.4,q57-v2.4,q58-v2.4,q59-v2.4,q6-v2.4,q60-v2.4,q61-v2.4,q62-v2.4,q63-v2.4,q64-v2.4,q65-v2.4,q66-v2.4,q67-v2.4,q68-v2.4,q69-v2.4,q7-v2.4,q70-v2.4,q71-v2.4,q72-v2.4,q73-v2.4,q74-v2.4,q75-v2.4,q76-v2.4,q77-v2.4,q78-v2.4,q79-v2.4,q8-v2.4,q80-v2.4,q81-v2.4,q82-v2.4,q83-v2.4,q84-v2.4,q85-v2.4,q86-v2.4,q87-v2.4,q88-v2.4,q89-v2.4,q9-v2.4,q90-v2.4,q91-v2.4,q92-v2.4,q93-v2.4,q94-v2.4,q95-v2.4,q96-v2.4,q97-v2.4,q98-v2.4,q99-v2.4,ss_max-v2.4","true"]'

Deploy the optimized EMR cluster (with encryption optimization)

Step 1: Launch EMR 7.9.0 cluster with Spark runtime

The optimized configuration uses the EMR 7.9.0 Spark runtime without any bootstrap actions:

aws emr create-cluster \
  --name "EMR-7.9-Optimized-Native-Spark" \
  --release-label emr-7.9.0 \
  --applications Name=Spark \
  --ec2-attributes SubnetId=<YOUR-SUBNET-ID>,InstanceProfile=EMR_EC2_DefaultRole \
  --service-role EMR_DefaultRole
  --instance-groups \
    InstanceGroupType=MASTER,InstanceCount=1,InstanceType=r5d.4xlarge \
    InstanceGroupType=CORE,InstanceCount=8,InstanceType=r5d.4xlarge \
  --use-default-roles \
  --log-uri s3://<YOUR-BUCKET-NAME>/logs/optimized/

Example:

aws emr create-cluster \
--name "EMR-7.9-Optimized-Native-Spark" \
--release-label emr-7.9.0 \
--applications Name=Spark \
--ec2-attributes SubnetId=subnet-08a5f71f92bc8a801 \
--instance-groups \
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=r5d.4xlarge \
InstanceGroupType=CORE,InstanceCount=8,InstanceType=r5d.4xlarge \
--bootstrap-actions \
Path=s3://spark-ba/install-spark-3-5-5-no-encryption.sh,Name="install spark 3.5.5 without encryption optimization" \
--use-default-roles \
--log-uri s3://aws-logs-123456789012-us-west-2/elasticmapreduce/

Step 2: Submit the benchmark job to optimized cluster

ext submit the Spark job using the following commands:

aws emr add-steps \
  --cluster-id <YOUR-OPTIMIZED-CLUSTER-ID> \ 
  --steps 'Type=Spark,Name="EMR-7.9-Optimized-Native-Spark Step",ActionOnFailure=CONTINUE,Args=["--deploy-mode","client","--conf","spark.io.encryption.enabled=true","--class","com.amazonaws.eks.tpcds.BenchmarkSQL","s3://<YOUR-BUCKET-NAME>/jar/spark-benchmark-assembly-3.3.0.jar","s3://<YOUR-BUCKET-NAME>/blog/BLOG_TPCDS-TEST-3T-partitioned","s3://<YOUR-BUCKET-NAME>/blog/BASELINE_TPCDS-TEST-3T-RESULT","/opt/tpcds-kit/tools","parquet","3000","3","false","q1-v2.4,q10-v2.4,q11-v2.4,q12-v2.4,q13-v2.4,q14a-v2.4,q14b-v2.4,q15-v2.4,q16-v2.4,q17-v2.4,q18-v2.4,q19-v2.4,q2-v2.4,q20-v2.4,q21-v2.4,q22-v2.4,q23a-v2.4,q23b-v2.4,q24a-v2.4,q24b-v2.4,q25-v2.4,q26-v2.4,q27-v2.4,q28-v2.4,q29-v2.4,q3-v2.4,q30-v2.4,q31-v2.4,q32-v2.4,q33-v2.4,q34-v2.4,q35-v2.4,q36-v2.4,q37-v2.4,q38-v2.4,q39a-v2.4,q39b-v2.4,q4-v2.4,q40-v2.4,q41-v2.4,q42-v2.4,q43-v2.4,q44-v2.4,q45-v2.4,q46-v2.4,q47-v2.4,q48-v2.4,q49-v2.4,q5-v2.4,q50-v2.4,q51-v2.4,q52-v2.4,q53-v2.4,q54-v2.4,q55-v2.4,q56-v2.4,q57-v2.4,q58-v2.4,q59-v2.4,q6-v2.4,q60-v2.4,q61-v2.4,q62-v2.4,q63-v2.4,q64-v2.4,q65-v2.4,q66-v2.4,q67-v2.4,q68-v2.4,q69-v2.4,q7-v2.4,q70-v2.4,q71-v2.4,q72-v2.4,q73-v2.4,q74-v2.4,q75-v2.4,q76-v2.4,q77-v2.4,q78-v2.4,q79-v2.4,q8-v2.4,q80-v2.4,q81-v2.4,q82-v2.4,q83-v2.4,q84-v2.4,q85-v2.4,q86-v2.4,q87-v2.4,q88-v2.4,q89-v2.4,q9-v2.4,q90-v2.4,q91-v2.4,q92-v2.4,q93-v2.4,q94-v2.4,q95-v2.4,q96-v2.4,q97-v2.4,q98-v2.4,q99-v2.4,ss_max-v2.4","true"]'

Benchmark command parameters explained

The Amazon EMR Spark step uses the following parameters:

  • EMR step configuration:
    • Type=Spark: Specifies this is a Spark application step
    • Name=”EMR-7.9-Baseline-Spark-3.5.5″: Human-readable name for the step
    • ActionOnFailure=CONTINUE: Continue with other steps if this one fails
  • Spark submit arguments:
    • –deploy-mode client: Run the driver on the master node (not cluster mode)
    • –class com.amazonaws.eks.tpcds.BenchmarkSQL: Main class for the TPC-DS benchmark
  • Application parameters:
    • JAR file: s3://<YOUR-BUCKET-NAME>/jar/spark-benchmark-assembly-3.3.0.jar
    • Input data: s3://<YOUR-BUCKET-NAME>/blog/BLOG_TPCDS-TEST-3T-partitioned (3 TB TPC-DS dataset)
    • Output location: s3://<YOUR-BUCKET-NAME>/blog/BASELINE_TPCDS-TEST-3T-RESULT (S3 path for results)
    • TPC-DS tools path: /opt/tpcds-kit/tools(local path on EMR nodes)
    • Format: parquet (output format)
    • Scale factor: 3000 (3 TB dataset size)
    • Iterations: 3 (run each query 3 times for averaging)
    • Collect results: false (don’t collect results to driver)
    • Query list: "q1-v2.4,q10-v2.4,...,ss_max-v2.4" (all 104 TPC-DS queries)
    • Final parameter: true (enable detailed logging and metrics)
  • Query coverage:
    • All 104 standard TPC-DS benchmark queries (q1-v2.4 through q99-v2.4)
    • Plus the ss_max-v2.4 query for additional testing
    • Each query runs 3 times to calculate average performance

Summarize the results

  1. Download the test result files from both output S3 locations:
    # Baseline results
    aws s3 cp s3://<YOUR-BUCKET-NAME>/blog/BASELINE_TPCDS-TEST-3T-RESULT/timestamp=xxxx/summary.csv/xxx.csv ./baseline-results.csv
       
    # Optimized results
    aws s3 cp s3://<YOUR-BUCKET-NAME>/blog/OPTIMIZED_TPCDS-TEST-3T-RESULT/timestamp=xxxx/summary.csv/xxx.csv ./optimized-results.csv

  2. The CSV files contain four columns (without headers):
    • Query name
    • Median time (seconds)
    • Minimum time (seconds)
    • Maximum time (seconds)
  3. Calculate performance metrics for comparison:
    • Average time per query: AVERAGE(median, min, max) for each query
    • Total runtime: Sum of all median times
    • Geometric mean: GEOMEAN(average times) across all queries
    • Speedup: Calculate the ratio between baseline and optimized for each query
  4. Create comparison analysis:Speedup = (Baseline Time - Optimized Time) / Baseline Time * 100%

Testing configuration details

The following table summarizes the test environment used for this post:

Parameter Value
EMR release emr-7.9.0 (both configurations)
Baseline Spark version 3.5.5 (installed through bootstrap action)
Baseline bootstrap script s3://spark-ba/install-spark-3-5-5-no-encryption.sh (public)
Optimized spark version Amazon EMR Spark runtime
Cluster size 9 nodes (1 primary and 8 core)
Instance type r5d.4xlarge
vCPUs per node 16
Memory per node 128 GB
Instance storage 600 GB SSD
EBS volume 64 GB gp2 (2 volumes per instance)
Total vCPUs 144 (9 × 16)
Total memory 1152 GB (9 × 128)
Dataset TPC-DS 3TB (Parquet format)
Queries 104 queries (TPC-DS v2.4)
Iterations 3 runs per query
DRA Disabled for consistent benchmarking

Clean up

To avoid incurring future charges, delete the resources you created:

  1. Terminate both EMR clusters:
    aws emr terminate-clusters --cluster-ids <YOUR-BASELINE-CLUSTER-ID> <YOUR-OPTIMIZED-CLUSTER-ID>

  2. Delete S3 test results if no longer needed:
    aws s3 rm s3://<YOUR-BUCKET-NAME>/blog/BASELINE_TPCDS-TEST-3T-RESULT/ --recursive
    aws s3 rm s3://<YOUR-BUCKET-NAME>/blog/OPTIMIZED_TPCDS-TEST-3T-RESULT/ --recursive
    aws s3 rm s3://<YOUR-BUCKET-NAME>/logs/ --recursive

  3. Remove IAM roles if created specifically for testing

Key findings

  • Up to 20% performance improvement using the Amazon EMR 7.9’s Spark runtime with no code changes required
  • 20% cost savings because of reduced runtime
  • Significant gains for shuffle-heavy, join-intensive workloads
  • 100% API compatibility with open source Apache Spark
  • Simple migration from custom Spark builds to EMR runtime
  • Easy benchmarking using publicly available bootstrap scripts

Conclusion

You can run your Apache Spark workloads up to 20% faster and at lower cost without making any changes to your applications by using the Amazon EMR 7.9.0 optimized Spark runtime. This improvement is achieved through numerous optimizations in the EMR Spark runtime, including enhanced encryption handling, improved data serialization, and optimized shuffle operations.

To learn more about Amazon EMR 7.9 and best practices, see the EMR documentation. For configuration guidance and tuning advice, subscribe to the AWS Big Data Blog.

Related resources:

If you’re running Spark workloads on Amazon EMR today, we encourage you to test the EMR 7.9 Spark runtime with your production workloads and measure the improvements specific to your use case.


About the authors

Sonu Kumar Singh

Sonu Kumar Singh

Sonu is a Senior Solutions Architect with more than 13 years of experience, with a specialization in Analytics and Healthcare domain. He has been instrumental in catalyzing transformative shifts in organizations by enabling data-driven decision-making thereby fueling innovation and growth. He enjoys it when something he designed or created brings a positive impact.

Roshin Babu

Roshin Babu

Roshin is a Sr. Specialist Solutions architect at AWS, where he collaborates with the sales team to support public sector clients. His role focuses on developing innovative solutions that solve complex business challenges while driving increased adoption of AWS analytics services. When he’s not working, Roshin is passionate about exploring new destinations, discovering great food, and enjoying soccer both as a player and fan.Polaris Jhandi

Polaris Jhandi

Polaris Jhandi

Polaris is a Cloud Application Architect with AWS Professional Services. He has a background in AI/ML and big data. He is currently working with customers to migrate their legacy mainframe applications to the AWS Cloud.Zheng Yuan

Zheng Yuan

Zheng Yuan

Zheng is a Software Engineer on the Amazon EMR Spark team, where he focuses on improving the performance of the Spark execution engine across various use cases.

Introducing catalog federation for Apache Iceberg tables in the AWS Glue Data Catalog

Post Syndicated from Debika D original https://aws.amazon.com/blogs/big-data/introducing-catalog-federation-for-apache-iceberg-tables-in-the-aws-glue-data-catalog/

Apache Iceberg has become the standard choice of open table format for organizations seeking robust and reliable analytics at scale. However, enterprises increasingly find themselves navigating complex multi-vendor landscapes with disparate catalog systems. Managing data across these has become a major challenge for organizations operating in multi-vendor environments. This fragmentation drives significant operational complexity, particularly around access control and governance. Customers using AWS analytics services such as Amazon Redshift, Amazon EMR, Amazon Athena, Amazon SageMaker, and AWS Glue to analyze Iceberg tables in the AWS Glue Data Catalog want to get the same price-performance for workloads in remote catalogs. Simply migrating or replacing these remote catalogs isn’t practical, leaving teams to implement and maintain synchronization processes that continuously replicate metadata across systems, creating operational overhead, escalating costs, and risking data inconsistencies.

AWS Glue now supports catalog federation for remote Iceberg tables in the Data Catalog. With catalog federation, you can query remote Iceberg tables, stored in Amazon Simple Storage Service (Amazon S3) and cataloged in remote Iceberg catalogs, using AWS analytics engines and without moving or duplicating tables. After a remote catalog is integrated, AWS Glue always fetch the latest metadata in the background, so you always have access to the Iceberg metadata through your preferred AWS analytics services. This capability supports both coarse-grained access control and fine-grained permissions through AWS Lake Formation, giving you the flexibility on how and when remote Iceberg tables are shared with data consumers. With integration for Snowflake Polaris Catalog, Databricks Unity Catalog, and other custom catalogs supporting Iceberg REST specifications, you can federate to remote catalogs, discover databases and tables, configure access permissions, and begin querying remote Iceberg data.

In this post, we discuss how to get started with catalog federation for Iceberg tables in the Data Catalog.

Solution overview

Catalog federation uses the Data Catalog to communicate with remote catalog systems to discover catalog objects and Lake Formation to authorize access to their data in Amazon S3. When you query a remote Iceberg table, the Data Catalog discovers the latest table information in the remote catalog at query runtime, getting the table’s S3 location, current schema, and partition information. Your analytics engine (Athena, Amazon EMR, or Amazon Redshift) Your analytics engine (Athena, EMR, or Redshift) then uses this information to access Iceberg data files directly from Amazon S3. And Lake Formation manages access to the table by vending scoped credentials to the table data stored in Amazon S3, allowing the engines to apply fine-grained permissions to the federated table. This approach avoids metadata and data duplication while providing real-time access to remote Iceberg tables through your preferred AWS analytics engines.

The Data Catalog facilitates connectivity to remote catalog systems that support Apache Iceberg by establishing an AWS Glue connection with the remote catalog endpoint. You can connect the Data Catalog to remote Iceberg REST catalogs using OAuth2 or custom authentication mechanisms using an access token. During integration, administrators configure a principal (service account or identity) with the appropriate permissions to access resources in the remote catalog. The AWS Glue connection object uses this configured principal’s credentials to authenticate and access metadata in the remote catalog server. You can also connect the Data Catalog to remote catalogs that use a private link or proxy for isolating and restricting network access. After it’s connected, this integration uses the standardized Iceberg REST API specification to retrieve the most current table metadata information from these remote catalogs. AWS Glue onboards these remote catalogs as federated catalogs within its own catalog infrastructure, enabling unified metadata access across multiple catalog systems.

Lake Formation serves as the centralized authorization layer for managing user access to federated catalog resources. When users attempt to access tables and databases in federated catalogs, Lake Formation evaluates their permissions and enforces fine-grained access control policies.

Beyond metadata authorization, the catalog federation also manages secure access to the actual underlying data files. It accomplishes this through credential vending mechanisms that issue temporary, scope-limited credentials. AWS Glue federated catalogs work with your preferred AWS analytics engines and query services, enabling consistent metadata access and unified data governance across your analytics workloads.

In the following sections, we walk through the steps to integrate the Data Catalog with your remote catalog server:

  1. Set up an integration principal in the remote catalog and provide required access on catalog resources to this principal. Enable OAuth based authentication for the integration principal.
  2. Create a federated catalog in the Data Catalog using the AWS Glue connection. Create an AWS Glue connection that uses the credentials of the integration principal (in Step1) to connect to the Iceberg REST endpoint of the remote catalog. Configure an AWS Identity and Access Management (IAM) role with permission to S3 locations where the remote table data resides. In a cross-account scenario, make sure the bucket policy grants required access to this IAM role. This federated catalog mirrors the catalog object in your remote catalog server.
  3. Discover Iceberg tables in federated catalogs using Lake Formation or AWS Glue APIs. Query Iceberg tables using AWS analytics engines. During query operations, Lake Formation manages fine-grained permission on federated resources and credential vending to underlying data for the end-users.

Prerequisites

Before you begin, verify you have the following setup in AWS:

  • An AWS account.
  • The AWS Command Line Interface (AWS CLI) version 2.31.38 or later installed and configured.
  • An IAM admin role or user with appropriate permissions to the following services:
    • IAM
    • AWS Glue Data Catalog
    • Amazon S3
    • AWS Lake Formation
    • AWS Secrets manager
    • Amazon Athena
  • Create a data lake admin. For instructions, see Create a data lake administrator.

Set up authentication credentials in remote Iceberg catalog

Catalog federation to a remote Iceberg catalog uses the OAuth2 credentials of the principal configured with metadata access. This authentication mechanism allows the AWS Glue Data Catalog to access the metadata of various objects (such as databases, and tables) within the remote catalogs, based on the privileges associated with the principal. To support proper functionality, you must grant the principal with the necessary permissions to read the metadata of these objects. Generate the CLIENT_ID and CLIENT_SECRET to enable OAuth based authentication for the integration principal.

Create AWS Glue catalog federation using connection to remote Iceberg catalog

Create a federated catalog in the Data Catalog that mirrors a catalog object in the remote Iceberg catalog server and is used by the AWS Glue service to federate metadata queries such as ListDatabases, ListTables, and GetTable to the remote catalog. As data lake administrator, you can create a federated catalog in the Data Catalog using an AWS Glue connection object that is registered with AWS Lake Formation.

Configure data source connection for AWS Glue connection

Catalog federation uses an AWS Glue connection for metadata access when you provide authentication and Iceberg REST API endpoint configurations in the remote catalog. The AWS Glue connection supports OAuth2 or custom as the authentication method.

Connect using OAuth2 authentication

For the OAuth2 authentication method, you can provide a client secret either directly as input or stored in AWS Secrets Manager and used by the AWS Glue connection object during authentication. AWS Glue internally manages the token refresh upon expiration. To store the client secret in Secrets manager, complete the following steps:

  1. On the Secrets Manager console, choose Secrets in the navigation pane.
  2. Choose Store a new secret.
  3. Choose Other type of secret, provide the key name as USER_MANAGED_CLIENT_APPLICATION_CLIENT_SECRET, and enter the client secret value.
  4. Choose Next and provide a name for the secret.
  5. Choose Next and choose Store to save the secret.

Connect using custom authentication

For custom authentication, use Secrets Manager to store and retrieve the access token. This access token is created, refreshed, and managed by the customer’s application or system, providing proper control and management over the authentication process. To store the access token in Secrets Manager, complete the following steps:

  1. On the Secrets Manager console, choose Secrets in the navigation pane.
  2. Choose Store a new secret.
  3. Choose Other type of secret and provide the key name as BEARER_TOKEN with the value noted as the access token of the integration principal.
  4. Choose Next and provide a name for the secret.
  5. Choose Next and choose Store to save the secret.

Register AWS Glue connection with Lake Formation

Create an IAM role that Lake Formation can use to vend credentials and attach permission on S3 bucket prefixes where the Iceberg tables are stored. Optionally, if you’re using Secrets Manager to store the client secret or are using a network configuration, you can add permissions for those services to this role. For instruction, refer to Catalog federation to remote Iceberg catalogs.

Complete the following steps to register the connection:

  1. On the Lake Formation console, choose Catalogs in the navigation pane.
  2. Choose Create catalog and select the data source.
  3. Provide the federated catalog details:
    1. Name of the federated catalog.
    2. Catalog name in the remote catalog server and this needs to match the exact catalog name in remote catalog.
  4. Provide AWS Glue connection details. To reuse an existing connection, choose Select existing connection and choose the connection to reuse. For a first-time setup, choose Input new connection configuration and provide the following information:
    1. Provide the AWS Glue connection name.
    2. Provide the remote catalog Iceberg REST API endpoint.
    3. Specify the catalog object casing type. The connection can support uppercase objects through the object hierarchy or lowercase objects.
    4. Configure authentication parameters:
      1. For OAuth2: Provide the client ID and client secret directly or choose the secret where the client secret is stored, token authorization URL, and scope mapped to the credential.
      2. For custom: Provide the secret managed by Secrets Manager where the access token is stored.
      3. Network configuration: If you have a network and/or proxy setup, you can provide this information. Otherwise, leave this section as default.
  5. Register the connection with Lake Formation using the IAM role with access to the bucket where the remote table metadata and data is stored.
  6. Verify the connection by choosing Run test.
  7. After the test is successful, create the catalog.

You can now discover remote objects under the federated catalog. You can onboard other remote catalogs by reusing the existing connection configured to the same external catalog instance.

Query the federated catalog objects using AWS analytical engines

As the data lake administrator, you can now manage access control on databases and tables in a federated catalog using AWS Lake Formation. You can also use tag-based access control to scale your permission model by tagging the resource based on the access control mechanism.

After permissions are granted, an IAM principal or an IAM user can access the federated tables using AWS analytical services including Athena, Amazon Redshift, Amazon EMR, and Amazon SageMaker. Query the federated Iceberg table using Athena as shown in the following example.

Clean up

To avoid incurring ongoing charges, complete the following steps to clean up the resources created during this walkthrough:

  1. Delete the federated catalog in the Data Catalog:
    aws glue delete-catalog \
        --name <your-federated-catalog-name>

  2. Deregister the AWS Glue connection from Lake Formation:
    aws lakeformation deregister-resource \
        --resource-arn <your-glue-connector-arn>

  3. Revoke Lake Formation permissions (if any were granted):
    # List existing permissions first
    aws lakeformation list-permissions \
        --catalog-id <your-account-id> \
        --resource '{
            "Catalog": {}
        }'
    
    # Revoke permissions as needed
    aws lakeformation revoke-permissions \
        --principal '{
            "DataLakePrincipalIdentifier": "<principal-arn>"
        }' \
        --resource '{
            "Database": {
                "CatalogId": "<catalog-id>",
                "Name": "<database-name>"
            }
        }' \
        --permissions ["SELECT", "DESCRIBE"]

  4. Delete the AWS Glue connection:
    aws glue delete-connection \
        --connection-name <your-glue-connection-to-snowflake-account>

  5. Delete IAM roles and policies associated with Lake Formation and the AWS Glue connection:
    # Detach policies from the role
    aws iam detach-role-policy \
        --role-name <your-lakeformation-role-name> \
        --policy-arn <your-lakeformation-policy-arn>
    
    # Delete the custom policy
    aws iam delete-policy \
        --policy-arn <your-lakeformation-policy-arn>
    
    # Delete the role
    aws iam delete-role \
        --role-name <your-lakeformation-role-name>
    # Detach policies from the role
    aws iam detach-role-policy \
        --role-name <your-glue-connection-role-name> \
        --policy-arn <your-glue-connection-policy-arn>
    
    # Delete the custom policy
    aws iam delete-policy \
        --policy-arn <your-glue-connection-policy-arn>
    
    # Delete the role
    aws iam delete-role \
        --role-name <your-glue-connection-role-name>

  6. Delete the Secrets Manager secret:
    # Schedule secret for deletion (7-30 days)
    aws secretsmanager delete-secret \
        --secret-id <your-snowflake-secret>

This teardown guide doesn’t affect the actual metadata in the remote catalog server nor the data in S3 buckets. It only affects the federation configurations in the Data Catalog and Lake Formation. Any corresponding service principals or configurations in the remote catalog server must be addressed separately.

Make sure you follow the teardown steps in the specified order to avoid dependency conflicts. For example, an AWS Glue connection object can’t be deleted if an AWS Glue catalog object is associated with it.

Additionally, make sure you have the necessary permissions to delete these resources.

Conclusion

In this post, we explored how catalog federation addresses the growing challenge of managing Iceberg tables across multi-vendor catalog environments. We walked through the architecture, demonstrating how the Data Catalog communicates with remote catalog systems, including Snowflake Polaris Catalog, Databricks Unity Catalog, and custom Iceberg REST-compliant catalogs, with centralized authorization and credential vending for secure data access. We covered the setup process, including configuring authentication principals, creating federated catalogs using AWS Glue connections, to implementing fine-grained access controls and querying remote Iceberg tables directly from AWS analytics engines.

Catalog federation offers several advantages:

  • Query your Iceberg data where it lives while maintaining security, governance, and price-performance benefits of AWS analytics services
  • Remove operational overheads and costs to maintain synchronization processes
  • Avoid data duplication and inconsistencies
  • Get real-time access to up-to-date table schemas without migrating or replacing existing catalogs.

To learn more, refer to Catalog federation to remote Iceberg catalogs.


About the authors

Debika D

Debika D

Debika is a Senior Product Marketing Manager with Amazon SageMaker, specializing in messaging and go-to-market strategy for lakehouse architecture. She is passionate about all things data and AI.

Srividya Parthasarathy

Srividya Parthasarathy

Srividya is a Senior Big Data Architect on the AWS Lake Formation team. She works with the product team and customers to build robust features and solutions for their analytical data platform. She enjoys building data mesh solutions and sharing them with the community.

Pratik Das

Pratik Das

Pratik is a Senior Product Manager with AWS Lake Formation. He is passionate about all things data and works with customers to understand their requirements and build delightful experiences. He has a background in building data-driven solutions and machine learning systems.

Getting started with Apache Iceberg write support in Amazon Redshift

Post Syndicated from Sanket Hase original https://aws.amazon.com/blogs/big-data/getting-started-with-apache-iceberg-write-support-in-amazon-redshift/

Many companies store structured data in warehouses for analytics while keeping diverse datasets in data lakes for flexible processing. Until now, maintaining consistency between these systems required complex ETL processes and introduced potential data synchronization challenges.

The new Amazon Redshift Apache Iceberg write support removes these complexities through direct writes to Apache Iceberg tables stored in Amazon S3 and S3 Tables. With this native integration you can write data directly from Redshift queries to your data lake without intermediate ETL steps, facilitate data consistency with ACID-compliant transactions that help optimize query performance with flexible partitioning strategies, and use the familiar Redshift SQL interface when writing to Apache Iceberg tables. For example, you can now run a complex transformation in Redshift and write the results directly to an Apache Iceberg table that other analytics engines like Amazon EMR or Amazon Athena can immediately query. By using this approach you can query the same datasets from both Redshift and other analytics tools without copying data.

In this post, we show how you can use Amazon Redshift to write data directly to Apache Iceberg tables stored in Amazon S3 and S3 Tables for seamless integration between your data warehouse and data lake while maintaining ACID compliance.

“Verisk processes billions of catastrophe risk modeling records using Amazon Redshift and Apache Iceberg, achieving 30% faster query aggregations and significant storage cost reductions”

— Karthick Shanmugam, Associate Vice President, Verisk

Solution overview 

You can now create and write directly to Apache Iceberg tables stored in Amazon S3 and S3 Tables using familiar SQL commands in Amazon Redshift. We’ll guide you through configuring permissions for S3 table buckets using AWS Lake Formation. Finally, we’ll analyze customer and order datasets across both Redshift native and Apache Iceberg data formats to derive insights. The workflow is illustrated in the following diagram:

In this post we will walk you through following steps:

  1. Create an external database named customer_db in AWS Glue Data Catalog using Amazon Redshift SQL.
  2. Create an external table named customer in the Glue database and write customer data using Amazon Redshift SQL.
  3. Create table bucket named orders on Amazon S3 Tables to write orders data.
  4. Grant permissions using AWS Lake Formation to an IAM role for reading and writing to the orders table.
  5. Write orders data to the orders Amazon S3 table bucket.

This solution uses the following AWS services:

Prerequisites 

  • Create Amazon Redshift data warehouse (provisioned or Serverless).
  • Permissions to create database on AWS Glue Data Catalog from Redshift.
  • Create a new AWS Glue database called customer_db or use an existing database of your choice. If you use an existing database or a different name, replace customer_db with your actual database name in the subsequent commands.
  • S3 bucket and S3 Table bucket in the same AWS Region as your Redshift cluster.
  • Have access to an IAM role that is a Lake Formation data lake administrator. For instructions, refer to Create a data lake administrator.
  • Create IAM role RedshifticebergRole with following policy. Add managed permission for AmazonRedshiftQueryEditorV2.
    {
        "Version": "2012-10-17",
        "Statement": [
                        {
                            "Sid": "VisualEditor0",
                            "Effect": "Allow",
                            "Action": "redshift:GetClusterCredentialsWithIAM",
                            "Resource": "arn:aws:redshift:<YOUR-REGION>:<AWS-ACCOUNT-NUMBER>:dbname::<YOUR-REDSHIFT-CLUSTER-NAME>/*"
                        }
                    ]
    }

Setting up your environment

To set up your environment, complete the following steps.

Creating Apache Iceberg tables in Amazon S3 standard buckets

  1. Connect to Redshift using Query Editor V2.
  2. Create user for the Federated role RedshifticebergRole.
    Create user IAMR:RedshifticebergRole

  3. Verify you have an Amazon Redshift External Schema configured. Run following script on Redshift:
    CREATE EXTERNAL SCHEMA demo_iceberg
    FROM DATA CATALOG DATABASE 'customer_db'
    IAM_ROLE 'arn:aws:iam::<AWS-ACCOUNT-NUMBER>:role/RedshiftCustomizedIcebergRole';

  4. Create external table customer in Apache Iceberg table format in the demo_iceberg external schema created above and then insert data.

    Use this two-step approach when you need control over column definitions or plan to append data.

    Replace your S3 bucket name in place of <<your-bucket>>.

    -- Step 1: Define your table structure
    CREATE TABLE demo_iceberg.customer
    (
    customer_id bigint,
    customer_name varchar,
    email varchar,
    city varchar
    )
    USING ICEBERG
    LOCATION 's3://<<your-bucket>>/iceberg-data/customers/';
    
    -- Step 2: Insert data
                  
    (1, 'Customer1 Smith', '[email protected]', 'New York'),
    (2, 'Customer2 Johnson', '[email protected]', 'Los Angeles'),
    (3, 'Customer3 Brown', '[email protected]', 'Chicago'),
    (4, 'Customer4 Davis', '[email protected]', 'Houston'),
    (5, 'Customer5 Wilson', '[email protected]', 'Phoenix'),
    (6, 'Customer6 Miller', '[email protected]', 'Philadelphia'),
    (7, 'Customer7 Garcia', '[email protected]', 'San Antonio'),
    (8, 'Customer8 Rodriguez', '[email protected]', 'San Diego'),
    (9, 'Customer9 Martinez', '[email protected]', 'Dallas'),
    (10, 'Customer10 Anderson', '[email protected]', 'San Jose'),
    (11, 'Customer11 Taylor', '[email protected]', 'Austin'),
    (12, 'Customer12 Thomas', '[email protected]', 'Jacksonville'),
    (13, 'Customer13 Jackson', '[email protected]', 'Fort Worth'),
    (14, 'Customer14 White', '[email protected]', 'Columbus'),
    (15, 'Customer15 Harris', '[email protected]', 'Charlotte');
    
    -- Step 3: Select data
    SELECT * FROM demo_iceberg.customer;
    

    Figure 2: Result from demo_iceberg.customer

  5. Grant access to external schema for user IAMR:RedshifticebergRole:
    Grant usage on schema demo_iceberg to "IAMR:RedshifticebergRole";

Create Apache Iceberg tables in Amazon S3 Table buckets

Amazon S3 table buckets are integrated with AWS Lake Formation, which serves as the central authority for managing data access permissions. When working with Apache Iceberg tables, Lake Formation provides a unified security framework that simplifies access control across your entire data lake. This centralized approach makes sure consistent and efficient permission management, alleviating the need to handle permissions in multiple places.

To create an S3 table bucket:

  1. Go to Amazon S3, choose Table buckets in the left navigation pane.
  2. On the Table buckets page, in the Integration with AWS analytics services section, choose Enable integration.
  3. In the Table buckets list, choose the Create table bucket button and enter a name for your table bucket, for example, iceberg-write-blog, and choose Create table bucket. After creation, the bucket will appear in the S3 tables catalog, s3tablescatalog, in the Lake Formation console.
  4. In the AWS Lake Formation console, choose Catalogs, in the Catalogs table select s3tablescatalog to open the detail page for that table.
  5. On the s3tablescatalog details page, under Catalogs, choose the table bucket iceberg-write-blog.
  6. On the iceberg-write-blog details page, under Databases, choose Create database.
  7. Enter the database name iceberg_write_namespace, select the Catalog from the drop down menu, and choose Create database.
  8. Grant a permission to create a table in the database to the Lake Formation IAM role. On the iceberg-write-blog details page select the radio button for iceberg_write_namespace, choose Actions, Grant.
  9. On the Grant permissions page, under Principal type select Principals, under Principals select IAM users and roles, in the IAM users and roles drop down menu select RedshifticebergRole.
  10. For LF-Tags or catalog resources, choose Named Data Catalog resources, for Catalogs select iceberg-write-blog and for Databases select iceberg_write_namespace.
  11. For Database permissions select the checkbox for Create table, Drop, and Describe, then choose Grant.

Creating Apache Iceberg tables in Amazon Redshift using Amazon S3 table buckets

AWS Lake Formation catalogs are automatically mounted on Amazon Redshift data warehouses in same account. Amazon Redshift writes directly to S3 Tables using the auto mounted S3 table catalog. The SQL syntax for writing to Apache Iceberg tables stored in S3 table buckets is similar to the syntax for Apache Iceberg tables stored in S3 standard buckets. The key difference is the auto mounted S3 Table catalog, which supports three-part notation access. This feature alleviates the need to create an EXTERNAL SCHEMA when referencing data lake Apache Iceberg tables residing in S3 Table buckets.

To create the Apache Iceberg table:

  1. Switch to the RedshifticebergRole. To access S3 tables through the Redshift Query Editor V2, you must use a Federated user account, the RedshifticebergRole has been granted the necessary Lake Formation permissions.
  2. Log in to Redshift using the Query Editor V2 Federated user option.
  3. In Query Editor V2, create the table named orders in Apache Iceberg table format:
    CREATE TABLE "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders
    (
     customer_id BIGINT,
     order_id BIGINT,
     Total_order_amt DECIMAL(10,2),
     Total_order_tax_amt REAL,
     tax_pct DOUBLE PRECISION,
     order_date DATE,
     order_created_at_tz TIMESTAMPTZ,
     is_active_ind BOOLEAN
    )
    USING ICEBERG 
    PARTITIONED BY (DAY(order_date));
    

  4. Insert data into the table using standard SQL:
    INSERT INTO "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders
    (order_date, order_id, customer_id, total_order_amt, total_order_tax_amt, tax_pct, order_created_at_tz, is_active_ind)
    VALUES
    ('2024-01-15', 1001, 1, 125.50, 10.04, 0.08, '2024-01-15 10:30:00-06:00', true),
    ('2024-02-20', 1002, 2, 89.99, 6.75, 0.075, '2024-02-20 14:22:15-06:00', true),
    ('2024-03-10', 1003, 3, 234.75, 20.78, 0.0885, '2024-03-10 09:15:30-06:00', false),
    ('2024-04-05', 1004, 1, 67.25, 5.38, 0.08, '2024-04-05 16:45:00-05:00', true),
    ('2024-05-18', 1005, 4, 156.80, 12.54, 0.08, '2024-05-18 11:20:45-05:00', true),
    ('2024-06-22', 1006, 5, 45.99, 4.14, 0.09, '2024-06-22 13:10:20-05:00', true),
    ('2024-07-14', 1007, 2, 312.40, 24.99, 0.08, '2024-07-14 08:35:10-05:00', false),
    ('2024-08-30', 1008, 6, 78.50, 7.07, 0.09, '2024-08-30 15:25:35-05:00', true),
    ('2024-09-12', 1009, 3, 199.99, 18.00, 0.09, '2024-09-12 12:40:50-05:00', true),
    ('2024-10-08', 1010, 7, 523.75, 41.90, 0.08, '2024-10-08 17:15:25-05:00', true),
    ('2024-10-25', 1011, 4, 92.30, 8.31, 0.09, '2024-10-25 10:05:15-05:00', false),
    ('2024-11-02', 1012, 8, 167.45, 13.40, 0.08, '2024-11-02 14:50:40-06:00', true),
    ('2024-11-08', 1013, 1, 34.99, 2.80, 0.08, '2024-11-08 09:30:20-06:00', true),
    ('2024-11-09', 1014, 9, 445.60, 40.10, 0.09, '2024-11-09 16:20:55-06:00', true),
    ('2024-11-10', 1015, 5, 278.85, 22.31, 0.08, '2024-11-10 11:45:30-06:00', true);
    

  5. Create a Redshift local_orders table and insert sample records:
    CREATE TABLE dev.public.local_orders
    (
    customer_id BIGINT,
    order_id BIGINT,
    Total_order_amt DECIMAL(10,2),
    Total_order_tax_amt REAL,
    tax_pct DOUBLE PRECISION,
    order_date DATE,
    order_created_at_tz TIMESTAMPTZ,
    is_active_ind BOOLEAN
    );
    
    
    INSERT INTO dev.public.local_orders
    (customer_id, order_id, Total_order_amt, Total_order_tax_amt, tax_pct, order_date, order_created_at_tz, is_active_ind)
    VALUES
    (1001, 5001, 299.99, 24.00, 0.08, '2024-01-15', '2024-01-15 14:30:00-05:00', true),
    (1002, 5002, 1250.50, 100.04, 0.08, '2024-01-16', '2024-01-16 09:15:22-05:00', true),
    (1003, 5003, 75.25, 6.02, 0.08, '2024-01-16', '2024-01-16 16:45:33-05:00', true),
    (1004, 5004, 499.99, 40.00, 0.08, '2024-01-17', '2024-01-17 11:20:45-05:00', true),
    (1005, 5005, 149.50, 11.96, 0.08, '2024-01-17', '2024-01-17 13:55:12-05:00', false),
    (1002, 5006, 899.99, 72.00, 0.08, '2024-01-18', '2024-01-18 10:05:30-05:00', true),
    (1006, 5007, 45.75, 3.66, 0.08, '2024-01-18', '2024-01-18 15:40:18-05:00', true),
    (1007, 5008, 1500.00, 120.00, 0.08, '2024-01-19', '2024-01-19 08:25:55-05:00', true),
    (1008, 5009, 250.25, 20.02, 0.08, '2024-01-19', '2024-01-19 12:10:40-05:00', true),
    (1009, 5010, 725.75, 58.06, 0.08, '2024-01-20', '2024-01-20 14:15:28-05:00', true);
    

  6. Using the CREATE TABLE AS (CTAS) format, create a table from the existing table with no compression:
    CREATE TABLE "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders_new 
    using ICEBERG
    TABLE PROPERTIES ('compression_type'='uncompressed')
    AS
    select * from dev.public.local_orders;
    

  7. Select data with standard SQL using the three-part notation:
    select * from "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders;

    You can also use the USE clause to specify the default database (and omit the database name):

    USE "iceberg-write-blog@s3tablescatalog";
    
    select * from iceberg_write_namespace.orders;

    The resulting table will look like the following image:

  8. Set a schema search path to further simplify table access by omitting the schema name from the notation:
    -- Redshift default database is set to 'iceberg-write-blog@s3tablescatalog'
    USE "iceberg-write-blog@s3tablescatalog";
    
    -- Redshift will search 'iceberg_write_namespace' to resolve table orders
    set search_path to iceberg_write_namespace;
    
    select * from orders;

  9. Show table:
    show table "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders 
    
    --Result
    CREATE TABLE "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders 
    (
    customer_id bigint,
    order_id bigint,
    total_order_amt decimal(10, 2),
    total_order_tax_amt float,
    tax_pct double precision,
    order_date date,
    order_created_at_tz timestamptz,
    is_active_ind Boolean
    )
    USING ICEBERG
    PARTITIONED BY (DAY(order_date))
    TABLE PROPERTIES ('compression_type'='snappy');

Bringing it together

Let’s demonstrate how to combine data from two sources and show how they can work together in a single query.

  • Customer data stored in standard S3 buckets
  • Orders data stored in S3 table buckets
  1. Log in to Redshift using Federated user:
    select 
    b.order_date,
    b.order_id,
    b.total_order_amt,
    CONVERT_TIMEZONE('America/Los_Angeles', b.order_created_at_tz) AS order_pacific_time,
    a.customer_name,
    a.email,
    a.city
    from dev.demo_iceberg.customer a join "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders b
    on(a.customer_id = b.customer_id)
    where b.order_date between '2024-01-15' and '2024-10-25'
    and b.is_active_ind=true;

    The result from the consolidated query:

  2. Drop table:
    Drop TABLE "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders_new;

Clean up

To avoid ongoing charges, follow these steps in order:

  1. Drop Apache Iceberg tables:
    DROP TABLE dev.demo_iceberg.customer;
    DROP TABLE "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders;

  2. Remove S3 objects, replace your-bucket with the name of the bucket you created:
    aws s3 rm s3://<your-bucket>/iceberg/ --recursive

  3. Remove Lake Formation permissions, replace your-bucket with the name of the bucket you created:
    aws lakeformation deregister-resource --resource-arn arn:aws:s3:::<your-bucket>

Conclusion

With Apache Iceberg write support in Amazon Redshift you can to build flexible data architectures that combine the performance of a data warehouse with the scalability of a data lake. You can now write data directly to Apache Iceberg tables while maintaining ACID compliance and partitioning for query optimization. You can use Amazon Redshift to create Apache Iceberg tables in your data lake, making them immediately queryable through Amazon EMR or Amazon Athena.

To learn more, review the Amazon Redshift Iceberg integration and Writing to Apache Iceberg tables documentation. Visit the AWS Database Blog for latest updates.


About the authors

Sanket Hase

Sanket Hase

Sanket is an Engineering Manager with the Amazon Redshift team, leading query execution teams in the areas of data lake analytics, hardware-software co-design, and vectorized query execution.

Harshida Patel

Harshida Patel

Harshida is a Principal Solutions Architect, Analytics with AWS.

Ritesh Kumar Sinha

Ritesh Kumar Sinha

Ritesh is an Analytics Specialist Solutions Architect based out of San Francisco. He has helped customers build scalable data warehousing and big data solutions for over 16 years. He loves to design and build efficient end-to-end solutions on AWS. In his spare time, he loves reading, walking, and doing yoga.

Raghu Kuppala

Raghu Kuppala

Raghu is an Analytics Specialist Solutions Architect experienced working in the databases, data warehousing, and analytics space. Outside of work, he enjoys trying different cuisines and spending time with his family and friends.

Xiening Dai

Xiening Dai

Xiening is a Principal Software Engineer working on Redshift Query Processing and Data Lake.

How to use the Secrets Store CSI Driver provider Amazon EKS add-on with Secrets Manager

Post Syndicated from Angad Misra original https://aws.amazon.com/blogs/security/how-to-use-the-secrets-store-csi-driver-provider-amazon-eks-add-on-with-secrets-manager/

In this post, we introduce the AWS provider for the Secrets Store CSI Driver, a new AWS Secrets Manager add-on for Amazon Elastic Kubernetes Service (Amazon EKS) that you can use to fetch secrets from Secrets Manager and parameters from AWS Systems Manager Parameter Store and mount them as files in Kubernetes pods. The add-on is straightforward to install and configure, works on Amazon Elastic Compute Cloud (Amazon EC2) instances and hybrid nodes, and includes the latest security updates and bugfixes. It provides a secure and reliable way to retrieve your secrets in Kubernetes workloads.

The AWS provider for the Secrets Store CSI Driver is an open source Kubernetes DaemonSet.

Amazon EKS add-ons provide installation and management of a curated set of add-ons for EKS clusters. You can use these add-ons to help ensure that your EKS clusters are secure and stable and reduce the number of steps required to install, configure, and update add-ons.

Secrets Manager helps you manage, retrieve, and rotate database credentials, application credentials, OAuth tokens, API keys, and other secrets throughout their lifecycles. By using Secrets Manager to store credentials, you can avoid using hard-coded credentials in application source code, helping to avoid unintended or inadvertent access.

New EKS add-on: AWS provider for the Secrets Store CSI Driver

We recommend installing the provider as an Amazon EKS add-on instead of the legacy installation methods (Helm, kubectl) to reduce the amount of time it takes to install and configure the provider. The add-on can be installed in several ways: using eksctl—which you will use in this post—the AWS Management Console, the Amazon EKS API, AWS CloudFormation, or the AWS Command Line Interface (AWS CLI).

Security considerations

The open-source Secrets Store CSI Driver maintained by the Kubernetes community enables mounting secrets as files in Kubernetes clusters. The AWS provider relies on the CSI driver and mounts secrets as file in your EKS clusters. Security best practice recommends caching secrets in memory where possible. If you prefer to adopt the native Kubernetes experience, please follow the steps in this blog post. If you prefer to cache secrets in memory, we recommend using the AWS Secrets Manager Agent.

IAM principals require Secrets Manager permissions to get and describe secrets. If using Systems Manager Parameter Store, principals also require Parameter Store permissions to get parameters. Resource policies on secrets serve as another access control mechanism, and AWS principals must be explicitly granted permissions to access individual secrets if they’re accessing secrets from a different AWS account (see Access AWS Secrets Manager secrets from a different account). The Amazon EKS add-on provides security features including support for using FIPS endpoints. AWS provides a managed IAM policy, AWSSecretsManagerClientReadOnlyAccess, which we recommend using with the EKS add-on.

Solution walkthrough

In the following sections, you’ll create an EKS cluster, create a test secret in Secrets Manager, install the Amazon EKS add-on, and use it to retrieve the test secret and mount it as a file in your cluster.

Prerequisites

  1. AWS credentials, which must be configured in your environment to allow AWS API calls and are required to allow access to Secrets Manager
  2. AWS CLI v2 or higher
  3. Your preferred AWS Region must be available in your environment. Use the following command to set your preferred region:
    aws configure set default.region <preferred_region>
    

  4. The kubectl and eksctl command-line tools
  5. A Kubernetes deployment file hosted in the GitHub repo for the provider

With the prerequisites in place, you’re ready to run the commands in the following steps in your terminal:

Create an EKS cluster

  1. Create a shell variable in your terminal with the name of your cluster:
    CLUSTER_NAME="my-test-cluster”
    

  2. Create an EKS cluster:
    eksctl create cluster $CLUSTER_NAME 
    

eksctl will automatically use a recent version of Kubernetes and create the resources needed for the cluster to function. This command typically takes about 15 minutes to finish setting up the cluster.

Create a test secret

Create a secret named addon_secret in Secrets Manager:

aws secretsmanager create-secret \
  --name addon_secret \
  --secret-string "super secret!"

Set up the Secrets Store CSI Driver provider EKS add-on

Install the Amazon EKS add-on:

eksctl create addon \
  --cluster $CLUSTER_NAME \
  --name aws-secrets-store-csi-driver-provider

Create an IAM role

Create an AWS Identity and Access Management (IAM) role that the EKS Pod Identity service principal can assume and save it in a shell variable (replace <region> with the AWS Region configured in your environment):

ROLE_ARN=$(aws --region <region> --query Role.Arn --output text iam create-role --role-name nginx-deployment-role --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "pods.eks.amazonaws.com"
            },
            "Action": [
                "sts:AssumeRole",
                "sts:TagSession"
            ]
        }
    ]
}')

Attach a managed policy to the IAM role

Note: AWS provides a managed policy for client-side consumption of secrets through Secrets Manager: AWSSecretsManagerClientReadOnlyAccess. This policy grants access to get and describe secrets for the secrets in your account. If you want to further follow the principle of least privilege, create a custom policy scoped down to only the secrets you want to retrieve.

Attach the managed policy to the IAM role that you just created:

aws iam attach-role-policy \
  --role-name nginx-deployment-role \
  --policy-arn arn:aws:iam::aws:policy/AWSSecretsManagerClientReadOnlyAccess

Set up the EKS Pod Identity Agent

Note: The add-on provides two methods of authentication: IAM roles for service accounts (IRSA) and EKS Pod Identity. In this solution, you’ll use EKS Pod Identity.

  1. After you’ve installed the add-on in your cluster, install the EKS Pod Identity Agent add-on for authentication:
    eksctl create addon \
      --cluster $CLUSTER_NAME \
      --name eks-pod-identity-agent
    

  2. Create an EKS Pod Identity association for the cluster:
    eksctl create podidentityassociation \
        --cluster $CLUSTER_NAME \
        --namespace default \
        --region <region> \
        --service-account-name nginx-pod-identity-deployment-sa \
        --role-arn $ROLE_ARN \
        --create-service-account true
    

Set up your SecretProviderClass

The SecretProviderClass is a YAML file that defines which secrets and parameters to mount as files in your cluster.

  1. Create a minimal SecretProviderClass called spc.yaml for the test secret with the following content:
    apiVersion: secrets-store.csi.x-k8s.io/v1
    kind: SecretProviderClass
    metadata:
      name: nginx-pod-identity-deployment-aws-secrets
    spec:
      provider: aws
      parameters:
        objects: |
          - objectName: "addon_secret"
            objectType: "secretsmanager"
        usePodIdentity: "true"
    

  2. Deploy your SecretProviderClass (make sure you’re in the same directory as the spc.yaml you just created):
    kubectl apply -f spc.yaml
    

To learn more about the SecretProviderClass, see the GitHub readme for the provider.

Deploy your pod to your EKS cluster

For brevity, we’ve omitted the content of the Kubernetes deployment file. The following is an example deployment file for Pod Identity in the GitHub repository for the provider—use this file to deploy your pod:

kubectl apply -f https://raw.githubusercontent.com/aws/secrets-store-csi-driver-provider-aws/main/examples/ExampleDeployment-PodIdentity.yaml

This will mount addon_secret at /mnt/secrets-store in your cluster.

Retrieve your secret

  1. Print the value of addon_secret to confirm that the secret was mounted successfully:
    kubectl exec -it $(kubectl get pods | awk '/nginx-pod-identity-deployment/{print $1}' | head -1) -- cat /mnt/secrets-store/addon_secret
    

  2. You should see the following output:
    super secret!
    

You’ve successfully fetched your test secret from Secrets Manager using the new Amazon EKS add-on and mounted it as a file in your Kubernetes cluster.

Clean up

Run the following commands to clean up the resources that you created in this tutorial:

aws secretsmanager delete-secret \
  --secret-id addon_secret \
  --force-delete-without-recovery

aws iam delete-role --role-name nginx-deployment-role

eksctl delete cluster $CLUSTER_NAME

Conclusion

In this post, you learned how to use the new Amazon EKS add-on for the AWS Secrets Store CSI Driver provider to securely retrieve your secrets and parameters and mount them as files in your Kubernetes clusters. The new EKS add-on provides benefits such as the latest security patches and bug fixes, tighter integration with Amazon EKS, and reduces the time it takes to install and configure the AWS Secrets Store CSI Driver provider. The add-on is validated by EKS to work with EC2 instances and hybrid nodes.

Further reading

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Angad Misra

Angad Misra

Angad is a Software Engineer on the AWS Secrets Manager team. When he isn’t building secure, reliable, and scalable software from first principles, he enjoys a good latte, live music, playing guitar, exploring the great outdoors, cooking, and lazing around with his cat, Freyja.

Secure Amazon Elastic VMware Service (Amazon EVS) with AWS Network Firewall

Post Syndicated from Sheng Chen original https://aws.amazon.com/blogs/architecture/secure-amazon-elastic-vmware-service-amazon-evs-with-aws-network-firewall/

Amazon Elastic VMware Service (Amazon EVS) helps organizations migrate, run, and scale VMware workloads natively on AWS. It delivers a VMware Cloud Foundation (VCF) environment that operates directly within your Amazon Virtual Private Cloud (Amazon VPC) on Amazon EC2 bare-metal instances. The solution helps customers accelerate cloud migrations and data center exits without needing to refactor existing applications.

For customers considering a hybrid cloud architecture, a unified network security solution is required to protect application traffic across Amazon EVS environments, Amazon VPCs, on-premises data centers and the internet. It also needs to provide a single point of control for firewall policy management, centralized logging, and monitoring to streamline network security operations.

AWS Network Firewall is a managed firewall and intrusion detection and prevention service (IDS/IPS) that can help address these requirements. Built on AWS managed infrastructure, it automatically scales with traffic demands while maintaining high availability and consistent performance. The service provides centralized policy management and traffic inspection across multiple VPCs and AWS accounts. Additionally, it provides comprehensive visibility and reporting through firewall log collections to Amazon Simple Storage Service (Amazon S3), Amazon CloudWatch Logs, or Amazon Data Firehose.

In this post, we demonstrate how to utilize AWS Network Firewall to secure an Amazon EVS environment, using a centralized inspection architecture across an EVS cluster, VPCs, on-premises data centers and the internet. We walk through the implementation steps to deploy this architecture using AWS Network Firewall and AWS Transit Gateway.

Architecture overview

AWS Network Firewall operates as a “bump-in-the-wire” solution, which transparently inspects and filters network traffic across Amazon VPCs. It is inserted directly into the traffic path by updating VPC or Transit Gateway route tables, allowing it to examine all packets without requiring any changes to the existing application flow patterns.

The following diagram depicts the architecture overview of our centralized inspection model using AWS Network Firewall.

Figure 1: Secure Amazon EVS with AWS Network Firewall using centralized inspection architecture

Figure 1: Secure Amazon EVS with AWS Network Firewall using centralized inspection architecture

The Amazon EVS environment is deployed directly within a customer VPC (i.e. EVS VPC), which consists of EVS VLAN subnets that form the underlay networks for VCF deployment. This infrastructure provides connectivity for NSX overlay networks, host management, vMotion, and vSANAmazon VPC Route Server enables dynamic routing between the underlay networks and overlay networks. For more information, see Concepts and components of Amazon EVS.

The architecture also includes a standard workload VPC (i.e. VPC01), and a Direct Connect Gateway connects to the on-premises data center via an AWS Direct Connect connection. We use a dedicated egress VPC with NAT gateways for centralized internet egress, and a separate ingress VPC with Application Load Balancers to terminate ingress web traffic and steer flows back to the target services.

With this architecture, the following traffic flow patterns can be inspected:

East-West Traffic:

  • Between EVS VPCs and Workload VPCs
  • Between Workload VPCs

North-South Traffic:

  • Between EVS/Workload VPCs and on-premises
  • Between EVS/Workload VPCs and internet
  • Between on-premises and internet

The centralized inspection architecture provides several benefits:

  • Single point of control for network security inspection across multiple VPCs
  • Enhanced rule enforcement across AWS infrastructure, on-premises resources, and the internet
  • Centralized logging and monitoring

For this demo we use the AWS Network Firewall native integration with AWS Transit Gateway capability to streamline firewall deployment and management. With a native firewall attachment, AWS automatically provisions and manages all the necessary VPC resources, reducing the operational overhead of managing subnets, route tables, and firewall endpoints within the inspection VPC.

Prerequisites

This post assumes familiarity with: AWS Command Line Interface (AWS CLI), Amazon VPC, Amazon EC2, NAT gateway, Application Load Balancer, Internet gateway, AWS Direct Connect, AWS Transit Gateway and the VMware VCF platform.

The following prerequisites are necessary to complete this solution.

  • An EVS VPC includes:
    • An Amazon EVS cluster (minimum 4x i4i nodes)
    • VPC CIDR: 10.0.0.0/16
    • NSX Segments CIDR: 192.168.0.0/19 (summarized)
    • A VPC Route Server deployed in the EVS VPC to receive NSX segment routes via BGP dynamic routing. Refer to the EVS User Guide for more details.
  • A Workload VPC (VPC01):
    • CIDR: 172.21.0.0/16
  • An Egress VPC:
    • CIDR: 172.23.0.0/16
    • 1x Internet Gateway
    • 1x NAT Gateway
  • An Ingress VPC:
    • CIDR: 172.24.0.0/16
    • 1x Internet Gateway
    • 1x Application Load Balancer
  • Optional: a Direct Connect Gateway:
    •  connecting to the on-premises environment (10.0.0.0/8)

Note: The CIDR blocks used in this example are for demo purposes only; change the address spaces to match your own networking environment. The design can also be scaled to include additional EVS environments and/or other VPCs based on workload needs.

Walkthrough

In this section, we walk through the implementation steps to deploy the centralized inspection architecture with AWS Network Firewall and AWS Transit Gateway. We focus on the overall network integration of the architecture without diving into the detailed configurations of AWS Network Firewall or Transit Gateway.

1. Create an AWS Transit Gateway

In the VPC console, create a Transit Gateway. Make sure to deselect the following options:

  • Default route table association
  • Default route table propagation

Create two empty transit gateway route tables and associate them with the Transit Gateway.

  • Pre-inspection route table: steers traffic into the AWS Network Firewall for centralized inspection
  • Post-inspection route table: returns traffic back to its original destination after inspection and is permitted by the AWS Network Firewall

2. Attach VPCs to the Transit Gateway

Attach all four VPCs (EVS, VPC01, Ingress, Egress) to the same Transit Gateway. The Direct Connect Gateway can also be attached to the Transit Gateway if AWS Network Firewall is needed to inspect traffic between the on-premises environment and AWS or the internet.

Figure 2: Attach VPCs to the Transit Gateway

Figure 2: Attach VPCs to the Transit Gateway

Associate all attachments to the pre-inspection Transit Gateway route table.

Figure 3: Associate VPC attachments to the pre-inspection route table

Figure 3: Associate VPC attachments to the pre-inspection route table

3. Create an AWS Network Firewall with Transit Gateway native integration

In the Network Firewall section of the VPC console, choose Create firewall.

At the Attachment type section, select Transit Gateway to enable native integration with the existing Transit Gateway.

Figure 4: Enable AWS Network Firewall native integration with Transit Gateway

Figure 4: Enable AWS Network Firewall native integration with Transit Gateway

At the Logging configuration, enable the following log types with CloudWatch log group as the log destination. Create a log group for each log type in the CloudWatch Console.

  • Alert: /anfw-centralized/anfw01/alert
  • Flow: /anfw-centralized/anfw01/flow

Create and associate an empty firewall policy to deploy the AWS Network Firewall instance. The firewall policy contains a list of rule groups that define how the firewall inspects and manages traffic. This empty firewall policy can be configured later.

With the Transit Gateway native integration enabled, a Transit Gateway attachment is automatically created for the AWS Network Firewall, with the resource type shown as Network Function. In addition, the Appliance Mode is automatically enabled for the firewall attachment to make sure the Transit Gateway continues to use the same Availability Zone (AZ) for the attachment over the lifetime of a flow.

Associate the firewall attachment to the post-inspection Transit Gateway route table.

Figure 5: AWS Network Firewall native attachment

Figure 5: AWS Network Firewall native attachment

4. Update Transit Gateway route tables

Update the pre-inspection Transit Gateway route table with a default route that points to the AWS Network Firewall attachment. This makes sure traffic that arrives to the Transit Gateway from all VPC attachments and the Direct Connect Gateway attachment is sent to the firewall for centralized inspection.

Figure 6: Transit Gateway pre-inspection route table

Figure 6: Transit Gateway pre-inspection route table

Add the following static routes to the post-inspection route table to direct return traffic back to each VPC and the Direct Connect Gateway accordingly.

Figure 7: Transit Gateway post-inspection route table

Figure 7: Transit Gateway post-inspection route table

5. Update VPC route tables

Finally, update route tables at each VPC as per the following table.

Make sure to add the following routes at the relevant VPC route tables:

  • EVS VPC and VPC01 have a default route (marked in blue) to steer all egress flows into AWS Network Firewall for centralized inspection.
  • Ingress VPC and Egress VPC have RFC-1918 routes (marked in green) to direct return traffic to the Transit Gateway.

Within the EVS VPC, notice the NSX segment routes are automatically propagated to the NSX uplink subnet route table and the private subnet route table via the VPC Route Server.

Figure 8: NSX uplink subnet route table within EVS VPC

Figure 8: NSX uplink subnet route table within EVS VPC

A centralized security inspection architecture has now been deployed for the EVS environment, using AWS Network Firewall with Transit Gateway native integration.

6. Testing

Egress inspection (FQDN filtering)

To test egress inspection from EVS VPC or VPC01 to the internet, create a stateful rule group for the firewall instance using FQDN filtering:

  • Rule group format: Domain list
  • Domain names: .google.com
  • Source IPs: 192.168.0.0/19, 172.21.0.0/16
  • Protocols: HTTP & HTTPS
  • Action: Allow

As expected, testing web access from a virtual machine (192.168.12.10) within the EVS environment to the allowed domain (i.e. google.com) is permitted by the AWS Network Firewall. However, access to unauthorized domain (i.e. facebook.com) is blocked at the firewall with an alert trigged, which can be verified at the CloudWatch log group at /aws/network-firewall/alert/.

Figure 9: Egress inspection from EVS to internet with FQDN filtering

Figure 9: Egress inspection from EVS to internet with FQDN filtering

Ingress inspection

Create another stateful rule group to allow Application Load Balancers deployed within the Ingress VPC to access a web server running in the EVS environment via HTTP protocol:

  • Rule group format: Standard stateful rule
  • Geographic IP Filtering: Disable Geographic IP filtering
  • Protocol: HTTP
  • Source: 172.24.0.0/16
  • Source Port: ANY
  • Destination: 192.168.12.10/32
  • Destination Port ANY
  • Traffic direction: Forward
  • Action: Alert

The CloudWatch firewall logs show an Application Load Balancer (172.24.6.45) from the Ingress VPC can establish HTTP connection to the EVS web server (192.168.12.10). Additionally, the Application Load Balancer has successfully registered the EVS web server as a remote IP target.

Figure 10: Ingress inspection from Ingress VPC to EVS

Figure 10: Ingress inspection from Ingress VPC to EVS

East-West inspection

For East-West inspection testing, update the previous stateful rule group to add a new rule to block ICMP traffic from VPC01 to the EVS VPC.

  • Rule group format: Standard stateful rule
  • Geographic IP Filtering: Disable Geographic IP filtering
  • Protocol: ICMP
  • Source: 172.21.0.0/16
  • Source Port: ANY
  • Destination: 192.168.0.0/19
  • Destination Port: ANY
  • Action: Drop

As a result, pings from an EC2 instance (172.21.128.4) from VPC01 to the EVS web server (192.168.12.10) are being dropped.

Figure 11: East-West Inspection from VPC01 to EVS

Figure 11: East-West Inspection from VPC01 to EVS

Conclusion

In this post, we demonstrated how to utilize AWS Network Firewall to secure Amazon EVS workloads and to provide centralized traffic inspection between Amazon EVS environments, Amazon VPCs, on-premises data centers, and the internet. We walked through the implementation steps for deploying the centralized inspection architecture using AWS Network Firewall and AWS Transit Gateway.

To learn more, review these resources:


About the authors

Orchestrating data processing tasks with a serverless visual workflow in Amazon SageMaker Unified Studio

Post Syndicated from Suba Palanisamy original https://aws.amazon.com/blogs/big-data/orchestrating-data-processing-tasks-with-a-serverless-visual-workflow-in-amazon-sagemaker-unified-studio/

Automation of data processing and data integration tasks is essential for data engineers and analysts to maintain up-to-date data pipelines and reports. Amazon SageMaker Unified Studio is a single data and AI development environment where you can find and access the data in your organization and act on it using the ideal tools for your use case. SageMaker Unified Studio offers multiple ways to integrate with data through its editorial tools, including Visual ETL, Query Editor, and JupyterLab builders.

Recently, AWS launched the visual workflow experience in SageMaker Unified Studio IAM-based domains. With visual workflows, you don’t need to code Python DAGs manually or have deep expertise in Apache Airflow. Instead, you can visually define orchestration workflows through an intuitive drag-and-drop interface in SageMaker Unified Studio. The visual definition is automatically converted to workflow definitions that leverage Amazon Managed Workflows for Apache Airflow (Amazon MWAA) Serverless, providing enterprise-grade orchestration capabilities with a simplified user experience.

In this post, we show how to use the new visual workflow experience in SageMaker Unified Studio IAM-based domains to orchestrate an end-to-end machine learning workflow. The workflow ingests weather data, applies transformations, and generates predictions—all through a single, intuitive interface, without writing any orchestration code.

For more details on Amazon MWAA Serverless, see Introducing Amazon MWAA Serverless.

Example use case

To demonstrate how SageMaker Unified Studio simplifies end-to-end workflow orchestration, let’s walk through a real-world scenario from agricultural analytics. The following diagram shows a weather data processing workflow that we will orchestrate using the visual workflow experience in SageMaker Unified Studio.

A regional agricultural extension office collects hourly weather data from multiple stations across farming communities. Their goal is to analyze this data and provide farmers with actionable insights into weather patterns and their impact on crop conditions. To achieve this, the team built a ML–powered analytics workflow using SageMaker Unified Studio to automate the processing of incoming weather data and predict irrigation needs.

In this walkthrough, we demonstrate how the visual workflow experience in Unified Studio can orchestrate an end-to-end data pipeline that:

  • Monitors and ingests hourly weather data from Amazon Simple Storage Service (Amazon S3)
  • Transforms raw weather measurements using Visual ETL jobs (type casting, SQL operations, and data cleansing)
  • Generates seasonal irrigation predictions and crop impact insights using JupyterLab notebooks

Whenever new weather data arrives, the workflow automatically routes it through a series of transformation steps, and produces ready-to-use insights—all visually orchestrated in SageMaker Unified Studio with no custom orchestration code required.

Prerequisites

Before you begin, complete the following steps:

  1. Signup for an AWS account and create a user with administrative access using the setup guide.
  2. Setup your SageMaker Unified Studio IAM-based domain:
    1. Navigate to the Amazon SageMaker console and use the Region selector in the top navigation bar to choose the appropriate AWS Region.
    2. On the Amazon SageMaker home page, choose Get started.
    3. For Project data access, choose to Auto-create a new role with admin permissions.
    4. Select the checkbox for S3 table integration with AWS Analytics services, for Data encryption choose Use AWS owned key, and then Set up.
  3. Go back to the Amazon SageMaker home page and choose Open to access the SageMaker Unified Studio experience.
  4. From the SageMaker Studio UI you can access the project in the SageMaker Unified Studio IAM-based domain. This project curates all assets accessible through the designated Execution IAM role.

Workflow implementation steps

In this section, we use Amazon SageMaker Studio to create an end-to-end visual workflow in IAM-based domain.

Step 1: Set up data storage and import weather dataset

First, we’ll prepare the Amazon S3 storage locations for raw and processed data:

  1. Download this weather dataset file to your local environment.
  2. From the left menu of the project, choose Files. Under Shared, create two new folders raw_data and processed_data.
  3. Upload the weather dataset file downloaded locally into raw_data folder.

Step 2: Create the weather data transformation job using Visual ETL

Next, create a Visual ETL job to transform the raw weather data through type casting, SQL transformations, and data cleansing:

  1. From the left menu, under Data Analytics, choose Visual ETL and Create Visual Job.
  2. Choose the + sign, and under Data sources, choose Amazon S3.
  3. For the Amazon S3 node settings, choose the following:
    • S3 URI: Choose Browse S3 and Select
    • Delimiter: ,
    • Multiline: Disabled
    • Header: Enabled
    • Infer schema: Disabled
    • Recursive file lookup: Disabled

  4. Choose the + sign next to the Amazon S3 box to add another node, under Transforms select Change columns.
  5. Connect the Amazon S3 node to the change columns node.
  6. Select the Change columns node to open the configuration window.
    • Choose Add type cast. Select temperature_2m (¬∞C) as the source column and add temperature_celsius as the target column. Select float as the Type.
    • Select precipitation (mm) as the source column and add Precipitation_mm as the target column. Select float as the Type.
    • Select rain (mm) as the source column and add Rain_mm as the target column. Select float as the Type.
    • Select windspeed_10m (km/h) as the source column and add windspeed as the target column. Select float as the Type.
    • Close the configuration window.

  7. Choose the + sign to add another node, under Transforms select SQL query . In the configuration window, paste in the following SQL statement:
    SELECT 
        MAKE_TIMESTAMP(2016, 1, day_of_year, hour, 0, 0) as timestamp,
        (temperature_celsius * 9 / 5) + 32 as temp_f,
        rain_mm * 25.4 as rain_inches,
        windspeed
    FROM {myDataSource}

  8. Choose the + sign to add another node, under Data targets, choose Amazon S3 and provide the following options:
    • S3 URI: Choose Browse S3 and select the processed_data folder created in Step 1.
    • Format: CSV
    • Update catalog: true
    • Database: sagemaker_sample_db
    • Table: weather_data
    • Include header: true
    • Ouput to a single file: false

  9. Connect the nodes to create a complete job.
  10. Save the Visual ETL and name it DataProcessing.

Step 3: Create the analysis and prediction notebook using JupyterLab

Now, we’ll set up the JupyterLab notebook that performs seasonal irrigation analysis and crop impact predictions based on temperature, rainfall, and wind speed patterns.Complete the following steps:

  1. Download the Crop Irrigation Prediction Python notebook to your local environment.
  2. In the SageMaker Unified Studio, from the left menu, choose JupyterLab. Wait for a few seconds for JupyterLab to be set up if you are trying for the first time.
  3. Upload CropIrrigationPrediction.ipynb using the upload files option.
  4. Review the notebook code to understand how it processes the weather data and generates irrigation predictions.

Step 4: Orchestrate the workflow

Finally, we will use the visual workflow to orchestrate tasks. With visual workflows, you can define a collection of tasks organized as a directed acyclic graph (DAG) that can run on a user-defined schedule.

  1. Choose Workflows from the left menu.
  2. Choose Create new Workflow.
  3. Rename the workflow to WeatherDataProcessingOrchestration.
  4. Create S3 task for monitoring and ingesting raw weather data:
    1. Choose the + sign, then choose S3 Key Sensor.
    2. Select S3-task to open the configuration window.
    3. For Bucket key choose Browse S3 and choose the synthetic_weather_hourly_data.csv file from the shared/raw_data S3 folder.



  5. Create a Glue task to transform the weather data:
    1. Choose the + Sign and add Data Processing Job / Glue Job Operator.
    2. Select Glue-task node to open the configuration window. For Operation type select Choose an existing Glue job.
    3. For Job name, choose Browse Jobs and select DataProcessing (this is the visual ETL job we created in the previous step.


  6. Choose the + sign and add SageMaker Unified Studio Jupyter Notebook Operator.
  7. Select the Notebook-task to open the configuration window. For Source, choose Browse Files and choose CropIrrigationPrediction.ipynb.

  8. Connect the tasks to create the complete workflow.
  9. Review the Workflow settings and choose Save.
    1. Provide a workflow description, “Workflow for Weather Data Processing”
    2. For Trigger, choose Manual only, because in this example you will trigger the workflow manually. You can also configure the workflow to trigger automatically on a schedule or disable it from running

Step 5: Execute and monitor the workflow

To run your workflow, complete the following steps:

  1. Choose Run to trigger workflow execution.
  2. Choose View runs to see the running workflow.
  3. Choose the Run ID for detailed logs on the execution.
  4. When the run is complete, you can review the task logs by choosing the Task ID.

The model’s output is written to the S3 processed data output folder. You can review the crop irrigation prediction results to verify they reflect realistic weather patterns and field conditions. If any results appear unexpected or unclear, examine the upstream transformation steps or adjust the notebook logic to refine the outputs.

Clean up

To avoid incurring future charges, clean up the resources you created during this walkthrough. Leaving these resources running may result in ongoing costs for storage and compute.To clean up your resources:

  1. On the workflows page, select your workflow, and under Actions, choose Delete workflow.
  2. In Visual ETL, select your weather data transformation flow, and under Actions, choose Delete job.
  3. In Query Editor, use the three dots next to the name of the table weather_data and choose Drop table.
  4. In JupyterLab, in the File Browser sidebar, choose (right-click) your notebook and choose Delete.
  5. In Files, choose the folder raw_data and under Actions, choose Delete. Repeat the steps for the folders processed_data and output.

Conclusion

In this post, you learned how you can use the visual workflow experience in Amazon SageMaker Unified Studio to build end-to-end data processing pipelines through an intuitive, no-code interface. This experience removes the need to write orchestration logic manually while still offering production-grade reliability and scalability powered by Amazon MWAA Serverless. Whether you’re processing weather data for agricultural insights or building more complex machine learning pipelines, the visual workflow experience accelerates development and makes workflow automation accessible to data engineers, analysts, and data scientists alike.As organizations increasingly rely on automated data pipelines to drive business decisions, the visual workflow experience provides the perfect balance of simplicity and power. We encourage you to explore this new capability in Amazon SageMaker Unified Studio and discover how it can transform your data processing workflows.

To learn more, visit the Amazon SageMaker Unified Studio page.


About the authors

Suba Palanisamy

Suba Palanisamy

Suba is an Enterprise Support Lead, helping customers achieve operational excellence on AWS. Suba is passionate about all things data and analytics. She enjoys traveling with her family and playing board games.

Vinod Jayendra

Vinod Jayendra

Vinod is an Enterprise Support Lead in ISV accounts at Amazon Web Services, where he helps customers in solving their architectural, operational, and cost optimization challenges. With a particular focus on Serverless & Analytics technologies, he draws from his extensive background in application development to deliver top-tier solutions. Beyond work, he finds joy in quality family time, embarking on biking adventures, and coaching youth sports team.

Kamen Sharlandjiev

Kamen Sharlandjiev

Kamen is a Senior Worldwide Specialist SA, Big Data expert. He’s on a mission to make life easier for customers who are facing complex data integration and orchestration challenges. His secret weapon? Fully managed AWS services that can get the job done with minimal effort. Follow Kamen on LinkedIn to keep up to date with the latest MWAA and AWS Glue features and news!

Yuhang Huang

Yuhang Huang

Yuhang is a Software Development Manager on the Amazon SageMaker Unified Studio team. He leads the engineering team to design, build, and operate scheduling and orchestration capabilities in SageMaker Unified Studio. In his free time, he enjoys playing tennis.

Vasudevan Venkataramanan

Vasudevan Venkataramanan

Vasudevan is a Senior Software Engineer on the Amazon SageMaker Unified Studio team. He is responsible for technical direction of scheduling and orchestration within SageMaker Unified Studio. Outside of his professional work, he enjoys spending time with his kid and playing pickleball and cricket.

Gal Heyne

Gal Heyne

Gal is a Senior Technical Product Manager for AWS Analytics services with a strong focus on AI/ML and data engineering. She is passionate about developing a deep understanding of customers’ business needs and collaborating with engineers to design simple-to-use data products.

Introducing guidelines for network scanning

Post Syndicated from Stephen Goodman original https://aws.amazon.com/blogs/security/introducing-guidelines-for-network-scanning/

Amazon Web Services (AWS) is introducing guidelines for network scanning of customer workloads. By following these guidelines, conforming scanners will collect more accurate data, minimize abuse reports, and help improve the security of the internet for everyone.

Network scanning is a practice in modern IT environments that can be used for either legitimate security needs or abused for malicious activity. On the legitimate side, organizations conduct network scans to maintain accurate inventories of their assets, verify security configurations, and identify potential vulnerabilities or outdated software versions that require attention. Security teams, system administrators, and authorized third-party security researchers use scanning in their standard toolkit for collecting security posture data. However, scanning is also performed by threat actors attempting to enumerate systems, discover weaknesses, or gather intelligence for attacks. Distinguishing between legitimate scanning activity and potentially harmful reconnaissance is a constant challenge for security operations.

When software vulnerabilities are found through scanning a given system, it’s particularly important that the scanner is well-intentioned. If a software vulnerability is discovered and attacked by a threat actor, it could allow unauthorized access to an organization’s IT systems. Organizations must effectively manage their software vulnerabilities to protect themselves from ransomware, data theft, operational issues, and regulatory penalties. At the same time, the scale of known vulnerabilities is growing rapidly, at a rate of 21% per year for the past 10 years as reported in the NIST National Vulnerability Database.

With these factors at play, network scanners need to scan and manage the collected security data with care. There are a variety of parties interested in security data, and each group uses the data differently. If security data is discovered and abused by threat actors, then system compromises, ransomware, and denial of service can create disruption and costs for system owners. With the exponential growth of data centers and connected software workloads providing critical services across energy, manufacturing, healthcare, government, education, finance, and transportation sectors, the impact of security data in the wrong hands can have significant real-world consequences.

Multiple parties

Multiple parties have vested interests in security data, including at least the following groups:

  • Organizations want to understand their asset inventories and patch vulnerabilities quickly to protect their assets.
  • Program auditors want evidence that organizations have robust controls in place to manage their infrastructure.
  • Cyber insurance providers want risk evaluations of organizational security posture.
  • Investors performing due diligence want to understand the cyber risk profile of an organization.
  • Security researchers want to identify risks and notify organizations to take action.
  • Threat actors want to exploit unpatched vulnerabilities and weaknesses for unauthorized access.

The sensitive nature of security data creates a complex ecosystem of competing interests, where an organization must maintain different levels of data access for different parties.

Motivation for the guidelines

We’ve described both the legitimate and malicious uses of network scanning, and the different parties that have an interest in the resulting data. We’re introducing these guidelines because we need to protect our networks and our customers; and telling the difference between these parties is challenging. There’s no single standard for the identification of network scanners on the internet. As such, system owners and defenders often don’t know who is scanning their systems. Each system owner is independently responsible for managing identification of these different parties. Network scanners might use unique methods to identify themselves, such as reverse DNS, custom user agents, or dedicated network ranges. In the case of malicious actors, they might attempt to evade identification altogether. This degree of identity variance makes it difficult for system owners to know the motivation of parties performing network scanning.

To address this challenge, we’re introducing behavioral guidelines for network scanning. AWS seeks to provide network security for every customer; our goal is to screen out abusive scanning that doesn’t meet these guidelines. Parties that broadly network scan can follow these guidelines to receive more reliable data from AWS IP space. Organizations running on AWS receive a higher degree of assurance in their risk management.

When network scanning is managed according to these guidelines, it helps system owners strengthen their defenses and improve visibility across their digital ecosystem. For example, Amazon Inspector can detect software vulnerabilities and prioritize remediation efforts while conforming to these guidelines. Similarly, partners in AWS Marketplace use these guidelines to collect internet-wide signals and help organizations understand and manage cyber risk.

“When organizations have clear, data-driven visibility into their own security posture and that of their third parties, they can make faster, smarter decisions to reduce cyber risk across the ecosystem.” – Dave Casion, CTO, Bitsight

Of course, security works better together, so AWS customers can report abusive scanning to our Trust & Safety Center as type Network Activity > Port Scanning and Intrusion Attempts. Each report helps improve the collective protection against malicious use of security data.

The guidelines

To help ensure that legitimate network scanners can clearly differentiate themselves from threat actors, AWS offers the following guidance for scanning customer workloads. This guidance on network scanning complements the policies on penetration testing and vulnerability reporting. AWS reserves the right to limit or block traffic that appears non-compliant with these guidelines. A conforming scanner adheres to the following practices:

Observational

  • Perform no actions that attempt to create, modify, or delete resources or data on discovered endpoints.
  • Respect the integrity of targeted systems. Scans cause no degradation to system function and cause no change in system configuration.
  • Examples of non-mutating scanning include:
    • Initiating and completing a TCP handshake
    • Retrieving the banner from an SSH service

Identifiable

  • Provide transparency by publishing sources of scanning activity.
  • Implement a verifiable process for confirming the authenticity of scanning activities.
  • Examples of identifiable scanning include:
    • Supporting reverse DNS lookups to one of your organization’s public DNS zones for scanning Ips.
    • Publishing scanning IP ranges, organized by types of requests (such as service existence, vulnerability checks).
    • If HTTP scanning, have meaningful content in user agent strings (such as names from your public DNS zones, URL for opt-out)

Cooperative

  • Limit scan rates to minimize impact on target systems.
  • Provide an opt-out mechanism for verified resource owners to request cessation of scanning activity.
  • Honor opt-out requests within a reasonable response period.
  • Examples of cooperative scanning include:
    • Limit scanning to one service transaction per second per destination service.
    • Respect site settings as expressed in robots.txt and security.txt and other such industry standards for expressing site owner intent.

Confidential

  • Maintain secure infrastructure and data handling practices as reflected by industry-standard certifications such as SOC2.
  • Ensure no unauthenticated or unauthorized access to collected scan data.
  • Implement user identification and verification processes.

See the full guidance on AWS.

What’s next?

As more network scanners follow this guidance, system owners will benefit from reduced risk to their confidentiality, integrity, and availability. Legitimate network scanners will send a clear signal of their intention and improve their visibility quality. With the constantly changing state of networking, we expect that this guidance will evolve along with technical controls over time. We look forward to input from customers, system owners, network scanners and others to continue improving security posture across AWS and the internet.

If you have feedback about this post, submit comments in the Comments section below or contact AWS Support.

Stephen Goodman

Stephen Goodman

As a senior manager for Amazon active defense, Stephen leads data-driven programs to protect AWS customers and the internet from threat actors.

Save up to 24% on Amazon Redshift Serverless compute costs with Reservations

Post Syndicated from Satesh Sonti original https://aws.amazon.com/blogs/big-data/save-up-to-24-on-amazon-redshift-serverless-compute-costs-with-reservations/

 Amazon Redshift Serverless makes it convenient to run and scale analytics without managing clusters, offering a flexible pay-as-you-go model. With Redshift Serverless Reservations, you can optimize compute costs and improve cost predictability for your Redshift Serverless workloads.

In this post, you learn how Amazon Redshift Serverless Reservations can help you lower your data warehouse costs. We explore ways to determine the optimal number of RPUs to reserve, review example scenarios, and discuss important considerations when purchasing these reservations.

How Amazon Redshift Serverless Reservations work

Amazon Redshift measures data warehouse capacity in Redshift Processing Units (RPUs). You pay for the workloads you run in RPU-hours on a per-second basis (with a 60-second minimum charge). 1 RPU provides 16 GB of memory. You can commit to a specific number of Redshift Processing Units (RPUs) for a one-year term. Two payment options are available: a no-upfront option with a 20% discount off on-demand rates, or an all-upfront option with a 24% discount. The reserved amount of RPUs is billed 24 hours a day, seven days a week

Amazon Redshift Serverless Reservations are managed at the AWS payer account level and can be shared across multiple AWS accounts. Any usage beyond your committed RPU level is charged at standard on-demand rates. You can purchase Serverless Reservations through either the Amazon Redshift console or the Amazon Redshift Serverless Reservations API using the create-reservation command.

Key benefits of Amazon Redshift Serverless Reservations

The following are some of the key benefits of subscribing to Redshift Serverless Reservations.

  • Cost savings through commitment: Redshift Serverless Reservations help you reduce your overall Redshift Serverless spend compared to on-demand (non-reserved) usage.
  • Centralized management: Supports reservation administration at the AWS payer account level for simplified governance and visibility across your organization.
  • Per-second metering with hourly billing: Offers per-second metering with hourly billing, so that you only pay for what you use. This cost-effective pricing model eliminates wasted resources and unnecessary charges, lowering your Amazon Redshift Serverless spend.
  • Predictable costs: The 24 hours a day, 7 days a week billing model offers stable monthly costs that simplify forecasting and budgeting.
  • Sharing capabilities between multiple AWS accounts: Enhances collaboration across different teams and departments, enabling improved resource utilization throughout your organization.

Determining optimal RPU reservation

You can determine your RPU reservation level through your serverless usage history and the AWS Billing and Cost Management recommendations.

Serverless usage history

You can use the Redshift Serverless Dashboard, which provides a detailed view of your workgroup and namespace activities. The dashboard helps you to analyze trends and patterns in your data warehouse usage. You can easily monitor your RPU capacity usage and total compute usage, helping you make informed decisions about resource allocation. For more granular analysis, you have the option to query the SYS_SERVERLESS_USAGE system table, which provides detailed historical usage data. To optimize costs while ensuring performance, you can reserve the minimum consistent RPUs used per hour by analyzing the usage patterns across all your workgroups.

AWS Billing and Cost Management recommendations

You can use AWS Billing and Cost Management to help you estimate your capacity needs:

  1. Navigate to your Billing and Cost Management dashboard.
  2. On the left navigation pane, expand Reservations.
  3. Choose Recommendations.
  4. Select Redshift in the Service drop down menu.
  5. Choose required Term, Payment option, and Based on the past to select the history to determine reservation recommendations.
  6. You will find the recommendations in the Recommendations section. The following is an example screen:

Recommendations Section

The following example shows a Redshift Serverless purchase recommendation from AWS Cost Management. The interface displays a specific recommendation to buy Reserved Instances with key details including the term length, AWS Region, payment option, and expected utilization rate. The recommendation includes upfront and recurring cost information, with a direct link to the Amazon Redshift console for implementation.

RS Serverless RPU instances

If reservations are not recommended based on your usage, then you will see “Based on your selections, no purchase recommendations are available for you at this time. Adjust your selections to view recommendation” message under the Recommended actions section.

Cost Explorer generates your reservation recommendations by identifying your On-demand usage during a specific period and identifying the best number of reservations to purchase to maximize your estimated savings.

Disclaimer: The approaches described above provide an estimate of your optimal RPU reservation level. Actual results may vary depending on workload patterns, peak usage, and utilization variability. Your RPU commitment may not always yield the maximum available discount percentage, as savings depend on how closely your Redshift Serverless Reserved RPUs aligns with real usage over time. This recommendation does not guarantee the cost for your actual use of AWS services.

For more details visit Accessing reservation recommendations.

Real-world scenarios

Let’s examine two different scenarios to understand how reservations can help you optimize costs, we’ll walk through the scenario of a single Redshift Serverless workgroup and a scenario with multiple Redshift serverless workgroups.

Scenario 1: Single Redshift Serverless workgroup

Let’s consider you have only one Redshift Serverless workgroup in your environment and the workload is spread as described in the following table.

In the table, hourly RPU consumption metrics for workgroup1 across different time intervals. The data shows a reservation of 64 RPUs with no upfront payment option, which provides a 20% discount. The table breaks down the compute usage into two categories: Reserved compute, consistently showing 64 RPUs across all hours, and On-demand compute, which varies based on actual consumption above the reserved capacity. The bottom row displays the Total charged RPUs, which reflects the final billing after applying the reserved instance discount. This helps visualize how the workload utilizes the reserved capacity and any additional on-demand usage throughout the specified time period.The total actual RPU consumption is 1,664 and the total charged consumption is 1,484.8. This configuration results in a 10.7% net discount.

Scenario 2: Multiple Redshift Serverless workgroups

In this scenario, you have multiple Redshift serverless workgroup in your environment and the workload is spread as described in the following table.

Similar to the previous single workgroup scenario, you can see hourly RPU consumption metrics for workgroups across different time intervals. In this scenario, you have also opted for 64 RPUs reserved with no upfront option, which applies a 20% discount to the workload. However, you can notice that the total consumption across workgroups matches the total reserved RPUs. This maximizes your total savings even though individual workgroups consumed less than the total RPUs reserved at the payer account level.

The total actual RPU consumption is 1,536 (768+512+256) across workgroups and the total charged consumption is 1,228.8. This configuration results in a 20% net discount.

You can use the following query to find the average RPUs consumed in each hour in a workgroup.

SELECT
date_trunc('hour',end_time) AS run_hr,
avg(compute_capacity)
FROM SYS_SERVERLESS_USAGE
GROUP BY 1
ORDER BY 1

You can use the output of this query to populate a spreadsheet with a similar structure as the ones used in the previous scenarios.

Considerations

We recommend you consider the following when using Redshift Serverless Reservations:

  • Start conservatively: Avoid over-purchasing Serverless Reservations RPUs. It’s best to begin with a minimum base RPU level or align your commitment to the average RPU usage across all Redshift Serverless workgroups under your AWS payer and linked accounts.
  • Reservations are immutable: Once purchased, Redshift Serverless Reservations can’t be changed or deleted. However, you can add additional reservations later to increase your coverage as your workloads grow.
  • Discount sharing control: The management account in an AWS Organization can disable Reserved Instance or Savings Plan discount sharing for any linked accounts, including itself. See the AWS documentation for details.
  • Automatic discount application: Redshift Serverless Reservations billing model automatically applies all the reserved RPU discount to your workloads before using on-demand cost, helping you save on costs.
  • Reservations are Regional: They apply only within the AWS Region where they are purchased and cannot be shared across Regions.
  • Handling excess usage: If your workload exceeds the number of reserved RPUs, the additional usage is billed at the standard on-demand rate.
  • Use a 30 to 60-day window for recommendations: To receive the most accurate reservation recommendations, we suggest using a 30- to 60-day usage window in the Billing and Cost Management console, under Reservations, in the Recommendations section. This approach assumes that your typical production workloads have been running during that period so that the recommendations reflect real-world usage.

Conclusion

In this post, we described how Amazon Redshift Serverless Reservations provide a way to reduce your data warehouse costs while maintaining the flexibility of Redshift serverless pricing. By carefully planning your Amazon Redshift Serverless Reservation strategy and monitoring usage patterns, you can achieve up to 24% cost savings for your Redshift Serverless analytics workloads. For detailed documentation, see Billing for serverless reservations.


About the authors

Satesh Sonti

Satesh Sonti

Satesh is a Principal Specialist Solutions Architect based out of Atlanta, specializing in building enterprise data platforms, data warehousing, and analytics solutions. He has over 20 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Ashish Agrawal

Ashish Agrawal

Ashish is a Principal Product Manager with Amazon Redshift, building cloud-based data warehouses and analytics cloud services. Ashish has over 25 years of experience in IT. Ashish has expertise in data warehouses, data lakes, and platform as a service. Ashish has been a speaker at worldwide technical conferences.

Improving throughput of serverless streaming workloads for Kafka

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/compute/improving-throughput-of-serverless-streaming-workloads-for-kafka/

Event-driven applications often need to process data in real-time. When you use AWS Lambda to process records from Apache Kafka topics, you frequently encounter two typical requirements: you need to process very high volumes of records in close to real-time, and you want your consumers to have the ability to scale rapidly to handle traffic spikes. Achieving both necessitates understanding how Lambda consumes Kafka streams, where the potential bottlenecks are, and how to optimize configurations for high throughput and best performance.

In this post, we discuss how to optimize Kafka processing with Lambda for both high throughput and predictable scaling. We explore the Lambda’s Kafka Event Source Mappings (ESMs) scaling, optimization techniques available during record consumption, how to use ESM Provisioned Mode for bursty workloads, and which observability metrics you need to use for performance optimization.

Overview

To start processing records from a Kafka topic with a Lambda function, whether using Amazon Managed Streaming for Apache Kafka (Amazon MSK) or a self-managed Kafka cluster, you create an ESM: a lightweight serverless resource that consumes records from Kafka topics and invokes your function.

The scaling behavior of Kafka ESMs is based on the offset lag. This is a metric indicating the number of records in the topic that have not yet been consumed by the Lambda function. This metric typically grows when producers publish new records faster than consumers process them. As the lag grows, the Lambda service gradually adds more Kafka consumers (also known as pollers) to your ESM. To preserve ordering guarantees, the maximum number of pollers is capped by the number of partitions in the topic. Lambda also scales pollers down automatically when lag decreases.

Each ESM follows a consistent polling workflow: poll -> filter -> batch -> invoke, as shown in the following diagram. Every stage has configurable options that directly affect performance, latency, and cost.


Figure 1. ESM processing workflow.

Polling: Increasing predictability with Provisioned Mode

By default, Kafka ESM uses the on-demand polling mode. In this mode, ESM starts with one poller, automatically adds more pollers when the offset lag grows, and scales the number of pollers down as lag decreases. On-demand mode does not need upfront scaling configuration and is the lowest-cost option for steady workloads. For many applications, this behavior is sufficient: scaling up can take several minutes, but the throughput eventually catches up, and you only pay for the resources you use, such as number of invocations.

However, if your workloads are bursty and latency-sensitive, then on-demand scaling may not be fast enough and can result in a rapidly growing lag. This can be addressed by switching to Provisioned Mode, which gives you more fine-grained control to configure a minimum and maximum number of always-on pollers for your Kafka ESM. These pollers remain connected even when traffic is low, so consumption begins immediately when a spike occurs, and scaling within the configured range is faster and more predictable.

The following diagram shows the performance improvements of using the ESM in Provisioned Mode for bursty workloads. You can see that in on-demand mode it took ESM over 15 minutes to eventually catch up to the new traffic volume, while in Provisioned Mode the ESM handled the traffic increase instantly.


Figure 2. Comparing Kafka ESM on-demand and Provisioned Mode.

Best practices for using Provisioned Mode:

  • Start small: Provisioned Mode is a paid capability. AWS recommends that for smaller topics (less than 10 partitions) you start with a single provisioned poller to evaluate throughput and observe workload behavior. For larger topics, you can start with a higher number of provisioned pollers to accommodate the baseline consumption. You can adjust this configuration at any time as you learn traffic patterns and refine your performance targets.
  • Estimate throughput: A single provisioned poller can process up to 5 MB/s of Kafka data. Monitor your average record size and per-record processing time to establish a baseline for minimum and maximum pollers, then validate with real workload metrics.
  • Set a low floor and flexible ceiling: Choose a minimum number of pollers that makes sure that latency targets are met when a traffic burst occurs, then allow the ESM to scale toward a higher maximum as needed.

See Low latency processing for Kafka event sources for more information.

To summarize:

  • Use Provisioned Mode for bursty traffic, strict SLOs, or when backlogs pose downstream risk.
  • Use on-demand polling mode for steady traffic, flexible latency requirements, or when minimizing cost is the primary objective.

Filtering: Drop irrelevant records early

By default, all records from Kafka are delivered to your Lambda function. This approach is direct and flexible. Your handler code decides which records to process and which to ignore. This default behavior is highly efficient for workloads where nearly all records are valuable.

When you find yourself discarding a large portion of records in your handler code, you can use native ESM filtering capabilities to drop irrelevant records before they reach your function. You can filter early to reduce cost, free up concurrency, increase throughput, and make sure that your Lambda function spends cycles on valuable work only.

The following diagram shows the application of an ESM filter to only process telemetry that meets a specified condition.


Figure 3. ESM filtering configuration.

Batching: Processing more records per invocation

You can batch multiple Kafka records together to process more data per invocation and increase the efficiency of your Lambda functions. Larger batches help you achieve higher throughput and reduce costs by making better use of each invocation run. To get the best results, you should balance batch size and latency targets and adjust the configuration based on your workload’s specific traffic patterns and SLOs.

Lambda gives you two primary controls for configuring ESM batching behavior:

  • Batch window: This is how long the ESM waits to accumulate records before invoking your function. A shorter window produces smaller batches and more frequent invocations. A longer window (up to 5 minutes) produces larger batches and less frequent invocations.
  • Batch size: This is the maximum number of records that the ESM can accumulate before invoking your function, up to 10,000.

There’s no single setting that universally works for all workloads. Your optimal configuration depends on workload characteristics such as latency tolerance and record size. AWS recommends starting with the default values and then gradually adjusting the configuration based on your requirements. For example, you can increase the batch size while monitoring function duration, error rates, and end-to-end latency.

The following diagram shows how to configure batch window and size using Terraform:


Figure 4. ESM batch window and batch size configuration with Terraform.

The ESM invokes your function when one of the following three conditions is met:

  1. The batch window elapses.
  2. The accumulated batch reaches the configured maximum batch size.
  3. The accumulated payload approaches the 6 MB maximum invocation payload limit of Lambda.

When using higher batch window values during traffic spikes, you typically see more records-per-batch and longer function invocation durations. This is normal: larger batches can take longer to process. Always interpret the Duration metric in the context of the batch size being processed.

Invoke: Process each batch faster and more efficiently

You control how quickly each batch completes through two main factors: the efficiency of your function code and the compute resources you allocate to your functions. You can improve both to process more records per second, reduce the necessary concurrency, and lower cost.

Optimize your code: Review your function handler code to identify where you can reduce work per record. For example, eliminate redundant serialization, initialize dependencies once during function startup, and consider parallel processing within the handler (where applicable). For performance-critical workloads, you can also choose languages that compile to binary, such as Go or Rust, which typically deliver high performance with lower resource usage.

Tune compute resources: Increasing the memory function allocation proportionally increases vCPU. Use the Lambda PowerTuning tool to find the memory configuration that best balances performance and cost for your workload.

Correlate metrics: As you optimize, monitor Duration and Concurrency. You should see the concurrency drop as duration improves. That correlation confirms that your changes are improving the system throughput and efficiency.

When you combine handler optimizations with early filtering and efficient batching, even small improvements can make your pipeline noticeably faster to operate under load.

Observability drives good decisions

You can’t optimize what you can’t see. To tune your data processing pipeline, use a combination of OffsetLag, function invocation metrics, and Kafka broker metrics to understand your data processing performance. OffsetLag tells you whether your function is keeping up with incoming records, as shown in the following figure. Function metrics such as Duration, Concurrency, Errors, and Throttles show how efficiently your code is processing record batches. If you use Provisioned Mode, then you can use the Provisioned Pollers metric to track the poller capacity.


Figure 5. Kafka consumption observability with Amazon CloudWatch.

Always interpret function duration in the context of batch size. During traffic spikes, you can typically observe both duration and actual batch size increase, which is expected amortization, not a regression. For alerting, monitor lag growth, unexpected drops in invocation rate, and error spikes. With these signals in place, you can detect issues early and tune your configuration with confidence.

A sample step-by-step optimization loop

  1. Establish a clean baseline: Make your handler idempotent and batch-aware, start with a short batch window and moderate batch size. Monitor your ESM and confirm offset lag stays near zero at steady state.
  2. Filter early: Move static checks (record type, version, other custom properties) into ESM filtering and verify invoked counts drop relative to polled counts, proving the filter saves cost and concurrency.
  3. Increase batch size gradually while monitoring the duration, error rates, and latency metrics. Extend the batch window slightly if spikes cause too many invocations.
  4. Speed up the handler: Increase memory for more CPU, reduce per-record I/O, remove redundant serialization, and parallelize safely inside the batch while tracking duration and concurrency metrics together.
  5. Prove spike readiness: Replay realistic surges, monitor offset lag and drain time, and enable Provisioned Mode with a small minimum if recovery takes too long, adjusting with MB/s-per-poller estimates.
  6. Implement alerting: Watch for sustained lag growth, unexpected gaps between polled and invoked, and error spikes tied to partitions or large batches. Always read metrics in context with batch size.
  7. Re-evaluate periodically: Re-measure system throughput, confirm filter effectiveness, and retune batch and memory settings regularly as workloads evolve.

Conclusion

Optimizing Kafka streams processing with AWS Lambda necessitates understanding how ESMs work and tuning consumption components: polling, filtering, batching, and invoking. Filtering redundant records early removes unnecessary work, batching helps you process more records per invocation, and handler optimizations make sure that you make the most of the compute that you allocate. Together, these adjustments let you scale efficiently and keep offset lag under control.

When your workload is bursty, use Provisioned Mode to absorb spikes without long recovery times. With the right alerts on lag, errors, and unexpected polled versus invoked behavior, you can spot problems early and adjust before they impact users. Following this optimization guide gives you a practical way to measure, tune, and revisit your setup as traffic patterns change.

To learn more about optimizing Kafka consumption, see the AWS re:Invent 2024 session about Improving throughput and monitoring of serverless streaming workloads.

To learn more about building Serverless architectures see Serverless Land.

Serverless strategies for streaming LLM responses

Post Syndicated from KyungYong Shim original https://aws.amazon.com/blogs/compute/serverless-strategies-for-streaming-llm-responses/

Modern generative AI applications often need to stream large language model (LLM) outputs to users in real-time. Instead of waiting for a complete response, streaming delivers partial results as they become available, which significantly improves the user experience for chat interfaces and long-running AI tasks. This post compares three serverless approaches to handle Amazon Bedrock LLM streaming on Amazon Web Services (AWS), which helps you choose the best fit for your application.

  1. AWS Lambda function URLs with response streaming
  2. Amazon API Gateway WebSocket APIs
  3. AWS AppSync GraphQL subscriptions

We cover how each option works, the implementation details, authentication with Amazon Cognito, and when to choose one over the others.

Lambda function URLs with response streaming

AWS Lambda function URLs provide a direct HTTP(S) endpoint to invoke your Lambda function. Response streaming allows your function to send incremental chunks of data back to the caller without buffering the entire response. This approach is ideal for forwarding the Amazon Bedrock streamed output, providing a faster user experience. Streaming is supported in Node.js 18+. In Node.js, you wrap your handler with awslambda.streamifyResponse(), which provides a stream to write data to, and which sends it immediately to the HTTP response.

Architecture

The following figure shows the architecture.

Lambda function URLs with Amazon Bedrock architecture

  1. The client makes a fetch() call to the Lambda function URL.
  2. Lambda invokes InvokeModelWithResponseStream using the AWS SDK for JavaScript.
  3. As tokens arrive from Amazon Bedrock, they are written to the response stream.

Implementation steps

  1. Create a streaming Lambda function: Use a Node.js 18+ or later runtime (necessary for native streaming). Install the AWS SDK to call Amazon Bedrock. In the handler code, wrap the function with awslambda.streamifyResponse and stream the model output. For example, in Node.js you might do the following:
    const bedrock = new BedrockRuntimeClient({region: “us-east-1”});
    
    // Please consider adding more details when you use it for your application.
    exports.handler = awslambda.streamifyResponse(async (event, responseStream) => 
    {
        // 1. Parse input (e.g., prompt from event)
        const prompt = event.body?.prompt;
        // 2. Call Amazon Bedrock with streaming (using AWS SDK for Amazon Bedrock)
        const command = new InvokeModelWithResponseStreamCommand({ modelId: "YOUR_MODEL_ID", body: { prompt }});
        const response = await bedrock.send(command);
        // 3. Stream Bedrock tokens to client
        for await (const event of response.body) {
            if (event.content) {
                responseStream.write(event.content); // write partial output
            }
        }
        // 4. End stream when done
        responseStream.end();
    });
    

  2. This code snippet uses the Amazon Bedrock SDK’s async iterable to read the event stream of tokens and writes each to the responseStream.
  3. Configure the Lambda role: the execution role must allow the Amazon Bedrock invocation (such as bedrock:InvokeModelWithResponseStream on the LLM model Amazon Resource Name (ARN)).

Authentication with Amazon Cognito

Lambda function URLs can be set to “None” (public) or “AWS_IAM”. Native Cognito User Pool token authentication isn’t supported, thus you need to implement a solution.

  1. JWT verification in Lambda: Allow public access and verify a valid JWT from Amazon Cognito in the request header within your Lambda code. This necessitates development effort.
    // Initialize Cognito JWT Verifier
    const { CognitoJwtVerifier } = require('aws-jwt-verify');
    
    const jwtVerifier = CognitoJwtVerifier.create({
      userPoolId: USER_POOL_ID,
      tokenUse: 'id',
      clientId: USER_POOL_CLIENT_ID
    });
    
    // Verify JWT token from Cognito
    async function verifyToken(token) {
      try {
        if (!token) throw new Error('No authorization token provided');
        
        // Remove 'Bearer ' prefix if present
        if (token.startsWith('Bearer ')) {
          token = token.slice(7);
        }
    
        // Verify the token using Cognito JWT Verifier
        const payload = await jwtVerifier.verify(token);
        logger.info(`Verified token for user: ${payload.sub}`);
        
        return payload;
      } catch (error) {
        logger.error(`Token verification failed: ${error.message}`);
        throw new Error(`Invalid token: ${error.message}`);
      }
    }
    
    //...
    
        // Verify authentication
        let userId;
        try {
          const authHeader = event.headers?.Authorization;
          const payload = await verifyToken(authHeader);
          userId = payload.sub;
          logger.info(`Authenticated user: ${userId}`);
        } catch (error) {
          responseStream.write(`data: ${JSON.stringify({ type: 'error', error: 'Unauthorized', message: error.message })}\n\n`);
          return;
        }
    

  2. IAM authorization with Amazon Cognito identity: Use AWS credentials obtained from Amazon Cognito. A more complex setup, especially for web apps, is potentially overkill for a single function.

Pros and cons of Lambda function URLs

Pros:

  • Clarity: No API Gateway or other services are needed, which minimizes operational overhead.
  • Low latency, high throughput: The response is delivered directly from Lambda to the client. This yields excellent Time To First Byte (TTFB) performance, with no intermediate buffering.
  • Direct implementation: For Node.js developers, enabling streaming is as direct as a wrapper and writing to a stream. This is ideal for quick prototypes or single function microservices.
  • Lower cost for low concurrent usage: You pay only for Lambda execution time. There’s no persistent connection cost, which is the same as with WebSocket or AWS AppSync. If invocations are infrequent or short, then this could be cost-efficient.

Cons:

  • Limited runtime support: Native streaming is only supported in Node.js.
  • No built-in user pool auth: Unlike API Gateway or AWS AppSync, Lambda URLs don’t directly support Amazon Cognito user pool authorizers. You must handle auth either through AWS Identity and Access Management (IAM) or manual token validation, adding some development effort and potential security pitfalls if done incorrectly.
  • Error handling complexity: Streaming makes error propagation trickier. If an error occurs mid-stream, then you need to decide how to inform the client.

API Gateway WebSocket for streaming

API Gateway WebSocket APIs establish persistent, stateful connections between clients and your backend. This is ideal for real-time applications needing server-initiated messages. The client connects once, sends a prompt to Amazon Bedrock through the WebSocket, and the server pushes the streamed response back over the same connection.

Architecture

The following figure shows the architecture.

API Gateway WebSocket with Amazon Bedrock architecture

  1. Client connects through the WebSocket URL and store connectionId.
  2. Client sends a prompt through a custom route to the LLMHandler.
  3. Lambda as LLMHandler invokes Amazon Bedrock and streams back through WebSocket.
  4. Client disconnects through the DisconnectHandler and removes connectionId.

Implementation steps

  1. Create a WebSocket API in API Gateway with routes
    1. $connect: Connected to ConnectHandler Lambda.
    2. $disconnect: Connected to DisconnectHandler Lambda.
    3. $stream: All messages go to StreamHandler Lambda.
  2. Create Lambda Authorizer
    1. Receives the connection request with token in query string.
    2. Validates the JWT token against Amazon Cognito.
    3. Returns Allow/Deny policy for the connection.
      def lambda_handler(event, context):
          # Extract token from querystring
          token = event.get("queryStringParameters", {}).get("token", "")
          
          # Validate JWT token against Cognito
          if validate_token(token):
              return {
                  "isAuthorized": True,
                  # Optionally include context that other handlers can access
                  "context": {
                      "userId": extracted_user_id
                  }
              }
          else:
              return {"isAuthorized": False}
      

  3. Create Connection Handler
    1. Connection Lambda runs after successful authorization.
    2. Receives the new connection’s unique connectionId.
    3. Store connection info in Amazon DynamoDB (optional).
    4. Returns 200 status to complete the connection.
      def lambda_handler(event, context):
          # Extract connectionId
          connection_id = event.get("requestContext", {}).get("connectionId")
          
          # Optionally store in DynamoDB
          # dynamodb.put_item(...)
          
          # Connection established successfully
          return {"statusCode": 200}
      

  4. Create Disconnect Handler
    1. Disconnect Lambda is triggered automatically when clients disconnect.
    2. Receives the terminated connection’s connectionId.
    3. Cleans up any stored connection data.
    4. Returns 200 status
      def lambda_handler(event, context):
          # Extract connectionId
          connection_id = event.get("requestContext", {}).get("connectionId")
          
          # Optionally remove from DynamoDB
          # dynamodb.delete_item(...)
          
          # Disconnection handled successfully
          return {"statusCode": 200}
      

  5. Create LLM Handler
      1. Receives messages sent to the stream route.
      2. Extracts prompt from the message body.
      3. Calls Amazon Bedrock model with streaming response.
      4. Streams tokens back to the client using the connection ID.
        def lambda_handler(event, context):
            # Extract connectionId and domain details for sending responses
            connection_id = event["requestContext"]["connectionId"]
            domain = event["requestContext"]["domainName"]
            stage = event["requestContext"]["stage"]
            
            # Parse message body to get the prompt
            body = json.loads(event.get("body", "{}"))
            prompt = body.get("prompt", "")
            
            # Create API Gateway management client for sending responses
            api_client = boto3.client(
                'apigatewaymanagementapi',
                endpoint_url=f'https://{domain}/{stage}'
            )
            
            # Call Amazon Bedrock with streaming response
            response = bedrock_client.invoke_model_with_response_stream(...)
            
            # Stream tokens back to client
            for chunk in response["body"]:
                # Extract token from chunk
                token = process_chunk(chunk)
                
                # Send token directly back through the WebSocket
                api_client.post_to_connection(
                    ConnectionId=connection_id,
                    Data=json.dumps({"token": token, "isComplete": False})
                )
            
            # Send completion message
            api_client.post_to_connection(
                ConnectionId=connection_id,
                Data=json.dumps({"token": "", "isComplete": True})
            )
            
            return {"statusCode": 200}
        

Authentication with Amazon Cognito

Securing a WebSocket API with Amazon Cognito needs a bit more work. API Gateway WebSocket doesn’t have a built-in Amazon Cognito User Pool authorizer:

  1. Lambda authorizer with JWT authentication: API Gateway invokes your Lambda authorizer upon connection, validating the Amazon Cognito JWT (passed as a query parameter). The Lambda generates an IAM policy granting access and returns it.
  2. IAM authentication for WebSockets: Clients sign requests with SigV4 using AWS credentials from an Amazon Cognito Identity Pool. API Gateway evaluates the request against IAM policies.

Pros and cons of API Gateway WebSocket APIs

Pros:

  • Bidirectional real-time communication: WebSockets are ideal for applications where the server needs to push data such as the LLM’s response without explicit requests.
  • Persistent connection for multi-turn conversations: After the initial handshake, the same connection can be reused for subsequent prompts and responses, avoiding repeated setup latency. This is great for a chat UI where the user asks multiple questions in one session.
  • Scalability: API Gateway is a managed service that can handle 500 connections/second and 10,000 requests/second across APIs, which can be increased by request.

Cons:

  • Higher development complexity: When compared to the clarity of a direct Lambda URL, a WebSocket API involves multiple Lambdas and coordination to manage the connection state.
  • Custom auth implementation: There is no built-in Amazon Cognito user pool integration, thus you must implement a Lambda authorizer.
  • Timeout management: The API Gateway integration timeout is 29 s, thus your Lambda function should return the response promptly.

AWS AppSync GraphQL subscription

AWS AppSync is a fully managed GraphQL service that streamlines building real-time APIs. It handles WebSocket connections and client fan-out automatically. Clients subscribe to a GraphQL subscription, and a Lambda resolver pushes the Amazon Bedrock streamed tokens back.

Architecture

The following figure shows the architecture.

AWS AppSync GraphQL subscription with Amazon Bedrock architecture

  1. Client calls a startStream mutation. AppSync invokes the Request Lambda.
  2. The Request Lambda immediately returns a unique sessionId and sends the processing task to an Amazon Simple Queue Service (Amazon SQS) queue.
  3. Client uses the sessionId to subscribe to an onTokenReceived GraphQL subscription.
  4. The Processing Lambda (triggered by Amazon SQS) invokes Amazon Bedrock and, for each token, calls a publishToken mutation in AWS AppSync.
  5. AWS AppSync automatically pushes the token to all clients subscribed with the matching sessionId.

Implementation steps

  1. Design the GraphQL Schema: define types and operations.
    type StreamResponse {
      sessionId: String!
      status: String!
      message: String
      timestamp: String!
      error: String
    }
    
    type TokenEvent {
      sessionId: String!
      token: String!
      isComplete: Boolean!
      timestamp: String!
    }
    
    type Mutation {
      startStream(prompt: String!): StreamResponse!
      publishToken(sessionId: String!, token: String!, isComplete: Boolean!): TokenEvent!
    }
    
    type Subscription {
      onTokenReceived(sessionId: String!): TokenEvent
    

  2. Create the Request Handler (Request Lambda)
    1. Receives the GraphQL mutation with the prompt.
    2. Generates a unique session ID.
    3. Sends the prompt and session ID to the SQS queue.
    4. Returns the session ID to the client immediately.
      def lambda_handler(event, context):
          # Extract prompt from GraphQL event
          prompt = event["arguments"]["prompt"]
          
          # Generate unique session ID
          session_id = str(uuid.uuid4())
          
          # Send message to SQS queue
          sqs_client.send_message(
              QueueUrl="your-sqs-queue-url",
              MessageBody=json.dumps({
                  "prompt": prompt,
                  "sessionId": session_id
              })
          )
          
          # Return session ID to client
          return {
              "sessionId": session_id,
              "status": "streaming_started",
              "timestamp": datetime.datetime.utcnow().isoformat()
          }
      

  3. Create the Processing Handler (Processing Lambda)
    1. It is triggered by Amazon SQS messages.
    2. It calls Amazon Bedrock with streaming enabled.
    3. For each token generated, it calls the AppSync publishToken mutation.
      def lambda_handler(event, context):
          # Process SQS event records
          for record in event["Records"]:
              body = json.loads(record["body"])
              prompt = body["prompt"]
              session_id = body["sessionId"]
              
              # Call Amazon Bedrock with streaming
              response = bedrock_client.invoke_model_with_response_stream(...)
              
              # Process streaming response
              for chunk in response["body"]:
                  # Extract token from chunk
                  token = process_chunk(chunk)
                  
                  # Publish token to AppSync
                  publish_token_to_appsync(
                      session_id=session_id,
                      token=token,
                      is_complete=False
                  )
              
              # Send completion token
              publish_token_to_appsync(
                  session_id=session_id,
                  token="",
                  is_complete=True
              )
      

  4. Configure GraphQL Resolvers
    1. StartStream resolver: Connect to the Request Lambda.
    2. PublishToken resolver: Trigger subscription with a NONE data source.
  5. Client subscription setup
    1. Make a startStream mutation.
      const { sessionId } = await client.mutate({
        mutation: START_STREAM,
        variables: { prompt }
      });
      

    2. Subscribe to receive tokens.
      client.subscribe({
        query: ON_TOKEN_RECEIVED,
        variables: { sessionId }
      }).subscribe({
        next: ({ data }) => {
          if (data.onTokenReceived.isComplete) {
            // Handle completion
          } else {
            // Append token to UI
            appendToken(data.onTokenReceived.token);
          }
        }
      });
      

Authentication with Amazon Cognito

AWS AppSync integrates seamlessly with Amazon Cognito User Pools. Setting the API’s auth mode to Amazon Cognito User Pool needs a valid JWT for every GraphQL operation. This is the most developer-friendly option for authentication. AWS AppSync handles the handshake and token refresh.

Pros and cons of AWS AppSync subscriptions

Pros:

  • Fully managed real-time protocol: You don’t deal with raw WebSockets or connection IDs at all. AWS AppSync automatically establishes and maintains a secure WebSocket for subscriptions (no need for a connect or disconnect Lambda).
  • Streamlined authentication: Built-in support for Amazon Cognito User Pool tokens means that you can secure the API without writing custom authorizers.

Cons:

  • Potential overhead and complexity: For a direct case (one prompt—one stream), introducing GraphQL and AWS AppSync might be seen as over-engineering if your app doesn’t use GraphQL for other use cases.
  • 30-second resolver limit: AWS AppSync has a 30-second limit for mutation resolvers, thus you need to design the initial request to start the process and immediately return, relying on a subscription to stream the results progressively to avoid blocking the user.

Conclusion

The Amazon Bedrock streaming interface unlocks fluid, low-latency LLM experiences. You can use the right AWS serverless architecture to deliver streamed responses in a secure, scalable, and cost-effective way.

  • Lambda function URLs with streaming: Direct, single-user applications and prototypes.
  • API Gateway WebSocket: Multi-turn conversations, collaborative applications.
  • AppSync: Complex applications already using GraphQL.

Each method is serverless, production-ready, and fully integrated with Amazon Cognito for secure access control. AWS provides the flexibility to design high-quality AI user experiences at scale.

Refer to GitHub sample source code for more details.

Comparative table

Feature LAMBDA FUNCTION URLS API GATEWAY WEBSOCKET APIs APPSYNC GRAPHQL SUBSCRIPTIONS
Complexity Lowest Medium High
Real-time focus Limited Strong Strong
Authentication Needs custom logic Needs custom logic Built-in Amazon Cognito support
Scalability Good Good Excellent
GraphQL support None None Native
Use cases Q&A Chatbots, real-time apps Complex apps, multi-user scenarios
Cost Pay per invocation Connection time and Lambda execution Request/connection-based pricing

 

Enhanced data discovery in Amazon SageMaker Catalog with custom metadata forms and rich text documentation

Post Syndicated from Ramesh H Singh original https://aws.amazon.com/blogs/big-data/enhanced-data-discovery-in-amazon-sagemaker-catalog-with-custom-metadata-forms-and-rich-text-documentation/

Amazon SageMaker Catalog now supports custom metadata forms and rich text descriptions at the column level, extending existing curation capabilities for business names, descriptions, and glossary term classifications.

With these new features, data stewards can define and capture business-specific metadata directly in individual columns, and authors can use markdown-enabled rich text to provide detailed documentation and business context. Both form fields and formatted descriptions are indexed in real time, making them immediately discoverable through catalog search.

Column-level context is essential for understanding and trusting data. This release helps organizations improve data discoverability, collaboration, and governance by letting metadata stewards document columns using structured and formatted information that aligns with internal standards.

In this post, we show how to enhance data discovery in SageMaker Catalog with custom metadata forms and rich text documentation at the schema level.

Key capabilities

SageMaker Catalog now offers the following key capabilities:

  • Custom metadata forms – Data stewards can now use custom metadata forms to capture organization-specific metadata fields for columns such as Business Owner, Regulatory Classification, Units of Measure, or Approved Use Case. Each field is stored as a key-value pair and indexed for search, enabling business-level queries like “find columns where sensitivity = confidential.”
  • Rich text (markdown) descriptions – Each column supports a markdown-enabled description field. Authors can format text with headings, bullet lists, and hyperlinks to add deeper business or operational context—for example, logic definitions, sample values, or data lineage references.
  • Real-time indexing for search – Custom form values and rich text content are indexed as soon as they are saved. Users can search using a metadata value, keyword, or glossary term across columns.

Solution overview

For this post, we explore a financial services use case. Our example financial services organization defines a column metadata form that includes several fields, as illustrated in the following table.

Field Example Value
Approved Use Case Financial revenue modeling
Business Owner Finance Office
Domain RF

For a dataset column named revenue, the author adds the following markdown description:

# Business Revenue

- Use for Financial Modeling
- Use only for batch use cases

When analysts search for Domain = RF, this column appears in results with complete business context.

In the following sections, we demonstrate how to use to use metadata forms for columns and add rich text descriptions that is searchable.

Prerequisites

To test this solution, you should have an Amazon SageMaker Unified Studio domain set up with a domain owner or domain unit owner privileges. You should also have an existing project to publish assets and catalog assets. For instructions to create these assets, see the Getting started guide.

In this example, we created a project named financial_analysis and a test table. To create a similar table, see Get started with Amazon S3 Tables in Amazon SageMaker Unified Studio. To ingest the sample data to SageMaker Catalog and generate business metadata, see Create an Amazon SageMaker Unified Studio data source for Amazon Redshift in the project catalog.

Create new metadata form

Complete the following steps to create a new metadata form:

  1. In SageMaker Unified Studio, go to your project.
  2. Under Project catalog in the navigation pane, choose Metadata entities.
  3. Choose Create metadata form.
  4. Provide an optional display name, a technical name, and an optional description, then choose Create metadata form.
  5. Define the form fields. In this example, we add the fields Domain, Business Owner, and Approved Use Case.
  6. For Requirement Options, select the configuration for each field. For our use case, we select Always required.
  7. Choose Create field.
  8. Turn on Enabled so the form is visible and can be used for assets.

Attach metadata form to column

Complete the following steps to attach the metadata form to a column:

  1. Under Project catalog in the navigation pane, choose Assets.
  2. Search for and select your asset (for this example, we use the asset business_finance).
  3. On the Schema tab, choose View/Edit next to the revenue field.
  4. Choose Add metadata form.
  5. Choose the form you created and choose Add.
  6. Add details for the metadata form fields

Add additional context as formatted text

Next, we enter a rich text description for each column using the markdown editor, including headings, bullet lists, links, and sample values. Complete the following steps:

  1. Choose Edit next to README for the revenue field where you added the metadata form.
  2. Enter details and choose Save.
  3. Choose Preview to view the formatted README at the column level.

Publish and verify search

Now you’re ready to publish the asset. The metadata form values and markdown descriptions become part of the catalog record and are indexed for search. You can also see the history of revisions on the History tab. Other project users can see the metadata form and rich text description for the published assets and subscribe to the data asset. You can create more data products with these assets, and they will also have the column metadata form and README.

In the catalog search UI, data users can now filter on custom form fields (for example, “Domain = RF”) or search in natural language for text that matches the column description.

Best practices

Consider the following best practices when using this feature:

  • Define metadata forms aligned with your business vocabulary (domains, owners, sensitivity levels) proactively before publishing assets at scale.
  • Make column descriptions actionable—include business definitions, value ranges, logic, update cadence, and dependencies.
  • Verify the catalog indexing is timely; publish changes proactively so search results reflect new metadata.
  • Use governance controls. You can combine column-level metadata with existing asset-level templates and approval workflows to enforce publishing standards.
  • Monitor search usage and metadata completeness; target high-value datasets for complete column-level documentation first.
  • Do not store confidential or sensitive information in your metadata forms.

Conclusion

With column-level metadata forms and rich text descriptions, SageMaker Catalog helps organizations deliver higher-quality metadata, stronger governance, and better data discovery. These features make it straightforward for teams to capture complete business context and for analysts to quickly locate and understand the data they need.

Custom metadata forms and rich text descriptions at the column level are now available in AWS Regions where SageMaker is supported.

To learn more about SageMaker, see the Amazon SageMaker User Guide. Get started with this capability, refer to the user guide.


About the Authors

Ramesh Singh

Ramesh Singh

Ramesh is a Senior Product Manager Technical (External Services) at AWS in Seattle, Washington, currently with the Amazon SageMaker team. He is passionate about building high-performance ML/AI and analytics products that enable enterprise customers to achieve their critical goals using cutting-edge technology.

Pradeep Misra

Pradeep Misra

Pradeep is a Principal Analytics and Applied AI Solutions Architect at AWS. He is passionate about solving customer challenges using data, analytics, and AI/ML. Outside of work, he likes exploring new places, trying new cuisines, and playing badminton with his family. He also likes doing science experiments, building LEGOs, and watching anime with his daughters.

Abbas Makhdum

Abbas Makhdum

Abbas is Head of Product Marketing for Amazon SageMaker Catalog at AWS, where he leads go-to-market strategy and launches for data and AI governance solutions. With deep expertise across data, AI, and analytics, Abbas has also authored a book on data and AI governance with O’Reilly. He is passionate about helping organizations unlock business value by making data and AI more accessible, transparent, and governed.

Harish Panwar

Harish Panwar

Harish is a Software Development Manager at AWS in Bangalore, India. He is leading the Catalog engineering team, which is building data and AI governance solutions. Harish is a veteran in Amazon SageMaker, with deep expertise across SageMaker AI and SageMaker Catalog. He is passionate about creating simple and intuitive AI solutions making AI accessible to everyone.

Building multi-tenant SaaS applications with AWS Lambda’s new tenant isolation mode

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/compute/building-multi-tenant-saas-applications-with-aws-lambdas-new-tenant-isolation-mode/

Today, AWS announced a new tenant isolation mode for AWS Lambda, that allows you to process function invocations in separate execution environments for each application end-user or tenant invoking your Lambda function. This capability simplifies building secure multi-tenant SaaS applications by managing tenant-level compute environment isolation and request routing for you. As a result, you can focus on your core business logic rather than implementing your own tenant-aware compute environment isolation.

Overview

Lambda runs your function code in secure execution environments that leverage Firecracker virtualization to provide isolation. These execution environments never share or reuse virtual resources (such as vCPU, disk, or memory) across functions, or even across different versions of the same function. However, Lambda can reuse execution environments for multiple invocations of the same function version, as these execution environments are fully set-up and can therefore deliver faster request processing for your functions.

Figure 1. Incoming invocations processed by a collection of execution environments that belong to a single function.

Figure 1. Incoming invocations processed by a collection of execution environments that belong to a single function.

Multi-tenant SaaS applications that handle sensitive tenant-specific data or execute code supplied dynamically by tenants may need a higher degree of isolation—at the individual application tenant level rather than at the function level—for secure code execution and to reduce the risk of cross-tenant data access.

Prior to today’s launch, developers would implement custom solutions, such as SDKs or application logic to manage isolation within function code. This approach was bug-prone, required more work from application development teams, and didn’t ensure isolation at the compute environment level.

Alternatively, developers adopted the approach of creating separate functions per application tenant, replicating the same code across hundreds or thousands of tenants. This approach provided stronger compute environment isolation than sharing compute environments across multiple tenants of the same function, but increased implementation overhead and operational complexity as workloads grew to support a larger number of tenants over time.

Figure 2. Using function-per-tenant model, each tenant’s requests are processed by a separate function.

Figure 2. Using function-per-tenant model, each tenant’s requests are processed by a separate function.

Starting today, AWS Lambda offers a new tenant isolation mode that lets you isolate execution environments used across different tenants of your multi-tenant SaaS applications, even when all of the tenants invoke the same function. When you enable the new tenant isolation mode, you include a tenant identifier with each function invocation. Lambda uses this identifier to route the request to the correct execution environment. As a result, each execution environment is reused only for invocations from the same tenant. This means you still get the performance benefits of warm execution environments, while ensuring that each tenant’s workloads remain isolated.

Figure 3. With the new tenant isolation capability, Lambda creates separate execution environments per tenant for a single function.

Figure 3. With the new tenant isolation capability, Lambda creates separate execution environments per tenant for a single function.

For organizations handling sensitive tenant-specific data or running untrusted code supplied dynamically by end-users, Lambda’s new tenant isolation mode provides the security benefits of per-tenant compute environment separation without the operational complexity of managing individual functions or infrastructure for each tenant.

Example scenario

Consider building a multi-tenant serverless SaaS application. To optimize performance, your function handler can retrieve tenant-specific configuration and data, cache it in memory, and reuse it for subsequent invocations from the same tenant. For example, you might cache tenant-specific database location, feature flags, or business rules that are frequently accessed during request processing. You may store this information within the application runtime process as global variables or as files in the /tmp directory. However, if the underlying execution environment is used to serve multiple tenants, this approach can potentially expose data across tenants.

With tenant isolation mode you can address this risk with much simpler architecture and configuration. This built-in capability makes Lambda an excellent choice for multi-tenant SaaS applications needing isolated compute environments for individual tenants.

Getting Started with Lambda Tenant Isolation Mode

Use the new tenancy-config parameter to configure tenant isolation mode when you create your function. You can only apply this configuration at function creation time; it cannot be updated for existing functions. The following snippet creates a function with tenancy config using the AWS CLI.

aws lambda create-function \
   --function-name my-function1 \
   --runtime nodejs22.x \
   --zip-file fileb://my-function1.zip \
   --handler index.handler \
   --role arn:aws:iam:1234567890:role/my-function-role \
   --tenancy-config '{"TenantIsolationMode": "PER_TENANT"}'

After the function is created, you must provide the tenant ID parameter with each invocation. Lambda uses this identifier to ensure that the execution environment used for a particular tenant is never reused for other tenants. For subsequent invocations from the same tenant, Lambda may reuse the execution environment to optimize performance. Specify this tenant-id parameter as illustrated below:

aws lambda invoke \
   --function-name my-function \
   --tenant-id BlueTenant \
   response.json

The new tenant-id parameter is required for functions using the tenant isolation mode. Function invocations omitting this parameter will fail with an invocation error, as shown below:

aws lambda invoke --function-name multitenant-function out.json

An error occurred (InvalidParameterValueException) when calling the Invoke operation:
The invoked function is enabled with tenancy configuration. 
Add a valid tenant ID in your request and try again.

Lambda makes the tenant ID parameter available through your function handler’s context object. This allows you to access tenant-specific information in your code, for example if you wish to implement custom logic based on the tenant identity, as shown below:

exports.handler = async function (event, context) {
   const tenantId = context.tenantId;

   // Process tenant-specific logic

   return {
      statusCode: 200,
      body: `OK for tenantId=${tenantId}`
   };
};

The following table outlines differences between Lambda functions with and without tenant isolation mode enabled:

Feature Without the new
tenant isolation mode
With the new
tenant isolation mode
Execution environment isolation Isolated per function version. Isolated per end-user or tenant invoking a function version.
Execution environment reuse Can be reused to process all invocations of a function version. Can only be reused to process invocations from the same tenant invoking a function version.
Data stored on local disk and in-memory Potentially accessible across all invocations of a function version. Potentially accessible across invocations from the same tenant. Not accessible for invocations from other tenants.
Cold starts Occur when there are no warm execution environments available to process incoming invocation. Occur when there are no tenant-specific warm execution environments available to process incoming invocation. More cold starts expected due to tenant-specific execution environments.

Integrating with Amazon API Gateway

Amazon API Gateway uses Lambda’s Invoke API to invoke Lambda functions. When using the Invoke API, Lambda expects the tenant ID parameter to be passed using the X-Amz-Tenant-Id HTTP header. You can configure API Gateway to inject this HTTP header into the Lambda invocation request with a value obtained from client request properties such as HTTP header, query parameter, or path parameter. When using Lambda Authorizers, you can obtain the value from authorization context information returned by the authorizer, such as principal ID or JWT claim. See API Gateway documentation to learn how you can return authorization information from Lambda authorizers to be used for the X-Amz-Tenant-Id header value.

Figure 4. Obtaining X-Amz-Tenant-Id header value from authentication sources.

Figure 4. Obtaining X-Amz-Tenant-Id header value from authentication sources.

The following screenshot illustrates API Gateway Lambda integration configuration, where the incoming request to API Gateway includes an x-tenant-id header that is mapped to the X-Amz-Tenant-Id request header to invoke a Lambda function using tenant isolation mode.

Figure 5. Mapping client request header to Lambda tenant-id header.

Figure 5. Mapping client request header to Lambda tenant-id header.

The following code snippet illustrates this configuration implemented with the AWS CDK.

const lambdaIntegration = new ApiGw.LambdaIntegration(fn, {
   requestParameters: {
      // This configures API Gateway to inject X-Amz-Tenant-Id header
      // into downstream requests. The header value is obtained from 
      // x-tenant-id header in the client request.
      'integration.request.header.X-Amz-Tenant-Id': 'method.request.header.x-tenant-id'
   }
});

resource.addMethod('GET', lambdaIntegration, {
   requestParameters: {
      // This enables API Gateway to use the x-tenant-id header value 
      // obtained from the client request. The header name is arbitrary.
      // you can use any other header name. 
      'method.request.header.x-tenant-id': true
   }
});

Tenant-aware observability

For functions using tenant isolation, Lambda automatically includes the tenant ID in function logs when you have JSON logging enabled, making it easier to monitor and debug tenant-specific issues. Note that the tenantId property is available during function invocation, rather than during function initialization. The tenantId property is included for both platform events (like platform.start and platform.report) and custom logs you print in your function code, as shown in the following screenshot:

Figure 6. Lambda function logs with tenantId.

Figure 6. Lambda function logs with tenantId.

Lambda creates a separate CloudWatch log stream for each execution environment. You can use CloudWatch Log Insights to find log streams that belong to a particular tenant by filtering by tenant Id:

fields @logStream, @message
| filter tenantId=='BlueTenant' or record.tenantId=='BlueTenant'
| stats count() as logCount by @logStream
| sort @timestamp desc

You can also retrieve tenant-specific logs across all log streams:

fields @message
| filter tenantId=='BlueTenant' or record.tenantId=='BlueTenant'
| limit 1000

Each log stream starts with function initialization logs followed by the invocation logs. This structure helps you to debug tenant-specific issues and understand the lifecycle of each tenant’s execution environments.

Considerations

When using the new tenant isolation for Lambda functions, consider the following:

  • Each tenant’s execution environments are isolated from other tenants so that tenant-specific data stored on disk or in memory remain separated from other tenants invoking the same Lambda function.
  • All tenants share the function’s execution role. For more fine-grained permissions for individual tenants, consider propagating tenant-scoped credentials from the upstream application components invoking your Lambda function.
  • Your application may experience higher percentage of cold starts, as Lambda processes requests in separate execution environments for each tenant invoking your functions.
  • You pay a fee for each new tenant-specific execution environment created, depending on the memory configured for your function. See Lambda pricing page for details.

Best practices

When using the new tenant isolation mode for Lambda functions, AWS recommends the following best practices:

  • Implement robust tenant ID validation at the application layer to prevent unauthorized access through tenant ID manipulation. Consider using a dedicated service or database to maintain valid tenant IDs.
  • Monitor and audit tenant access patterns regularly to detect potential security anomalies or unauthorized cross-tenant access attempts.
  • Be aware of Lambda concurrency quotas when building multi-tenant applications. You might need to request quota increases based on your tenant count and usage patterns.

Sample code

Follow the instructions in this GitHub repository to provision a sample project in your own account and see the new Lambda tenant isolation mode in action. The sample project illustrates how to integrate a function using the new tenant isolation mode with Amazon API Gateway and propagate tenant identity from client requests.

Conclusion

The new tenant isolation mode for Lambda simplifies building serverless multi-tenant SaaS applications on AWS. By automatically managing application tenant-level compute environment isolation, this capability eliminates the need for custom isolation logic or separate tenant functions, allowing you to focus on the core business logic while AWS handles the complexities of tenant-aware compute environment isolation.

Combined with the existing security features in Lambda, rapid scaling, and pay-per-use pricing, tenant isolation mode makes Lambda an even more compelling choice for modern SaaS applications, whether you’re building new solutions or enhancing existing ones.

To learn more, refer to the documentation for tenant isolation. For details on pricing, refer to Lambda’s pricing page.

How to update CRLs without public access using AWS Private CA

Post Syndicated from Rochak Karki original https://aws.amazon.com/blogs/security/how-to-update-crls-without-public-access-using-aws-private-ca/

Certificates and the hierarchy of trust they create are the backbone of a secure infrastructure. AWS Private Certificate Authority is a highly available certificate authority (CA) that you can use to create private CA hierarchies, secure your applications and devices with private certificates, and manage certificate lifecycles.

A certificate revocation list (CRL) is a file that contains a signed list of certificates revoked before their scheduled expiration date. Certificates can be revoked for a variety of reasons, including unintended key exposure, or because of discontinued use.

AWS Private CA writes CRLs to an Amazon Simple Storage Service (Amazon S3) bucket that you specify. CRLs are public, fully qualified domain names (FQDNs), but you might have requirements for a CRL that is only accessible internally to your organization, or you might have security standards that require all S3 buckets to have Amazon S3 block public access enabled.

The recommended practice for S3 buckets is to enable Block Public Access, which enables only authorized and authenticated AWS accounts to have access to a bucket and its contents. However, because some public key infrastructure (PKI) clients retrieve CRLs across the public internet, a workaround might be necessary to serve CRLs without requiring authenticated client access to an S3 bucket. One recommended solution is to use Amazon CloudFront to provide access to the CRL. This will likely be the best solution for most customers. Our documentation specifically highlights CloudFront as the recommended implementation path. However, you might not be able to use CloudFront or might need another option.

You might need a solution where the CRL lookups don’t traverse the public internet. In this post, we go over two different approaches to achieve this.

Option 1: Relocate CRLs to an internally accessible location

By default, AWS Private CA writes CRLs to an S3 bucket that you specify. This solution consists of moving the CRL to a separate location that is internally accessible to your TLS clients, but not accessible via the public internet such as an on-premises server. A CRL distribution point (CDP) is a link that points to the location of the CRL where revoked certificates appear. However, when private certificates are generated by AWS Certificate Manager (ACM), the CDP universal resource identifiers (URI) in the certificates point by default to the S3 bucket initially specified.

This solution uses a custom CNAME in the CDP to indicate, during certificate generation, the location where the CRL will ultimately be located.

The steps in the solution are as follows:

  1. Select the S3 bucket where the CRL will be stored.
  2. Issue a certificate through the CA with a custom CNAME.
  3. Create an AWS Lambda function that moves the CRL file from the S3 bucket to another specified location.
  4. Create an Amazon Simple Notification Service (Amazon SNS) notification that alerts a user to the success metric of the CRL generation event.

Prerequisites:

For this walkthrough, you must have the following resources ready to use:

  1. An AWS account with:
    • An AWS Identity and Access Management (IAM) role with permissions for Amazon S3, ACM Private CA, Amazon EventBridge, and Lambda
    • An ACM private CA root and subordinate CA configured in the same AWS Region
    • An S3 bucket for the CRL with permissions that allow the AWS Private CA service principal to PutObject, PutObjectACL, GetBucketACL and GetBucketLocation (see the following example bucket policy)
{     
    "Version": "2012-10-17",     
    "Statement": [         
        {             
            "Effect": "Allow",             
            "Principal": {                 
                "Service": "acm-pca.amazonaws.com"             
            },             
            "Action": [                 
                "s3:PutObject",                 
                "s3:PutObjectAcl",                 
                "s3:GetBucketAcl",                 
                "s3:GetBucketLocation"             
            ],             
            "Resource": [                 
                "arn:aws:s3:::<name-of-bucket>/*",                 
                "arn:aws:s3:::<name-of-bucket>"             
            ],             
            "Condition": {                 
                "StringEquals": {                     
                    "aws:SourceAccount": "<account-num-here>",                     
                    "aws:SourceArn": "<subordinate-ca-arn-here>"                 
                }             
            }         
        }     
    ] 
}

2. AWS Command Line Interface (AWS CLI) configured

Deploy:

With the prerequisites in place, you’re ready to deploy the first solution.

To enable CRL distribution:

  1. Use your account to sign in to the AWS Management Console for AWS Private Certificate Authority.
  2. Select the name of your subordinate CA. This should take you to another page with more details.
  3. Scroll down and choose the Revocation configuration tab.
  4. Choose Edit on the top right.
  5. Figure 1: Edit the revocation configuration

    Figure 1: Edit the revocation configuration

  6. Select Activate CRL distribution. Select the CRL S3 bucket you created prior to the walkthrough.
  7. Figure 2: Enter a name for your CRL

    Figure 2: Enter a name for your CRL

  8. Modify the CDP by expanding the CRL settings dropdown. In the Custom CRL Name field, enter the URL where you will eventually move the CRL. This should be a place that is accessible by your internal organization, but not accessible externally. If you use partitioned CRLs, select the Enable partitioning checkbox. To learn more about CRL partitioning, see Plan your AWS Private CA certificate revocation method.
  9. Choose Save changes.

To create an SNS topic and Lambda function:

  1. Go to the Amazon SNS console.
  2. Create a standard SNS topic. Leave all options as default and subscribe an appropriate email to the topic.
  3. Figure 3: Create an SNS topic

    Figure 3: Create an SNS topic

  4. Go to the Lambda console.
  5. Choose Create Function.
  6. Enter a name for your function. Under Runtime, select Python 3.12 from the dropdown.
  7. Figure 4: Create a Lambda function

    Figure 4: Create a Lambda function

  8. Verify that the role associated with your Lambda function has permissions to get objects from the S3 bucket where AWS Private CA places the CRL (set when you configured the revocation details for the CA), copy objects in Amazon S3, then put objects in an S3 bucket (or wherever the new CRL distribution point specified in the certificate custom CNAME will be—for example, an internal-only accessible location), and publish to an Amazon SNS topic. The Lambda function also checks the success metric of a CRL generation event. If the event fails, an SNS topic will notify an admin. If the event is successful, a copy of the CRL in the original S3 bucket is created in the new specified location and an SNS topic will notify an admin.

Example code (Python 3.13):

import boto3 
import json 

def lambda_handler(event, context):     
	#create a s3 client     
	s3 = boto3.client('s3')          

	#create a sns client     
	sns = boto3.client('sns')     
    topicArn = "<sns-topic-arn-here>”     
    
    #get name of the CA from the CW event     
    caID = event['resources'][0].split('/')[-1]          
    status = event['detail']['result']     
    if status == 'success':              
    	
        source = '<ORIGINS3BUCKET>'         
        destination = '<DESTINATION-S3BUCKET>'         
        #See below note for more clarification on S3 CRL paths         
        folder = 'crl/'         
        file = caID + '.crl'         
        key = folder + file              
        
        try:             
        	copySource = {                 
            	'Bucket': source,                 
                'Key': key             
           	}                      
            
            s3.copy_object(                 
            	CopySource=copySource,                 
                Bucket=destination,                 
                Key=file             
          	)             
            response = sns.publish(                 
            	TopicArn=<sns-topicArn>,                 
                Message=f'Successfully moved {key} from {source} to {destination} in {caID}',                 
                Subject="CRL Upload Success"             
          	)                      
            
            return {                 
            	'statusCode': 200,                 
                'body': json.dumps(f'Successfully moved {key} from {source} to {destination} in {caID}')             
          	}                  
    	
        except s3.exceptions.NoSuchKey:             
        	response = sns.publish(                 
            	TopicArn=<sns-topicArn>,                 
                Message=f"Object {key} not found in {source}",                 
                Subject='CRL Upload Failure'             
          	)             
            return {                 
            	'statusCode': 404,                 
                'body': json.dumps(f'Object {key} not found in {source}')             
          	}                  
   		except Exception as e:             
    		print(e)             
        	response = sns.publish(                 
        		TopicArn=<sns-topicArn>,                 
            	Message=f'Error moving object: {str(e)}',                 
            	Subject='Failure Uploading CRL'             
     		)             
			return {                 
    			'statusCode': 500,                 
        		'body': json.dumps(f'Error moving object: {str(e)}')             
  			}     
    else:         
    	response = sns.publish(                 
        		TopicArn=<sns-topicArn>,                 
            	Message=f'Certificate Authority {caID} CRL creation {status}',                 
            	Subject='CRL Upload Failure'             
     		)         
        return {             
        	'statusCode': 200,             
            'body': json.dumps(f'Certificate Authority {caID} CRL creation {status}')         
      	}

Note: By default, the non-partitioned CRL path in S3 is <s3-bucket-name>/crl/<CA-ID>.crl. If you used a custom path, modify the path name to the CRL accordingly. Alternatively, if using partitioned CRLs, the path changes to <s3-bucket-name>/crl/<CA-ID>/<partition_GUID>.crl; in that case, you can loop over each file in the <CA-ID> path to achieve the same effect.

To create an EventBridge that deploys your Lambda function:

  1. Go to the EventBridge console. Under Buses, select Rules.
  2. Choose Create Rule.
  3. Enter a name for your rule. Under Rule Type, select Rule with an Event Pattern and choose Next.
  4. Under Events, select AWS events or EventBridge partner events as the Event Source.
  5. For the Event pattern, select Use pattern form. For the Event source, select AWS services. For Event Type, select ACM Private CA CRL Generation.
Figure 5: Configure the event pattern

Figure 5: Configure the event pattern

  1. Choose Next.
  2. Under Target types, choose AWS Service, and then select Lambda function from the Select a target dropdown and select the function that you created earlier.
  3. Figure 6: Select the Lambda function as the target

    Figure 6: Select the Lambda function as the target

  4. Choose Next. Review your topic, then choose Update rule.
  5. To test the success of the Lambda function:

    1. To test the EventBridge topic, create and revoke a certificate. You can do this using the AWS CLI by getting the serial number of a certificate using openSSL:
      openssl x509 -in cert.pem -noout -serial
    2. Use the following command to revoke the certificate:
      aws acm-pca revoke-certificate —certificate-authority-arn <CA ARN> \ —certificate-serial <SERIAL NUMBER RETURNED IN STEP 1> --revocation-reason “UNSPECIFIED”
    3. To make sure that the Lambda function is triggered, wait 5–30 minutes. Check CloudTrail to make sure that RevokeCertificate was called, then monitor the CloudWatch log of the Lambda function. You should also get a notification message from your SNS topic.
    4. You have now successfully moved your CRL to a new location.

    Option 2: Implement Private CRL Access Through AWS Private CA

    This solution provides private Certificate CRL access within AWS Private CA, avoiding the need for public internet exposure. The design centers on establishing root and subordinate CAs with CRL functionality enabled within a dedicated S3 bucket, combined with a private network infrastructure using Gateway VPC endpoints and private subnets. Security is enforced through an S3 bucket policy that accomplishes three critical objectives:

    • Authorizing essential AWS Private CA permissions
    • Constraining CRL access to a designated Gateway VPC endpoint
    • Explicitly blocking access attempts from other sources.

    The solution includes private DNS zone configuration for proper resolution and can be verified through access testing confirming successful CRL retrieval from private VPC instances while making sure that requests from public instances are denied, maintaining a strictly private PKI.

    1. Create a root CA and subordinate CA with CRL enabled
    2. Configure a dedicated S3 bucket for CRL storage
    3. Issue private certificates through ACM
    4. Set up a VPC with private subnets
    5. Configure a Gateway VPC endpoint for Amazon S3
    6. Set up route tables for local traffic only
    7. Implement an S3 bucket policy with specific permissions
    8. Configure private DNS resolution
    9. Set up access controls through VPC endpoints
    10. Test private access from within the VPC
    11. Verify that public access is blocked

    Prerequisites for CRL solution 2

    For this walkthrough, you must have the following resources available:

    Deploy CRL solution 2

    With the prerequisites in place, you’re ready to use the console and AWS CLI to deploy the solution.

    To deploy the solution:

    1. Go to the AWS Private Certificate Authority console.
    2. In the navigation pane, choose Create a Private CA.
      1. Under Mode options, select General-purpose.
      2. For CA type options, select root.
      3. For the Subject distinguished name options: Fill in at least one of the subject distinguished name options: Organization(O), Organization unit (OU), Country(C), State, Locality name, and Common name (CN).
        Figure 7: Create a private CA (root)

        Figure 7: Create a private CA (root)

      4. Select Key algorithm options, for example, RSA 2046.
      5. Under Certificate revocation options, select Activate CRL Distribution, and select or create an S3 bucket for CRL storage.
      6. Under Pricing, select the checkbox to acknowledge pricing and then select Create CA.
    Figure 8: Configure a private CA (root)

    Figure 8: Configure a private CA (root)

    3. After creating a root CA, repeat all of step 2 to create a subordinate CA, selecting
    Subordinate CA under
    CA options (step 2-b). When completed, both the root CA and subordinate CA will be visible on the Private certificate authority page.

    Figure 9: View of root CA and subordinate CA

    Figure 9: View of root CA and subordinate CA

    With the root CA and subordinate CA in place, the next step is to create a VPC gateway endpoint for S3 access to enable private network communication.

    To create a VPC gateway endpoint:

    1. Go to the Amazon VPC console
    2. In the left navigation pane, select Endpoints, and choose Create Endpoint.
    3. Configure the Gateway VPC endpoint settings:
      1. Enter a descriptive name for your endpoint (optional).
      2. Type: Select AWS services.
      3. Services: Select the service name com.amazonaws.[region].s3 from the list.
      4. Type: Verify that Gateway is selected (automatically chosen for Amazon S3).
      5. VPC: Choose the VPC where you want to create the endpoint.
      6. Route tables: Select the route tables associated with the subnets that need Amazon S3 access.
      7. Policy: Select Full Access or create a custom policy to restrict access to specific S3 buckets or actions.
      8. Review your configuration and choose Create endpoint.
    Figure 10: Gateway VPC endpoint configuration

    Figure 10: Gateway VPC endpoint configuration

    1. Create two private subnets:
      1. In the Amazon VPC console, choose Subnets and then Create subnet.
      2. Select your VPC and enter the subnet details (name, Availability Zone, and CIDR block).
      3. Repeat for the second subnet in a different Availability Zone.
    2. Configure route tables:
      1. Navigate to Route Tables and choose Create route table.
      2. Create and name two route tables for your private subnets.
      3. Associate each route table with its corresponding private subnet.
      4. Make sure that each route table contains only local routes (VPC CIDR).
      5. Remove any routes for internet access (0.0.0.0/0).
    Figure 11: Private route table configuration

    Figure 11: Private route table configuration

    1. You can see now see under Resource Map that the Gateway VPC endpoint provides secure access to Amazon S3 resources within the private network.
    Figure 12: VPC private instance configuration

    Figure 12: VPC private instance configuration

    1. Use the following example code to implement a bucket policy that enforces the following key security controls:
      • Grant AWS Private CA the necessary permissions for certificate management.
      • Restrict CRL access exclusively through the specified VPC endpoint.
      • Explicitly deny GetObject requests not originating from the designated Gateway VPC endpoint.
    Figure 13: S3 bucket policy

    Figure 13: S3 bucket policy

    The following is an example S3 bucket policy for private CA CRL access with VPC endpoint restrictions:

    {     
        "Version": "2012-10-17",     
        "Statement": [         
            {             
                "Effect": "Allow",             
                "Principal": {                 
                    "Service": "acm-pca.amazonaws.com"             
                    },             
                "Action": [                 
                    "s3:PutObject",                 
                    "s3:PutObjectAcl",                 
                    "s3:GetBucketAcl",                 
                    "s3:GetBucketLocation"             
                ],             
                "Resource": [                 
                    "<arn:aws:s3:::BUCKET_NAME>",                 
                    "<arn:aws:s3:::BUCKET_NAME>/"           
                ],           
                "Condition": {               
                    "StringEquals": {                   
                        "aws:SourceArn": "<arn:aws:acm-pca:REGION:ACCOUNT_ID:certificate-authority/CA_ID>",                   
                        "aws:SourceAccount": "<ACCOUNT_ID>"               
                        }           
                }       
            },       
            {           
                "Sid": "Allow Access to CRL",           
                "Effect": "Allow",            
                "Principal": "",             
                "Action": "s3:GetObject",             
                "Resource": "<arn:aws:s3:::BUCKET_NAME/crl/CA_ID.crl>",             
                "Condition": {                 
                    "StringEquals": {                     
                        "aws:SourceVpce": "<VPCE_ID>"                 
                        }             
                }         
            },         
            {             
                "Sid": "Access-to-specific-VPCE-only",             
                "Effect": "Deny",             
                "Principal": "",            
                "Action": "s3:GetObject",           
                "Resource": [               
                    "<arn:aws:s3:::BUCKET_NAME>",               
                    "<arn:aws:s3:::BUCKET_NAME>/"             
                ],             
                "Condition": {                 
                    "StringNotEquals": {                     
                        "aws:SourceVpce": "<VPCE_ID>"                 
                        }             
                }         
            }     
        ] 
    }

    Figure 14: S3 bucket CRL properties

    Figure 14: S3 bucket CRL properties

    Create a private hosted zone:

    1. Go to the Route 53 console.
    2. In the left navigation pane, choose Hosted zones.
    3. Choose Create hosted zone.
    4. Configure the following:
      1. Domain name: Enter s3.amazonaws.com
      2. Description: (optional) enter Private hosted zone for S3 CRL endpoint
      3. Type: Select Private hosted zone.
      4. VPC: For Region, select your VPC’s Region; for VPC ID, select your VPC from the dropdown list.
    5. Choose Create hosted zone.

    Create a record set:

    1. Inside your new private hosted zone:
      1. Choose Create record.
      2. Select Simple routing policy.
      3. Choose Next.
    2. Configure record:
      1. Record name: Enter your S3 bucket name.
      2. Record type: Select A – Routes traffic to an IPv4 address.
      3. Alias: Toggle Yes.
      4. Route traffic to: Select Alias to S3 website endpoint.
      5. Region: Select your Region.
      6. S3 endpoint: Select from dropdown list.
      7. TTL: Leave as default (300 seconds).
    3. Choose Create record.
    Figure 15: Hosted zone details

    Figure 15: Hosted zone details

    Verify configuration:

    1. Go to the Amazon EC2 console and choose Launch instance.
    2. Select Amazon Linux 2.
    3. Choose Instance Type.
    4. Select you VPC and subnet.
    5. Under Network settings, select Create security group, then choose Allow SSH traffic from and enter your IP address.
    6. Choose Launch instance.
    7. After the instance is launched, select the instance and choose Connect.
    8. Select EC2 Instance Connect and choose Connect.

    Test the solution

    To test private access from an EC2 instance within your private VPC, verify CRL access using:
    curl -s https://<bucket-name>.s3.<region>.amazonaws.com/crl/<certificate-id>.crl | openssl crl -text -noout

    If successful, the command completes the following steps, as shown in Figure 16:

    1. Retrieves the CRL from Amazon S3
    2. Decodes it using OpenSSL
    3. Displays comprehensive CRL information including issuer details, update timestamps, revoked certificate list, signature algorithm, and other metadata
    Figure 16: Public access verification

    Figure 16: Public access verification

    To validate your security controls, attempt access from a public EC2 instance using the following command:
    curl https://<bucket-name>.s3.<region>.amazonaws.com/crl/<certificate-id>.crl

    This should fail, receiving an access denied error confirming that the CRL cannot be accessed from the public internet, as shown in Figure 17.

    Figure 17: Access denied error confirming that the CRL cannot be accessed from the public internet

    Figure 17: Access denied error confirming that the CRL cannot be accessed from the public internet

    Conclusion

    In this post, we walked you through two solutions that you can use to make your CRLs accessible to your internal organization, but not publicly available. First, we showed you how to configure a custom CNAME in your CRL distribution point and deploy Lambda functions to automatically copy each newly generated CRL from the default S3 bucket into a private S3 store.

    Next, we showed you a VPC architecture that uses an Amazon S3 VPC gateway endpoint, tightly scoped bucket policies, and private Route 53 DNS zones to make sure that CRL retrieval is confined to your VPC. We also covered the essential IAM and bucket policies that your clients need to access those CRLs securely. You can get started with setting up this solution on AWS Private CA today.

    If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Certificate Manager forum or contact AWS Support.

Rochak Karki

Rochak Karki

Rochak is a Security Specialist Solutions Architect at AWS, focusing on threat detection, incident response, and data protection helping customers build secure environments. Rochak is a US Army veteran and holds a Bachelor of Science in Engineering from the University of Wyoming. Outside of work, he enjoys spending time with family and friends, hiking, and traveling.

Cheryl Wang

Cheryl is an Associate Security Solutions Architect at AWS based in the SF Bay Area. Cheryl is passionate about cybersecurity and helping customers improve their security infrastructure. She holds a B.A. in Computer Science from Wellesley College. Outside of work, she enjoys writing and playing guzheng.

Getting started with Amazon S3 Tables in Amazon SageMaker Unified Studio

Post Syndicated from David Pasha original https://aws.amazon.com/blogs/big-data/getting-started-with-amazon-s3-tables-in-amazon-sagemaker-unified-studio/

Modern data teams face a critical challenge: their analytical datasets are scattered across multiple storage systems and formats, creating operational complexity that slows down insights and hampers collaboration. Data scientists waste valuable time navigating between different tools to access data stored in various locations, while data engineers struggle to maintain consistent performance and governance across disparate storage solutions. Teams often find themselves locked into specific query engines or analytics tools based on where their data resides, limiting their ability to choose the best tool for each analytical task.

Amazon SageMaker Unified Studio addresses this fragmentation by providing a single environment where teams can access and analyze organizational data using AWS analytics and AI/ML services. The new Amazon S3 Tables integration solves a fundamental problem: it enables teams to store their data in a unified, high-performance table format while maintaining the flexibility to query that same data seamlessly across multiple analytics engines—whether through JupyterLab notebooks, Amazon Redshift, Amazon Athena, or other integrated services. This eliminates the need to duplicate data or compromise on tool choice, allowing teams to focus on generating insights rather than managing data infrastructure complexity.

Table buckets are the third type of S3 bucket, taking place alongside the existing general purpose buckets, directory buckets, and now the fourth type – vector buckets. You can think of a table bucket as an analytics warehouse that can store Apache Iceberg tables with various schemas. Additionally, S3 Tables deliver the same durability, availability, scalability, and performance characteristics as S3 itself, and automatically optimize your storage to maximize query performance and to minimize cost.

In this post, you learn how to integrate SageMaker Unified Studio with S3 tables and query your data using Athena, Redshift, or Apache Spark in EMR and Glue.

Integrating S3 Tables with AWS analytics services

S3 table buckets integrate with AWS Glue Data Catalog and AWS Lake Formation to allow AWS analytics services to automatically discover and access your table data. For more information, see creating an S3 Tables catalog.

Before you get started with SageMaker Unified Studio, your administrator must first create a domain in the SageMaker Unified Studio and provide you with the URL. For more information, see the SageMaker Unified Studio Administrator Guide.

If you’ve never used S3 Tables in SageMaker Studio, you can allow it to enable the S3 Tables analytics integration when you create a new S3 Tables catalog in SageMaker Unified Studio.

Note: This integration needs to be configured individually in each AWS Region.

When you integrate using SageMaker Unified Studio, it takes the following actions in your account:

  • Creates a new AWS Identity and Access Management (IAM) service role that gives AWS Lake Formation access to all your tables and table buckets in the same AWS Region where you are going to provision the resources. This allows Lake Formation to manage access, permissions, and governance for all current and future table buckets.
  • Creates a catalog from an S3 table bucket in the AWS Glue Data Catalog.
  • Add the Redshift service role (AWSServiceRoleForRedshift) as a Lake Formation Read-only administrator permissions.

Prerequisites

Creating catalogs from S3 table buckets in SageMaker Unified Studio

To get started using S3 Tables in SageMaker Unified Studio you create a new Lakehouse catalog with S3 table bucket source using the following steps.

  1. Open the SageMaker console and use the region selector in the top navigation bar to choose the appropriate AWS Region.
  2. Select your SageMaker domain.
  3. Select or create a new project you want to create a table bucket in.
  4. In the navigation menu select Data, then select + to add a new data source.
  5. Choose Create Lakehouse catalog.
  6. In the add catalog menu, choose S3 Tables as the source.
  7. Enter a name for the catalog blogcatalog.
  8. Enter database name taxidata.
  9. Choose Create catalog.
  10. The following steps will help you create these resources in your AWS account:
    1. A new S3 table bucket and the corresponding Glue child catalog under the parent Catalog s3tablescatalog.
    2. Go to Glue console, expand Data Catalog, Click databases, a new database within that Glue child catalog. The database name will match the database name you provided.
    3. Wait for the catalog provisioning to finish.
  11. Create tables in your database, then use the Query Editor or a Jupyter notebook to run queries against them.

Creating and querying S3 table buckets

After adding an S3 Tables catalog, it can be queried using the format s3tablescatalog/blogcatalog. You can begin creating tables within the catalog and query them in SageMaker Studio using the Query Editor or JupyterLab. For more information, see Querying S3 Tables in SageMaker Studio.

Note: In SageMaker Unified Studio, you can create S3 tables only using the Athena engine. However, once the tables are created, they can be queried using Athena, Redshift, or through Spark in EMR and Glue.

Using the query editor

Creating a table in the query editor

  1. Navigate to the project you created in the top center menu of the SageMaker Unified Studio home page.
  2. Expand the Build menu in the top navigation bar, then choose Query editor.
  3. Launch a new Query Editor tab. This tool functions as a SQL notebook, enabling you to query across multiple engines and build visual data analytics solutions.
  4. Select a data source for your queries by using the menu in the upper-right corner of the Query Editor.
    1. Under Connections, choose Lakehouse (Athena) to connect to your Lakehouse resources.
    2. Under Catalogs, choose S3tablescatalog/blogcatalog.
    3. Under Databases, choose the name of the database for your S3 tables.
  5. Select Choose to connect to the database and query engine.
  6. Run the following SQL query to create a new table in the catalog.
    CREATE TABLE taxidata.taxi_trip_data_iceberg (
    pickup_datetime timestamp,
    dropoff_datetime timestamp,
    pickup_longitude double,
    pickup_latitude double,
    dropoff_longitude double,
    dropoff_latitude double,
    passenger_count bigint,
    fare_amount double
    )
    PARTITIONED BY
    (day(pickup_datetime))
    TBLPROPERTIES (
    'table_type' = 'iceberg'
    );

    After you create the table, you can browse to it in the Data explorer by choosing S3tablescatalog →s3tableCatalog →taxidata→taxi_trip_data_iceberg.

  7. Insert data into a table with the following DML statement.
    INSERT INTO taxidata.taxi_trip_data_iceberg VALUES (
    TIMESTAMP '2025-07-20 10:00:00',
    TIMESTAMP '2025-07-20 10:45:00',
    -73.985,
    40.758,
    -73.982,
    40.761,
    2, 23.75
    );

  8. Select data from a table with the following query.
    SELECT * FROM taxidata.taxi_trip_data_iceberg
    WHERE pickup_datetime >= TIMESTAMP '2025-07-20'
    AND pickup_datetime < TIMESTAMP '2025-07-21';

You can learn more about the Query Editor and explore additional SQL examples in the SageMaker Unified Studio documentation.

Before proceeding with JupyterLab setup:

To create tables using the Spark engine via a Spark connection, you must grant the S3TableFullAccess permission to the Project Role ARN.

  1. Locate the Project Role ARN in SageMaker Unified Studio Project Overview.
  2. Go to the IAM console then select Roles.
  3. Search for and select the Project Role.
  4. Attach the S3TableFullAccess policy to the role, so that the project has full access to interact with S3 Tables.

Using JupyterLab

  1. Navigate to the project you created in the top center menu of the SageMaker Unified Studio home page.
  2. Expand the Build menu in the top navigation bar, then choose JupyterLab.
  3. Create a new notebook.
  4. Select Python3 Kernel.
  5. Choose PySpark as the connection type.
  6. Select your table bucket and namespace as the data source for your queries:
    1. For Spark engine, execute query USE s3tablescatalog_blogdata

Querying data using Redshift:

In this section, we walk through how to query the data using Redshift within SageMaker Unified Studio.

  1. From the SageMaker Studio home page, choose your project name in the top center navigation bar.
  2. In the navigation panel, expand the Redshift project folder.
  3. Open the blogdata@s3tablescatalog database.
  4. Expand the taxidata schema.
  5. Under the Tables section, locate and expand taxi_trip_data_iceberg.
  6. Review the table metadata to view all columns and their corresponding data types.
  7. Open the Sample data tab to preview a small, representative subset of records.
  8. Choose Actions.
  9. Select Preview data from the dropdown to open and view the full dataset in the data viewer.

When you select your table, the Query Editor automatically opens with a pre-populated SQL query. This default query retrieves the top 10 records from the table, giving you an instant preview of your data. It uses standard SQL naming conventions, referencing the table by its fully qualified name in the format database_schema.table_name. This approach ensures the query accurately targets the intended table, even in environments with multiple databases or schemas.

Best practices and considerations

The following are some considerations you should take note of.

  • When you create an S3 table bucket using the S3 console, integration with AWS analytics services is enabled automatically by default. You can also choose to set up the integration manually through a guided process in the console. Also, when you create S3 Table bucket programmatically using the AWS SDK, or AWS CLI, or REST APIs, the integration with AWS analytics services is not automatically configured. You need to manually perform the steps required to integrate the S3 Table bucket with AWS Glue Data Catalog and Lake Formation, allowing these services to discover and access the table data.
  • When creating an S3 table bucket for use with AWS analytics services like Athena, we recommend using all lowercase letters for the table bucket name. This requirement ensures proper integration and visibility within the AWS analytics ecosystem. Learn more about it from getting started with S3 tables.
  • S3 Tables offer automatic table maintenance features like compaction, snapshot management, and unreferenced file removal to optimize data for analytics workloads. However, there are some limitations to consider. Please read more on it from considerations and limitations for maintenance jobs.

Conclusion

In this post, we discussed how to use SageMaker Unified Studio’s integration with S3 Tables to enhance your data analytics workflows. The post explained the setup process, including creating a Lakehouse catalog with S3 table bucket source, configuring necessary IAM roles, and establishing integration with AWS Glue Data Catalog and Lake Formation. We walked you through practical implementation steps, from creating and managing Apache Iceberg based S3 tables to executing queries through both the Query Editor and JupyterLab with PySpark, as well as accessing and analyzing data using Redshift.

To get started with SageMaker Unified Studio and S3 Tables integration, visit Access Amazon SageMaker Unified Studio documentation.


About authors

Sakti Mishra

Sakti Mishra

Sakti is a Principal Data and AI Solutions Architect at AWS, where he helps customers modernize their data architecture and define end-to end-data strategies, including data security, accessibility, governance, and more. He is also the author of Simplify Big Data Analytics with Amazon EMR and AWS Certified Data Engineer Study Guide. Outside of work, Sakti enjoys learning new technologies, watching movies, and visiting places with family.

Vivek Shrivastava

Vivek Shrivastava

Vivek is a Principal Data Architect, Data Lake in AWS Professional Services. He is a big data enthusiast and holds 14 AWS Certifications. He is passionate about helping customers build scalable and high-performance data analytics solutions in the cloud. In his spare time, he loves reading and finds areas for home automation.

David Pasha

David Pasha

David is a Senior Healthcare and Life Sciences (HCLS) Technical Account Manager with 16 years of expertise in analytics. As an active member of the Analytics Technical Field Community (TFC), he specializes in designing and implementing scalable data warehouse solutions for customers in the cloud.

Debu Panda

Debu Panda

Debu is a Senior Manager, Product Management at AWS. He is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world.

Building responsive APIs with Amazon API Gateway response streaming

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/compute/building-responsive-apis-with-amazon-api-gateway-response-streaming/

Today, AWS announced support for response streaming in Amazon API Gateway to significantly improve the responsiveness of your REST APIs by progressively streaming response payloads back to the client. With this new capability, you can use streamed responses to enhance user experience when building LLM-driven applications (such as AI agents and chatbots), improve time-to-first-byte (TTFB) performance for web and mobile applications, stream large files, and perform long-running operations while reporting incremental progress using protocols such as server-sent events (SSE).

In this post you will learn about this new capability, the challenges it addresses, and how to use response streaming to improve the responsiveness of your applications.

Overview

Consider this scenario – you’re running an AI-powered agentic application that uses an Amazon Bedrock foundation model. Your users interact with the application through an API, asking complex questions that require detailed responses. Before response streaming, users would send their prompts and wait to eventually receive the application response, sometimes for tens of seconds. This awkward pause between questions and responses created a disconnected, unnatural experience.

With the new API Gateway response streaming capability, the interaction through the API becomes much more fluid and natural. As soon as your application starts processing the model response, you can stream it back to your users using the API Gateway.

The following animation illustrates this significant user experience improvement. The prompt on the left is processed using a non-streaming response with user having to wait for several seconds to receive the result. The prompt on the right is using the new API Gateway response streaming, significantly reducing TTFB and improving user experience.

Figure 1. Comparing user experience before (left) and after (right) enabling API Gateway response streaming when returning a response from a Bedrock foundational model.

Your users can now see AI responses appear in real-time, word by word, just like watching someone type. This immediate feedback makes your applications feel more responsive and engaging, keeping users connected throughout the interaction. In addition, you don’t have to worry about response size limits or implement complex workarounds – the streaming happens automatically and efficiently, letting you focus on building great user experiences rather than managing infrastructure constraints.

Understanding response steaming

In the traditional request-response model, responses must be fully computed before being sent to the client. This can negatively impact user experience – the client must wait for the complete response to be generated on the server-side and transmitted over-the-wire. This is especially pronounced in interactive, latency-sensitive cloud applications such as AI agents, chatbots, virtual assistants, or music generators.

Figure 2. Response is returned to the client only after it’s been fully generated, increasing time-to-first-byte latency.

Another important scenario is returning larger response payloads, such as images, large documents, or datasets. In some cases, these payloads may exceed the 10 MB response size limit or default integration timeout limit of 29 seconds of API Gateway. Before the launch of response streaming, developers worked around these limitations by using pre-signed Amazon S3 URLs to download large responses or accepting lower RPS for an increase in timeout. While functional, these workarounds introduced additional latency and architectural complexity.

With response streaming support you can address these challenges. You can now update your REST APIs to return streamed responses, significantly enhancing user experience, improving TTFB performance, supporting response payload sizes to exceed 10 MB, and serving requests that can take up to 15 minutes.

Figure 3. Response streaming reduces time-to-first-byte and improves user experience.

The response streaming capability is already delivering significant performance for organizations:

“Working closely with the AWS teams to enable response streaming was instrumental in advancing our roadmap to deliver the most performant storefront experiences for our largest customers at Salesforce Commerce Cloud. Our collaboration exceeded our Core Web Vital goals; we saw our Total Blocking Time metrics drop by over 98%, which will enable our customers to drive higher revenue and conversion rates.”, says Drew Lau, Senior Director of Product Management at Salesforce.

Response streaming is supported for any HTTP-proxy integration, AWS Lambda functions (using proxy integration mode), and private integrations. To get started, configure your API integration to stream the response from your backend, as described in the following sections, and redeploy your API for changes to take effect.

Getting started with response streaming

To enable response streaming for your REST APIs, update your integration configuration to set the response transfer mode to STREAM. This enables API Gateway to start streaming the response to the client as soon as response bytes become available. When using response streaming, you can configure request timeout up to 15 minutes. For best time to first byte user experience, AWS strongly recommends your backend integration also implements response streaming.

You can enable response streaming in several different ways, as illustrated in the following snippets:

Using the API Gateway console, when creating method integrations, select Stream for the Response transfer mode.

Figure 4. Enabling response streaming in API Gateway Console.

Setting response transfer mode using the Open API spec:

paths:
  /products:
    get:
      x-amazon-apigateway-integration:
        httpMethod: "GET"
        uri: "https://example.com"
        type: "http_proxy"
        timeoutInMillis: 300000
        responseTransferMode: "STREAM"

Setting response transfer mode using infrastructure-as-code (IaC) frameworks, such as AWS CloudFormation. Note the /response-streaming-invocations Uri fragment, it tells API Gateway to use the Lambda InvokeWithResponseStreaming endpoint:

MyProxyResourceMethod:
  Type: 'AWS::ApiGateway::Method'
  Properties:
    RestApiId: !Ref LambdaSimpleProxy
    ResourceId: !Ref ProxyResource
    HttpMethod: ANY
    Integration:
      Type: AWS_PROXY
      IntegrationHttpMethod: POST
      ResponseTransferMode: STREAM
      Uri: !Sub arn:aws:apigateway:${APIGW_REGION}:lambda:path/2021-11-
           15/functions/${FN_ARN}/response-streaming-invocations

Updating response transfer mode using the AWS CLI:

aws apigw update-integration \
   --rest-api-id a1b2c2 \
   --resource-id aaa111 \
   --http-method GET \
   --patch-operations "op='replace',path='/responseTransferMode',value=STREAM" \
   --region us-west-2

Using response streaming with Lambda functions

When using Lambda functions as a downstream integration endpoint, your Lambda functions must be streaming-enabled. The API Gateway uses the InvokeWithResponseStreaming API to invoke functions, as illustrated in the following diagram, and requires Lambda proxy integration. See the API Gateway documentation for additional guidance.

Figure 5. Using API Gateway response streaming with Lambda functions for interactive AI applications.

When you use response streaming with Lambda functions, API Gateway expects the handler response stream to contain the following components (in order):

  • JSON response metadata – Must be a valid JSON object and can only contain statusCode, headers, multiValueHeaders, and cookies fields (all optional). Metadata cannot be an empty string; at a minimum it must be an empty JSON object.
  • The 8-null-byte delimiter – Lambda adds this delimiter automatically when you use the built-in awslambda.HttpResponseStream.from() method, as illustrated below. When not using this method, you’re responsible for adding the delimiter yourself.
  • Response payload – Can be empty.

The following code snippet illustrates how you can return a streamed response from your Lambda functions so it will be compatible with API Gateway response streaming:

export const handler = awslambda.streamifyResponse(
   async (event, responseStream, context) => {

      const httpResponseMetadata = {
         statusCode: 200,
         headers: {
            'Content-Type': 'text/plain',
            'X-Custom-Header': 'some-value'
         }
      };

      responseStream = awslambda.HttpResponseStream.from(
         responseStream,
         httpResponseMetadata
      );

      responseStream.write('hello');
      await new Promise(r => setTimeout(r, 1000));
      responseStream.write(' world');
      await new Promise(r => setTimeout(r, 1000));
      responseStream.write('!!!');
      responseStream.end();
   }
);

Refer to the API Gateway documentation for further implementation guidelines.

Using response streaming with HTTP Proxy integrations

You can stream HTTP responses from your applications used as downstream integration endpoints, for example web servers running on Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS). In this case, you must use HTTP_PROXY integration and specify the response transfer mode as STREAM (using the console, AWS CLI, or IaC). Redeploy your API after modifying it.

Figure 6. Using API Gateway response streaming with HTTP server applications.

Once API Gateway receives a streaming response from your application, it will wait until the HTTP headers block transfer is complete. Then, it will send to the client an HTTP response status code and headers, followed by the content from your application as it gets received by the API Gateway service. It will continue streaming response from your application to the client until the stream ends (up to 15 minutes).

Many popular API and web application development frameworks provide response streaming abstractions. The following code snippet illustrates how you can implement HTTP response streaming using FastAPI:

app = FastAPI()

async def stream_response():
   yield b"Hello "
   await asyncio.sleep(1)
   yield b"World "
   await asyncio.sleep(1)
   yield b"!"

@app.get("/")
async def main():
   return StreamingResponse(stream_response(), media_type="text/plain")

Adding real-time response streaming to your HTTP clients

Different HTTP clients have different ways to process streamed response fragments as they arrive. The following code snippet illustrates how to process a streamed response with a Node.js application:

const request = http.request(options, (response)=>{
   response.on('data', (chunk) => {
      console.log(chunk);
   });

   response.on('end', () => {
      console.log('Response complete’);
   });
});

request.end();

When using CURL, you can use the –no-buffer argument to print response fragments as they arrive.

curl --no-buffer {URL}

Sample code

Clone this sample project from GitHub to see API Gateway response streaming in action. Follow instructions in the README.md to provision the sample project in your AWS account.

Considerations

Before you enable response streaming, consider:

  • Response streaming is available for REST APIs and can be used with HTTP_PROXY integrations, Lambda integrations (in proxy mode), and private integrations.
  • You can use API Gateway response streaming with any endpoint type, such as Regional, Private, and Edge-optimized, with or without custom domain names.
  • When using response streaming, you can configure response timeouts up to 15 minutes, according to your scenario requirements.
  • All streaming responses from Regional or Private endpoints are subject to a 5-minute idle timeout. All streaming responses from edge-optimized endpoints are subject to a 30-second idle timeout.
  • Within each streaming response, the first 10MB of response payload is not subject to any bandwidth restrictions. Response payload data exceeding 10MB is restricted to 2MB/s.
  • Response streaming is compatible with API Gateway security capabilities such as authorizers, WAF, access controls, TLS/mTLS, request throttling, and access logging.
  • When processing streamed responses, the following features are not supported: response transformation with VTL, integration response caching, and content encoding.
  • Always protect your APIs against unauthorized access and other potential security threats by implementing proper authorization with Lambda Authorizers or Amazon Cognito User Pools. Read REST API protection documentation and API Gateway security documentation for additional details.

Observability

You can continue using existing observability capabilities, such as execution logs, access logs, AWS X-Ray integration, and Amazon CloudWatch metrics with API Gateway response streaming.

In addition to the existing access logs variables, the following new variables are available:

  • $content.integration.responseTransferMode – the response transfer mode of your integration. This can be either BUFFERED or STREAMED.
  • $context.integration.timeToAllHeaders – the time between when API Gateway establishes the integration connection to when it receives all integration response headers from the client.
  • $context.integration.timeToFirstContent – the time between when API Gateway establishes the integration connection to when it receives the first content bytes.

See API Gateway documentation for more information.

Pricing

With this new capability, you continue to pay the same API Invoke rates for streamed responses. Each 10MB of response data, rounded up to the nearest 10MB, is billed as a single request. See API Gateway pricing page for additional details.

Conclusion

The new response streaming capability for Amazon API Gateway enhances how you can build and deliver responsive APIs in the cloud. With immediate streaming of response data as it becomes available, you can significantly improve time-to-first-byte performance and overcome traditional payload size and timeout limitations. This is particularly valuable for AI-powered applications, file transfers, and interactive web experiences that demand real-time responsiveness.

To learn more about API Gateway response streaming see the service documentation.

To learn more about building Serverless architectures see Serverless Land.

Analyze AWS Network Firewall logs using Amazon OpenSearch dashboard

Post Syndicated from Hoorang Broujerdi original https://aws.amazon.com/blogs/security/analyze-aws-network-firewall-logs-using-amazon-opensearch-dashboard/

Amazon CloudWatch and Amazon OpenSearch Service have launched a new dashboard that simplifies the analysis of AWS Network Firewall logs. Previously, in our blog post How to analyze AWS Network Firewall logs using Amazon OpenSearch Service we demonstrated the required services and steps to create an OpenSearch dashboard. The new dashboard removes these extra steps and streamlines the entire process. In this post, I show you how to build and use the new OpenSearch Service dashboards to analyze Network Firewall logs more efficiently.

Network Firewall is a managed security service that protects Amazon Virtual Private Cloud (Amazon VPC) VPCs by monitoring and filtering network traffic. Network Firewall provides stateful inspection, which gives you information that you can use to create custom rules to control incoming and outgoing traffic. It automatically scales, offers high availability, and integrates with other AWS security services, in addition to helping to block unexpected traffic, prevent unauthorized access, and filter traffic based on domains and IP addresses.

Analyzing Network Firewall logs provides you with insight into the traffic entering or leaving your VPC and helps you troubleshoot issues and understand your security posture over time. This analysis is crucial for maintaining effective security controls.

Network Firewall generates three types of logs from its stateful engine:

  • Flow logs: These capture standard network traffic flow information based on your stateless rules
  • Alert logs: These show traffic that matches stateful rules configured with DROP, ALERT, or REJECT actions
  • TLS logs: These provide details about TLS inspection events (requires TLS inspection configuration)

Prerequisites

This post assumes that you’re familiar with the fundamentals of AWS networking concepts and services such as Amazon VPC, subnets, routing tables, and other services such as Network Firewall, Amazon CloudWatch, and OpenSearch Service.

To analyze Network Firewall logs using OpenSearch Service, you must have:

  1. An active Network Firewall in your VPC
  2. CloudWatch log groups configured for:
    1. Flow logs, for example /inspection-nwfw-flow-logs
    2. Alert logs, for example /inspection-nwfw-alert-logs

If you haven’t deployed Network Firewall in your VPC, you can use one of the available Network Firewall deployment architecture templates to create a firewall. After creating a firewall, configure CloudWatch log groups for the firewall flow and alert logs and configure stateful logging. Fine-tune your firewall policy and rule configuration and make sure that you’re routing traffic symmetrically through the firewall. Verify that your CloudWatch log groups are receiving firewall logs. You can do this by navigating to the AWS Management Console for CloudWatch, selecting your log group, and viewing the log streams under the Log streams tab.

With the firewall in the routed path and publishing metrics and log events, you can proceed with creating a Network Firewall OpenSearch dashboard.

Scenario

In this post, I show you how to set up a centralized architecture, single Availability Zone deployment as shown in Figure 1. Then, you will create an OpenSearch dashboard for your firewall to monitor and analyze traffic.

Figure 1: Network Firewall centralized architecture, single Availability Zone deployment
Figure 1: Network Firewall centralized architecture, single Availability Zone deployment

Solution deployment

To analyze Network Firewall logs in OpenSearch Service, you first need to create an OpenSearch integration.

To create an OpenSearch Service integration:

  1. Open the Amazon CloudWatch console.
  2. Choose Settings in the navigation pane.
  3. Choose the Logs tab.
  4. Scroll down to find OpenSearch integration and choose Create integration.

Figure 2: Create an OpenSearch integration
Figure 2: Create an OpenSearch integration

  1. There are three items to be configured under OpenSearch collection:
    1. Enter a name for Integration name. For example, CW-AOS-Integration01.
    2. KMS key ARN – optional is optional. If you leave that empty, your data will be encrypted by default with a key that AWS owns and manages. You also have an option to create and use an AWS Key Management Service (AWS KMS) key.
    3. For Data retention, select a number between 1 and 30 depending on your retention policy. For example, select 10 to retain logs for 10 days.

Figure 3: Configure an OpenSearch collection
Figure 3: Configure an OpenSearch collection

  1. Next, you need to configure AWS Identity and Access Management (IAM) permissions.
    1. For the IAM role for writing to OpenSearch collection, you can either create a new role or use an existing role. If you choose Create new role, then you need to provide an IAM role name. For example, CWLogQueryOS. This role must have permissions to read from all log groups in the account. See Permissions that the integration needs for an example of the permission that the integration needs.
    2. IAM roles and users who can view dashboards defines who can view the dashboards. Select either:
      • Allow all roles and users in this account to view dashboards.
      • Specify roles and users who can view dashboards. By choosing Specify roles…, you can select the IAM roles and users who can view the dashboard.
    3. Choose Confirm integration setup to create the integration. It might take 1–5 minutes for the integration to be created.

Figure 4: Configure IAM permissions
Figure 4: Configure IAM permissions

After you receive notification of successful creation of the OpenSearch integration, you can create an OpenSearch dashboard.

To create an OpenSearch dashboard:

  1. Navigate to Amazon CloudWatch console and choose Logs insights in the navigation pane.
  2. In Logs Insights, choose the Analyze with OpenSearch tab.
  3. Choose Create dashboard.
  4. Under Select dashboard type, select AWS Network Firewall.
  5. Enter a name for the dashboard, such as InspectionFirewall.

Figure 5: Select the dashboard type and enter a name
Figure 5: Select the dashboard type and enter a name

  1. Under Dashboard data configuration, select Every 5 minutes.
  2. Under Select log groups, select Inspection-nwfw-alert-logs and Inspection-nwfw-flow-logs.

Figure 6: Select data synchronization frequency and log groups
Figure 6: Select data synchronization frequency and log groups

  1. Choose Create dashboard. If you have multiple firewalls in your environment, repeat steps 1–8 to create a dashboard for each Firewall.
  2. Choose Select a dashboard and select and select a dashboard to view.

Figure 7: View a list of existing firewalls in OpenSearch dashboards
Figure 7: View a list of existing firewalls in OpenSearch dashboards

Dashboard overview

Your new OpenSearch dashboard, similar to Figure 8, provides you with visual insight into some of your firewall events such as:

  • Top talkers
  • Top protocols
  • Alert log analysis
  • Firewall engines

Figure 8: Network Firewall OpenSearch dashboard
Figure 8: Network Firewall OpenSearch dashboard

As shown in Figure 9, you can refine your analysis to focus on a specific traffic pattern or security event by using the filters at the top of the dashboard to focus on traffic based on:

  • Source or destination addresses
  • Protocols
  • Actions
  • Firewall names

Figure 9: Network Firewall OpenSearch dashboard filters
Figure 9: Network Firewall OpenSearch dashboard filters

To dive deep into a widget:

  • Hover your cursor over a widget in the dashboard to reveal the options menu icon (…) in the top right corner of the widget.
  • Choose the options menu icon (…) to maximize the widget or open the Inspect view, as shown in Figure 10.

Figure 10: Top Source IP by Packets widget showing the options menu icon (…)
Figure 10: Top Source IP by Packets widget showing the options menu icon (…)

Figure 11 shows the Inspect window for the Top Source IP by Packets widget. In this window, you can get information by selecting Statistics, Request, or Response.

Figure 11: Inspect window for Top Source IP by Packets widget
Figure 11: Inspect window for Top Source IP by Packets widget

This window might look different depending on the widget you choose. Some widget options menus provide more information than others and include an option to download the information in CSV format. For example, you can use the Top Source IPs by Packets and Bytes widget to view data and download it in CSV format, as shown in Figure 12.

Figure 12: Inspect window for Top Source IPs by Packets and Bytes widget
Figure 12: Inspect window for Top Source IPs by Packets and Bytes widget

When using the Top Source IPs by Packets and Bytes, you can use the View menu to switch the view from Data to Requests to access more information, as shown in Figure 13.

Figure 13: Switch the Inspect window view for Top Source IPs by Packets and Bytes widgets between Data and Requests
Figure 13: Switch the Inspect window view for Top Source IPs by Packets and Bytes widgets between Data and Requests

Example use cases

The following are some examples of how you can use the Network Firewall OpenSearch dashboard to facilitate monitoring and troubleshooting:

  • Identify unusual traffic patterns:
    • Use the Top Source IPs by Packets and Bytes widget
    • Look for unexpected spikes or outliers
  • Monitor security rule effectiveness:
    • Analyze the Alert Log Analysis section
    • Track which rules are triggering most frequently
  • Troubleshoot connectivity issues:
    • Use filters to isolate traffic for specific IP ranges
    • Examine flow logs for blocked connections
  • Verify compliance:
    • Review TLS logs to verify encryption standards
    • Use filters to focus on traffic to and from sensitive resources

Cost considerations

You will incur charges for AWS Network Firewall and the OpenSearch services used. For more information, see AWS Network Firewall Pricing and Amazon CloudWatch Pricing.

Conclusion

By building Amazon OpenSearch Service dashboards for AWS Network Firewall logs to transform complex security data into actionable insights, you can monitor and analyze your network security posture more effectively. By combining the robust security features of Network Firewall with the powerful visualization capabilities offered by OpenSearch Service, you gain real-time visibility into network traffic patterns, can quickly identify potential security threats, and streamline your troubleshooting workflows. This solution reduces the mean time to detect security incidents and improves operational efficiency through visual analytics to support data-driven decision making. Whether you’re focusing on threat detection, compliance monitoring, or security optimization, these dashboards can provide the visibility and insights needed to strengthen your overall security posture.


If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Hoorang Broujerdi

Hoorang Broujerdi

Hoorang is a Senior Technical Account Manager at AWS Enterprise Support with more than two decades of experience. He helps organizations architect resilient, secure, and efficient cloud environments, guiding them through complex networking challenges and large-scale infrastructure transformations. He has helped numerous organizations enhance their cloud operations through targeted optimizations, robust architectures, and best-practice implementations.

Post-quantum (ML-DSA) code signing with AWS Private CA and AWS KMS

Post Syndicated from Panos Kampanakis original https://aws.amazon.com/blogs/security/post-quantum-ml-dsa-code-signing-with-aws-private-ca-and-aws-kms/

Following our recent announcement of ML-DSA support in AWS Key Management Service (AWS KMS), we just introduced post-quantum ML-DSA signature support in AWS Private Certificate Authority (AWS Private CA). Customers can use AWS Private CA to create and manage their own private public key infrastructure (PKI) hierarchies. Through this integration, you can establish and use customer-managed quantum-resistant roots of trust for code signing, device authentication, outside (of AWS) workload authentication with AWS IAM Roles Anywhere, or communication tunnels such as IKEv2/IPsec or Mutual TLS (mTLS) using private PKI.

As outlined in the AWS post-quantum cryptography migration plan, establishing quantum-resistant roots of trust is critical for systems that need to maintain security for extended periods of time. ML-DSA, a signature scheme standardized in FIPS 204, provides quantum resistance while maintaining the performance characteristics needed for deployments at scale.

Previously, we shared how to use AWS Private CA and AWS KMS for code signing. In this post, we show you how to combine the post-quantum signing capability provided by AWS KMS with post-quantum code-signing PKI from AWS Private CA. Consumers of signed code that have been pre-provisioned with the post-quantum PKI roots can rest assured that the software could not have been forged by an adversary with a cryptographically relevant quantum computer (CRQC). For demonstration purposes, we use the diy-code-signing-kms-private-ca sample program, which uses the AWS SDK for Java. This code creates a PKI infrastructure, generates a code-signing certificate, signs binary code, and verifies the signature. Although we break down the steps to demonstrate the functionality in this post, you can run the Runner as-is to see it in action with commands found in the README file.

This post uses the Cryptographic Message Syntax (CMS) standard for encapsulating the signatures generated for input binary data. It stores the signature, X.509 certificate, and chain used to establish trust. The signature, known as a detached signature, doesn’t contain the original data. The detached signature can be used together with the original file, which was signed with standard tools such as OpenSSL natively to validate the authenticity of the file.

Create a post-quantum PKI hierarchy

For this post, we will use AWS Private CA to introduce a code-signing PKI. It will consist of a root CA to sign a subordinate CA, and a code-signing certificate signed by the subordinate CA. The whole chain will consist of quantum-resistant ML-DSA certificates.

CA hierarchy creation

First, the post-quantum CA hierarchy must be created with ML-DSA. In this example, we use the ML-DSA-65 variant of the post-quantum signature algorithm. You can do this with the AWS Java SDK as shown in the Runner.java file:

PrivateCA rootPrivateCA = PrivateCA.builder()
	.withCommonName(ROOT_COMMON_NAME)
	.withType(CertificateAuthorityType.ROOT)
	.withAlgorithmFamily(ML_DSA_65_ALGORITHM_FAMILY)
	.getOrCreate();

PrivateCA subordinatePrivateCA = PrivateCA.builder()
    .withIssuer(rootPrivateCA).withCommonName(SUBORDINATE_COMMON_NAME)
    .withType(CertificateAuthorityType.SUBORDINATE)
	.withAlgorithmFamily(ML_DSA_65_ALGORITHM_FAMILY)
    .getOrCreate();

Code-signer creation

For code signing, you need an asymmetric key pair and a code-signing certificate. The asymmetric ML-DSA key pair is generated in AWS KMS and the code-signing certificate is issued by AWS Private CA.

Create an ML-DSA key pair in AWS KMS

First, you must create an asymmetric key pair for code signing operations. Similar to the creation of the hierarchy, the AWS Java SDK can be used to create that AWS KMS key (key pair). Signing will be taking place with the key pair’s private key in AWS KMS. The corresponding public key will be in the code-signing leaf certificate signed by the subordinate CA. These calls are performed as part of the main method within the Runner.java file:

AsymmetricCMK codeSigningCMK = AsymmetricCMK
    .builder().withAlias(CMK_ALIAS)
	.withAlgorithmFamily(ML_DSA_65_ALGORITHM_FAMILY)
    .getOrCreate();

Alternatively, you can generate the key pair in AWS KMS with the AWS Management Console or the AWS Command Line Interface (AWS CLI) as shown in the ML-DSA KMS security blog.

Issue a code-signing certificate

Creating a certificate signing request (CSR) using AWS Private CA is a two-step process. First, you must create a CSR that contains both an identity (Subject) and the previously created AWS KMS public key. The following code snippet in Runner.java accomplishes this:

String codeSigningCSR = codeSigningCMK
	.generateCSR(END_ENTITY_COMMON_NAME);

OpenSSL 3.5 or later can parse this CSR to view its content with the following command if the CSR contents have been written to disk at csr.pem:

openssl req -in csr.pem -inform pem -text -noout
Certificate Request:
	Data:
		Version: 1 (0x0)
		Subject: CN=CodeSigningCertificate
		Subject Public Key Info:
			Public Key Algorithm: ML-DSA-65
				ML-DSA-65 Public-Key:
				pub:
					<Public Key Data>   
		Attributes:
			Requested Extensions:
				X509v3 Basic Constraints:
					CA:FALSE
	Signature Algorithm: ML-DSA-65
	Signature Value:
		<Signature Data>

You can see that the CSR contains an ML-DSA-65 public key. Its corresponding private key will be used to sign code.

The CSR is used by the subordinate CA to issue the code-signing certificate. Note that the code-signing template is used in the templateArn of the IssueCertificate request in the relevant PrivateCA.java file. The inclusion of this template helps ensure that AWS Private CA will issue a certificate with the correct Key Usage (KU) and Extended Key Usage (EKU) extension values, regardless of the values presented in the CSR.

IssueCertificateRequest issueCertificateRequest = IssueCertificateRequest.builder()
	.idempotencyToken(UUID.randomUUID().toString())
	.certificateAuthorityArn(subordinatePrivateCA.arn())
	.csr(SdkBytes.fromUtf8String(csr))
	.signingAlgorithm(algorithmFamily.getPcaSigningAlgorithm())
	.templateArn("arn:aws:acm-pca:::template/CodeSigningCertificate/V1")
	.validity(validity)
	.build();

IssueCertificateResponse issueCertificateResponse = client
	.issueCertificate(issueCertificateRequest);

String certificateArn = issueCertificateResponse.certificateArn();

GetCertificateRequest getCertificateRequest = GetCertificateRequest.builder()
	.certificateAuthorityArn(ca.arn())
	.certificateArn(certificateArn)
	.build();

The response includes the ML-DSA-65 code-signing certificate. You can use OpenSSL 3.5 or later to inspect the contents of the certificate after you save it to a file named code-signing-cert.pem:

openssl x509 -in code-signing-cert.pem -inform pem -text -noout
Certificate:
	Data:
		Version: 3 (0x2)
		Serial Number:
			1a:15:af:1e:64:8d:cd:29:b4:dc:66:2a:8b:1e:ee:b0
		Signature Algorithm: ML-DSA-65
		Issuer: CN=CodeSigningSubordinate-MLDSA65
		Validity
			Not Before: Sep 24 13:10:38 2025 GMT
			Not After : Sep 24 14:10:38 2026 GMT
		Subject: CN=CodeSigningCertificate
		Subject Public Key Info:
			Public Key Algorithm: ML-DSA-65
				ML-DSA-65 Public-Key:
				pub:
					<Public Key Data>
		X509v3 extensions:
			X509v3 Basic Constraints:
				CA:FALSE
			X509v3 Authority Key Identifier:
B7:EF:2E:C9:7A:A8:7E:B5:D6:2D:9A:3F:C7:A7:F8:9D:74:01:6A:EF
			X509v3 Subject Key Identifier:

7F:63:35:0C:56:F8:ED:F1:2A:DF:B5:2E:7C:F1:2C:D9:A0:0E:63:B6
			X509v3 Key Usage: critical
				Digital Signature
			X509v3 Extended Key Usage: critical
				Code Signing
	Signature Algorithm: ML-DSA-65
	Signature Value:
		<Signature Data>

You can see that the certificate includes the ML-DSA-65 public key of the code-signing key pair and the ML-DSA-65 signature from the subordinate CA. You also see the KU and the EKU values, which represent a code-signing certificate from the AWS Private CA template.

Sign code

At this point, you have set up the code-signing PKI, have a code-signing certificate issued by AWS Private CA and a corresponding ML-DSA key pair residing in KMS.

The Java SDK can be used to generate a CMS signature for a code binary file. In the background, this is accomplished by calling the AWS KMS Sign API with the ML-DSA key pair as shown in Runner.java. The following is part of the Java code. This first snippet involves building a certificate chain and then using it along with the code-signing AWS KMS key, the signer’s certificate, and <DATA_TO_SIGN>, the byte array representation of the code file, to generate the detached signature in a CMS structure.

	// Parse code-signing certificate from PEM
	X509CertificateHolder signerCert = CertificateUtils
		.fromPEM(codeSigningCertificate.certificate());

	Collection<X509CertificateHolder> chainCerts = CertificateUtils
		.toCertificateHolders(codeSigningCertificate.certificateChain());

	// Build certificate chain including code-signing cert and intermediate certs
	Collection<X509CertificateHolder> certChain = new ArrayList<> ();
	certChain.add(signerCert);

	// Parse certificate chain
	for (X509CertificateHolder chainCert : chainCerts) {
		if (!chainCert.equals(signerCert)) {
			certChain.add(chainCert);
		}
	}

	// Create detached CMS signature
	CMSCodeSigningObject cmsCodeSigningObject = CMSCodeSigningObject
		.createDetachedSignature(
			codeSigningCMK,
			ML_DSA_65_ALGORITHM_FAMILY,
			<DATA_TO_SIGN>,
			signerCert,
			certChain);

The code-signing object is written to disk in signature-MLDSA65.p7s. You can inspect it with OpenSSL 3.5 or later:

openssl cms -cmsout -in signature-MLDSA65.p7s -inform DER -print
CMS_ContentInfo:
	contentType: pkcs7-signedData (1.2.840.113549.1.7.2)
	d.signedData:
		version: 1
		digestAlgorithms:
			algorithm: shake256 (2.16.840.1.101.3.4.2.12)
			parameter: <ABSENT>
		encapContentInfo:
			eContentType: pkcs7-data (1.2.840.113549.1.7.1)
			eContent: <ABSENT>
		certificates:
			d.certificate:
				cert_info:
					version: 2
					serialNumber: 0xD0B2937F5BABC80AD55C0A90E1DE7057
					signature:
						algorithm: ML-DSA-65 (2.16.840.1.101.3.4.3.18)
						parameter: <ABSENT>
					issuer:			CN=CodeSigningSubordinate-MLDSA65
					validity:
						notBefore: Oct 28 15:05:27 2025 GMT
						notAfter: Oct 28 16:05:26 2026 GMT
					subject:		CN=CodeSigningCertificate
					key:		X509_PUBKEY:
						algor:
							algorithm: ML-DSA-65 (2.16.840.1.101.3.4.3.18)
							parameter: <ABSENT>
						public_key:(0 unused bits)
							...
						issuerUID: <ABSENT>
						subjectUID: <ABSENT>
						extensions:
							object: X509v3 Basic Constraints (2.5.29.19)
							critical: FALSE
							value:
								0000 - 30 00 0.
                                    
							object: X509v3 Authority Key Identifier (2.5.29.35)
							critical: FALSE
							value:
								0000 - 30 16 80 14 b7 ef 2e c9-7a a8 7e b5 d60.......z.~..
								000d - 2d 9a 3f c7 a7 f8 9d 74-01 6a ef-.?....t.j.

                        	object: X509v3 Subject Key Identifier (2.5.29.14)
							critical: FALSE
							value:
								0000 - 04 14 7f 63 35 0c 56 f8-ed f1 2a df b5...c5.V...*..
								000d - 2e 7c f1 2c d9 a0 0e 63-b6.|.,...c.

                         	object: X509v3 Key Usage (2.5.29.15)
							critical: TRUE
							value:
								0000 - 03 02 07 80....
                                    
							object: X509v3 Extended Key Usage (2.5.29.37)
							critical: TRUE
							value:
								0000 - 30 0a 06 08 2b 06 01 05-05 07 03 030...+.......
					sig_alg:
						algorithm: ML-DSA-65 (2.16.840.1.101.3.4.3.18)
						parameter: <ABSENT>
					signature:(0 unused bits)
						...
		d.certificate:
			cert_info:
			version: 2
			serialNumber: 29577999257397559174219641462943780786
			signature:
				algorithm: ML-DSA-65 (2.16.840.1.101.3.4.3.18)
				parameter: <ABSENT>
				issuer:			CN=CodeSigningRoot-MLDSA65
				[...]
                
		d.certificate:
			cert_info:
			version: 2
			serialNumber: 0xB9419A2C5D2422B3A58A5B449546D74B
			signature:
				algorithm: ML-DSA-65 (2.16.840.1.101.3.4.3.18)
				parameter: <ABSENT>
				issuer:			CN=CodeSigningRoot-MLDSA65
				[...]
	crls:
		<ABSENT>
	signerInfos:
		version: 1
		d.issuerAndSerialNumber:
			issuer:				CN=CodeSigningSubordinate-MLDSA65
			serialNumber: 0xD0B2937F5BABC80AD55C0A90E1DE7057
		digestAlgorithm:
			algorithm: shake256 (2.16.840.1.101.3.4.2.12)
			parameter: <ABSENT>
		signedAttrs:
			object: contentType (1.2.840.113549.1.9.3)
			set:
				OBJECT:pkcs7-data (1.2.840.113549.1.7.1)

			object: signingTime (1.2.840.113549.1.9.5)
			set:
				UTCTIME:Oct 28 16:05:27 2025 GMT

			object: id-aa-CMSAlgorithmProtection (1.2.840.113549.1.9.52)
			set:
				SEQUENCE:
	0:d=0hl=2 l=26 cons: SEQUENCE
	2:d=1hl=2 l=11 cons:SEQUENCE
	4:d=2hl=2 l=9 prim:OBJECT:shake256
	15:d=1hl=2 l=11 cons:cont [ 1 ]
	17:d=2hl=2 l=9 prim:OBJECT:ML-DSA-65

        	object: messageDigest (1.2.840.113549.1.9.4)
			set:
				OCTET STRING:
					...
		signatureAlgorithm:
			algorithm: ML-DSA-65 (2.16.840.1.101.3.4.3.18)
			parameter: <ABSENT>
		signature:
			[...]

The CMS signature object directly encapsulates both the code-signing certificate and the subordinate CA certificate. It’s expected that the root certificate will reside in a customer-managed trust store. In addition to these certificates, the CMS object also contains the digest of the input data within the signedAttrs of the signerInfos in the ASN.1 structure. The digest algorithm is SHAKE256 and the OCTET STRING represents the binary digest itself. The use of ML-DSA in CMS is specified in RFC9882.

Note: Although this example uses one ML-DSA signature, some use cases might include dual signatures, a traditional and a quantum-resistant one. Such signed artifacts can be backwards compatible with legacy verifiers that support and can only verify the traditional signature. Upgraded verifiers can verify both signatures.

Verify signed code

Before loading a signed code artifact, its signature needs to be verified. That includes verifying the signature of the code and validating the certificate chain to the trusted root CA. The following code snippet from the main method within the file Runner.java is used for the certificate chain validation and the signature in the code object:

X509CertificateHolder rootCACertificate = CertificateUtils.fromPEM(rootCACertificatePEM); 
cmsCodeSigningObject.verifyDetachedSignature(<DATA_TO_SIGN>, rootCACertificate);

The preceding code retrieves the ML-DSA public key from the code-signing certificate; AWS access or credentials aren’t needed to validate the signature. Entities that have the root CA certificate loaded in their trust store can verify it without needing access to the AWS KMS verify API.

Note: The Runner.java implementation doesn’t use a certificate trust store that’s either part of a browser or part of a file system within the resident operating system of a device or a server. The trust store is placed in an instance of a Java class object for the purpose of this post. If you’re planning to use this code-signing example in a production system, you must change the implementation to use a trust store on the host. To do so, you can build and distribute a secure trust store that includes the root CA certificate.

Alternatively, OpenSSL 3.5 or later can be used to validate the detached signature of the provided input data file with root-ca-MLDSA65.pem, the provided root CA certificate from AWS Private CA.

openssl cms -verify -in signature-MLDSA65.p7s -content <input-data-file> \
            -CAfile root-ca-MLDSA65.pem -inform DER -purpose any \
            -binary -out /dev/null
CMS Verification successful

Note: Although this post focused on code-signing, AWS Private CA can enable post-quantum ML-DSA authentication for other private PKI use cases. In one scenario, applications outside of AWS can access AWS resources by temporarily using certificate-based authentication and swapping it with AWS credentials with AWS IAM Roles Anywhere. AWS IAM Roles Anywhere now supports ML-DSA PKIs like the one created in this post. In another scenario, an mTLS client or IKEv2/IPsec peer could use an ML-DSA certificate issued by AWS Private CA to be authenticated by a server or peer respectively who has been pre-provisioned with the post-quantum PKI root certificate.

Conclusion

This announcement marks an important milestone for post-quantum authentication. With the introduction of ML-DSA X.509 certificates in AWS Private CA, customers can bring quantum resistance to their private PKI use cases. These use cases include client authentication for mTLS or IKEv2/IPsec tunnels, IAM Roles Anywhere, or applications that use private PKI authentication. ML-DSA certificates with AWS Private CA and signing with AWS KMS also enable post-quantum code-singing and establishing post-quantum long-lived roots of trust for devices designed to operate for a long time even after CRQCs became available. Learn more about post-quantum cryptography in general and the overall AWS plan to migrate to post-quantum cryptography.


If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Security, Identity, & Compliance re:Post or contact AWS Support. For more details regarding AWS PQC efforts, refer to our PQC page.
 

Panos Kampanakis

Panos Kampanakis

Panos is a Principal Security Engineer at AWS. He has experience with cybersecurity, applied cryptography, security automation, and vulnerability management. He has coauthored publications on cybersecurity and participated in various security standards bodies to provide common interoperable protocols and languages for security information sharing, cryptography, and public-key infrastructure. Currently, he works with engineers and industry standards partners to provide cryptographically secure tools, protocols, and standards.

Jake Massimo

Jake Massimo

Jake is a Senior Applied Scientist on the AWS Cryptography team. His work interfaces Amazon with the global cryptographic community through international conferences, academic literature, and standards organizations and influences the adoption of post-quantum cloud-scale cryptographic technology. Recently, his focus has been architecting AWS post-quantum cryptographic capabilities, including core libraries and infrastructure so that AWS and customers can seamlessly transition to quantum-safe cryptography.

Author

Kyle Schultheiss

Kyle is a Senior Software Engineer on the AWS Cryptography team. He has been working on the AWS Private Certificate Authority service since its inception in 2018. In prior roles, he contributed to other AWS services such as Amazon Virtual Private Cloud, Amazon EC2, and Amazon Route 53.

Handle unpredictable processing times with operational consistency when integrating asynchronous AWS services with an AWS Step Functions state machine

Post Syndicated from Philip Whiteside original https://aws.amazon.com/blogs/compute/handle-unpredictable-processing-times-with-operational-consistency-when-integrating-asynchronous-aws-services-with-an-aws-step-functions-state-machine/

Integrating asynchronous AWS services with an AWS Step Functions state machine, presents a challenge when building serverless applications on Amazon Web Services (AWS). Services such as Amazon Translate, Amazon Macie, and Amazon Bedrock Data Automation (BDA) excel at handling long-running operations that can take more than 10 minutes to complete because of their asynchronous nature. Asynchronous services return an immediate 200 OK response, indicating that the request has succeeded, upon job submission (see the API response syntax of StartTextTranslationJob in Amazon Translate, CreateClassificationJob in Macie, and InvokeDataAutomationAsync in BDA), rather than waiting for the actual task completion and results.

In this post, we explore using AWS Step Function state machine with asynchronous AWS services, look at some scenarios where the processing time can be unpredictable, explain when traditional solutions such as polling (periodically check) fall short, and demonstrate how to implement a generalized callback pattern to handle asynchronous operations into a more manageable synchronous flow. We cover the related architecture, technical implementation, and best practices, and we provide a real-world examples that uses the AWS Cloud Development Kit (AWS CDK). Services used in this generalized callback pattern include Amazon DynamoDB, Amazon EventBridge and AWS Step Functions.

Understanding the issue this solution addresses

Asynchronous operations are designed to handle long-running operations without blocking resources, a design followed by many AWS services. However, these services create challenges in Step Functions workflows by returning immediate 200 OK responses rather than confirming task completion. This breaks the Step Functions execution model, which expects each step to be complete before advancing. Developers often attempt to address this issue through polling loops to repeatedly check the status of operations, an approach that works for containerized applications and Amazon Elastic Compute Cloud (Amazon EC2). For these services, compute resources are already provisioned, but compute resources become problematic in serverless architectures when AWS Lambda functions have a 15-minute execution limit, making them unsuitable for long-running polls.

Step Functions supports Run a Job (.sync) to call a service and have Step Functions wait for a job to complete, but this works only for selected optimized integrations. However, this functionality is limited to specific AWS services such as AWS Glue. Amazon Translate, Macie, and other services are not optimized integrations. If your operation is not listed as working with .sync, it can benefit from the generalized callback pattern covered in this post.

For these non-optimized integrations, an option is to use polling (periodically check). However, polling can lead to additional latency in response because polling times are unlikely to align with job completion. This is shown in the following figure.

Timeline diagram showing alternating 'Job' blocks and 'Delay' blocks, with 'Poll' markers indicated at regular intervals along the time axis. The diagram illustrates a sequential process of job execution and delay periods.

Figure 1: A job processing and delay timeline diagram

The Step Functions generalized callback pattern can solve this latency issue by pausing execution for up to one year while waiting for task completion (this does not incur additional cost). When such an asynchronous operation finishes, a callback mechanism resumes the workflow where it left off. This generalized callback pattern transforms asynchronous operations into synchronous ones, and it maintains cost efficiency and operational agility.

Scenarios

To help us see where this generalized callback pattern could be applied, let’s look at a few scenarios. Each of these scenarios makes use of AWS Step Functions state machines to run the applications’ workflows.

Scenario 1: Document translation with personally identifiable information compliance

Organizations must manage personally identifiable information (PII) when translating documents because PII can be duplicated across language outputs. For example, when translating a document containing “Jane Doe,” that name appears in both the original and translated versions, creating multiple instances of sensitive data that need compliance measures. Amazon Translate batch translation has a default concurrency of 10, meaning that translations could take more than 10 minutes or be queued for longer periods. Additionally, the Amazon Translate batch translation operation is asynchronous, holding the translation request in a queue until completed. The generalized callback pattern in this post makes sure that Step Functions state machine workflows resume appropriately to apply consistent PII handling across all outputs. In this scenario the design makes use of tagging Amazon Simple Storage Service (Amazon S3) files as containing PII or not, which in turn associates S3 lifecycle policies for specific retention periods to those S3 objects.

Workflow diagram showing five connected steps: 1) Start, 2) StartTextTranslationJob, 3) Wait for Translate result, 4) Tag all files, 5) ending with End state.

Figure 2: A text translation workflow diagram

Scenario 2: Using concurrent execution to pause the state machine until processes have completed

Continuing from scenario 1, Macie and Amazon Translate can run in parallel (each approximately 10 minutes) rather than sequentially (approximately 20 minutes) for a better user experience. Similarly to Amazon Translate batch translation operations being asynchronous, the Macie create classification operation is also asynchronous. Step Functions state machines enable concurrent execution of both service requests. The generalized callback pattern enables the state machine to pause each parallel workflow and resume only when the asynchronous services have completed their jobs. Without this pattern, both services would immediately return 200 OK responses, causing the workflow to continue prematurely before translations or classification results are available. If the classification results are not available later in the workflow, then the appropriate PII tags will not be applied and therefore the appropriate lifecycle retention policy will also not be applied, resulting in not adhering to PII handling practices.

Figure 3: A parallel classification and translation workflow diagram

Scenario 3: Intelligent document processing

Organizations that use Bedrock Data Automation for intelligent document processing must take into consideration Regional concurrency limits. BDA has Regional concurrency limits “Max number of concurrent jobs” of 25 jobs in the us-east-1 and us-west-2 Regions. Also, BDA has a concurrency limit of only five jobs in other supported Regions, so large document batches could be queued for extended periods resulting in long processing wait times for the user. This service functionality is handled asynchronously as the duration of the request could be many minutes. The generalized callback pattern makes sure that workflows resume appropriately as soon as a job finishes rather than waiting an arbitrary time to check if the job has been completed. For example, the generalized callback pattern for BDA can be used to enhance the solution outlined in the blog post, Scalable intelligent document processing using Amazon Bedrock Data Automation.

Figure 4: A data automation workflow diagram

Solution architecture

The following architecture diagram shows the generalized callback pattern (the blue section on the right side) integrated with your existing application (the grey section on the left side).

Figure 5: The Step Functions generalized callback architecture

Key components of this post’s solution architecture

This generalized callback pattern architecture consists of four essential components working together. Each component plays a specific role while maintaining cost efficiency and operational reliability. The following components form the foundation of this pattern:

  • Step Functions task: Implements the “Wait for Callback” task state generating unique task tokens for workflow resumption.
  • EventBridge rule: Monitors asynchronous service completion events and is customizable for different service patterns. AWS services make use of an event bus to route service event notifications to other services, such as job completions.
  • DynamoDB: Provides persistent storage correlating job IDs with task tokens for quick lookup.
  • Step Functions state machine: Manages the resume process and makes sure of proper cleanup of stored tokens.

Solution process

This generalized callback pattern operates through a coordinated sequence of four key steps. Each step builds upon the previous one. The following process demonstrates how the pattern manages workflow execution. The diagram above shows more detailed steps following these key steps.

  1. Start the asynchronous operation for which you want to wait for completion. The asynchronous service responds with success (200 OK) and the state machine continues. Initiating an Amazon Translate batch translation operation is one example of such an asynchronous operation.
  2. Trigger the generalized callback pattern with the “Wait for Callback” capability. Pair the task token with the jobId in DynamoDB using the unique jobId as the primary key. Example:
    {
        id    = translationJobId,
        token = stepFunctionTaskToken
    }
  3. Monitor for completion: When the asynchronous service completes the requested job, such as translation of documents, an event is created in EventBridge that contains the jobId and status. Example:
    {
        jobId  = translationJobId,
        status = complete
    }
    
  4. Resume workflow: The EventBridge rule triggers the workflow to resume, which looks up the task token using the jobId, resumes the paused Step Functions execution, and cleans up the database entry.

Not every service creates events for every action, so validate that your service operation generates the expected events. For example, Macie does not create events when no findings are discovered. In these cases, implement more event generation mechanisms through Amazon CloudWatch Logs subscriptions that trigger Lambda functions to create custom events.

Technical implementation of the solution

For rapid deployment of this post’s solution, AWS CDK users can use this sample CDK pattern with all key components. Alternatively, you can implement the individual components yourself by using the following steps, with each component customizable to your requirements.

Some of the JSON-based snippets below are Amazon States Language (ASL) snippets, which is the language that defines an AWS Step Functions state machine. State machines can be built in the AWS Console using the drag and drop visual builder, or with ASL. The visual builder generates this ASL and you can toggle to view/edit the workflow code (ASL).

Use a Step Functions task that supports “WaitForCallback” to store task token in DynamoDB

Use a Step Functions task that supports ”WaitForCallback” to store the task token in DynamoDB alongside the job ID from the asynchronous service.

AWS services generate a unique ID for that service which refers to that job/request/action. DynamoDB holds the mappings between job IDs and task tokens, supporting multiple state machines paused in parallel with concurrent execution. To prevent clashes when different asynchronous services generate overlapping IDs (for example, if Service A and Service B both generate ID “12345”), use separate DynamoDB tables for each service to maintain ID uniqueness. The sample AWS CDK pattern demonstrates this approach by providing dedicated DynamoDB tables and Step Functions state machines for each service integration. This ID-token structure allows for quick lookups for workflow resumption and cleanup.

The following ASL accomplishes this by using a DynamoDB PutItem task:

"DynamoDB PutItem": {
    "Type": "Task",
    "Resource": "arn:aws:states:::dynamodb:putItem",
    "Parameters": {
        "TableName": "resumeTokenSessionTable",
        "Item": {
            "id":    { "S.$": "$.JobId" },
            "token": { "S.$": "$$.Task.Token" },
            "ttl":   { "S.$": "$.ttl" }
        },
        "ConditionExpression": "attribute_not_exists(id)"
    },
    "Next": "XXXX"
}

In this example, the Item object stores three values: the job ID ($.JobId), the task token ($$.Task.Token), and a TTL value ($.ttl). The ttl field configures Time to Live for automatic cleanup based on your service’s expected completion time. Since this stores only three small string values, data usage per entry is minimal. The primary consideration is the number of concurrent operations, as each active asynchronous job requires one DynamoDB entry until completion or TTL expiration.

The DynamoDB table uses “id” as the primary key and includes a “token” attribute. These fields are essential for the “WaitForCallback” pattern: the “id” (job ID) allows your asynchronous service to look up the correct entry, while the “token” (Step Functions task token) is what your service sends back to Step Functions to resume the paused workflow. The following JSON shows an example of these values:

{
    "id":    { "S": "xxxxxxxx-yyyy-zzzz-aaaa-bbbbbbbbbbbb" },
    "token": { "S": "11111111-2222-3333-4444-555555555555" },
    "ttl":   { "S": "1480550400" }
}

When your asynchronous service completes its work, it retrieves the task token using the job ID, then calls Step Functions with that token to resume execution from where it paused.

The task token acts as a unique identifier for resuming execution at the exact pause point. To prevent overriding an existing record when a duplicate id is used, you can specify a “ConditionExpression”. This ASL shows just the ConditionExpression.

“ConditionExpression”: “attribute_not_exists(id)”

Create an EventBridge rule to monitor event patterns from your asynchronous service

EventBridge integration forms the heart of the event-driven resumption mechanism. You can create EventBridge rules to monitor specific event patterns from asynchronous AWS services. Most AWS services automatically publish completion events to default EventBridge at no cost, and you can use the EventBridge rule wizard to identify correct event patterns. For services that do not publish events—such as Macie that creates no events when no findings are discovered—implement shims by using Amazon CloudWatch Logs to trigger Lambda functions that generate custom events. This JSON shows the EventBridge Rule pattern definition.

"EventPattern": {
    "source": [
        "aws.translate"
    ],
    "detail-type": [
        "Translate TextTranslationJob State Change"
    ],
    "detail": {
        "jobStatus": [
            "COMPLETED"
        ],
    }
}

Resume the workflow

At this point, you know the operation has completed, so you can safely resume the workflow. Using the job ID, call the DynamoDB GetItem operation to receive the task token. This ASL shows the task definition to get the task token for a given job ID retrieved from the event notification.

"getResumeToken": {
    "Next": "sendTaskSuccess",
    "Type": "Task",
    "ResultPath": "$.getResumeToken",
    "Resource": "arn:aws:states:::dynamodb:getItem",
    "Parameters": {
        "Key": {
            "id": { "S.$": "$.id" }
        },
        "TableName": "resumeTokenSessionTable"
    }
}

Use the task token to resume the workflow and then delete the DynamoDB entry for cleanup. This ASL shows the task definition to use the task token to resume the state machine at the point where it was paused at.

"sendTaskSuccess": {
    "Next": "deleteResumeToken",
    "Type": "Task",
    "ResultPath": "$.sendTaskSuccess",
    "Resource": "arn:aws:states:::aws-sdk:sfn:sendTaskSuccess",
    "Parameters": {
        "TaskToken.$": "$.getResumeToken.Item.token.S",
        "Output": {
            "status": "resume"
        }
    }
}

This ASL shows the task definition to clean up the DynamoDB to remove the used task token.

"deleteResumeToken": {
    "End": true,
    "Type": "Task",
    "Resource": "arn:aws:states:::dynamodb:deleteItem",
    "Parameters": {
        "Key": {
            "id": { "S.$": "$.id" }
        },
        "TableName": "resumeTokenSessionTable"
    }
}

This completes the technical implementation of our solution. With all components in place—the WaitForCallback task, EventBridge rules, workflow resumption logic, and DynamoDB storage—you now have a fully functional generalized callback pattern implementation that eliminates polling and efficiently manages asynchronous operations.

Now that we’ve established how to implement the generalized callback pattern technically, let’s explore the best practices and important considerations that will help you optimize and secure your implementation.

Best practices and considerations

When implementing the generalized callback pattern in AWS Step Functions, it’s essential to understand and apply best practices that optimize costs, enhance security, and ensure efficient operation. This section outlines key considerations and recommendations for implementing the pattern effectively, focusing on cost optimization strategies and security measures that help maintain a robust and secure serverless workflow. By following these guidelines, you can maximise the benefits of the generalized callback pattern while minimising potential risks and unnecessary expenses.

Optimize costs by using this post’s generalized callback pattern

Managing costs for long-running asynchronous operations can present challenges. Traditional polling accumulates unnecessary expenses through repeated state transitions and execution time, but this post’s generalized callback pattern is an event-driven approach that significantly reduces operational costs.

Eliminate polling costs and minimize execution time

The generalized callback pattern reduces costs by eliminating polling transitions and pausing execution during wait periods. For standard workflows billed at $0.000025 per state transition, using just two transitions instead of continuous polling achieves approximately an 87% cost reduction. A 15-minute translation job polling every minute would need 15 transitions as opposed to two with the generalized callback pattern. For express workflows billed at $0.000001 per request and $0.00001667 per GB-second, the pattern delivers significant savings through reduced request count and minimal execution time. Traditional polling keeps workflows active during the entire operation, accumulating execution time charges. By contrast, the generalized callback pattern eliminates execution time charges during the wait period. In the translation job example mentioned previously in this paragraph, this could reduce the execution time from more than 15 minutes to just the seconds needed to start jobs and complete processes.

Increase resource efficiency

The callback pattern increases resource efficiency by removing constant polling, resulting in substantial reduction in CloudWatch logging and associated monitoring costs. This creates a more cost-effective solution with a reduced AWS resource footprint.

Further cost-optimize the callback pattern

Enhance cost efficiency through DynamoDB optimizations. Choose on-demand mode for unpredictable workloads or provisioned mode with auto scaling for consistent patterns, configure auto scaling settings based on usage, and implement TTL to automatically remove expired items without consuming write capacity.

Security considerations for the callback pattern

The callback pattern involves storing task tokens, processing events, and managing workflow resumption across multiple AWS services. Implementing proper access controls is essential to protect the integrity of your workflows and prevent unauthorized access or manipulation of the pattern’s components.

This section outlines the security considerations for the callback pattern, focusing on access controls for data storage and event processing.

Data storage security

Enable DynamoDB encryption at rest by using AWS owned or user managed AWS Key Management Service (AWS KMS) keys. Implement identity-based policies by defining the Step Functions AWS Identity and Access Management (IAM) role actions (such as PutItem, GetItem, and DeleteItem) and resource-based policies that specify which IAM principals can access the table. Together, these help ensure that only authorized state machines access token storage and operations are limited to minimum permissions. Also, configure TTL to automatically remove expired tokens so that these tokens do not accidentally get reused, which can result in errors with resuming the relevant AWS Step Function workflows.

Event processing security

Scope EventBridge rules precisely to match only specific necessary events. For Amazon Translate job completion, rules should explicitly match only translation job completion events, thus preventing unauthorized triggers. IAM roles should follow least-privilege principles so that only specific actions can cause workflows to resume.

Conclusion

The callback pattern presented in this post provides a solution for managing long-running asynchronous operations in serverless architectures. You can use the Step Functions “Wait for Callback” task state with EventBridge and DynamoDB to transform asynchronous services into synchronous workflows without the overhead of polling. This pattern reduces costs, improves efficiency through event-driven architecture, and maintains security through proper access controls. You can use the provided CDK implementation to implement this pattern and adapt it to your specific needs while following recommended security and cost optimization practices. 


About the authors

Maria John is a Senior Solutions Architect at Amazon Web Services, helping customers build solutions on AWS.

Philip Whiteside is a Senior Solutions Architect at Amazon Web Services. Philip is passionate about overcoming barriers by utilizing technology.

Analyzing Amazon EC2 Spot instance interruptions by using event-driven architecture

Post Syndicated from Shekhar Shrinivasan original https://aws.amazon.com/blogs/big-data/analyzing-amazon-ec2-spot-instance-interruptions-by-using-event-driven-architecture/

Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances offer significant cost savings of up to 90% compared to On-Demand pricing, making them attractive for cost-conscious workloads. However, when using Spot Instances within AWS Auto Scaling Groups (ASGs), their unpredictable interruptions create operational challenges. Without proper visibility into interruption patterns, teams struggle to optimize capacity planning, implement effective fallback mechanisms, and make informed decisions about workload placement across availability zones and instance types.

This challenge can be addressed through a custom event-driven monitoring and analytics dashboard that provides near real-time visibility into Spot Instance interruptions specifically for ASG-managed instances. For the remainder of this document, we’ll refer to this custom solution as “Spot Interruption Insights” for Auto Scaling Groups.

In this post, you’ll learn how to build this comprehensive monitoring solution step-by-step. You’ll gain practical experience designing an event-driven pipeline, implementing data processing workflows, and creating insightful dashboards that help you track interruption trends, optimize ASG configurations, and improve the resilience of your Spot Instance workloads.

Solution overview

The architecture uses an event-driven approach utilizing AWS native services for robust spot instance interruption monitoring.

The solution uses Amazon EventBridge to capture interruption events, Amazon Simple Queue Service (Amazon SQS) for reliable message queuing, AWS Lambda for data processing, and Amazon OpenSearch Service for storage and visualization of interruption patterns.

  1. EC2 Spot interruption notices are captured via an Amazon EventBridge rule.
  2. The notices are routed to an SQS queue for reliable message handling.
  3. A Lambda function processes the events, fetching EC2 instance metadata and AWS Auto Scaling Group (ASG) details by making optimized batch calls to the EC2 and Auto Scaling APIs. This design minimizes throttling risks on the control plane APIs, ensuring scalability. The Lambda function is configured with batching and concurrency limits to prevent overwhelming the API endpoints and the OpenSearch Service bulk indexing process.
  4. After processing, events are bulk-indexed into Amazon OpenSearch Service, enabling near real-time visibility and analytics.

A Dead Letter Queue (DLQ) ensures no data is lost in case of failures, while AWS Identity and Access Management (IAM) roles enforce least-privilege access between all components.

The OpenSearch Service domain is deployed within the private subnets of an Amazon VPC, ensuring it is not publicly accessible.

  1. Access to OpenSearch Dashboards is routed through an Application Load Balancer (ALB) configured with an HTTPS listener,
  2. ALB forwards traffic to an NGINX proxy running on EC2 instances in an Auto Scaling group. This setup provides secure and scalable access.
  3. Authentication and authorization are enforced using OpenSearch Service’s internal user database, ensuring that only authorized users can access the dashboards.

OpenSearch Dashboards visualize interruption metrics, delivering actionable insights to support effective capacity planning and workload placement.

Extensibility and alternative analytics tools

While this solution uses Amazon OpenSearch Service for storing and visualizing Spot Interruption data, the architecture is flexible and can be extended to support other analytics and observability platforms. You can modify the Lambda function to forward data to tools such as Amazon Quick Sight, Amazon Timestream, Amazon Redshift, or external services depending on your analytics and compliance needs. This enables teams to use their preferred tooling for building visualizations, setting alerts, or integrating with existing dashboards.

What you’ll build

By the end of this post, you’ll have a complete Spot Interruption monitoring system as seen in the following screenshot that automatically captures EC2 Spot Instance interruption events from your Auto Scaling Groups and presents them through interactive dashboards. Your solution will include real-time visualizations showing interruption patterns by availability zone, instance types, and time periods, along with ASG-specific metrics that help you identify optimization opportunities.

The sections of this post walk you through the step-by-step implementation of this solution, from deployment to setting up the event-driven architecture to configuring the analytics dashboards. Remember that you can deploy and customize this solution for your environment.

Prerequisites

You must have access to an AWS account with enough privileges to create and manage the AWS resources discussed in this blog post.You must also have the following software/components installed on your device:

Note: This application utilizes multiple AWS services, and there are associated costs beyond the Free Tier usage. Refer to the AWS Pricing page for specific details. You are accountable for any incurred AWS costs. This example solution does not imply any warranty.

Deployment instructions

Create a new directory, navigate to that directory in a terminal and clone the GitHub repository:

git clone https://github.com/aws-samples/sample-spot-interruption-insights

Change directory to the solution directory:

cd sample-spot-interruption-insights

Checklist for deployment

This section lists the setup and configurations that are required before you deploy the solution stack by using AWS SAM.

If you don’t have a VPC, Subnets, NAT Gateway already created and configured you can follow the steps mentioned in the Amazon VPC documentation to create the necessary resources.

  1. VPC Created – Ensure a VPC exists with DNS hostnames and DNS resolution enabled. You will need the VPC ID during deployment
  2. Public Subnets (2 or more) – Configure two or more public subnet IDs from different Availability Zones.
  3. Private Subnets (2 or more) – Configure two or more private subnet IDs from different Availability Zones.
  4. Outbound Internet Access for Private Subnets – Ensure NAT Gateway access as nginx proxy will be installed on EC2 instance in private subnet. Refer to Example: VPC with servers in private subnets and NAT for more information on setting up NAT for instances in private subnets.
  5. ALB Access – CIDR IP range allowed to access ALB (such as, `1.2.3.4/32`). This is for accessing the dashboard.
  6. Certificate ARN for ALB HTTPS Listener – To configure HTTPS listener. Certificate (can be self-signed) for HTTPS port of the load balancer. Refer to Prerequisites for importing ACM certificates for more information on importing self-signed certificate into AWS Certificate Manager (ACM)
  7. OpenSearch Service-Linked Role – Before deploying this template, ensure the AWS OpenSearch service-linked role exists in your account by running:
    aws iam create-service-linked-role --aws-service-name es.amazonaws.com

    Note:

    • This command only needs to be run once per AWS account.
    • If the role already exists, you’ll see an error message that can be safely ignored.
    • This role allows Amazon OpenSearch Service to manage network interfaces in your VPC.
    • Without this role, deployments that place OpenSearch Service domains in a VPC will fail with the error: “Before you can proceed, you must enable a service-linked role to give Amazon OpenSearch Service permissions to access your VPC.”
    • The service-linked role is named "AWSServiceRoleForAmazonOpenSearchService" and is managed by AWS.
  8. AMIId – Valid EC2 AMI ID for the region. Note:- This solution is designed to work exclusively with AMIs that use the DNF package manager. Use the latest Amazon Linux 2023 AMI for optimal compatibility and security.

    The following AMIs are confirmed compatible with this solution:

    • Amazon Linux 2023
    • Fedora (35 and newer)
    • RHEL 8 and newer
    • CentOS Stream 8 and newer
    • Oracle Linux 8 and newer

Build and deploy the solution – From the command line, use AWS SAM to build and deploy the AWS resources as specified in the template.yml file.

sam build
sam deploy --guided

During the prompts: Fill-out the following parameters:

  • Stack Name: {Enter your preferred stack name}
  • AWS Region: {Enter your preferred region code}
  • Parameter DomainName: {Enter the name for your new OpenSearch Service domain where the index will be created and data will be pushed for analytics. This will create a new OpenSearch domain with the name you specify – Preferably keep short domain name}
  • MasterUsername: {Admin username to login to the OpenSearch dashboard}
  • MasterUserPassword: { Must contain lowercase, uppercase, numbers, and special characters (!@#$%^&*). Minimum 12 characters recommended. Avoid common passwords (Password123!, Admin@2024 and more) as these may cause deployment failures due to security validation checks.}
  • IndexName: {OpenSearch Index name where Spot interrupted instance related data will be pushed}
  • EventRuleName: {Amazon EventBridge rule name to capture EC2 Spot interruption notices}
  • CustomEventRuleName: {Amazon EventBridge custom rule name to capture EC2 Spot interruption notices. This will be used for verifying the solution}
  • TargetQueueName: {EventBridge Rule target SQS name}
  • SQSDLQQueueName: {Target SQS Dead Letter Queue name}
  • LambdaDLQQueueName: {Lambda Dead Letter Queue name}
  • VPCId: {Enter the VPCId where the resources will be deployed}
  • PublicSubnetIds: {Enter 2 or more Public SubnetIDs separated by comma}
  • PrivateSubnetIds: {Enter 2 or more Private SubnetIDs separated by comma}
  • RestrictedIPCidr: {IP address/CIDR for restricting ALB access in CIDR format (such as 10.2.3.4/32)}
  • CertificateArn: {Certificate ARN for configuring ALB HTTPS Listener}
  • AMIId: {Valid EC2 AMI ID for the region}
  • Confirm changes before deploy: Y
  • Allow SAM CLI IAM role creation: Y
  • Disable rollback: N
  • Save arguments to configuration file: Y
  • SAM configuration file: {Press enter to use default name}
  • SAM configuration environment: {Press enter to use default name}

Note: The complete solution may take approximately 15-20 minutes to deploy. After the deployment is complete, there are a few manual steps that need to be performed to ensure the solution functions as expected.

Post deployment instructions

The following steps need to be performed in OpenSearch Dashboards after logging in. Get the DNS Name of the Application Load Balancer endpoint from the deployment output section of the CloudFormation stack or the ALB console. Access the OpenSearch dashboards using the ALB DNS name as follows –

https://[ALB-DNS-NAME]/_dashboards

You will be redirected to the OpenSearch Dashboards login page. Log in using the MasterUsername and MasterUserPassword you specified during deployment.

If this is the first time you are logging in then you may see a Welcome screen.

  1. Choose ‘Explore on my own’ on the Welcome screen.
  2. Choose ‘Dismiss’ on the next screen.
  3. If the ‘Select your tenant’ dialog appears with ‘Global’ preselected, Choose ‘Confirm’. Otherwise, select ‘Global’ first and then and choose ‘Confirm’.

Create index and attribute mapping

This section lists the required steps to create the index and attribute mapping.

  1. On the Home screen select the Hamburger Menu icon () on the top left
  2. Select ‘Dev Tools’ at the bottom of the menu.
  3. On the dev tools console, paste the following PUT command and execute the request by choosing ‘Click to send request’.

    Note The index name should match what you entered during the deployment. Change the index name accordingly before creating the index.

    PUT /<YOUR-INDEX-NAME-SPECIFIED-DURING-DEPLOYMENT>
            {
                "mappings": {
                    "properties": {
                    "instance_id": {
                        "type": "keyword"
                    },
                    "instance_name": {
                        "type": "keyword"
                    },
                    "instance_type": {
                        "type": "keyword"
                    },
                    "asg_name": {
                        "type": "keyword"
                    },
                    "timestamp": {
                        "type": "date"
                    },
                    "region": {
                        "type": "keyword"
                    },
                    "availability_zone": {
                        "type": "keyword"
                    },
                    "private_ip": {
                        "type": "ip"
                    },
                    "public_ip": {
                        "type": "ip"
                    }
                    }
                }
            }

    The following is a screenshot of this command in Dev Tools.

  4. Confirm that the index was created successfully.

Create index pattern

This section lists the required steps to create the index pattern

  1. Access the Hamburger Menu icon on the top left.
  2. Select ‘Dashboard Management’ from the bottom of the menu.
  3. Choose ‘Index Patterns’
  4. Choose “Create Index Pattern”
  5. Enter the Index pattern name and choose “Next step”.
    The index pattern name should be the index name you entered during the deployment followed by an asterisk. See the following screenshot for reference.

  6. Select ‘timestamp’ in primary Time field and choose ‘Create index pattern’
  7. Choose the star icon to make the index pattern default

Configure Lambda with required access for new index

In this section you will create a role in OpenSearch Service dashboards and will map Lambda execution role to the same to perform operations on the new index.

  1. Navigate to the Lambda console
  2. Search for the function beginning with your OpenSearch Service domain name.
  3. In the function details, go to Configuration > Permissions
  4. Choose the Role Name in the Execution Role section.
  5. Copy the Lambda execution role ARN from this function which handles Spot interruption events.
  6. Access the Hamburger Menu icon on the top left and select ‘Security’ from the bottom of the menu.
  7. Now select the ‘Roles’ menu option under ‘Security’ menu and then select ‘Create Role’
    • Enter a role name and set Cluster Permissions to “cluster_composite_ops_ro“.
    • For Index Permissions, select the index pattern name created during deployment.

    See the following screenshot for reference.

  8. Set the Tenant Permissions to “global_tenant” as seen in the image and Choose “Create”.

  9. After the role is created, on the same screen, select the ‘Mapped Users’ tab and choose ‘Manage Mapping’

  10. Choose ‘Manage Mapping’
  11. In the ‘Backend roles’ add the Lambda execution role ARN copied earlier and Choose ‘Map’

You can create more users in the internal database and grant appropriate access to the visualisations and dashboards. The following steps show how to create a read only role and to create an internal user and grant read only access.

Manage users and roles

In this section you will create a new user and a role with read-only access, then assign the role to the user to grant them read-only access to the Spot Interruption dashboard and visualizations.

  1. Access the Hamburger Menu icon on the top left
  2. Select ‘Security’ from the bottom of the menu
  3. Select ‘Internal Users’ and then select ‘Create Internal user’
  4. Enter username and set a Password, then choose “Create”.

  5. Now select the ‘Roles’ menu option under ‘Security’ menu and then select ‘Create Role’
    • Enter the role name and set Cluster Permissions to “cluster_composite_ops_ro“.
    • For Index Permissions, select the index pattern name created during deployment.

    See the following screenshot for reference.

  6. Set the Tenant Permissions to “global_tenant” as seen in the image and Choose “Create”.

  7. After the role is created, on the same screen, select the ‘Mapped Users’ tab and choose ‘Manage Mapping’

  8. Select the user created above in ‘Users’ and choose ‘Map’

Configure and deploy sample visualisations and dashboard

Sample visualizations and a starter dashboard are provided under the data folder of the git repo you cloned earlier. Look for the file named spot-interruption-dashboard-visualisations.ndjson.To import the visualizations:

  1. Navigate to Saved Objects under Dashboard Management in OpenSearch Dashboards.
  2. Import the spot-interruption-dashboard-visualisations.ndjson file.
  3. During the import, you may encounter index pattern conflicts. Select the index pattern you created from the dropdown and choose “Confirm all changes”.

Once imported, the sample visualizations and dashboard linked to your index pattern will be available under Dashboards in the left-side hamburger menu. You can view the Spot Interruption Dashboard, which includes visualizations based on Availability Zones, Regions, Instance Types, Auto Scaling Groups (ASGs), and Interruptions over time. You can further customize by creating your own visualizations using the attributes available in the index or by editing/creating new dashboards. The dashboard will display empty views until Spot interruption data is available to visualize.

Test the solution

A temporary event rule was created during deployment to simulate matching Amazon EC2 Spot interruption notices. The rule name is the name you specified during deployment for the CustomEventRuleName parameter.

To verify the solution, you can send sample events from the EventBridge console as depicted below. In the AWS console,

  • Open the Amazon EventBridge console
  • In the left menu under ‘Buses’ section choose ‘Event buses’
  • Choose the ‘default’ event bus
  • Choose the ‘Send events’ button
  • In the Send events page enter the following details:
    • Event bus: default
    • Event source: custom.spot.interruption.simulator
    • Detail type: EC2 Spot Instance Interruption Warning
    • Event detail: {"instance-id": "<instance-id>", "Instance-action": "terminate"}

    Replace the instance-id with an actual instance id that is associated with an Amazon EC2 Auto Scaling group. Refer to the following screenshot.

After the event is sent successfully, you can log in to OpenSearch Dashboards and view the Spot Interruption Dashboard, which has been prebuilt with the indexed event data. This dashboard provides insights across key dimensions such as Availability Zones, Regions, instance types, Auto Scaling groups, and interruption trends over time. Use the dashboard as a starting point to understand the kinds of insights possible and customize or create new visualizations based on your needs and the fields available in the index.

Alternatively, you can navigate to the Discover section in the menu to view the raw event details. Ensure that you select the index pattern you created earlier in this demonstration, and adjust the time range if necessary (such as the last 15 minutes) to view the latest data.

Security and cost optimizations

This solution is designed to be secure and cost-efficient by default, but there are some more optimizations you can apply to further reduce cost and enhance security:

Security best practices

  1. Amazon Cognito Authentication : Integrate Amazon Cognito with OpenSearch Dashboards to manage user authentication, enable Multi Factor Authentication, and avoid hardcoding admin credentials. More information Configuring Amazon Cognito authentication for OpenSearch Dashboards
  2. Lambda Layer Versioning: Ensure pinned versions of Lambda Layers are used to avoid unexpected changes. More information Managing Lambda dependencies with layers
  3. Logging and Threat Detection: Enable AWS CloudTrail and Amazon GuardDuty to monitor for unauthorized activity or anomalies. More information Monitoring Amazon OpenSearch Service API calls with AWS CloudTrail

Cost optimizations

  1. Bulk Indexing with Throttling Controls: Lambda processes batches and respects throttling limits to avoid excessive OpenSearch usage.
  2. Short Retention for CloudWatch Logs: Tune log retention periods to avoid unnecessary storage costs.
  3. Optimize Visualizations: Design saved visualizations to avoid expensive queries (like wide time ranges and large aggregations). More information Optimizing query performance for Amazon OpenSearch Service data sources
  4. Index State Management (ISM) : Configure ISM policies in OpenSearch to delete or archive older interruption data. More information Index State Management in Amazon OpenSearch Service

Cleanup

Run the following command to delete the resources deployed earlier.

sam delete

After deleting the stack, make sure to also remove any post-deployment configurations you may have created within the OpenSearch Service dashboards console. While these configurations won’t incur additional costs, it’s considered a best practice to clean up your environment by deleting any resources that are no longer needed. Take some time to review the OpenSearch Service dashboards and identify any custom settings, dashboards, or visualizations you set up during the deployment process. Then, delete these individual configurations to ensure your environment is fully cleaned up.

Conclusion

In this post, you learned how to build and deploy a comprehensive Spot Instance interruption monitoring solution for Auto Scaling groups by using EventBridge, Amazon SQS, Lambda, and OpenSearch Service. You implemented an event-driven pipeline to capture and process Amazon EC2 Spot Instance interruption events, created secure analytics dashboards, and established real-time visibility into interruption patterns across your Auto Scaling group–managed workloads.

This post’s solution empowers your teams with the visibility and agility needed to operate confidently with Amazon EC2 Spot Instances. By combining event-driven architecture with secure, scalable analytics, you can now proactively monitor interruption events, identify interruption trends, and optimize workload strategies for resilience and cost-efficiency.

With real-time data at your fingertips, you’re equipped to make smarter infrastructure decisions and maximize the benefits of Spot Instance capacity while minimizing disruption risks.


About the author

Shekhar Shrinivasan

Shekhar Shrinivasan

Shekhar is a Senior Technical consultant who specializes in cloud architecture design, migration strategies, and AWS workload optimization. He helps enterprise customers accelerate their digital transformation through best practices implementation, scalable infrastructure solutions, and strategic technical guidance to maximize their cloud return on investment.