Tag Archives: Technical How-to

Run Apache Spark and Iceberg 4.5x faster than open source Spark with Amazon EMR

Post Syndicated from Atul Payapilly original https://aws.amazon.com/blogs/big-data/run-apache-spark-and-iceberg-4-5x-faster-than-open-source-spark-with-amazon-emr/

This post shows how Amazon EMR 7.12 can make your Apache Spark and Iceberg workloads up to 4.5x faster performance.

The Amazon EMR runtime for Apache Spark provides a high-performance runtime environment with full API compatibility with open source Apache Spark and Apache Iceberg. Amazon EMR on EC2, Amazon EMR Serverless, Amazon EMR on Amazon EKS, Amazon EMR on AWS Outposts and AWS Glue use the optimized runtimes.

Our benchmarks show Amazon EMR 7.12 runs TPC-DS 3 TB workloads 4.5x faster than open source Spark 3.5.6 with Iceberg 1.10.0.

Performance improvements include optimizations for metadata caching, parallel I/O, adaptive query planning, data type handling, and fault tolerance. There were also some Iceberg specific regressions around data scans that we identified and fixed.

These optimizations let you match Parquet performance on Amazon EMR while keeping the key features of Iceberg key features: ACID transactions, time travel, and schema evolution.

Benchmark results compared to open source

To assess the performance of the Spark engine with the Iceberg table format, we performed benchmark tests using the 3 TB TPC-DS dataset, version 2.13, a popular industry standard benchmark. Benchmark tests for the Amazon EMR runtime for Apache Spark and Apache Iceberg were conducted on Amazon EMR 7.12 EC2 clusters compared to open source Apache Spark 3.5.6 and Apache Iceberg 1.10.0 on EC2 clusters.

Note: Our results derived from the TPC-DS dataset are not directly comparable to the official TPC-DS results due to setup differences.

The setup instructions and technical details are available in our GitHub repository. To minimize the influence of external catalogs like AWS Glue and Hive, we used the Hadoop catalog for the Iceberg tables. This uses the underlying file system, specifically Amazon S3, as the catalog. We can define this setup by configuring the property spark.sql.catalog.<catalog_name>.type. The fact tables used the default partitioning by the date column, which vary from 200–2,100 partitions. No precalculated statistics were used for these tables.

We ran a total of 104 SparkSQL queries in 3 sequential rounds, and the average runtime of each query across these rounds was taken for comparison. The average runtime for the 3 rounds on Amazon EMR 7.12 with Iceberg enabled was 0.37 hours, demonstrating a 4.5x speed increase compared to open source Spark 3.5.6 and Iceberg 1.10.0. The following figure presents the total runtimes in seconds.

The following table summarizes the metrics.

Metric Amazon EMR 7.12 on EC2 Amazon EMR 7.5 on EC2 Open source Apache Spark 3.5.6 and Apache Iceberg 1.10.0
Average runtime in seconds 1349.62 1535.62 6113.92
Geometric mean over queries in seconds 7.45910 8.30046 22.31854
Cost* $4.81 $5.47 $17.65

*Detailed cost estimates are discussed later in this post.

The following chart demonstrates the per-query performance improvement of Amazon EMR 7.12 relative to open source Spark 3.5.6 and Iceberg 1.10.0. The extent of the speedup varies from one query to another, with the fastest up to 13.6x faster for q23b, with Amazon EMR outperforming open source Spark with Iceberg tables. The horizontal axis arranges the TPC-DS 3TB benchmark queries in descending order based on the performance improvement seen with Amazon EMR, and the vertical axis depicts the magnitude of this speedup as a ratio.

Cost comparison breakdown

Our benchmark provides the total runtime and geometric mean data to assess the performance of Spark and Iceberg in a complex, real-world decision support scenario. For additional insights, we also examine the cost aspect. We calculate cost estimates using formulas that account for EC2 On-Demand instances, Amazon Elastic Block Store (Amazon EBS), and Amazon EMR expenses.

  • Amazon EC2 cost (includes SSD cost) = number of instances * r5d.4xlarge hourly rate * job runtime in hours
    • 4xlarge hourly rate = $1.152 per hour
  • Root Amazon EBS cost = number of instances * Amazon EBS per GB-hourly rate * root EBS volume size * job runtime in hours
  • Amazon EMR cost = number of instances * r5d.4xlarge Amazon EMR cost * job runtime in hours
    • 4xlarge Amazon EMR cost = $0.27 per hour
  • Total cost = Amazon EC2 cost + root Amazon EBS cost + Amazon EMR cost

The calculations reveal that the Amazon EMR 7.12 benchmark yields a 3.6x cost efficiency improvement over open source Spark 3.5.6 and Iceberg 1.10.0 in running the benchmark job.

Metric Amazon EMR 7.12 Amazon EMR 7.5 Open source Apache Spark 3.5.6 and Apache Iceberg 1.10.0
Runtime in seconds 1349.62 1535.62 6113.92

Number of EC2 instances

(Includes primary node)

9 9 9
Amazon EBS Size 20gb 20gb 20gb

Amazon EC2

(Total runtime cost)

$3.89 $4.42 $17.61
Amazon EBS cost $0.01 $0.01 $0.04
Amazon EMR cost $0.91 $1.04 $0
Total cost $4.81 $5.47 $17.65
Cost savings Amazon EMR 7.12 is 3.6x better Amazon EMR 7.5 is 3.2x better Baseline

In addition to the time-based metrics discussed so far, data from Spark event logs show that Amazon EMR scanned approximately 4.3x less data from Amazon S3 and 5.3x fewer records than the open source version in the TPC-DS 3 TB benchmark. This reduction in Amazon S3 data scanning contributes directly to cost savings for Amazon EMR workloads.

Run open source Apache Spark benchmarks on Apache Iceberg tables

We used separate EC2 clusters, each equipped with 9 r5d.4xlarge instances, for testing both open source Spark 3.5.6 and Amazon EMR 7.12 for Iceberg workload. The primary node was equipped with 16 vCPU and 128 GB of memory, and the 8 worker nodes together had 128 vCPU and 1024 GB of memory. We conducted tests using the Amazon EMR default settings to showcase the typical user experience and minimally adjusted the settings of Spark and Iceberg to maintain a balanced comparison.

The following table summarizes the Amazon EC2 configurations for the primary node and 8 worker nodes of type r5d.4xlarge.

EC2 Instance vCPU Memory (GiB) Instance storage (GB) EBS root volume (GB)
r5d.4xlarge 16 128 2 x 300 NVMe SSD 20 GB

Prerequisites

The following prerequisites are required to run the benchmarking:

  1. Using the instructions in the emr-spark-benchmark GitHub repository, set up the TPC-DS source data in your S3 bucket and on your local computer.
  2. Build the benchmark application following the steps provided in Steps to build spark-benchmark-assembly application and copy the benchmark application to your S3 bucket. Alternatively, copy spark-benchmark-assembly-3.5.6.jar to your S3 bucket.
  3. Create Iceberg tables from the TPC-DS source data. Follow the instructions on GitHub to create Iceberg tables using the Hadoop catalog. For example, the following code uses an Amazon EMR 7.12 cluster with Iceberg enabled to create the tables:
aws emr add-steps --cluster-id <cluster-id> --steps Type=Spark,Name="Create Iceberg Tables",
Args=[--class,com.amazonaws.eks.tpcds.CreateIcebergTables,--conf,spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,
--conf,spark.sql.catalog.hadoop_catalog=org.apache.iceberg.spark.SparkCatalog,
--conf,spark.sql.catalog.hadoop_catalog.type=hadoop,
--conf,spark.sql.catalog.hadoop_catalog.warehouse=s3://<bucket>/<warehouse_path>/,
--conf,spark.sql.catalog.hadoop_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO,
s3://<bucket>/<jar_location>/spark-benchmark-assembly-3.5.6.jar,s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/,
/home/hadoop/tpcds-kit/tools,parquet,3000,true,<database_name>,true,true],ActionOnFailure=CONTINUE --region <AWS region>

Note: The Hadoop catalog warehouse location and database name from the preceding step. We use the same Iceberg tables to run benchmarks with Amazon EMR 7.12 and open source Spark.

This benchmark application is built from the branch tpcds-v2.13_iceberg. If you’re building a new benchmark application, switch to the correct branch after downloading the source code from the GitHub repository.

Create and configure a YARN cluster on Amazon EC2

To compare Iceberg performance between Amazon EMR on Amazon EC2 and open source Spark on Amazon EC2, follow the instructions in the emr-spark-benchmark GitHub repository to create an open source Spark cluster on Amazon EC2 using Flintrock with 8 worker nodes.

Based on the cluster selection for this test, the following configurations are used:

Make sure to replace the placeholder <private ip of primary node>, in the yarn-site.xml file, with the primary node’s IP address of your Flintrock cluster.

Run the TPC-DS benchmark with Apache Spark 3.5.6 and Apache Iceberg 1.10.0

Complete the following steps to run the TPC-DS benchmark:

  1. Log in to the open source cluster primary node using flintrock login $CLUSTER_NAME.
  2. Submit your Spark job:
    1. Choose the correct Iceberg catalog warehouse location and database that has the created Iceberg tables.
    2. The results are created in s3://<YOUR_S3_BUCKET>/benchmark_run.
    3. You can track progress in /media/ephemeral0/spark_run.log.
spark-submit \
--master yarn \
--deploy-mode client \
--class com.amazonaws.eks.tpcds.BenchmarkSQL \
--conf spark.driver.cores=4 \
--conf spark.driver.memory=10g \
--conf spark.executor.cores=16 \
--conf spark.executor.memory=100g \
--conf spark.executor.instances=8 \
--conf spark.network.timeout=2000 \
--conf spark.executor.heartbeatInterval=300s \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.shuffle.service.enabled=false \
--conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.InstanceProfileCredentialsProvider \
--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.jars.packages=org.apache.hadoop:hadoop-aws:3.3.4,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.0,org.apache.iceberg:iceberg-aws-bundle:1.10.0 \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions   \
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog    \
--conf spark.sql.catalog.local.type=hadoop  \
--conf spark.sql.catalog.local.warehouse=s3a://<YOUR_S3_BUCKET>/<warehouse_path>/ \
--conf spark.sql.defaultCatalog=local   \
--conf spark.sql.catalog.local.io-impl=org.apache.iceberg.aws.s3.S3FileIO   \
spark-benchmark-assembly-3.5.6.jar   \
s3://<YOUR_S3_BUCKET>/benchmark_run 3000 1 false  \
q1-v2.13,q10-v2.13,q11-v2.13,q12-v2.13,q13-v2.13,q14a-v2.13,q14b-v2.13,q15-v2.13,q16-v2.13,\
q17-v2.13,q18-v2.13,q19-v2.13,q2-v2.13,q20-v2.13,q21-v2.13,q22-v2.13,q23a-v2.13,q23b-v2.13,\
q24a-v2.13,q24b-v2.13,q25-v2.13,q26-v2.13,q27-v2.13,q28-v2.13,q29-v2.13,q3-v2.13,q30-v2.13,\
q31-v2.13,q32-v2.13,q33-v2.13,q34-v2.13,q35-v2.13,q36-v2.13,q37-v2.13,q38-v2.13,q39a-v2.13,\
q39b-v2.13,q4-v2.13,q40-v2.13,q41-v2.13,q42-v2.13,q43-v2.13,q44-v2.13,q45-v2.13,q46-v2.13,\
q47-v2.13,q48-v2.13,q49-v2.13,q5-v2.13,q50-v2.13,q51-v2.13,q52-v2.13,q53-v2.13,q54-v2.13,\
q55-v2.13,q56-v2.13,q57-v2.13,q58-v2.13,q59-v2.13,q6-v2.13,q60-v2.13,q61-v2.13,q62-v2.13,\
q63-v2.13,q64-v2.13,q65-v2.13,q66-v2.13,q67-v2.13,q68-v2.13,q69-v2.13,q7-v2.13,q70-v2.13,\
q71-v2.13,q72-v2.13,q73-v2.13,q74-v2.13,q75-v2.13,q76-v2.13,q77-v2.13,q78-v2.13,q79-v2.13,\
q8-v2.13,q80-v2.13,q81-v2.13,q82-v2.13,q83-v2.13,q84-v2.13,q85-v2.13,q86-v2.13,q87-v2.13,\
q88-v2.13,q89-v2.13,q9-v2.13,q90-v2.13,q91-v2.13,q92-v2.13,q93-v2.13,q94-v2.13,q95-v2.13,\
q96-v2.13,q97-v2.13,q98-v2.13,q99-v2.13,ss_max-v2.13    \
true <database> > /media/ephemeral0/spark_run.log 2>&1 &!

Summarize the results

After the Spark job finishes, retrieve the test result file from the output S3 bucket at s3://<YOUR_S3_BUCKET>/benchmark_run/timestamp=xxxx/summary.csv/xxx.csv. This can be done either through the Amazon S3 console by navigating to the specified bucket location or by using the Amazon Command Line Interface (AWS CLI). The Spark benchmark application organizes the data by creating a timestamp folder and placing a summary file within a folder labeled summary.csv. The output CSV files contain 4 columns without headers:

  • Query name
  • Median time
  • Minimum time
  • Maximum time

With the data from 3 separate test runs with 1 iteration each time, we can calculate the average and geometric mean of the benchmark runtimes.

Run the TPC-DS benchmark with Amazon EMR runtime for Apache Spark

Most of the instructions are similar to Steps to run Spark Benchmarking with a few Iceberg-specific details.

Prerequisites

Complete the following prerequisite steps:

  1. Run aws configure to configure the AWS CLI shell to point to the benchmarking AWS account. Refer to Configure the AWS CLI for instructions.
  2. Upload the benchmark application JAR file to Amazon S3.

Deploy Amazon EMR cluster and run the benchmark job

Complete the following steps to run the benchmark job:

  1. Use the AWS CLI command as shown in Deploy EMR on EC2 Cluster and run benchmark job to deploy an Amazon EMR on EC2 cluster. Make sure to enable Iceberg. See Create an Iceberg cluster for more details. Choose the correct Amazon EMR version, root volume size, and same resource configuration as the open source Flintrock setup. Refer to create-cluster for a detailed description of the AWS CLI options.
  2. Store the cluster ID from the response. We need this for the next step.
  3. Submit the benchmark job in Amazon EMR using add-steps from the AWS CLI:
    1. Replace <cluster ID> with the cluster ID from Step 2.
    2. The benchmark application is at s3://<your-bucket>/spark-benchmark-assembly-3.5.6.jar.
    3. Choose the correct Iceberg catalog warehouse location and database that has the created Iceberg tables. This should be the same as the one used for the open source TPC-DS benchmark run.
    4. The results will be in s3://<your-bucket>/benchmark_run.
aws emr add-steps   --cluster-id <cluster-id>
--steps Type=Spark,Name="SPARK Iceberg EMR TPCDS Benchmark Job",
Args=[--class,com.amazonaws.eks.tpcds.BenchmarkSQL,
--conf,spark.driver.cores=4,
--conf,spark.driver.memory=10g,
--conf,spark.executor.cores=16,
--conf,spark.executor.memory=100g,
--conf,spark.executor.instances=8,
--conf,spark.network.timeout=2000,
--conf,spark.executor.heartbeatInterval=300s,
--conf,spark.dynamicAllocation.enabled=false,
--conf,spark.shuffle.service.enabled=false,
--conf,spark.sql.iceberg.data-prefetch.enabled=true,
--conf,spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,
--conf,spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog,
--conf,spark.sql.catalog.local.type=hadoop,
--conf,spark.sql.catalog.local.warehouse=s3://<your-bucket>/<warehouse-path>,
--conf,spark.sql.defaultCatalog=local,
--conf,spark.sql.catalog.local.io-impl=org.apache.iceberg.aws.s3.S3FileIO,
s3://<your-bucket>/spark-benchmark-assembly-3.5.6.jar,
s3://<your-bucket>/benchmark_run,3000,1,false,
'q1-v2.13\,q10-v2.13\,q11-v2.13\,q12-v2.13\,q13-v2.13\,q14a-v2.13\,q14b-v2.13\,q15-v2.13\,q16-v2.13\,q17-v2.13\,q18-v2.13\,q19-v2.13\,q2-v2.13\,q20-v2.13\,q21-v2.13\,q22-v2.13\,q23a-v2.13\,q23b-v2.13\,q24a-v2.13\,q24b-v2.13\,q25-v2.13\,q26-v2.13\,q27-v2.13\,q28-v2.13\,q29-v2.13\,q3-v2.13\,q30-v2.13\,q31-v2.13\,q32-v2.13\,q33-v2.13\,q34-v2.13\,q35-v2.13\,q36-v2.13\,q37-v2.13\,q38-v2.13\,q39a-v2.13\,q39b-v2.13\,q4-v2.13\,q40-v2.13\,q41-v2.13\,q42-v2.13\,q43-v2.13\,q44-v2.13\,q45-v2.13\,q46-v2.13\,q47-v2.13\,q48-v2.13\,q49-v2.13\,q5-v2.13\,q50-v2.13\,q51-v2.13\,q52-v2.13\,q53-v2.13\,q54-v2.13\,q55-v2.13\,q56-v2.13\,q57-v2.13\,q58-v2.13\,q59-v2.13\,q6-v2.13\,q60-v2.13\,q61-v2.13\,q62-v2.13\,q63-v2.13\,q64-v2.13\,q65-v2.13\,q66-v2.13\,q67-v2.13\,q68-v2.13\,q69-v2.13\,q7-v2.13\,q70-v2.13\,q71-v2.13\,q72-v2.13\,q73-v2.13\,q74-v2.13\,q75-v2.13\,q76-v2.13\,q77-v2.13\,q78-v2.13\,q79-v2.13\,q8-v2.13\,q80-v2.13\,q81-v2.13\,q82-v2.13\,q83-v2.13\,q84-v2.13\,q85-v2.13\,q86-v2.13\,q87-v2.13\,q88-v2.13\,q89-v2.13\,q9-v2.13\,q90-v2.13\,q91-v2.13\,q92-v2.13\,q93-v2.13\,q94-v2.13\,q95-v2.13\,q96-v2.13\,q97-v2.13\,q98-v2.13\,q99-v2.13\,ss_max-v2.13',
true,<database>],ActionOnFailure=CONTINUE --region <aws-region>

Summarize the results

After the step is complete, you can see the summarized benchmark result at s3://<YOUR_S3_BUCKET>/benchmark_run/timestamp=xxxx/summary.csv/xxx.csv in the same way as the previous run and compute the average and geometric mean of the query runtimes.

Clean up

To help prevent future charges, delete the resources you created by following the instructions provided in the Cleanup section of the GitHub repository.

Summary

Amazon EMR optimizes the runtime for Spark when used with Iceberg tables, achieving 4.5x faster performance than open source Apache Spark 3.5.6 and Apache Iceberg 1.10.0 with Amazon EMR 7.12 on TPC-DS 3 TB, v2.13. This represents a significant advancement from Amazon EMR 7.5, which delivered 3.6x faster performance and closes the gap to parquet performance on Amazon EMR so customers can use the benefits of Iceberg without a performance penalty.

We encourage you to keep up to date with the latest Amazon EMR releases to fully benefit from ongoing performance improvements.

To stay informed, subscribe to the RSS feed for the AWS Big Data Blog, where you can find updates on the Amazon EMR runtime for Spark and Iceberg, as well as tips on configuration best practices and tuning recommendations.


About the authors

Atul Felix Payapilly is a software development engineer for Amazon EMR at Amazon Web Services.

Akshaya KP is a software development engineer for Amazon EMR at Amazon Web Services.

Hari Kishore Chaparala is a software development engineer for Amazon EMR at Amazon Web Services.

Giovanni Matteo is the Senior Manager for the Amazon EMR Spark and Iceberg group.

Accelerate data lake operations with Apache Iceberg V3 deletion vectors and row lineage

Post Syndicated from Ron Ortloff original https://aws.amazon.com/blogs/big-data/accelerate-data-lake-operations-with-apache-iceberg-v3-deletion-vectors-and-row-lineage/

Organizations building petabyte-scale data lakes face increasing challenges as their data grows. Batch updates and compliance deletes create a proliferation of positional delete files, slowing downstream data pipelines and driving up storage costs. Tracking data changes for audit trails and incremental processing requires custom, engine-specific implementations that add complexity and maintenance burden. As data volumes scale, these challenges compound, leaving data teams juggling custom solutions and increasing operational costs just to maintain data freshness and compliance.

Apache Iceberg V3 addresses these challenges with two new capabilities: deletion vectors and row lineage. AWS now delivers these capabilities across Apache Spark on Amazon EMR 7.12, AWS Glue, Amazon SageMaker notebooks, Amazon S3 Tables, and the AWS Glue Data Catalog, giving you a complete, integrated V3 experience without custom implementation. This means faster writes, lower storage costs, comprehensive audit trails, and efficient incremental processing, all working seamlessly across your entire data lake architecture.

In this post, we walk you through the new capabilities in Iceberg V3, explain how deletion vectors and row lineage address these challenges, explore real-world use cases across industries, and provide practical guidance on implementing Iceberg V3 features across AWS analytics, catalog, and storage services.

What’s new in Iceberg V3

Iceberg V3 introduces new capabilities and data types. Two key capabilities that address the challenges discussed earlier are deletion vectors and row lineage.

Deletion vectors replace positional delete files with an efficient binary format stored as Puffin files. Instead of creating separate delete files for each delete operation, the deletion vector consolidates these delete references to a single delete vector per data file, rather than a delete reference file per deleted row. During query execution, engines efficiently filter out deleted rows using these compact vectors, maintaining query performance while removing the need to merge multiple delete files.

This avoids write amplification from random batch updates and GDPR compliance deletes, significantly reducing the overhead of maintaining fresh data. High-frequency update workloads can see immediate improvements in write performance and reduced storage costs from fewer small delete files. Additionally, having fewer small delete files reduces table maintenance costs for compaction operations.

Row lineage enables precise change tracking at the row level with full auditability. Row lineage adds metadata fields to each data file that track when rows were created and last modified. The _row_id field uniquely identifies each row, and the _last_updated_sequence_number field tracks the snapshot when the row was last modified. These fields enable efficient change tracking queries without scanning entire tables, and they’re automatically maintained by the Iceberg specification without requiring custom code.

Before row lineage, change tracking in Iceberg provided only the net changes between snapshots, making it difficult to track individual record modifications. Row lineage metadata fields can now be queried to return all incremental changes, giving you full fidelity for auditing data modifications and regulatory compliance. For data transformations, your downstream systems can process changes incrementally, speeding up data pipelines and reducing compute costs for change data capture (CDC) workflows. Row lineage is engine agnostic, interoperable, and built into the Iceberg V3 specification, alleviating the need for custom, engine-specific change tracking implementations.

Real-world use cases

The new Iceberg V3 capabilities address critical challenges across multiple industries:

  • Marketing and advertising services organizations – You can now efficiently handle GDPR right-to-be-forgotten requests and regulatory compliance deletes without the write amplification that previously degraded pipeline performance. Row lineage provides complete audit trails for data modifications, meeting strict regulatory requirements for data governance.
  • Ecommerce platforms processing millions of product updates and inventory changes daily – You can maintain data freshness while reducing storage costs. Deletion vectors enable faster upsert operations, helping teams meet aggressive SLA requirements during peak shopping periods.
  • Healthcare and life sciences companies – You can track patient data modifications with precision for compliance purposes while efficiently processing large-scale genomic datasets. Row lineage provides the detailed change history required for clinical trial audits and regulatory submissions.
  • Media and entertainment providers managing large catalogs of user viewing data – You can efficiently process incremental changes for recommendation engines. Row lineage enables downstream analytics systems to process only changed records, reducing compute costs in incremental processing scenarios.

Get started with Iceberg V3

To take advantage of deletion vectors for optimized writes and row lineage for built-in change tracking in Iceberg V3, set the table property format-version = 3 during table creation. Alternatively, setting this property on an existing Iceberg V2 table atomically upgrades the table without data rewrites. Before creating or upgrading V3 tables, make sure the Iceberg engines in your solution are V3-compatible. Refer to Apache Iceberg V3 on AWS for more details.

Create a new V3 table with Apache Spark on Amazon EMR 7.12

The following code creates a new table named customer_data. Setting the table property format-version = 3 creates a V3 table. If the format-version table property is not explicitly set, a V2 table is created. V2 is currently the Iceberg default table version. Setting write.delete.mode, write.update.mode, and write.merge.mode to merge-on-read configures Spark to write deletion vectors for delete, update, or merge statements performed on the table.

CREATE TABLE customer_data (
customer_id bigint,
name string,
email string,
last_purchase timestamp,
total_spent decimal(10,2)
)
USING iceberg
TBLPROPERTIES (
'format-version' = '3',
'write.delete.mode' = 'merge-on-read',
'write.update.mode' = 'merge-on-read',
'write.merge.mode' = 'merge-on-read'
)

Run the following code to insert records into the customer_data table:

INSERT INTO customer_data VALUES
 (1, 'Alejandro Rosalez', '[email protected]', TIMESTAMP '2025-11-24 18:55:27', 42.97)
,(2, 'Akua Mansa', '[email protected]', TIMESTAMP '2025-11-24 17:55:27', 25.02)
,(3, 'Ana Carolina Silva','[email protected]', TIMESTAMP '2025-11-24 16:55:27', 43.67)
,(4, 'Arnav Desai','[email protected]', TIMESTAMP '2025-11-24 15:55:27', 98.32)
,(5, 'Carlos Salazar','[email protected]', TIMESTAMP '2025-11-24 12:55:27', 76.45)

Delete a record where customer_id = 5 to generate a delete file:

DELETE 
  FROM customer_data 
  WHERE customer_id = 5

Updating a record with the following update statement also generates a delete file:

UPDATE customer_data
  SET name = 'Mansa Akua' 
  WHERE customer_id = 2

The last part of this example queries the manifest’s metadata table to verify delete files were produced:

SELECT added_snapshot_id
      ,sum(added_delete_files_count) as added_delete_files_count 
FROM customer_data.manifests 
GROUP BY added_snapshot_id 
ORDER BY added_snapshot_id

This query will result in three records returned, as shown in the following screenshot. The added_delete_files_count for the first snapshot that inserts records should be 0. The next two snapshots for the corresponding delete and update statements should have 1 each for added_delete_files_count value.

Query row lineage for change tracking

Row lineage is automatically enabled on V3 tables. The following example includes row lineage metadata fields and an example of how to query table changes after a row lineage sequence number:

SELECT
customer_id,
name,
email,
_row_id,
_last_updated_sequence_number
FROM customer_data
WHERE _last_updated_sequence_number > 0
ORDER BY _last_updated_sequence_number, _row_id

Running this query after the previous insert, update, and delete statements returns four records, as shown in the following screenshot. The deleted record is removed. The _last_updated_sequence_number is 3 for the update to customer_id = 2.

Upgrade an existing V2 table

You can upgrade your existing V2 tables to V3 with the following command:

ALTER TABLE existing_customer_data
SET TBLPROPERTIES ('format-version' = '3')

When you upgrade a table from V2 to V3, several important operations occur atomically:

  • A new metadata snapshot is created atomically, resulting in no data loss.
  • Existing Parquet data files are reused without modification.
  • Row-lineage fields (_row_id and _last_updated_sequence_number) are added to the table metadata.
  • The next compaction operation will remove old V2 positional delete files. If new deletion vector files are generated before compaction runs, they will merge existing V2 positional delete files.
  • New modifications will automatically use V3’s deletion vector files.
  • The upgrade does not perform a historical backfill of row-lineage change tracking records.

The upgrade process is synchronous and completes in seconds for most tables. If the upgrade fails, an error message is returned immediately, and the table remains in its V2 state.

Getting the most from Iceberg V3

In this section, we share the key things we’ve learned from customers already using these features.

Know your workload pattern

Deletion vectors work best when you’re doing lots of writes, such as high-frequency updates, batch deletes, or CDC workloads making random non-append-only updates. If you’re writing more than you’re reading, deletion vectors will deliver immediate performance gains. To unlock these benefits, set your table to merge-on-read mode for delete, update, and merge operations.

Let AWS handle compaction

Enable automatic compaction through the Data Catalog or use S3 Tables (on by default). You will get hands-free optimization without building custom maintenance jobs. Deletion vectors produce fewer delete files than positional deletes in Iceberg V2. Given a similar pattern and amount of modified records, V3 compaction should be quicker and cost less than V2.

Understand the importance of row lineage when using the V2 changelog

With the Spark changelog procedure in Iceberg V2, if a row gets inserted and then deleted between snapshots, it disappears from your change feed—you never see it. Iceberg V3 row lineage captures both operations because _last_updated_sequence_number updates on each modification. This full fidelity is important for audit trails and regulatory compliance where you need to prove what happened to every record. Performance-wise, the V2 changelog requires scanning and merging delete files to compute changes—that’s compute you’re paying for on every read. V3 row lineage stores metadata fields directly on each row, so filtering by _last_updated_sequence_number is a simple metadata scan.

Test before you upgrade

Iceberg V3 upgrades are atomic and fast, but test in dev first. Make sure all your query engines support Iceberg V3 before upgrading shared tables—mixing V2 and V3 engines causes headaches. After upgrading, keep a few V2 snapshots around temporarily for time-travel queries while you validate performance.

Conclusion

Iceberg V3 support across AWS analytics, catalog, and storage services marks a significant advancement in data lake capabilities. By combining deletion vectors’ write optimization with row lineage’s comprehensive change tracking, you can build more efficient, auditable, and cost-effective data lakes at scale. The seamless interoperability across AWS services makes sure your data lake architecture remains flexible and future-proof.

To learn more about AWS support for Iceberg V3, refer to Using Apache Iceberg on AWS.

To learn more about building modern data lakes with Iceberg on AWS, refer to Analytics on AWS.


About the authors

Ron Ortloff

Ron Ortloff

Ron is a Principal Product Manager at AWS.

Getting started with Apache Iceberg write support in Amazon Redshift

Post Syndicated from Sanket Hase original https://aws.amazon.com/blogs/big-data/getting-started-with-apache-iceberg-write-support-in-amazon-redshift/

Many companies store structured data in warehouses for analytics while keeping diverse datasets in data lakes for flexible processing. Until now, maintaining consistency between these systems required complex ETL processes and introduced potential data synchronization challenges.

The new Amazon Redshift Apache Iceberg write support removes these complexities through direct writes to Apache Iceberg tables stored in Amazon S3 and S3 Tables. With this native integration you can write data directly from Redshift queries to your data lake without intermediate ETL steps, facilitate data consistency with ACID-compliant transactions that help optimize query performance with flexible partitioning strategies, and use the familiar Redshift SQL interface when writing to Apache Iceberg tables. For example, you can now run a complex transformation in Redshift and write the results directly to an Apache Iceberg table that other analytics engines like Amazon EMR or Amazon Athena can immediately query. By using this approach you can query the same datasets from both Redshift and other analytics tools without copying data.

In this post, we show how you can use Amazon Redshift to write data directly to Apache Iceberg tables stored in Amazon S3 and S3 Tables for seamless integration between your data warehouse and data lake while maintaining ACID compliance.

“Verisk processes billions of catastrophe risk modeling records using Amazon Redshift and Apache Iceberg, achieving 30% faster query aggregations and significant storage cost reductions”

— Karthick Shanmugam, Associate Vice President, Verisk

Solution overview 

You can now create and write directly to Apache Iceberg tables stored in Amazon S3 and S3 Tables using familiar SQL commands in Amazon Redshift. We’ll guide you through configuring permissions for S3 table buckets using AWS Lake Formation. Finally, we’ll analyze customer and order datasets across both Redshift native and Apache Iceberg data formats to derive insights. The workflow is illustrated in the following diagram:

In this post we will walk you through following steps:

  1. Create an external database named customer_db in AWS Glue Data Catalog using Amazon Redshift SQL.
  2. Create an external table named customer in the Glue database and write customer data using Amazon Redshift SQL.
  3. Create table bucket named orders on Amazon S3 Tables to write orders data.
  4. Grant permissions using AWS Lake Formation to an IAM role for reading and writing to the orders table.
  5. Write orders data to the orders Amazon S3 table bucket.

This solution uses the following AWS services:

Prerequisites 

  • Create Amazon Redshift data warehouse (provisioned or Serverless).
  • Permissions to create database on AWS Glue Data Catalog from Redshift.
  • Create a new AWS Glue database called customer_db or use an existing database of your choice. If you use an existing database or a different name, replace customer_db with your actual database name in the subsequent commands.
  • S3 bucket and S3 Table bucket in the same AWS Region as your Redshift cluster.
  • Have access to an IAM role that is a Lake Formation data lake administrator. For instructions, refer to Create a data lake administrator.
  • Create IAM role RedshifticebergRole with following policy. Add managed permission for AmazonRedshiftQueryEditorV2.
    {
        "Version": "2012-10-17",
        "Statement": [
                        {
                            "Sid": "VisualEditor0",
                            "Effect": "Allow",
                            "Action": "redshift:GetClusterCredentialsWithIAM",
                            "Resource": "arn:aws:redshift:<YOUR-REGION>:<AWS-ACCOUNT-NUMBER>:dbname::<YOUR-REDSHIFT-CLUSTER-NAME>/*"
                        }
                    ]
    }

Setting up your environment

To set up your environment, complete the following steps.

Creating Apache Iceberg tables in Amazon S3 standard buckets

  1. Connect to Redshift using Query Editor V2.
  2. Create user for the Federated role RedshifticebergRole.
    Create user IAMR:RedshifticebergRole

  3. Verify you have an Amazon Redshift External Schema configured. Run following script on Redshift:
    CREATE EXTERNAL SCHEMA demo_iceberg
    FROM DATA CATALOG DATABASE 'customer_db'
    IAM_ROLE 'arn:aws:iam::<AWS-ACCOUNT-NUMBER>:role/RedshiftCustomizedIcebergRole';

  4. Create external table customer in Apache Iceberg table format in the demo_iceberg external schema created above and then insert data.

    Use this two-step approach when you need control over column definitions or plan to append data.

    Replace your S3 bucket name in place of <<your-bucket>>.

    -- Step 1: Define your table structure
    CREATE TABLE demo_iceberg.customer
    (
    customer_id bigint,
    customer_name varchar,
    email varchar,
    city varchar
    )
    USING ICEBERG
    LOCATION 's3://<<your-bucket>>/iceberg-data/customers/';
    
    -- Step 2: Insert data
                  
    (1, 'Customer1 Smith', '[email protected]', 'New York'),
    (2, 'Customer2 Johnson', '[email protected]', 'Los Angeles'),
    (3, 'Customer3 Brown', '[email protected]', 'Chicago'),
    (4, 'Customer4 Davis', '[email protected]', 'Houston'),
    (5, 'Customer5 Wilson', '[email protected]', 'Phoenix'),
    (6, 'Customer6 Miller', '[email protected]', 'Philadelphia'),
    (7, 'Customer7 Garcia', '[email protected]', 'San Antonio'),
    (8, 'Customer8 Rodriguez', '[email protected]', 'San Diego'),
    (9, 'Customer9 Martinez', '[email protected]', 'Dallas'),
    (10, 'Customer10 Anderson', '[email protected]', 'San Jose'),
    (11, 'Customer11 Taylor', '[email protected]', 'Austin'),
    (12, 'Customer12 Thomas', '[email protected]', 'Jacksonville'),
    (13, 'Customer13 Jackson', '[email protected]', 'Fort Worth'),
    (14, 'Customer14 White', '[email protected]', 'Columbus'),
    (15, 'Customer15 Harris', '[email protected]', 'Charlotte');
    
    -- Step 3: Select data
    SELECT * FROM demo_iceberg.customer;
    

    Figure 2: Result from demo_iceberg.customer

  5. Grant access to external schema for user IAMR:RedshifticebergRole:
    Grant usage on schema demo_iceberg to "IAMR:RedshifticebergRole";

Create Apache Iceberg tables in Amazon S3 Table buckets

Amazon S3 table buckets are integrated with AWS Lake Formation, which serves as the central authority for managing data access permissions. When working with Apache Iceberg tables, Lake Formation provides a unified security framework that simplifies access control across your entire data lake. This centralized approach makes sure consistent and efficient permission management, alleviating the need to handle permissions in multiple places.

To create an S3 table bucket:

  1. Go to Amazon S3, choose Table buckets in the left navigation pane.
  2. On the Table buckets page, in the Integration with AWS analytics services section, choose Enable integration.
  3. In the Table buckets list, choose the Create table bucket button and enter a name for your table bucket, for example, iceberg-write-blog, and choose Create table bucket. After creation, the bucket will appear in the S3 tables catalog, s3tablescatalog, in the Lake Formation console.
  4. In the AWS Lake Formation console, choose Catalogs, in the Catalogs table select s3tablescatalog to open the detail page for that table.
  5. On the s3tablescatalog details page, under Catalogs, choose the table bucket iceberg-write-blog.
  6. On the iceberg-write-blog details page, under Databases, choose Create database.
  7. Enter the database name iceberg_write_namespace, select the Catalog from the drop down menu, and choose Create database.
  8. Grant a permission to create a table in the database to the Lake Formation IAM role. On the iceberg-write-blog details page select the radio button for iceberg_write_namespace, choose Actions, Grant.
  9. On the Grant permissions page, under Principal type select Principals, under Principals select IAM users and roles, in the IAM users and roles drop down menu select RedshifticebergRole.
  10. For LF-Tags or catalog resources, choose Named Data Catalog resources, for Catalogs select iceberg-write-blog and for Databases select iceberg_write_namespace.
  11. For Database permissions select the checkbox for Create table, Drop, and Describe, then choose Grant.

Creating Apache Iceberg tables in Amazon Redshift using Amazon S3 table buckets

AWS Lake Formation catalogs are automatically mounted on Amazon Redshift data warehouses in same account. Amazon Redshift writes directly to S3 Tables using the auto mounted S3 table catalog. The SQL syntax for writing to Apache Iceberg tables stored in S3 table buckets is similar to the syntax for Apache Iceberg tables stored in S3 standard buckets. The key difference is the auto mounted S3 Table catalog, which supports three-part notation access. This feature alleviates the need to create an EXTERNAL SCHEMA when referencing data lake Apache Iceberg tables residing in S3 Table buckets.

To create the Apache Iceberg table:

  1. Switch to the RedshifticebergRole. To access S3 tables through the Redshift Query Editor V2, you must use a Federated user account, the RedshifticebergRole has been granted the necessary Lake Formation permissions.
  2. Log in to Redshift using the Query Editor V2 Federated user option.
  3. In Query Editor V2, create the table named orders in Apache Iceberg table format:
    CREATE TABLE "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders
    (
     customer_id BIGINT,
     order_id BIGINT,
     Total_order_amt DECIMAL(10,2),
     Total_order_tax_amt REAL,
     tax_pct DOUBLE PRECISION,
     order_date DATE,
     order_created_at_tz TIMESTAMPTZ,
     is_active_ind BOOLEAN
    )
    USING ICEBERG 
    PARTITIONED BY (DAY(order_date));
    

  4. Insert data into the table using standard SQL:
    INSERT INTO "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders
    (order_date, order_id, customer_id, total_order_amt, total_order_tax_amt, tax_pct, order_created_at_tz, is_active_ind)
    VALUES
    ('2024-01-15', 1001, 1, 125.50, 10.04, 0.08, '2024-01-15 10:30:00-06:00', true),
    ('2024-02-20', 1002, 2, 89.99, 6.75, 0.075, '2024-02-20 14:22:15-06:00', true),
    ('2024-03-10', 1003, 3, 234.75, 20.78, 0.0885, '2024-03-10 09:15:30-06:00', false),
    ('2024-04-05', 1004, 1, 67.25, 5.38, 0.08, '2024-04-05 16:45:00-05:00', true),
    ('2024-05-18', 1005, 4, 156.80, 12.54, 0.08, '2024-05-18 11:20:45-05:00', true),
    ('2024-06-22', 1006, 5, 45.99, 4.14, 0.09, '2024-06-22 13:10:20-05:00', true),
    ('2024-07-14', 1007, 2, 312.40, 24.99, 0.08, '2024-07-14 08:35:10-05:00', false),
    ('2024-08-30', 1008, 6, 78.50, 7.07, 0.09, '2024-08-30 15:25:35-05:00', true),
    ('2024-09-12', 1009, 3, 199.99, 18.00, 0.09, '2024-09-12 12:40:50-05:00', true),
    ('2024-10-08', 1010, 7, 523.75, 41.90, 0.08, '2024-10-08 17:15:25-05:00', true),
    ('2024-10-25', 1011, 4, 92.30, 8.31, 0.09, '2024-10-25 10:05:15-05:00', false),
    ('2024-11-02', 1012, 8, 167.45, 13.40, 0.08, '2024-11-02 14:50:40-06:00', true),
    ('2024-11-08', 1013, 1, 34.99, 2.80, 0.08, '2024-11-08 09:30:20-06:00', true),
    ('2024-11-09', 1014, 9, 445.60, 40.10, 0.09, '2024-11-09 16:20:55-06:00', true),
    ('2024-11-10', 1015, 5, 278.85, 22.31, 0.08, '2024-11-10 11:45:30-06:00', true);
    

  5. Create a Redshift local_orders table and insert sample records:
    CREATE TABLE dev.public.local_orders
    (
    customer_id BIGINT,
    order_id BIGINT,
    Total_order_amt DECIMAL(10,2),
    Total_order_tax_amt REAL,
    tax_pct DOUBLE PRECISION,
    order_date DATE,
    order_created_at_tz TIMESTAMPTZ,
    is_active_ind BOOLEAN
    );
    
    
    INSERT INTO dev.public.local_orders
    (customer_id, order_id, Total_order_amt, Total_order_tax_amt, tax_pct, order_date, order_created_at_tz, is_active_ind)
    VALUES
    (1001, 5001, 299.99, 24.00, 0.08, '2024-01-15', '2024-01-15 14:30:00-05:00', true),
    (1002, 5002, 1250.50, 100.04, 0.08, '2024-01-16', '2024-01-16 09:15:22-05:00', true),
    (1003, 5003, 75.25, 6.02, 0.08, '2024-01-16', '2024-01-16 16:45:33-05:00', true),
    (1004, 5004, 499.99, 40.00, 0.08, '2024-01-17', '2024-01-17 11:20:45-05:00', true),
    (1005, 5005, 149.50, 11.96, 0.08, '2024-01-17', '2024-01-17 13:55:12-05:00', false),
    (1002, 5006, 899.99, 72.00, 0.08, '2024-01-18', '2024-01-18 10:05:30-05:00', true),
    (1006, 5007, 45.75, 3.66, 0.08, '2024-01-18', '2024-01-18 15:40:18-05:00', true),
    (1007, 5008, 1500.00, 120.00, 0.08, '2024-01-19', '2024-01-19 08:25:55-05:00', true),
    (1008, 5009, 250.25, 20.02, 0.08, '2024-01-19', '2024-01-19 12:10:40-05:00', true),
    (1009, 5010, 725.75, 58.06, 0.08, '2024-01-20', '2024-01-20 14:15:28-05:00', true);
    

  6. Using the CREATE TABLE AS (CTAS) format, create a table from the existing table with no compression:
    CREATE TABLE "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders_new 
    using ICEBERG
    TABLE PROPERTIES ('compression_type'='uncompressed')
    AS
    select * from dev.public.local_orders;
    

  7. Select data with standard SQL using the three-part notation:
    select * from "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders;

    You can also use the USE clause to specify the default database (and omit the database name):

    USE "iceberg-write-blog@s3tablescatalog";
    
    select * from iceberg_write_namespace.orders;

    The resulting table will look like the following image:

  8. Set a schema search path to further simplify table access by omitting the schema name from the notation:
    -- Redshift default database is set to 'iceberg-write-blog@s3tablescatalog'
    USE "iceberg-write-blog@s3tablescatalog";
    
    -- Redshift will search 'iceberg_write_namespace' to resolve table orders
    set search_path to iceberg_write_namespace;
    
    select * from orders;

  9. Show table:
    show table "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders 
    
    --Result
    CREATE TABLE "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders 
    (
    customer_id bigint,
    order_id bigint,
    total_order_amt decimal(10, 2),
    total_order_tax_amt float,
    tax_pct double precision,
    order_date date,
    order_created_at_tz timestamptz,
    is_active_ind Boolean
    )
    USING ICEBERG
    PARTITIONED BY (DAY(order_date))
    TABLE PROPERTIES ('compression_type'='snappy');

Bringing it together

Let’s demonstrate how to combine data from two sources and show how they can work together in a single query.

  • Customer data stored in standard S3 buckets
  • Orders data stored in S3 table buckets
  1. Log in to Redshift using Federated user:
    select 
    b.order_date,
    b.order_id,
    b.total_order_amt,
    CONVERT_TIMEZONE('America/Los_Angeles', b.order_created_at_tz) AS order_pacific_time,
    a.customer_name,
    a.email,
    a.city
    from dev.demo_iceberg.customer a join "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders b
    on(a.customer_id = b.customer_id)
    where b.order_date between '2024-01-15' and '2024-10-25'
    and b.is_active_ind=true;

    The result from the consolidated query:

  2. Drop table:
    Drop TABLE "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders_new;

Clean up

To avoid ongoing charges, follow these steps in order:

  1. Drop Apache Iceberg tables:
    DROP TABLE dev.demo_iceberg.customer;
    DROP TABLE "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders;

  2. Remove S3 objects, replace your-bucket with the name of the bucket you created:
    aws s3 rm s3://<your-bucket>/iceberg/ --recursive

  3. Remove Lake Formation permissions, replace your-bucket with the name of the bucket you created:
    aws lakeformation deregister-resource --resource-arn arn:aws:s3:::<your-bucket>

Conclusion

With Apache Iceberg write support in Amazon Redshift you can to build flexible data architectures that combine the performance of a data warehouse with the scalability of a data lake. You can now write data directly to Apache Iceberg tables while maintaining ACID compliance and partitioning for query optimization. You can use Amazon Redshift to create Apache Iceberg tables in your data lake, making them immediately queryable through Amazon EMR or Amazon Athena.

To learn more, review the Amazon Redshift Iceberg integration and Writing to Apache Iceberg tables documentation. Visit the AWS Database Blog for latest updates.


About the authors

Sanket Hase

Sanket Hase

Sanket is an Engineering Manager with the Amazon Redshift team, leading query execution teams in the areas of data lake analytics, hardware-software co-design, and vectorized query execution.

Harshida Patel

Harshida Patel

Harshida is a Principal Solutions Architect, Analytics with AWS.

Ritesh Kumar Sinha

Ritesh Kumar Sinha

Ritesh is an Analytics Specialist Solutions Architect based out of San Francisco. He has helped customers build scalable data warehousing and big data solutions for over 16 years. He loves to design and build efficient end-to-end solutions on AWS. In his spare time, he loves reading, walking, and doing yoga.

Raghu Kuppala

Raghu Kuppala

Raghu is an Analytics Specialist Solutions Architect experienced working in the databases, data warehousing, and analytics space. Outside of work, he enjoys trying different cuisines and spending time with his family and friends.

Xiening Dai

Xiening Dai

Xiening is a Principal Software Engineer working on Redshift Query Processing and Data Lake.

Secure Amazon Elastic VMware Service (Amazon EVS) with AWS Network Firewall

Post Syndicated from Sheng Chen original https://aws.amazon.com/blogs/architecture/secure-amazon-elastic-vmware-service-amazon-evs-with-aws-network-firewall/

Amazon Elastic VMware Service (Amazon EVS) helps organizations migrate, run, and scale VMware workloads natively on AWS. It delivers a VMware Cloud Foundation (VCF) environment that operates directly within your Amazon Virtual Private Cloud (Amazon VPC) on Amazon EC2 bare-metal instances. The solution helps customers accelerate cloud migrations and data center exits without needing to refactor existing applications.

For customers considering a hybrid cloud architecture, a unified network security solution is required to protect application traffic across Amazon EVS environments, Amazon VPCs, on-premises data centers and the internet. It also needs to provide a single point of control for firewall policy management, centralized logging, and monitoring to streamline network security operations.

AWS Network Firewall is a managed firewall and intrusion detection and prevention service (IDS/IPS) that can help address these requirements. Built on AWS managed infrastructure, it automatically scales with traffic demands while maintaining high availability and consistent performance. The service provides centralized policy management and traffic inspection across multiple VPCs and AWS accounts. Additionally, it provides comprehensive visibility and reporting through firewall log collections to Amazon Simple Storage Service (Amazon S3), Amazon CloudWatch Logs, or Amazon Data Firehose.

In this post, we demonstrate how to utilize AWS Network Firewall to secure an Amazon EVS environment, using a centralized inspection architecture across an EVS cluster, VPCs, on-premises data centers and the internet. We walk through the implementation steps to deploy this architecture using AWS Network Firewall and AWS Transit Gateway.

Architecture overview

AWS Network Firewall operates as a “bump-in-the-wire” solution, which transparently inspects and filters network traffic across Amazon VPCs. It is inserted directly into the traffic path by updating VPC or Transit Gateway route tables, allowing it to examine all packets without requiring any changes to the existing application flow patterns.

The following diagram depicts the architecture overview of our centralized inspection model using AWS Network Firewall.

Figure 1: Secure Amazon EVS with AWS Network Firewall using centralized inspection architecture

Figure 1: Secure Amazon EVS with AWS Network Firewall using centralized inspection architecture

The Amazon EVS environment is deployed directly within a customer VPC (i.e. EVS VPC), which consists of EVS VLAN subnets that form the underlay networks for VCF deployment. This infrastructure provides connectivity for NSX overlay networks, host management, vMotion, and vSANAmazon VPC Route Server enables dynamic routing between the underlay networks and overlay networks. For more information, see Concepts and components of Amazon EVS.

The architecture also includes a standard workload VPC (i.e. VPC01), and a Direct Connect Gateway connects to the on-premises data center via an AWS Direct Connect connection. We use a dedicated egress VPC with NAT gateways for centralized internet egress, and a separate ingress VPC with Application Load Balancers to terminate ingress web traffic and steer flows back to the target services.

With this architecture, the following traffic flow patterns can be inspected:

East-West Traffic:

  • Between EVS VPCs and Workload VPCs
  • Between Workload VPCs

North-South Traffic:

  • Between EVS/Workload VPCs and on-premises
  • Between EVS/Workload VPCs and internet
  • Between on-premises and internet

The centralized inspection architecture provides several benefits:

  • Single point of control for network security inspection across multiple VPCs
  • Enhanced rule enforcement across AWS infrastructure, on-premises resources, and the internet
  • Centralized logging and monitoring

For this demo we use the AWS Network Firewall native integration with AWS Transit Gateway capability to streamline firewall deployment and management. With a native firewall attachment, AWS automatically provisions and manages all the necessary VPC resources, reducing the operational overhead of managing subnets, route tables, and firewall endpoints within the inspection VPC.

Prerequisites

This post assumes familiarity with: AWS Command Line Interface (AWS CLI), Amazon VPC, Amazon EC2, NAT gateway, Application Load Balancer, Internet gateway, AWS Direct Connect, AWS Transit Gateway and the VMware VCF platform.

The following prerequisites are necessary to complete this solution.

  • An EVS VPC includes:
    • An Amazon EVS cluster (minimum 4x i4i nodes)
    • VPC CIDR: 10.0.0.0/16
    • NSX Segments CIDR: 192.168.0.0/19 (summarized)
    • A VPC Route Server deployed in the EVS VPC to receive NSX segment routes via BGP dynamic routing. Refer to the EVS User Guide for more details.
  • A Workload VPC (VPC01):
    • CIDR: 172.21.0.0/16
  • An Egress VPC:
    • CIDR: 172.23.0.0/16
    • 1x Internet Gateway
    • 1x NAT Gateway
  • An Ingress VPC:
    • CIDR: 172.24.0.0/16
    • 1x Internet Gateway
    • 1x Application Load Balancer
  • Optional: a Direct Connect Gateway:
    •  connecting to the on-premises environment (10.0.0.0/8)

Note: The CIDR blocks used in this example are for demo purposes only; change the address spaces to match your own networking environment. The design can also be scaled to include additional EVS environments and/or other VPCs based on workload needs.

Walkthrough

In this section, we walk through the implementation steps to deploy the centralized inspection architecture with AWS Network Firewall and AWS Transit Gateway. We focus on the overall network integration of the architecture without diving into the detailed configurations of AWS Network Firewall or Transit Gateway.

1. Create an AWS Transit Gateway

In the VPC console, create a Transit Gateway. Make sure to deselect the following options:

  • Default route table association
  • Default route table propagation

Create two empty transit gateway route tables and associate them with the Transit Gateway.

  • Pre-inspection route table: steers traffic into the AWS Network Firewall for centralized inspection
  • Post-inspection route table: returns traffic back to its original destination after inspection and is permitted by the AWS Network Firewall

2. Attach VPCs to the Transit Gateway

Attach all four VPCs (EVS, VPC01, Ingress, Egress) to the same Transit Gateway. The Direct Connect Gateway can also be attached to the Transit Gateway if AWS Network Firewall is needed to inspect traffic between the on-premises environment and AWS or the internet.

Figure 2: Attach VPCs to the Transit Gateway

Figure 2: Attach VPCs to the Transit Gateway

Associate all attachments to the pre-inspection Transit Gateway route table.

Figure 3: Associate VPC attachments to the pre-inspection route table

Figure 3: Associate VPC attachments to the pre-inspection route table

3. Create an AWS Network Firewall with Transit Gateway native integration

In the Network Firewall section of the VPC console, choose Create firewall.

At the Attachment type section, select Transit Gateway to enable native integration with the existing Transit Gateway.

Figure 4: Enable AWS Network Firewall native integration with Transit Gateway

Figure 4: Enable AWS Network Firewall native integration with Transit Gateway

At the Logging configuration, enable the following log types with CloudWatch log group as the log destination. Create a log group for each log type in the CloudWatch Console.

  • Alert: /anfw-centralized/anfw01/alert
  • Flow: /anfw-centralized/anfw01/flow

Create and associate an empty firewall policy to deploy the AWS Network Firewall instance. The firewall policy contains a list of rule groups that define how the firewall inspects and manages traffic. This empty firewall policy can be configured later.

With the Transit Gateway native integration enabled, a Transit Gateway attachment is automatically created for the AWS Network Firewall, with the resource type shown as Network Function. In addition, the Appliance Mode is automatically enabled for the firewall attachment to make sure the Transit Gateway continues to use the same Availability Zone (AZ) for the attachment over the lifetime of a flow.

Associate the firewall attachment to the post-inspection Transit Gateway route table.

Figure 5: AWS Network Firewall native attachment

Figure 5: AWS Network Firewall native attachment

4. Update Transit Gateway route tables

Update the pre-inspection Transit Gateway route table with a default route that points to the AWS Network Firewall attachment. This makes sure traffic that arrives to the Transit Gateway from all VPC attachments and the Direct Connect Gateway attachment is sent to the firewall for centralized inspection.

Figure 6: Transit Gateway pre-inspection route table

Figure 6: Transit Gateway pre-inspection route table

Add the following static routes to the post-inspection route table to direct return traffic back to each VPC and the Direct Connect Gateway accordingly.

Figure 7: Transit Gateway post-inspection route table

Figure 7: Transit Gateway post-inspection route table

5. Update VPC route tables

Finally, update route tables at each VPC as per the following table.

Make sure to add the following routes at the relevant VPC route tables:

  • EVS VPC and VPC01 have a default route (marked in blue) to steer all egress flows into AWS Network Firewall for centralized inspection.
  • Ingress VPC and Egress VPC have RFC-1918 routes (marked in green) to direct return traffic to the Transit Gateway.

Within the EVS VPC, notice the NSX segment routes are automatically propagated to the NSX uplink subnet route table and the private subnet route table via the VPC Route Server.

Figure 8: NSX uplink subnet route table within EVS VPC

Figure 8: NSX uplink subnet route table within EVS VPC

A centralized security inspection architecture has now been deployed for the EVS environment, using AWS Network Firewall with Transit Gateway native integration.

6. Testing

Egress inspection (FQDN filtering)

To test egress inspection from EVS VPC or VPC01 to the internet, create a stateful rule group for the firewall instance using FQDN filtering:

  • Rule group format: Domain list
  • Domain names: .google.com
  • Source IPs: 192.168.0.0/19, 172.21.0.0/16
  • Protocols: HTTP & HTTPS
  • Action: Allow

As expected, testing web access from a virtual machine (192.168.12.10) within the EVS environment to the allowed domain (i.e. google.com) is permitted by the AWS Network Firewall. However, access to unauthorized domain (i.e. facebook.com) is blocked at the firewall with an alert trigged, which can be verified at the CloudWatch log group at /aws/network-firewall/alert/.

Figure 9: Egress inspection from EVS to internet with FQDN filtering

Figure 9: Egress inspection from EVS to internet with FQDN filtering

Ingress inspection

Create another stateful rule group to allow Application Load Balancers deployed within the Ingress VPC to access a web server running in the EVS environment via HTTP protocol:

  • Rule group format: Standard stateful rule
  • Geographic IP Filtering: Disable Geographic IP filtering
  • Protocol: HTTP
  • Source: 172.24.0.0/16
  • Source Port: ANY
  • Destination: 192.168.12.10/32
  • Destination Port ANY
  • Traffic direction: Forward
  • Action: Alert

The CloudWatch firewall logs show an Application Load Balancer (172.24.6.45) from the Ingress VPC can establish HTTP connection to the EVS web server (192.168.12.10). Additionally, the Application Load Balancer has successfully registered the EVS web server as a remote IP target.

Figure 10: Ingress inspection from Ingress VPC to EVS

Figure 10: Ingress inspection from Ingress VPC to EVS

East-West inspection

For East-West inspection testing, update the previous stateful rule group to add a new rule to block ICMP traffic from VPC01 to the EVS VPC.

  • Rule group format: Standard stateful rule
  • Geographic IP Filtering: Disable Geographic IP filtering
  • Protocol: ICMP
  • Source: 172.21.0.0/16
  • Source Port: ANY
  • Destination: 192.168.0.0/19
  • Destination Port: ANY
  • Action: Drop

As a result, pings from an EC2 instance (172.21.128.4) from VPC01 to the EVS web server (192.168.12.10) are being dropped.

Figure 11: East-West Inspection from VPC01 to EVS

Figure 11: East-West Inspection from VPC01 to EVS

Conclusion

In this post, we demonstrated how to utilize AWS Network Firewall to secure Amazon EVS workloads and to provide centralized traffic inspection between Amazon EVS environments, Amazon VPCs, on-premises data centers, and the internet. We walked through the implementation steps for deploying the centralized inspection architecture using AWS Network Firewall and AWS Transit Gateway.

To learn more, review these resources:


About the authors

Orchestrating data processing tasks with a serverless visual workflow in Amazon SageMaker Unified Studio

Post Syndicated from Suba Palanisamy original https://aws.amazon.com/blogs/big-data/orchestrating-data-processing-tasks-with-a-serverless-visual-workflow-in-amazon-sagemaker-unified-studio/

Automation of data processing and data integration tasks is essential for data engineers and analysts to maintain up-to-date data pipelines and reports. Amazon SageMaker Unified Studio is a single data and AI development environment where you can find and access the data in your organization and act on it using the ideal tools for your use case. SageMaker Unified Studio offers multiple ways to integrate with data through its editorial tools, including Visual ETL, Query Editor, and JupyterLab builders.

Recently, AWS launched the visual workflow experience in SageMaker Unified Studio IAM-based domains. With visual workflows, you don’t need to code Python DAGs manually or have deep expertise in Apache Airflow. Instead, you can visually define orchestration workflows through an intuitive drag-and-drop interface in SageMaker Unified Studio. The visual definition is automatically converted to workflow definitions that leverage Amazon Managed Workflows for Apache Airflow (Amazon MWAA) Serverless, providing enterprise-grade orchestration capabilities with a simplified user experience.

In this post, we show how to use the new visual workflow experience in SageMaker Unified Studio IAM-based domains to orchestrate an end-to-end machine learning workflow. The workflow ingests weather data, applies transformations, and generates predictions—all through a single, intuitive interface, without writing any orchestration code.

For more details on Amazon MWAA Serverless, see Introducing Amazon MWAA Serverless.

Example use case

To demonstrate how SageMaker Unified Studio simplifies end-to-end workflow orchestration, let’s walk through a real-world scenario from agricultural analytics. The following diagram shows a weather data processing workflow that we will orchestrate using the visual workflow experience in SageMaker Unified Studio.

A regional agricultural extension office collects hourly weather data from multiple stations across farming communities. Their goal is to analyze this data and provide farmers with actionable insights into weather patterns and their impact on crop conditions. To achieve this, the team built a ML–powered analytics workflow using SageMaker Unified Studio to automate the processing of incoming weather data and predict irrigation needs.

In this walkthrough, we demonstrate how the visual workflow experience in Unified Studio can orchestrate an end-to-end data pipeline that:

  • Monitors and ingests hourly weather data from Amazon Simple Storage Service (Amazon S3)
  • Transforms raw weather measurements using Visual ETL jobs (type casting, SQL operations, and data cleansing)
  • Generates seasonal irrigation predictions and crop impact insights using JupyterLab notebooks

Whenever new weather data arrives, the workflow automatically routes it through a series of transformation steps, and produces ready-to-use insights—all visually orchestrated in SageMaker Unified Studio with no custom orchestration code required.

Prerequisites

Before you begin, complete the following steps:

  1. Signup for an AWS account and create a user with administrative access using the setup guide.
  2. Setup your SageMaker Unified Studio IAM-based domain:
    1. Navigate to the Amazon SageMaker console and use the Region selector in the top navigation bar to choose the appropriate AWS Region.
    2. On the Amazon SageMaker home page, choose Get started.
    3. For Project data access, choose to Auto-create a new role with admin permissions.
    4. Select the checkbox for S3 table integration with AWS Analytics services, for Data encryption choose Use AWS owned key, and then Set up.
  3. Go back to the Amazon SageMaker home page and choose Open to access the SageMaker Unified Studio experience.
  4. From the SageMaker Studio UI you can access the project in the SageMaker Unified Studio IAM-based domain. This project curates all assets accessible through the designated Execution IAM role.

Workflow implementation steps

In this section, we use Amazon SageMaker Studio to create an end-to-end visual workflow in IAM-based domain.

Step 1: Set up data storage and import weather dataset

First, we’ll prepare the Amazon S3 storage locations for raw and processed data:

  1. Download this weather dataset file to your local environment.
  2. From the left menu of the project, choose Files. Under Shared, create two new folders raw_data and processed_data.
  3. Upload the weather dataset file downloaded locally into raw_data folder.

Step 2: Create the weather data transformation job using Visual ETL

Next, create a Visual ETL job to transform the raw weather data through type casting, SQL transformations, and data cleansing:

  1. From the left menu, under Data Analytics, choose Visual ETL and Create Visual Job.
  2. Choose the + sign, and under Data sources, choose Amazon S3.
  3. For the Amazon S3 node settings, choose the following:
    • S3 URI: Choose Browse S3 and Select
    • Delimiter: ,
    • Multiline: Disabled
    • Header: Enabled
    • Infer schema: Disabled
    • Recursive file lookup: Disabled

  4. Choose the + sign next to the Amazon S3 box to add another node, under Transforms select Change columns.
  5. Connect the Amazon S3 node to the change columns node.
  6. Select the Change columns node to open the configuration window.
    • Choose Add type cast. Select temperature_2m (¬∞C) as the source column and add temperature_celsius as the target column. Select float as the Type.
    • Select precipitation (mm) as the source column and add Precipitation_mm as the target column. Select float as the Type.
    • Select rain (mm) as the source column and add Rain_mm as the target column. Select float as the Type.
    • Select windspeed_10m (km/h) as the source column and add windspeed as the target column. Select float as the Type.
    • Close the configuration window.

  7. Choose the + sign to add another node, under Transforms select SQL query . In the configuration window, paste in the following SQL statement:
    SELECT 
        MAKE_TIMESTAMP(2016, 1, day_of_year, hour, 0, 0) as timestamp,
        (temperature_celsius * 9 / 5) + 32 as temp_f,
        rain_mm * 25.4 as rain_inches,
        windspeed
    FROM {myDataSource}

  8. Choose the + sign to add another node, under Data targets, choose Amazon S3 and provide the following options:
    • S3 URI: Choose Browse S3 and select the processed_data folder created in Step 1.
    • Format: CSV
    • Update catalog: true
    • Database: sagemaker_sample_db
    • Table: weather_data
    • Include header: true
    • Ouput to a single file: false

  9. Connect the nodes to create a complete job.
  10. Save the Visual ETL and name it DataProcessing.

Step 3: Create the analysis and prediction notebook using JupyterLab

Now, we’ll set up the JupyterLab notebook that performs seasonal irrigation analysis and crop impact predictions based on temperature, rainfall, and wind speed patterns.Complete the following steps:

  1. Download the Crop Irrigation Prediction Python notebook to your local environment.
  2. In the SageMaker Unified Studio, from the left menu, choose JupyterLab. Wait for a few seconds for JupyterLab to be set up if you are trying for the first time.
  3. Upload CropIrrigationPrediction.ipynb using the upload files option.
  4. Review the notebook code to understand how it processes the weather data and generates irrigation predictions.

Step 4: Orchestrate the workflow

Finally, we will use the visual workflow to orchestrate tasks. With visual workflows, you can define a collection of tasks organized as a directed acyclic graph (DAG) that can run on a user-defined schedule.

  1. Choose Workflows from the left menu.
  2. Choose Create new Workflow.
  3. Rename the workflow to WeatherDataProcessingOrchestration.
  4. Create S3 task for monitoring and ingesting raw weather data:
    1. Choose the + sign, then choose S3 Key Sensor.
    2. Select S3-task to open the configuration window.
    3. For Bucket key choose Browse S3 and choose the synthetic_weather_hourly_data.csv file from the shared/raw_data S3 folder.



  5. Create a Glue task to transform the weather data:
    1. Choose the + Sign and add Data Processing Job / Glue Job Operator.
    2. Select Glue-task node to open the configuration window. For Operation type select Choose an existing Glue job.
    3. For Job name, choose Browse Jobs and select DataProcessing (this is the visual ETL job we created in the previous step.


  6. Choose the + sign and add SageMaker Unified Studio Jupyter Notebook Operator.
  7. Select the Notebook-task to open the configuration window. For Source, choose Browse Files and choose CropIrrigationPrediction.ipynb.

  8. Connect the tasks to create the complete workflow.
  9. Review the Workflow settings and choose Save.
    1. Provide a workflow description, “Workflow for Weather Data Processing”
    2. For Trigger, choose Manual only, because in this example you will trigger the workflow manually. You can also configure the workflow to trigger automatically on a schedule or disable it from running

Step 5: Execute and monitor the workflow

To run your workflow, complete the following steps:

  1. Choose Run to trigger workflow execution.
  2. Choose View runs to see the running workflow.
  3. Choose the Run ID for detailed logs on the execution.
  4. When the run is complete, you can review the task logs by choosing the Task ID.

The model’s output is written to the S3 processed data output folder. You can review the crop irrigation prediction results to verify they reflect realistic weather patterns and field conditions. If any results appear unexpected or unclear, examine the upstream transformation steps or adjust the notebook logic to refine the outputs.

Clean up

To avoid incurring future charges, clean up the resources you created during this walkthrough. Leaving these resources running may result in ongoing costs for storage and compute.To clean up your resources:

  1. On the workflows page, select your workflow, and under Actions, choose Delete workflow.
  2. In Visual ETL, select your weather data transformation flow, and under Actions, choose Delete job.
  3. In Query Editor, use the three dots next to the name of the table weather_data and choose Drop table.
  4. In JupyterLab, in the File Browser sidebar, choose (right-click) your notebook and choose Delete.
  5. In Files, choose the folder raw_data and under Actions, choose Delete. Repeat the steps for the folders processed_data and output.

Conclusion

In this post, you learned how you can use the visual workflow experience in Amazon SageMaker Unified Studio to build end-to-end data processing pipelines through an intuitive, no-code interface. This experience removes the need to write orchestration logic manually while still offering production-grade reliability and scalability powered by Amazon MWAA Serverless. Whether you’re processing weather data for agricultural insights or building more complex machine learning pipelines, the visual workflow experience accelerates development and makes workflow automation accessible to data engineers, analysts, and data scientists alike.As organizations increasingly rely on automated data pipelines to drive business decisions, the visual workflow experience provides the perfect balance of simplicity and power. We encourage you to explore this new capability in Amazon SageMaker Unified Studio and discover how it can transform your data processing workflows.

To learn more, visit the Amazon SageMaker Unified Studio page.


About the authors

Suba Palanisamy

Suba Palanisamy

Suba is an Enterprise Support Lead, helping customers achieve operational excellence on AWS. Suba is passionate about all things data and analytics. She enjoys traveling with her family and playing board games.

Vinod Jayendra

Vinod Jayendra

Vinod is an Enterprise Support Lead in ISV accounts at Amazon Web Services, where he helps customers in solving their architectural, operational, and cost optimization challenges. With a particular focus on Serverless & Analytics technologies, he draws from his extensive background in application development to deliver top-tier solutions. Beyond work, he finds joy in quality family time, embarking on biking adventures, and coaching youth sports team.

Kamen Sharlandjiev

Kamen Sharlandjiev

Kamen is a Senior Worldwide Specialist SA, Big Data expert. He’s on a mission to make life easier for customers who are facing complex data integration and orchestration challenges. His secret weapon? Fully managed AWS services that can get the job done with minimal effort. Follow Kamen on LinkedIn to keep up to date with the latest MWAA and AWS Glue features and news!

Yuhang Huang

Yuhang Huang

Yuhang is a Software Development Manager on the Amazon SageMaker Unified Studio team. He leads the engineering team to design, build, and operate scheduling and orchestration capabilities in SageMaker Unified Studio. In his free time, he enjoys playing tennis.

Vasudevan Venkataramanan

Vasudevan Venkataramanan

Vasudevan is a Senior Software Engineer on the Amazon SageMaker Unified Studio team. He is responsible for technical direction of scheduling and orchestration within SageMaker Unified Studio. Outside of his professional work, he enjoys spending time with his kid and playing pickleball and cricket.

Gal Heyne

Gal Heyne

Gal is a Senior Technical Product Manager for AWS Analytics services with a strong focus on AI/ML and data engineering. She is passionate about developing a deep understanding of customers’ business needs and collaborating with engineers to design simple-to-use data products.

AWS Secrets Manager launches Managed External Secrets for Third-Party Credentials

Post Syndicated from Rohit Panjala original https://aws.amazon.com/blogs/security/aws-secrets-manager-launches-managed-external-secrets-for-third-party-credentials/

Although AWS Secrets Manager excels at managing the lifecycle of Amazon Web Services (AWS) secrets, managing credentials from third-party software providers presents unique challenges for organizations as they scale usage of their cloud applications. Organizations using multiple third-party services frequently develop different security approaches for each provider’s credentials because there hasn’t been a standardized way to manage them. When storing these third-party credentials in Secrets Manager, organizations frequently maintain additional metadata within secret values to facilitate service connections. This approach requires updating entire secret values when metadata changes and implementing provider-specific secret rotation processes that are manual and time consuming. Organizations looking to automate secret rotation typically develop custom functions tailored to each third-party software provider, requiring specialized knowledge of both third-party and AWS systems.

To help customers streamline third-party secrets management, we’re introducing a new feature in AWS Secrets Manager called managed external secrets. In this post, we explore how this new feature simplifies the management and rotation of third-party software credentials while maintaining security best practices.

Introducing managed external secrets

AWS Secrets Manager has a proven track record of helping customers secure and manage secrets for AWS services such as Amazon Relational Database Service (Amazon RDS) or Amazon DocumentDB through managed rotation capabilities. Building on this success, Secrets Manager now introduces managed external secrets, a new secret type that extends this same seamless experience to third-party software applications like Salesforce, simplifying secret management challenges through standardized formats and automated rotation.

You can use this capability to store secrets vended by third-party software providers in predefined formats. These formats were developed in collaboration with trusted integration partners to define both the secret structure and required metadata for rotation, eliminating the need for you to define your own custom storage strategies. Managed external secrets also provides automated rotation by directly integrating with software providers. With no rotation functions to maintain, you can reduce your operational overhead while benefiting from essential security controls, including fine-grained permissions management using AWS Identity and Access Management (IAM), secret access monitoring through Amazon CloudWatch and AWS CloudTrail, and automated secret-specific threat detection through Amazon GuardDuty. Moreover, you can implement centralized and consistent secret management practices across both AWS and third-party secrets from a single service, eliminating the need to operate multiple secrets management solutions at your organization. Managed external secrets follows standard Secrets Manager pricing, with no additional cost for using this new secret type.

Prerequisites

To create a managed external secret, you need an active AWS account with appropriate access to Secrets Manager. The account must have sufficient permissions to create and manage secrets, including the ability to access the AWS Management Console or programmatic access through the AWS Command Line Interface (AWS CLI) or AWS SDKs. At minimum, you need IAM permissions for the following actions: secretsmanager:DescribeSecret, secretsmanager:GetSecretValue, secretsmanager:UpdateSecret, and secretsmanager:UpdateSecretVersionStage.

You must have valid credentials and necessary access permissions for the third-party software provider you plan to have AWS manage secrets for.

For secret encryption, you must decide whether to use an AWS managed AWS Key Management Service (AWS KMS) key or a customer managed KMS key. For customer managed keys, make sure you have the necessary key policies configured. The AWS KMS key policy should allow Secrets Manager to use the key for encryption and decryption operations.

Create a managed external secret

Today, managed external secrets supports three integration partners: Salesforce, Snowflake, and BigID. Secrets Manager will continue to expand its partner list and more third-party software providers will be added over time. For the latest list, refer to Integration Partners.

To create a managed external secret, follow the steps in the following sections.

Note: This example demonstrates the steps for retrieving Salesforce External Client App Credentials, but a similar process can be followed for other third-party vendor credentials integrated with Secrets Manager.

To select secret type and add details:

  1. Go to the Secrets Manager service in the AWS Management Console and choose Store a new secret.
  2. Under Secret type, select Managed external secret.
  3. In the AWS Secrets Manager integrated third-party vendor credential section, select your provider from the available options. For this walkthrough, we select Salesforce External Client App Credential.
  4. Enter your configurations in the Salesforce External Client App Credential secret details section. The Salesforce External Client App credentials consist of several key components:
    1. The Consumer key (client ID), which serves as the credential identifier for OAuth 2.0. You can retrieve the consumer key directly from the Salesforce External Client App Manager OAuth settings.
    2. The Consumer secret (client secret), which functions as the private password for OAuth 2.0 authentication. You can retrieve the consumer secret directly from the Salesforce External Client App Manager OAuth settings.
    3. The Base URI, which is your Salesforce org’s base URL (formatted as https://MyDomainName.my.salesforce.com), is used to interact with Salesforce APIs.
    4. The App ID, which identifies your Salesforce External Client Apps (ECAs) and can be retrieved by calling the Salesforce OAuth usage endpoint.
    5. The Consumer ID, which identifies your Salesforce ECA, can be retrieved by calling the Salesforce OAuth credentials by App ID endpoint. For a list of commands, refer to Stage, Rotate, and Delete OAuth Credentials for an External Client App in the Salesforce documentation.
  5. Select the Encryption key from the dropdown menu. You can use an AWS managed KMS key or a customer managed KMS key.
  6. Choose Next.
Figure 1: Choose secret type

Figure 1: Choose secret type

To configure a secret:

  1. In this section, you need to provide information for your secret’s configuration.
  2. In Secret name, enter a descriptive name and optionally enter a detailed Description that helps identify the secret’s purpose and usage. You also have additional configuration choices available: you can attach Tags for better resource organization, set specific Resource permissions to control access, and select Replicate secret for multi-Region resilience.
  3. Choose Next.
Figure 2: Configure secret

Figure 2: Configure secret

To configure rotation and permissions (optional):

In the optional Configure rotation step, the new secret configuration introduces two key sections focused on metadata management, which are stored separately from the secret value itself.

  1. Under Rotation metadata, specify the API version your Salesforce app is using. To find the API version, refer to List Available REST API Versions in the Salesforce documentation. Note: The minimum version needed is v65.0.
  2. Select an Admin secret ARN, which contains the administrative OAuth credentials that are used to rotate the Salesforce client secret.
  3. In the Service permissions for secret rotation section, Secrets Manager automatically creates a role with necessary permissions to rotate your secret values. These default permissions are transparently displayed in the interface for review when you choose view permission details. You can deselect the default permissions for more granular control over secret rotation management.
  4. Choose Next.
Figure 3: Configure rotation

Figure 3: Configure rotation

To review:

In the final step, you’ll be presented with a summary of your secret’s configuration. On the Review page, you can verify parameters before proceeding with secret creation.

After confirming that the configurations are correct, choose Store to complete the process and create your secret with the specified settings.

Figure 4: Review

Figure 4: Review

After successful creation, your secret will appear on the Secrets tab. You can view, manage, and monitor aspects of your secret, including its configuration, rotation status, and permissions. After creation, review your secret configuration, including encryption settings and resource policies for cross-account access, and examine the sample code provided for different AWS SDKs to integrate secret retrieval into your applications. The Secrets tab provides an overview of your secrets, allowing for central management of secrets. Select your secret to view Secret details.

Figure 5: View secret details

Figure 5: View secret details

Your managed external secret has been successfully created in Secrets Manager. You can access and manage this secret through the Secrets Manager console or programmatically using AWS APIs.

Onboard as an integration partner with Secrets Manager

With the new managed external secret type, third-party software providers can integrate with Secrets Manager and offer their customers a programmatic way to securely manage secrets vended by their applications on AWS. This integration provides their customers with a centralized solution for managing both the lifecycle of AWS and third-party secrets, including automatic rotation capabilities from the moment of secret creation. Software providers like Salesforce are already using this capability.

“At Salesforce, we believe security shouldn’t be a barrier to innovation, it should be an enabler. Our partnership with AWS on managed external secrets represents security-by-default in action, removing operational burden from our customers while delivering enterprise-grade protection. With AWS Secrets Manager now extending to partners and automated zero-touch rotation eliminating human risk, we’re setting a new industry standard where secure credentials become seamless without specialized expertise or additional costs.” — Jay Hurst, Sr. Vice President, Product Management at Salesforce

There is no additional cost to onboard with Secrets Manager as an integration partner. To get started, partners must follow the process listed on the partner onboarding guide. If you have questions about becoming an integration partner, contact our team at [email protected] with Subject: [Partner Name] Onboarding request.

Conclusion

In this post, we introduced managed external secrets, a new secret type in Secrets Manager that addresses the challenges of securely managing the lifecycle of third-party secrets through predefined formats and automated rotation. By eliminating the need to define custom storage strategies and develop complex rotation functions, you can now consistently manage your secrets—whether from AWS services, custom applications, or third-party providers—from a single service. Managed external secrets provide the same security features as standard Secrets Manager secrets, including fine-grained permissions management, observability, and compliance controls, while adding built-in integration with trusted partners at no additional cost.

To get started, refer to the technical documentation. For information on migrating your existing partner secrets to managed external secrets, see Migrating existing secrets. The feature is available in all AWS Regions where AWS Secrets Manager is available. For a list of Regions where Secrets Manager is available, see the AWS Region table. If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on Secrets Manager re:Post or contact AWS Support.

Rohit Panjala

Rohit is a Security Specialist at AWS, focused on data protection and cryptography services. He is responsible for developing and implementing go-to-market (GTM) strategies and driving customer and partner adoptions for AWS data protection services on a global scale. Before joining AWS, he worked in security product management at IBM and electrical engineering roles. He holds a BS in engineering from The Ohio State University.

Rochak Karki

Rochak is a Security Specialist Solutions Architect at AWS, focusing on threat detection, incident response, and data protection to help customers build secure environments. Rochak is a US Army veteran and holds a BS in engineering from the University of Wyoming. Outside of work, he enjoys spending time with family and friends, hiking, and traveling.

How potential performance upside with AWS Graviton helps reduce your costs further

Post Syndicated from Markus Adhiwiyogo original https://aws.amazon.com/blogs/compute/how-potential-performance-upside-with-aws-graviton-helps-reduce-your-costs-further/

Amazon Web Services (AWS) provides many mechanisms to optimize the price performance of workloads running on Amazon Elastic Compute Cloud (Amazon EC2), and the selection of the optimal infrastructure to run on can be one of the most impactful levers. When we started building the AWS Graviton processor, our goal was to optimize AWS Graviton features and capabilities to deliver a processor that provides the best price performance across a broad array of cloud workloads running on Amazon EC2. That goal continues to be our guiding principle, and today customers who adopt AWS Graviton-based EC2 instances see up to 40% better price performance on their cloud workloads when compared to equivalent non-Graviton EC2 instances. The price performance improvement is the result of both the performance improvement and the lower price in using AWS Graviton-based instances.

Price performance blends the cost of infrastructure with the amount of work you can achieve with infrastructure usage. After talking to many AWS Graviton customers, we’ve learned that the cost savings go beyond the lower AWS Graviton-based instances price. Many AWS Graviton customers told us that the performance increase from AWS Graviton allows them to consume fewer computing hours than comparable non-Graviton instances for equivalent workload throughput. In turn, this leads to further cost reduction.

The following are some of examples from our customers:

  • Pinterest achieved 47% cost savings and 38% savings on compute resources while reducing carbon emissions by 62% for its web API workload.
  • SAP powers its SAP HANA Cloud with AWS Graviton to enhance its price performance by 35% while lowering carbon impact by 45%.
  • Sprinklr improved their machine learning (ML) inference workloads’ throughput by up to 20% while reducing costs by up to 25%.

You can find more customer examples in the AWS Graviton testimonials page.

To help organizations capture similar benefits, we’ve enhanced the AWS Graviton Savings Dashboard (GSD) with new features that account for both pricing and performance improvements. In the following section we explore these new capabilities and how they can help optimize your infrastructure costs.

Understanding performance-driven cost optimization in the GSD

The GSD helps organizations identify ideal workloads for AWS Graviton migration through automated resource matching and data-driven visualizations. You can learn the GSD details and setup in this AWS compute post.

Although the dashboard has traditionally focused on calculating direct cost savings from the AWS Graviton pricing advantages, we’ve observed that customers often experience more benefits when their applications perform more efficiently on AWS Graviton processors, leading to decreased compute resource usage. To better reflect these real-world scenarios, we’ve enhanced the dashboard with new features highlighting Normalized Instance Hours (NIH) analysis capabilities so that you can model potential savings based on both pricing benefits and compute hour reductions. Although this tool helps estimate potential savings, actual performance improvements can only be determined by testing your specific workloads on AWS Graviton instances. Performance is always workload and use case specific, so we encourage you to test your AWS Graviton-based workloads using the Optimization and Performance Runbook to help you determine the actual possible NIH percent reduction.

Key dashboard components

This section outlines the following three key dashboard components: NIH reduction analysis, enhanced cost analysis visualizations, and detailed savings analysis.

NIH reduction analysis

The dashboard now features a new slider that lets you model potential cost savings by inputting the percentage reduction in NIH. Many organizations have found it challenging to calculate their total possible savings since the benefits come from two sources: the lower instance pricing of AWS Graviton and the reduced compute hours.

You can use the slider to model different cost scenarios by adjusting a theoretical NIH reduction between 0% and 40%. You can use this slider to input NIH reductions validated through your workload testing, model the combined impact of both pricing benefits and reduced compute hours, and explore different scenarios to help prioritize which workloads to test first.

Figure 1: NIH slider location

Figure 1: NIH slider location

Assume that your testing shows that your workload runs just as effectively with 15% fewer normalized instance hours on AWS Graviton. You can now plug that exact number into the slider to see your modeled savings combining both pricing differences and compute hour reductions. Although we’ve heard success stories of significant reductions from customers, we recommend starting your initial estimate with a conservative 10% baseline and adjusting based on your own testing results.

Enhanced cost analysis visualizations

The dashboard presents key visualizations that demonstrate the direct relationship between NIH reduction and cost savings. First, you see the Potential Graviton Base Savings from pricing differences alone. In the following diagram, we can observe an example of $61.54K of cost savings from migrating to equivalent AWS Graviton instances. Next, the Estimated Additional Savings Due to Performance in the same diagram shows $42.40K in savings if your performance testing confirms a 15% NIH reduction in your workload. Finally, the dashboard sums these two values into the Total Potential Graviton Savings of $103.94K. The Total Potential Graviton Savings helps visualize how both pricing benefits and any validated compute hour reductions could contribute to your overall savings.

Figure 2: Visualization with relationship between NIH reduction and cost savings

Figure 2: Visualization with relationship between NIH reduction and cost savings

The Amortized Cost Breakdown and Normalized Instance Hrs Breakdown charts in the following figure show 6-month historical trends, helping you spot patterns such as seasonal spikes or high-usage periods. These patterns can help you identify where even small efficiency improvements might yield significant savings, for example, workloads with consistently high usage or predictable peak periods that would be good candidates for testing.

Figure 3: Amortized Cost, NIH, and Total Potential Savings Breakdown charts

Figure 3: Amortized Cost, NIH, and Total Potential Savings Breakdown charts

Detailed savings analysis

Building on our commitment to help customers optimize cloud costs, we’ve enhanced the Potential Graviton Savings Details table with two columns focused on performance-based savings modeling. The Estimated Additional Savings Due to Performance column shows the modeled savings based on your chosen NIH reduction percentage, while Total Potential Graviton Savings combines this with the base pricing benefits.

Figure 4: Potential Graviton Savings Details table

Figure 4: Potential Graviton Savings Details table

As you examine your current instance family, you can observe both baseline AWS Graviton savings and these added saving opportunities clearly laid out in a comprehensive breakdown. The analysis presents your total savings potential in both dollar amounts and percentages. This allows you to build a compelling business case for migration. Although this detailed breakdown provides valuable planning insights, remember that actual savings may vary depending on your specific workload patterns, implementation approaches, and operational considerations.

Conclusion

The Graviton Savings Dashboard (GSD) serves as a powerful analytics tool that streamlines your journey to cost-effective cloud computing. The GSD provides clear visualizations and interactive features to help you understand and maximize potential savings when migrating to AWS Graviton-based instances. To further explore the new features, navigate to the GSD interactive demo, where you can model an example of potential savings using the NIH reduction slider and detailed cost breakdowns.

Ready to explore how AWS Graviton can transform your infrastructure costs? Visit the GSD page to deploy or update your GSD dashboard. Access implementation guides, such as the CFM Technical Implementation Playbook (CFM TIPs), and start optimizing your cloud spend today with the enhanced capabilities of the GSD.

Over 85,000 AWS customers have discovered the benefits of AWS Graviton, with many completing their adoptions in just hours. We have created this resource guide so that you can accelerate your AWS Graviton adoption with minimal effort and enjoy significant price performance benefits.

“What I always tell customers is one week, one application, one engineer, and see what you can do. They always are pleasantly surprised by how much progress they can make. If you’re out there and you haven’t yet moved to AWS Graviton, what are you waiting for? Let’s make it happen!”

Dave Brown, VP, AWS Compute & ML Services

Important note about performance testing
The GSD does not attempt to estimate the potential NIH percent reduction or your workload’s performance when transitioned to AWS Graviton. You can use it to perform what-if analysis of your potential savings for a projected NIH percent reduction. In the absence of this variable, GSD only considers the price delta between instance types and misses an important contributor to the overall savings potential of AWS Graviton from the performance upside. Compute performance is always workload and use case specific, so we encourage you to test your AWS Graviton-based workloads using the Optimization and Performance Runbook to help you determine the actual possible NIH percent reduction.

Enhancing API security with Amazon API Gateway TLS security policies

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/compute/enhancing-api-security-with-amazon-api-gateway-tls-security-policies/

As compliance frameworks evolve and cryptographic standards advance, organizations are looking for additional controls to improve their cloud security posture. One of the neccesary controls is a more granular TLS configuration, for example when regulatory requirements mandate disabling older ciphers like CBC or enforcing TLS 1.3 as a minimum version.

In this post, you will learn how the new Amazon API Gateway’s enhanced TLS security policies help you meet standards such as PCI DSS, Open Banking, and FIPS, while strengthening how your APIs handle TLS negotiation. This new capability increases your security posture without adding operational complexity, and provides you with a single, consistent way to standardize TLS configuration across your API Gateway infrastructure.

Overview

Previously, API Gateway offered limited control over TLS configuration, and only for custom domain names. Default endpoints used fixed security policies, which meant you often had to introduce additional infrastructure, such as custom Amazon CloudFront distributions, to meet your organization’s security or compliance requirements.

With this launch, you can configure TLS behavior directly on all REST API endpoint types, including Regional, edge-optimized, and private, and apply consistent TLS settings across both your APIs and their custom domain names. You can choose from predefined enhanced security policies to enforce the minimum TLS versions and cipher suites that your workloads require. For example, you can enforce TLS 1.3, use hardened TLS 1.2 without CBC ciphers, adopt FIPS-aligned suites for government workloads, or prepare for the future with policies that include post-quantum cryptography (PQC). The new security policies provide finer-grained control without adding operational complexity, helping you align your APIs with evolving security and compliance expectations.

Understanding API Gateway security policies

A security policy in API Gateway is a predefined combination of a minimum TLS version and a curated set of cipher suites. When a client connects to your REST API or custom domain name, API Gateway uses the selected policy to determine which protocol versions and ciphers it will accept during the TLS handshake. This gives you a predictable and enforceable way to control how clients establish encrypted connections to your APIs.

API Gateway supports two categories of security policies. Legacy policies, such as TLS_1_0 or TLS_1_2, remain available for backwards compatibility. Enhanced policies, identified by the SecurityPolicy_* prefix, provide stricter and more modern controls for regulated workloads, advanced governance, or cryptographic hardening. When you use an enhanced policy, you must also specify an endpoint access mode, which adds additional validation for how traffic reaches your API, as described in the following sections.

Enhanced policies follow a consistent naming patterns that helps you quickly understand what each policy enforces. For example, for REGIONAL and PRIVATE endpoint types, the following pattern applies:

SecurityPolicy_[TLS-Versions]_[Variant]_[YYYY-MM]

From this structure, you can identify the minimum TLS versions supported, any specialized cryptographic variants (such as FIPS, PFS, or PQ), and the release date of the policy. For example, SecurityPolicy_TLS13_1_3_2025_09 accepts only TLS 1.3 traffic, while SecurityPolicy_TLS13_1_2_PFS_PQ_2025_09 supports TLS 1.2 as lowest and TLS 1.3 as highest TLS version with forward secrecy and post-quantum enhancements.

Each policy maps to a curated combination of ciphers. For instance, SecurityPolicy_TLS13_1_3_2025_09 accepts only three TLS 1.3 cipher suites (TLS_AES_128_GCM_SHA256, TLS_AES_256_GCM_SHA384, and TLS_CHACHA20_POLY1305_SHA256) and rejects any other protocol versions or ciphers. For a full list of supported policies and ciphers, and naming pattern for the EDGE endpont type, see the API Gateway documentation.

How security policies apply to default endpoints and custom domains

You can use API Gateway to attach different security policies to your default API endpoint and custom domain names. During TLS negotiation, API Gateway selects the policy based on the Server Name Indication (SNI) value in the client’s TLS handshake, not the HTTP Host header. This means the policy depends on the hostname the client uses when initiating TLS.

For example, if a client connects directly to your default endpoint, such as:

https://abcdef1234.execute-api.us-east-1.amazonaws.com

API Gateway uses the policy attached to that default endpoint because the SNI value matches its hostname.

If the client instead connects through a custom domain name, such as:

https://api.example.com

API Gateway uses the policy attached to that custom domain. In this case, the SNI value api.example.com determines which policy is enforced.

This distinction is important even if you disable your default endpoint. TLS negotiation always occurs before API Gateway evaluates endpoint settings, so the default endpoint security policy still applies to clients that connect directly to its hostname. To avoid unexpected client behavior, you should keep the API and its custom domain name aligned with the same security policy whenever possible.

Understanding endpoint access mode

When you use an enhanced security policy (SecurityPolicy_*), you must also specify an endpoint access mode. Endpoint access mode defines how strictly API Gateway validates the network path a request takes before it reaches your API. This gives you an additional layer of governance and helps you prevent unauthorized or misrouted traffic.

You can choose between two modes:

  • BASIC mode provides standard API Gateway behavior. It is the recommended starting point when you migrate an existing API to an enhanced security policy. Clients can continue reaching your API as they do today, without additional validation.
  • STRICT mode adds enforcement checks to ensure that requests originate from the correct endpoint type, and TLS negotiation aligns with your configuration.

When you enable STRICT mode, API Gateway performs additional validations, such as:

  • The SNI and HTTP Host header values match
  • The request originates from the same endpoint type as your API (Regional, edge-optimized, or private)

If any of these validations fail, API Gateway rejects the request. STRICT is a viable choice when you need stronger security guarantees, such as when running regulated or sensitive workloads. See API Gateway documentation for additional details.

When you switch from BASIC to STRICT mode, it takes up to 15 minutes for the change to fully propagate. Your API remains available during this period. If your endpoint access mode is set to STRICT, you cannot change the endpoint type until you revert the mode back to BASIC.

Applying security policies to new and existing APIs

You can apply a security policy when you create a new REST API or custom domain name, or update an existing resource to use one of the enhanced SecurityPolicy_* options. When migrating existing APIs, the recommended approach is to start with BASIC mode, validate client behavior (SNI and HTTP Host header values match, request originates from the same endpoint type as your API), and then move to STRICT mode once you confirm compatibility.

The following code snippets illustrate how to apply security policies to different scenarios:

Create a REST API with a security policy and STRICT endpoint access mode

You can attach a security policy directly during API creation, removing the need for extra infrastructure just to control TLS negotiation.

aws apigateway create-rest-api \
  --name "your-private-api-name" \
  --endpoint-configuration '{"types":["PRIVATE"]}' \
  --security-policy "SecurityPolicy_TLS13_1_3_2025_09" \
  --endpoint-access-mode STRICT \
  --policy file://api-policy.json

Create a custom domain name with a security policy and STRICT endpoint access mode

You can also specify the security policy when creating a custom domain name. API Gateway applies the selected policy during TLS negotiation based on the SNI value the client provides.

aws apigateway create-domain-name \
  --domain-name api.example.com \
  --regional-certificate-arn arn:aws:acm:region:account-id:certificate/certificate-id \
  --endpoint-configuration '{"types":["REGIONAL"]}' \
  --security-policy SecurityPolicy_TLS13_1_3_2025_09 \
  --endpoint-access-mode STRICT

Updating existing REST API

If you are migrating an existing API, start by applying the enhanced security policy with BASIC mode. After confirming that your clients can connect with BASIC mode as expected, proceed to enable the STRICT mode.

1. Apply the new policy with BASIC mode

aws apigateway update-rest-api --rest-api-id abcd123 --patch-operations '[
    {
         "op": "replace",
         "path": "/securityPolicy",
         "value": "SecurityPolicy_TLS13_1_3_2025_09"
    },
    {
         "op": "replace",
         "path": "/endpointAccessMode",
         "value": "BASIC"
     }
]'

Verify your clients can consume the API as expected using access logs and performance metrics in Amazon CloudWatch.

2. Enable the STRICT mode after validation

aws apigateway update-rest-api --rest-api-id abcd123 --patch-operations '[
    {
        "op": "replace",
        "path": "/endpointAccessMode",
        "value": "STRICT"
     }
]'

Updating existing custom domain name

Custom domain names follow the same migration approach as REST APIs.

1. Apply the new policy with BASIC mode and validate clients can successfully connect.

aws apigateway update-domain-name --domain-name api.example.com --patch-operations '[
    {
        "op": "replace",
        "path": "/securityPolicy",
        "value": "SecurityPolicy_TLS13_1_3_2025_09"
    },
    {
        "op": "replace",
        "path": "/endpointAccessMode",
        "value": "BASIC"
     }
]'

2. Enable the STRICT mode after validation

aws apigateway update-domain-name --domain-name api.example.com --patch-operations '[
    {
        "op": "replace",
        "path": "/endpointAccessMode",
        "value": "STRICT"
     }
]'

After you update your REST API or custom domain configuration, redeploy your API so that stages receive the new settings. When you change a security policy, the update takes up to 15 minutes to complete. The API status appears as UPDATING while the change propagates and returns to AVAILABLE when complete. Your API remains fully functional throughout this process.

Rolling back endpoint access mode

If you notice clients failing to connect to your API after applying the STRICT mode, you can revert the endpoint access mode back to BASIC at any time. Below code snippet illustrates doing this for a REST API.

aws apigateway update-rest-api --rest-api-id abcd123 --patch-operations '[
    {
      "op": "replace",
      "path": "/endpointAccessMode",
      "value": "BASIC"
    }
  ]'

You can use the same approach to update a custom domain name.

Monitoring TLS usage and policy migrations

As you adopt enhanced security policies, it is important to understand how clients negotiate encrypted connections with your API. Monitoring helps you verify client readiness, identify legacy consumers that may require updates, and validate that STRICT endpoint access mode behaves as expected during rollout. Use the following API Gateway access logs variables to monitor protocol and cipher usage over time.

  • $context.tlsVersion – the negotiated TLS version
  • $context.cipherSuite – the cipher suite selected during the handshake

You can use these variables to confirm that:

  • Clients are using the expected minimum TLS version
  • BC-based ciphers are no longer used after you move to a hardened policy
  • PQC and FIPS-aligned policies are being exercised by the appropriate clients

Access logs are especially useful during migrations, where validating the actual client behavior is a prerequisite before enabling STRICT mode. For example, if you still observe live clients negotiating TLS 1.0 or TLS 1.2 CBC ciphers after applying a hardened policy in BASIC mode, you can identify the affected clients and plan remediation before switching to STRICT mode.

Future-proof security configurations

Some of the new policies combine TLS 1.3 with post-quantum cryptography (PQC) to help you prepare for a future where quantum-capable threat actors exist. With these policies you can start testing and adopting quantum-resistant algorithms without redesigning your API architecture.

As standards evolve and new cipher suites are introduced, API Gateway’s policy model provides you with a clear path for adding new variants while keeping your configuration simple and predictable.

Conclusion and next steps

Enhanced TLS security policies and endpoint access mode in the Amazon API Gateway gives you direct control over how clients establish secure connections to your APIs. You can choose the policies that match your compliance needs, such as PCI DSS, FIPS, Open Banking, PQC, and use STRICT mode to control how traffic reaches your endpoints and apply additional domain-level validations, further hardening security of your APIs

To get started:

  1. Review the list of available security policies in the API Gateway documentation.
  2. Identify which REST APIs and domains require stronger TLS controls.
  3. Apply an appropriate SecurityPolicy-* policy with BASIC mode.
  4. Validate client behavior using access logs and CloudWatch metrics.
  5. Move to STRICT mode when you are ready to enforce additional connection-level protection.

For more information about building Serverless architectures, see ServerlessLand.com

Improving throughput of serverless streaming workloads for Kafka

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/compute/improving-throughput-of-serverless-streaming-workloads-for-kafka/

Event-driven applications often need to process data in real-time. When you use AWS Lambda to process records from Apache Kafka topics, you frequently encounter two typical requirements: you need to process very high volumes of records in close to real-time, and you want your consumers to have the ability to scale rapidly to handle traffic spikes. Achieving both necessitates understanding how Lambda consumes Kafka streams, where the potential bottlenecks are, and how to optimize configurations for high throughput and best performance.

In this post, we discuss how to optimize Kafka processing with Lambda for both high throughput and predictable scaling. We explore the Lambda’s Kafka Event Source Mappings (ESMs) scaling, optimization techniques available during record consumption, how to use ESM Provisioned Mode for bursty workloads, and which observability metrics you need to use for performance optimization.

Overview

To start processing records from a Kafka topic with a Lambda function, whether using Amazon Managed Streaming for Apache Kafka (Amazon MSK) or a self-managed Kafka cluster, you create an ESM: a lightweight serverless resource that consumes records from Kafka topics and invokes your function.

The scaling behavior of Kafka ESMs is based on the offset lag. This is a metric indicating the number of records in the topic that have not yet been consumed by the Lambda function. This metric typically grows when producers publish new records faster than consumers process them. As the lag grows, the Lambda service gradually adds more Kafka consumers (also known as pollers) to your ESM. To preserve ordering guarantees, the maximum number of pollers is capped by the number of partitions in the topic. Lambda also scales pollers down automatically when lag decreases.

Each ESM follows a consistent polling workflow: poll -> filter -> batch -> invoke, as shown in the following diagram. Every stage has configurable options that directly affect performance, latency, and cost.


Figure 1. ESM processing workflow.

Polling: Increasing predictability with Provisioned Mode

By default, Kafka ESM uses the on-demand polling mode. In this mode, ESM starts with one poller, automatically adds more pollers when the offset lag grows, and scales the number of pollers down as lag decreases. On-demand mode does not need upfront scaling configuration and is the lowest-cost option for steady workloads. For many applications, this behavior is sufficient: scaling up can take several minutes, but the throughput eventually catches up, and you only pay for the resources you use, such as number of invocations.

However, if your workloads are bursty and latency-sensitive, then on-demand scaling may not be fast enough and can result in a rapidly growing lag. This can be addressed by switching to Provisioned Mode, which gives you more fine-grained control to configure a minimum and maximum number of always-on pollers for your Kafka ESM. These pollers remain connected even when traffic is low, so consumption begins immediately when a spike occurs, and scaling within the configured range is faster and more predictable.

The following diagram shows the performance improvements of using the ESM in Provisioned Mode for bursty workloads. You can see that in on-demand mode it took ESM over 15 minutes to eventually catch up to the new traffic volume, while in Provisioned Mode the ESM handled the traffic increase instantly.


Figure 2. Comparing Kafka ESM on-demand and Provisioned Mode.

Best practices for using Provisioned Mode:

  • Start small: Provisioned Mode is a paid capability. AWS recommends that for smaller topics (less than 10 partitions) you start with a single provisioned poller to evaluate throughput and observe workload behavior. For larger topics, you can start with a higher number of provisioned pollers to accommodate the baseline consumption. You can adjust this configuration at any time as you learn traffic patterns and refine your performance targets.
  • Estimate throughput: A single provisioned poller can process up to 5 MB/s of Kafka data. Monitor your average record size and per-record processing time to establish a baseline for minimum and maximum pollers, then validate with real workload metrics.
  • Set a low floor and flexible ceiling: Choose a minimum number of pollers that makes sure that latency targets are met when a traffic burst occurs, then allow the ESM to scale toward a higher maximum as needed.

See Low latency processing for Kafka event sources for more information.

To summarize:

  • Use Provisioned Mode for bursty traffic, strict SLOs, or when backlogs pose downstream risk.
  • Use on-demand polling mode for steady traffic, flexible latency requirements, or when minimizing cost is the primary objective.

Filtering: Drop irrelevant records early

By default, all records from Kafka are delivered to your Lambda function. This approach is direct and flexible. Your handler code decides which records to process and which to ignore. This default behavior is highly efficient for workloads where nearly all records are valuable.

When you find yourself discarding a large portion of records in your handler code, you can use native ESM filtering capabilities to drop irrelevant records before they reach your function. You can filter early to reduce cost, free up concurrency, increase throughput, and make sure that your Lambda function spends cycles on valuable work only.

The following diagram shows the application of an ESM filter to only process telemetry that meets a specified condition.


Figure 3. ESM filtering configuration.

Batching: Processing more records per invocation

You can batch multiple Kafka records together to process more data per invocation and increase the efficiency of your Lambda functions. Larger batches help you achieve higher throughput and reduce costs by making better use of each invocation run. To get the best results, you should balance batch size and latency targets and adjust the configuration based on your workload’s specific traffic patterns and SLOs.

Lambda gives you two primary controls for configuring ESM batching behavior:

  • Batch window: This is how long the ESM waits to accumulate records before invoking your function. A shorter window produces smaller batches and more frequent invocations. A longer window (up to 5 minutes) produces larger batches and less frequent invocations.
  • Batch size: This is the maximum number of records that the ESM can accumulate before invoking your function, up to 10,000.

There’s no single setting that universally works for all workloads. Your optimal configuration depends on workload characteristics such as latency tolerance and record size. AWS recommends starting with the default values and then gradually adjusting the configuration based on your requirements. For example, you can increase the batch size while monitoring function duration, error rates, and end-to-end latency.

The following diagram shows how to configure batch window and size using Terraform:


Figure 4. ESM batch window and batch size configuration with Terraform.

The ESM invokes your function when one of the following three conditions is met:

  1. The batch window elapses.
  2. The accumulated batch reaches the configured maximum batch size.
  3. The accumulated payload approaches the 6 MB maximum invocation payload limit of Lambda.

When using higher batch window values during traffic spikes, you typically see more records-per-batch and longer function invocation durations. This is normal: larger batches can take longer to process. Always interpret the Duration metric in the context of the batch size being processed.

Invoke: Process each batch faster and more efficiently

You control how quickly each batch completes through two main factors: the efficiency of your function code and the compute resources you allocate to your functions. You can improve both to process more records per second, reduce the necessary concurrency, and lower cost.

Optimize your code: Review your function handler code to identify where you can reduce work per record. For example, eliminate redundant serialization, initialize dependencies once during function startup, and consider parallel processing within the handler (where applicable). For performance-critical workloads, you can also choose languages that compile to binary, such as Go or Rust, which typically deliver high performance with lower resource usage.

Tune compute resources: Increasing the memory function allocation proportionally increases vCPU. Use the Lambda PowerTuning tool to find the memory configuration that best balances performance and cost for your workload.

Correlate metrics: As you optimize, monitor Duration and Concurrency. You should see the concurrency drop as duration improves. That correlation confirms that your changes are improving the system throughput and efficiency.

When you combine handler optimizations with early filtering and efficient batching, even small improvements can make your pipeline noticeably faster to operate under load.

Observability drives good decisions

You can’t optimize what you can’t see. To tune your data processing pipeline, use a combination of OffsetLag, function invocation metrics, and Kafka broker metrics to understand your data processing performance. OffsetLag tells you whether your function is keeping up with incoming records, as shown in the following figure. Function metrics such as Duration, Concurrency, Errors, and Throttles show how efficiently your code is processing record batches. If you use Provisioned Mode, then you can use the Provisioned Pollers metric to track the poller capacity.


Figure 5. Kafka consumption observability with Amazon CloudWatch.

Always interpret function duration in the context of batch size. During traffic spikes, you can typically observe both duration and actual batch size increase, which is expected amortization, not a regression. For alerting, monitor lag growth, unexpected drops in invocation rate, and error spikes. With these signals in place, you can detect issues early and tune your configuration with confidence.

A sample step-by-step optimization loop

  1. Establish a clean baseline: Make your handler idempotent and batch-aware, start with a short batch window and moderate batch size. Monitor your ESM and confirm offset lag stays near zero at steady state.
  2. Filter early: Move static checks (record type, version, other custom properties) into ESM filtering and verify invoked counts drop relative to polled counts, proving the filter saves cost and concurrency.
  3. Increase batch size gradually while monitoring the duration, error rates, and latency metrics. Extend the batch window slightly if spikes cause too many invocations.
  4. Speed up the handler: Increase memory for more CPU, reduce per-record I/O, remove redundant serialization, and parallelize safely inside the batch while tracking duration and concurrency metrics together.
  5. Prove spike readiness: Replay realistic surges, monitor offset lag and drain time, and enable Provisioned Mode with a small minimum if recovery takes too long, adjusting with MB/s-per-poller estimates.
  6. Implement alerting: Watch for sustained lag growth, unexpected gaps between polled and invoked, and error spikes tied to partitions or large batches. Always read metrics in context with batch size.
  7. Re-evaluate periodically: Re-measure system throughput, confirm filter effectiveness, and retune batch and memory settings regularly as workloads evolve.

Conclusion

Optimizing Kafka streams processing with AWS Lambda necessitates understanding how ESMs work and tuning consumption components: polling, filtering, batching, and invoking. Filtering redundant records early removes unnecessary work, batching helps you process more records per invocation, and handler optimizations make sure that you make the most of the compute that you allocate. Together, these adjustments let you scale efficiently and keep offset lag under control.

When your workload is bursty, use Provisioned Mode to absorb spikes without long recovery times. With the right alerts on lag, errors, and unexpected polled versus invoked behavior, you can spot problems early and adjust before they impact users. Following this optimization guide gives you a practical way to measure, tune, and revisit your setup as traffic patterns change.

To learn more about optimizing Kafka consumption, see the AWS re:Invent 2024 session about Improving throughput and monitoring of serverless streaming workloads.

To learn more about building Serverless architectures see Serverless Land.

Build scalable REST APIs using Amazon API Gateway private integration with Application Load Balancer

Post Syndicated from Christian Silva original https://aws.amazon.com/blogs/compute/build-scalable-rest-apis-using-amazon-api-gateway-private-integration-with-application-load-balancer/

This post is written by Vijay Menon, Principal Solutions Architect, and Christian Silva, Senior Solutions Architect.

Today, we announced Amazon API Gateway REST API’s support for private integration with Application Load Balancers (ALBs). You can use this new capability to securely expose your VPC-based applications through your REST APIs without exposing your ALBs to the public internet.

Prior to this launch, if you wanted to connect API Gateway to private ALBs, you would have had to use a Network Load Balancer (NLB) as an intermediary, increasing cost and complexity. Now, you can directly integrate API Gateway with private ALBs without requiring an NLB, reducing operational overhead and optimizing cost.

Previous architecture: Connecting API Gateway to private ALBs

Before this launch, API Gateway REST APIs connect to private ALB resources through an NLB positioned in front of the ALB. Many customers have successfully built and operated production workloads using this architecture, demonstrating its reliability for business-critical applications. The following architecture demonstrates this setup.

Figure 1. Previous architecture: API Gateway to private ALB via intermediary NLB

In response to customer feedback for a simplified architecture and reduced costs, we’ve extended VPC link v2 support to REST APIs. This feature now enables direct private ALB integration for REST APIs, eliminating the need for an intermediary NLB.

New architecture: Connecting API Gateway to private ALBs

With direct private ALB integration, this architecture becomes simpler and more efficient. The integration removes the need for an intermediate NLB, reducing the number of hops between client and your services. This streamlined setup simplifies the architecture for applications, allowing more efficient use of ALB’s layer-7 load-balancing capabilities, authentication, and authorization features. While these ALB features were technically accessible before, the new architecture removes the overhead and complexity of managing an additional NLB. Here’s how the simplified architecture looks now:

Figure 2. Direct integration between API Gateway and private ALB

Benefits of a direct integration between your API Gateway endpoint and your private ALB

  • Architectural simplification and operational excellence: Now that your API Gateway can directly connect to your private ALB, you no longer need an NLB to act as a bridge between your API Gateway and your private ALB. This eliminates the need to provision, configure, manage, or monitor an intermediate load balancer. The reduction in infrastructure components translates to reduced operational overhead and fewer potential failure points. Traffic flows directly from API Gateway to your ALB within the Amazon Web Services (AWS) network, reducing network hops and latency.
  • Improved scalability: VPC link v2 supports a one-to-many relationship with load balancers. A single VPC link v2 allows API Gateway to integrate with multiple ALBs or NLBs within your VPC. This architectural advantage is particularly valuable for organizations managing complex applications with multiple microservices, each potentially behind its own ALB, or those running numerous APIs. The ability to consolidate multiple load balancer connections through a single VPC link not only reduces administrative overhead but also provides greater flexibility in scaling your architecture. As your application grows and you add more services or load balancers, you won’t need to provision additional VPC links, making it easier to expand your infrastructure while maintaining operational efficiency.
  • Cost optimization: You can remove the NLB from your architecture and thereby eliminate both the hourly charges for running the NLB and the associated Network Load Balancer Capacity Units (NLCU) used per hour. For organizations running multiple environments or numerous APIs, these savings can accumulate to thousands of dollars annually. Moreover, your data transfer patterns become more efficient. Traffic flows directly from API Gateway to your ALB within the AWS network, which avoids any unnecessary hops that could incur more data transfer charges. This streamlined path not only reduces costs but also improves performance by minimizing network latency.

Getting started

This tutorial demonstrates the setup using both the AWS Management Console and AWS Command Line Interface (AWS CLI). Before you begin, make sure that you have an internal ALB configured in your VPC. For resources that need naming, use appropriate names for your environment.

Step 1: Create a VPC link v2
The first step in our process is to create a VPC link v2, which will enable API Gateway to route traffic to your internal ALB. Here’s how to set it up:

  1. Navigate to the API Gateway console.
  2. In the left navigation pane, choose VPC links.
  3. Choose Create VPC link.
  4. Choose VPC link v2 as the VPC link type.
  5. Provide a descriptive name for your VPC link.
  6. Choose your VPC and subnets where your ALB resides. For high availability, choose subnets in multiple AWS Availability Zones (AZs) that match your ALB configuration.
  7. Assign one or more security groups to your VPC link. These security groups will control the traffic flow between API Gateway and your VPC.
  8. Choose Create and wait for the VPC link status to become Available. This process can take a few minutes.

Alternatively, you can create a VPC link v2 using the AWS CLI:

# Create VPC link v2
aws apigatewayv2 create-vpc-link \
    --name "test-vpc-link-v2" \
    --subnet-ids "<your-subnet1-id>" "<your-subnet2-id>" \
    --security-group-ids "<your-security-group-id>" \
    --region <your-AWS-region>

# Check VPC link v2 status
aws apigatewayv2 get-vpc-link \
    --vpc-link-id "<your-vpc-link-v2-id>" \
    --region <your-AWS-region>

Step 2: Create a REST API and configure integration
With your VPC link v2 now available, the next step is to create a REST API and configure it to use the VPC Link. This process involves creating the API, setting up resources and methods, and configuring the integration with your internal ALB.

  1. In the API Gateway console, choose Create API.
  2. Choose REST API.
  3. Enter an API name and choose Create API.
  4. Create a new resource by choosing Actions, then choose Create resource. This resource will represent the endpoint for your API.
  5. Create a method by choosing Actions, then choose Create method. The method defines the type of request your API will accept (GET, POST, etc.).
  6. Now, configure the integration. This is where you’ll connect your API to your internal ALB via the VPC link v2:
    1. Choose VPC link as the integration type.
    2. Choose the HTTP method for your backend integration.
    3. Choose your newly created VPC link v2.
    4. Specify your ALB as the Integration target.
    5. Enter the endpoint URL for your integration. The port specified in the URL is used to route requests to the backend.
    6. Set the Integration timeout.

Using the AWS CLI:

# Create REST API
aws apigateway create-rest-api \
    --name "test-rest-api" \
    --description "REST API integration with internal ALB via VPC link v2" \
    --region <your-AWS-region>

# Get REST API’s root resource ID
aws apigateway get-resources \
    --rest-api-id "<your-rest-api-id>" \
    --region <your-AWS-region>

# Create a new resource
aws apigateway create-resource \
    --rest-api-id "<your-rest-api-id>" \
    --parent-id "<your-parent-id>" \
    --path-part "internal-alb" \ 
    --region <your-AWS-region>

# Create a new method
aws apigateway put-method \
    --rest-api-id "<your-rest-api-id>" \
    --resource-id "<your-resource-id>" \
    --http-method ANY \
    --authorization-type NONE \
    --region <your-AWS-region>

# Create the integration
aws apigateway put-integration \
    --rest-api-id "<your-rest-api-id>" \
    --resource-id "<your-resource-id>" \
    --http-method ANY \
    --type HTTP_PROXY \
    --integration-http-method ANY \
    --uri "http://test-internal-alb.com/test" \
    --connection-type VPC_LINK \
    --connection-id "<your-vpc-link-v2-id>" \
    --integration-target "<your-ALB-arn>" \
    --region <your-AWS-region>

Step 3: Deploy and test
With your API configured, it’s time to deploy it and verify that it’s working correctly.

  1. Choose Deploy API to create a new deployment of your API.
  2. Create a new stage (for example “test”). Stages allow you to manage multiple versions of your API.
  3. After deployment, you’ll receive an API endpoint URL. Copy this URL as you’ll need it for testing.

Test your API using your preferred API client or a simple curl command.

Using the AWS CLI:

# Create a new deployment to a test stage
aws apigateway create-deployment \
    --rest-api-id "<your-rest-api-id>" \
    --stage-name "test" \
    --region <your-AWS-region>

Test your API integration using a curl command:

curl https://<rest-api-id>.execute-api.<your-aws-region>.amazonaws.com/internal-alb
{"message": "Hello from internal ALB"}

Step 4: Scale your VPC link v2
A single VPC link can now connect to multiple ALBs or NLBs within your VPC, simplifying infrastructure management. This AWS CLI snippet demonstrates API Gateway integrating with multiple internal services, for example orders and payments services, each behind its own ALB, using a single VPC link v2. Note how the same VPC link ID is used across both integrations.

# Orders service integration (ALB-1)
aws apigateway put-integration \
    --rest-api-id "<your-rest-api-id>" \
    --resource-id "<orders-resource-id>" \
    --http-method ANY \
    --type HTTP_PROXY \
    --integration-http-method ANY \
    --uri "<your-orders-alb-endpoint>" \
    --connection-type VPC_LINK \
    --connection-id "<your-vpc-link-v2-id>" \
    --integration-target "<your-orders-alb-arn>" \
    --region "<your-aws-region>"

# Payments service integration (ALB-2)
aws apigateway put-integration \
    --rest-api-id "<your-rest-api-id>" \
    --resource-id "<payments-resource-id>" \
    --http-method ANY \
    --type HTTP_PROXY \
    --integration-http-method ANY \
    --uri "<your-payments-alb-endpoint>" \
    --connection-type VPC_LINK \
    --connection-id "<your-vpc-link-v2-id>" \
    --integration-target "<your-payments-alb-arn>" \
    --region "<your-aws-region>"

For a detailed, step-by-step guide, please see our official documentation in the API Gateway Developer Guide.

Use cases

Private ALB integration with API Gateway enables architectural patterns that solve enterprise challenges. These are three key scenarios where organizations can use this new capability:

  • Microservices on Amazon ECS and Amazon EKS: Exposing microservices running on Amazon ECS or Amazon EKS becomes simpler with this integration. It allows secure, path-based routing to different services without exposing your ALB to the public internet or using complex NLB proxy patterns.
  • Hybrid cloud architectures: Seamless and secure connectivity between cloud-native APIs and on-premises resources is achieved via AWS Direct Connect or AWS Site-to-Site VPN. This setup allows flexible routing based on HTTP methods and headers to various internal systems.
  • Enterprise modernization: Gradual application modernization is facilitated by enabling phased migration from monolithic architectures to microservices. Organizations can route traffic between legacy and new components while maintaining operational continuity and minimizing risk.

Conclusion

Direct private integration between API Gateway REST APIs and ALBs enhances API architecture on AWS. By simplifying infrastructure and reducing operational overhead, this capability improves performance and efficiency for API-driven applications.

This feature is available today in all AWS Regions where VPC link v2 and ALBs are present. We can’t wait to see what you build with it and how it transforms your API architectures. Get started now by visiting the API Gateway console and creating your first VPC link v2 for direct ALB integration.

For more information, visit the API Gateway product page, review our pricing details, and explore the comprehensive developer documentation to learn about all the powerful features available to help you build world-class APIs on AWS.

Accelerate investigations with AWS Security Incident Response AI-powered capabilities

Post Syndicated from Daniel Begimher original https://aws.amazon.com/blogs/security/accelerate-investigations-with-aws-security-incident-response-ai-powered-capabilities/

If you’ve ever spent hours manually digging through AWS CloudTrail logs, checking AWS Identity and Access Management (IAM) permissions, and piecing together the timeline of a security event, you understand the time investment required for incident investigation. Today, we’re excited to announce the addition of AI-powered investigation capabilities to AWS Security Incident Response that automate this evidence gathering and analysis work.

AWS Security Incident Response helps you prepare for, respond to, and recover from security events faster and more effectively. The service combines automated security finding monitoring and triage, containment, and now AI-powered investigation capabilities with 24/7 direct access to the AWS Customer Incident Response Team (CIRT).

While investigating a suspicious API call or unusual network activity, scoping and validation require querying multiple data sources, correlating timestamps, identifying related events, and building a complete picture of what happened. Security operations center (SOC) analysts devote a significant amount of time to each investigation, with roughly half of that effort spent manually gathering and piecing together evidence from various tools and complex logs. This manual effort can delay your analysis and response.

AWS is introducing an investigative agent to Security Incident Response, changing this paradigm and adding layers of efficiency. The investigative agent helps you reduce the time required to validate and respond to potential security events. When a case for a security concern is created, either by you or proactively by Security Incident Response, the investigative agent asks clarifying questions to make sure it understands the full context of the potential security event. It then automatically gathers evidence from CloudTrail events, IAM configurations, and Amazon Elastic Compute Cloud (Amazon EC2) instance details and even analyzes cost usage patterns. Within minutes, it correlates the evidence, identifies patterns, and presents you with a clear summary.

How it works in practice

Before diving into an example, let’s paint a clear picture of where the investigative agent lives, how it’s accessed, and its purpose and function. The investigative agent is built directly into Security Incident Response and is automatically available when you create a case. Its purpose is to act as your first responder—gathering evidence, correlating data across AWS services, and building a comprehensive timeline of events so you can quickly move from detection to recovery.

For example: you discover that AWS credentials for an IAM user in your account were exposed in a public GitHub repository. You need to understand what actions were taken with those credentials and properly scope the potential security event, including lateral movement and reconnaissance operations. You need to identify persistence mechanisms that might have been created and determine the appropriate containment steps. To get started, you create a case in the Security Incident Response console and describe the situation.

Here’s where the agent’s approach differs from traditional automation: it asks clarifying questions first. When were the credentials first exposed? What’s the IAM user name? Have you already rotated the credentials? Which AWS account is affected?

This interactive step gathers the appropriate details and metadata before it starts gathering evidence. Specifically, you’re not stuck with generic results—the investigation is tailored to your specific concern.

After the agent has what it needs, it investigates. It looks up CloudTrail events to see what API calls were made using the compromised credentials, pulls IAM user and role details to check what permissions were granted, identifies new IAM users or roles that were created, checks EC2 instance information if compute resources were launched, and analyzes cost and usage patterns for unusual resource consumption. Instead of you querying each AWS service, the agent orchestrates this automatically.

Within minutes, you get a summary, as shown in the following figure. The investigation summary includes a high-level summary and critical findings, which include the credential exposure pattern, observed activity and the timeframe, affected resources, and limiting factors.

This response was generated using AWS Generative AI capabilities. You are responsible for evaluating any recommendations in your specific context and implementing appropriate oversight and safeguards. Learn more about AWS Responsible AI requirements.

Note: The preceding example is representative output. Exact formatting will vary depending on findings.

The investigation summary includes various tabs for detailed information, such as technical findings with an events timeline, as shown in the following figure:

Figure 2 – Security event timeline

Figure 2 – Security event timeline

When seconds count, this transparency is paramount to a quick, high-fidelity, and accurate response—especially if you need to escalate to the AWS CIRT, a dedicated group of AWS security experts, or explain your findings to leadership, creating a single lens for stakeholders to view the incident.

When the investigation is complete, you have a high-resolution picture of what happened and can make informed decisions about containment, eradication, and recovery. For the preceding exposed credentials scenario, you might need to:

  • Delete the compromised access keys
  • Remove the newly created IAM role
  • Terminate the unauthorized EC2 instances
  • Review and revert associated IAM policy changes
  • Check for additional access keys created for other users.

When you engage with the CIRT, they can provide additional guidance on containment strategies based on the evidence the agent gathered.

What this means for your security operations

The leaked credentials scenario shows what the agent can do for a single incident. But the bigger impact is on how you operate day-to-day:

  • You spend less time on evidence collection. The investigative agent automates the most time-consuming part of investigations—gathering and correlating evidence from multiple sources. Instead of spending an hour on manual log analysis, you can spend most of that time on making containment decisions and preventing recurrence.
  • You can investigate in plain language. The investigative agent uses natural language processing (NLP), which you can use to describe what you’re investigating in plain language, such as unusual API calls from IP address X or data access from terminated employee’s credentials, and the agent translates that into the technical queries needed. You don’t need to be an expert in AWS log formats or know the exact syntax for querying CloudTrail.
  • You get a foundation for high-fidelity and accurate investigations. The investigative agent handles the initial investigation—gathering evidence, identifying patterns, and providing a comprehensive summary. If your case requires deeper analysis or you need guidance on complex scenarios, you can engage with the AWS CIRT, who can immediately build on the work the agent has already done, speeding up their response time. They see the same evidence and timeline, so they can focus on advanced threat analysis and containment strategies rather than starting from scratch.

Getting started

If you already have Security Incident Response enabled, the AI-powered investigation capabilities are available now—no additional configuration needed. Create your next security case and the agent will start working automatically.

If you’re new to Security Incident Response, here’s how to set it up:

  1. Enable Security Incident Response through your AWS Organizations management account. This takes a few minutes through the AWS Management Console and provides coverage across your accounts.
  2. Create a case. Describe what you’re investigating; you can do this through the Security Incident Response console or an API, or set up automatic case creation from Amazon GuardDuty or AWS Security Hub alerts.
  3. Review the analysis. The agent presents its findings through the Security Incident Response console, or you can access them through your existing ticketing systems such as Jira or ServiceNow.

The investigative agent uses the AWS Support service-linked role to gather information from your AWS resources. This role is automatically created when you set up your AWS account and provides the necessary access for Support tools to query CloudTrail events, IAM configurations, EC2 details, and cost data. Actions taken by the agent are logged in CloudTrail for full auditability.

The investigative agent is included at no additional cost with Security Incident Response, which now offers metered pricing with a free tier covering your first 10,000 findings ingested per month. Beyond that, findings are billed at rates that decrease with volume. With this consumption-based approach, you can scale your security incident response capabilities as your needs grow.

How it fits with existing tools

Security Incident Response cases can be created by customers or proactively by the service. The investigative agent is automatically triggered when a new case is created, and cases can be managed through the console, API, or Amazon EventBridge integrations.

You can use EventBridge to build automated workflows that route security events from GuardDuty, Security Hub, and Security Incident Response itself to create cases and initiate response plans, enabling end-to-end detection-to-investigation pipelines. Before the investigative agent begins its work, the service’s auto-triage system monitors and filters security findings from GuardDuty and third-party security tools through Security Hub. It uses customer-specific information, such as known IP addresses and IAM entities, to filter findings based on expected behavior, reducing alert volume while escalating alerts that require immediate attention. This means the investigative agent focuses on alerts that actually need investigation.

Conclusion

In this post, I showed you how the new investigative agent in AWS Security Incident Response automates evidence gathering and analysis, reducing the time required to investigate security events from hours to minutes. The agent asks clarifying questions to understand your specific concern, automatically queries multiple AWS data sources, correlates evidence, and presents you with a comprehensive timeline and summary while maintaining full transparency and auditability.

With the addition of the investigative agent, Security Incident Response customers now get the speed and efficiency of AI-powered automation, backed by the expertise and oversight of AWS security experts when needed.

The AI-powered investigation capabilities are available today in all commercial AWS Regions where Security Incident Response operates. To learn more about pricing and features, or to get started, visit the AWS Security Incident Response product page.

If you have feedback about this post, submit comments in the Comments section below.

Daniel Begimher

Daniel Begimher

Daniel is a Senior Security Engineer in Global Services Security, specializing in cloud security, application security, and incident response. He co-leads the Application Security focus area within the AWS Security and Compliance Technical Field Community, holds all AWS certifications, and authored Automated Security Helper (ASH), an open source code scanning tool.

Serverless strategies for streaming LLM responses

Post Syndicated from KyungYong Shim original https://aws.amazon.com/blogs/compute/serverless-strategies-for-streaming-llm-responses/

Modern generative AI applications often need to stream large language model (LLM) outputs to users in real-time. Instead of waiting for a complete response, streaming delivers partial results as they become available, which significantly improves the user experience for chat interfaces and long-running AI tasks. This post compares three serverless approaches to handle Amazon Bedrock LLM streaming on Amazon Web Services (AWS), which helps you choose the best fit for your application.

  1. AWS Lambda function URLs with response streaming
  2. Amazon API Gateway WebSocket APIs
  3. AWS AppSync GraphQL subscriptions

We cover how each option works, the implementation details, authentication with Amazon Cognito, and when to choose one over the others.

Lambda function URLs with response streaming

AWS Lambda function URLs provide a direct HTTP(S) endpoint to invoke your Lambda function. Response streaming allows your function to send incremental chunks of data back to the caller without buffering the entire response. This approach is ideal for forwarding the Amazon Bedrock streamed output, providing a faster user experience. Streaming is supported in Node.js 18+. In Node.js, you wrap your handler with awslambda.streamifyResponse(), which provides a stream to write data to, and which sends it immediately to the HTTP response.

Architecture

The following figure shows the architecture.

Lambda function URLs with Amazon Bedrock architecture

  1. The client makes a fetch() call to the Lambda function URL.
  2. Lambda invokes InvokeModelWithResponseStream using the AWS SDK for JavaScript.
  3. As tokens arrive from Amazon Bedrock, they are written to the response stream.

Implementation steps

  1. Create a streaming Lambda function: Use a Node.js 18+ or later runtime (necessary for native streaming). Install the AWS SDK to call Amazon Bedrock. In the handler code, wrap the function with awslambda.streamifyResponse and stream the model output. For example, in Node.js you might do the following:
    const bedrock = new BedrockRuntimeClient({region: “us-east-1”});
    
    // Please consider adding more details when you use it for your application.
    exports.handler = awslambda.streamifyResponse(async (event, responseStream) => 
    {
        // 1. Parse input (e.g., prompt from event)
        const prompt = event.body?.prompt;
        // 2. Call Amazon Bedrock with streaming (using AWS SDK for Amazon Bedrock)
        const command = new InvokeModelWithResponseStreamCommand({ modelId: "YOUR_MODEL_ID", body: { prompt }});
        const response = await bedrock.send(command);
        // 3. Stream Bedrock tokens to client
        for await (const event of response.body) {
            if (event.content) {
                responseStream.write(event.content); // write partial output
            }
        }
        // 4. End stream when done
        responseStream.end();
    });
    

  2. This code snippet uses the Amazon Bedrock SDK’s async iterable to read the event stream of tokens and writes each to the responseStream.
  3. Configure the Lambda role: the execution role must allow the Amazon Bedrock invocation (such as bedrock:InvokeModelWithResponseStream on the LLM model Amazon Resource Name (ARN)).

Authentication with Amazon Cognito

Lambda function URLs can be set to “None” (public) or “AWS_IAM”. Native Cognito User Pool token authentication isn’t supported, thus you need to implement a solution.

  1. JWT verification in Lambda: Allow public access and verify a valid JWT from Amazon Cognito in the request header within your Lambda code. This necessitates development effort.
    // Initialize Cognito JWT Verifier
    const { CognitoJwtVerifier } = require('aws-jwt-verify');
    
    const jwtVerifier = CognitoJwtVerifier.create({
      userPoolId: USER_POOL_ID,
      tokenUse: 'id',
      clientId: USER_POOL_CLIENT_ID
    });
    
    // Verify JWT token from Cognito
    async function verifyToken(token) {
      try {
        if (!token) throw new Error('No authorization token provided');
        
        // Remove 'Bearer ' prefix if present
        if (token.startsWith('Bearer ')) {
          token = token.slice(7);
        }
    
        // Verify the token using Cognito JWT Verifier
        const payload = await jwtVerifier.verify(token);
        logger.info(`Verified token for user: ${payload.sub}`);
        
        return payload;
      } catch (error) {
        logger.error(`Token verification failed: ${error.message}`);
        throw new Error(`Invalid token: ${error.message}`);
      }
    }
    
    //...
    
        // Verify authentication
        let userId;
        try {
          const authHeader = event.headers?.Authorization;
          const payload = await verifyToken(authHeader);
          userId = payload.sub;
          logger.info(`Authenticated user: ${userId}`);
        } catch (error) {
          responseStream.write(`data: ${JSON.stringify({ type: 'error', error: 'Unauthorized', message: error.message })}\n\n`);
          return;
        }
    

  2. IAM authorization with Amazon Cognito identity: Use AWS credentials obtained from Amazon Cognito. A more complex setup, especially for web apps, is potentially overkill for a single function.

Pros and cons of Lambda function URLs

Pros:

  • Clarity: No API Gateway or other services are needed, which minimizes operational overhead.
  • Low latency, high throughput: The response is delivered directly from Lambda to the client. This yields excellent Time To First Byte (TTFB) performance, with no intermediate buffering.
  • Direct implementation: For Node.js developers, enabling streaming is as direct as a wrapper and writing to a stream. This is ideal for quick prototypes or single function microservices.
  • Lower cost for low concurrent usage: You pay only for Lambda execution time. There’s no persistent connection cost, which is the same as with WebSocket or AWS AppSync. If invocations are infrequent or short, then this could be cost-efficient.

Cons:

  • Limited runtime support: Native streaming is only supported in Node.js.
  • No built-in user pool auth: Unlike API Gateway or AWS AppSync, Lambda URLs don’t directly support Amazon Cognito user pool authorizers. You must handle auth either through AWS Identity and Access Management (IAM) or manual token validation, adding some development effort and potential security pitfalls if done incorrectly.
  • Error handling complexity: Streaming makes error propagation trickier. If an error occurs mid-stream, then you need to decide how to inform the client.

API Gateway WebSocket for streaming

API Gateway WebSocket APIs establish persistent, stateful connections between clients and your backend. This is ideal for real-time applications needing server-initiated messages. The client connects once, sends a prompt to Amazon Bedrock through the WebSocket, and the server pushes the streamed response back over the same connection.

Architecture

The following figure shows the architecture.

API Gateway WebSocket with Amazon Bedrock architecture

  1. Client connects through the WebSocket URL and store connectionId.
  2. Client sends a prompt through a custom route to the LLMHandler.
  3. Lambda as LLMHandler invokes Amazon Bedrock and streams back through WebSocket.
  4. Client disconnects through the DisconnectHandler and removes connectionId.

Implementation steps

  1. Create a WebSocket API in API Gateway with routes
    1. $connect: Connected to ConnectHandler Lambda.
    2. $disconnect: Connected to DisconnectHandler Lambda.
    3. $stream: All messages go to StreamHandler Lambda.
  2. Create Lambda Authorizer
    1. Receives the connection request with token in query string.
    2. Validates the JWT token against Amazon Cognito.
    3. Returns Allow/Deny policy for the connection.
      def lambda_handler(event, context):
          # Extract token from querystring
          token = event.get("queryStringParameters", {}).get("token", "")
          
          # Validate JWT token against Cognito
          if validate_token(token):
              return {
                  "isAuthorized": True,
                  # Optionally include context that other handlers can access
                  "context": {
                      "userId": extracted_user_id
                  }
              }
          else:
              return {"isAuthorized": False}
      

  3. Create Connection Handler
    1. Connection Lambda runs after successful authorization.
    2. Receives the new connection’s unique connectionId.
    3. Store connection info in Amazon DynamoDB (optional).
    4. Returns 200 status to complete the connection.
      def lambda_handler(event, context):
          # Extract connectionId
          connection_id = event.get("requestContext", {}).get("connectionId")
          
          # Optionally store in DynamoDB
          # dynamodb.put_item(...)
          
          # Connection established successfully
          return {"statusCode": 200}
      

  4. Create Disconnect Handler
    1. Disconnect Lambda is triggered automatically when clients disconnect.
    2. Receives the terminated connection’s connectionId.
    3. Cleans up any stored connection data.
    4. Returns 200 status
      def lambda_handler(event, context):
          # Extract connectionId
          connection_id = event.get("requestContext", {}).get("connectionId")
          
          # Optionally remove from DynamoDB
          # dynamodb.delete_item(...)
          
          # Disconnection handled successfully
          return {"statusCode": 200}
      

  5. Create LLM Handler
      1. Receives messages sent to the stream route.
      2. Extracts prompt from the message body.
      3. Calls Amazon Bedrock model with streaming response.
      4. Streams tokens back to the client using the connection ID.
        def lambda_handler(event, context):
            # Extract connectionId and domain details for sending responses
            connection_id = event["requestContext"]["connectionId"]
            domain = event["requestContext"]["domainName"]
            stage = event["requestContext"]["stage"]
            
            # Parse message body to get the prompt
            body = json.loads(event.get("body", "{}"))
            prompt = body.get("prompt", "")
            
            # Create API Gateway management client for sending responses
            api_client = boto3.client(
                'apigatewaymanagementapi',
                endpoint_url=f'https://{domain}/{stage}'
            )
            
            # Call Amazon Bedrock with streaming response
            response = bedrock_client.invoke_model_with_response_stream(...)
            
            # Stream tokens back to client
            for chunk in response["body"]:
                # Extract token from chunk
                token = process_chunk(chunk)
                
                # Send token directly back through the WebSocket
                api_client.post_to_connection(
                    ConnectionId=connection_id,
                    Data=json.dumps({"token": token, "isComplete": False})
                )
            
            # Send completion message
            api_client.post_to_connection(
                ConnectionId=connection_id,
                Data=json.dumps({"token": "", "isComplete": True})
            )
            
            return {"statusCode": 200}
        

Authentication with Amazon Cognito

Securing a WebSocket API with Amazon Cognito needs a bit more work. API Gateway WebSocket doesn’t have a built-in Amazon Cognito User Pool authorizer:

  1. Lambda authorizer with JWT authentication: API Gateway invokes your Lambda authorizer upon connection, validating the Amazon Cognito JWT (passed as a query parameter). The Lambda generates an IAM policy granting access and returns it.
  2. IAM authentication for WebSockets: Clients sign requests with SigV4 using AWS credentials from an Amazon Cognito Identity Pool. API Gateway evaluates the request against IAM policies.

Pros and cons of API Gateway WebSocket APIs

Pros:

  • Bidirectional real-time communication: WebSockets are ideal for applications where the server needs to push data such as the LLM’s response without explicit requests.
  • Persistent connection for multi-turn conversations: After the initial handshake, the same connection can be reused for subsequent prompts and responses, avoiding repeated setup latency. This is great for a chat UI where the user asks multiple questions in one session.
  • Scalability: API Gateway is a managed service that can handle 500 connections/second and 10,000 requests/second across APIs, which can be increased by request.

Cons:

  • Higher development complexity: When compared to the clarity of a direct Lambda URL, a WebSocket API involves multiple Lambdas and coordination to manage the connection state.
  • Custom auth implementation: There is no built-in Amazon Cognito user pool integration, thus you must implement a Lambda authorizer.
  • Timeout management: The API Gateway integration timeout is 29 s, thus your Lambda function should return the response promptly.

AWS AppSync GraphQL subscription

AWS AppSync is a fully managed GraphQL service that streamlines building real-time APIs. It handles WebSocket connections and client fan-out automatically. Clients subscribe to a GraphQL subscription, and a Lambda resolver pushes the Amazon Bedrock streamed tokens back.

Architecture

The following figure shows the architecture.

AWS AppSync GraphQL subscription with Amazon Bedrock architecture

  1. Client calls a startStream mutation. AppSync invokes the Request Lambda.
  2. The Request Lambda immediately returns a unique sessionId and sends the processing task to an Amazon Simple Queue Service (Amazon SQS) queue.
  3. Client uses the sessionId to subscribe to an onTokenReceived GraphQL subscription.
  4. The Processing Lambda (triggered by Amazon SQS) invokes Amazon Bedrock and, for each token, calls a publishToken mutation in AWS AppSync.
  5. AWS AppSync automatically pushes the token to all clients subscribed with the matching sessionId.

Implementation steps

  1. Design the GraphQL Schema: define types and operations.
    type StreamResponse {
      sessionId: String!
      status: String!
      message: String
      timestamp: String!
      error: String
    }
    
    type TokenEvent {
      sessionId: String!
      token: String!
      isComplete: Boolean!
      timestamp: String!
    }
    
    type Mutation {
      startStream(prompt: String!): StreamResponse!
      publishToken(sessionId: String!, token: String!, isComplete: Boolean!): TokenEvent!
    }
    
    type Subscription {
      onTokenReceived(sessionId: String!): TokenEvent
    

  2. Create the Request Handler (Request Lambda)
    1. Receives the GraphQL mutation with the prompt.
    2. Generates a unique session ID.
    3. Sends the prompt and session ID to the SQS queue.
    4. Returns the session ID to the client immediately.
      def lambda_handler(event, context):
          # Extract prompt from GraphQL event
          prompt = event["arguments"]["prompt"]
          
          # Generate unique session ID
          session_id = str(uuid.uuid4())
          
          # Send message to SQS queue
          sqs_client.send_message(
              QueueUrl="your-sqs-queue-url",
              MessageBody=json.dumps({
                  "prompt": prompt,
                  "sessionId": session_id
              })
          )
          
          # Return session ID to client
          return {
              "sessionId": session_id,
              "status": "streaming_started",
              "timestamp": datetime.datetime.utcnow().isoformat()
          }
      

  3. Create the Processing Handler (Processing Lambda)
    1. It is triggered by Amazon SQS messages.
    2. It calls Amazon Bedrock with streaming enabled.
    3. For each token generated, it calls the AppSync publishToken mutation.
      def lambda_handler(event, context):
          # Process SQS event records
          for record in event["Records"]:
              body = json.loads(record["body"])
              prompt = body["prompt"]
              session_id = body["sessionId"]
              
              # Call Amazon Bedrock with streaming
              response = bedrock_client.invoke_model_with_response_stream(...)
              
              # Process streaming response
              for chunk in response["body"]:
                  # Extract token from chunk
                  token = process_chunk(chunk)
                  
                  # Publish token to AppSync
                  publish_token_to_appsync(
                      session_id=session_id,
                      token=token,
                      is_complete=False
                  )
              
              # Send completion token
              publish_token_to_appsync(
                  session_id=session_id,
                  token="",
                  is_complete=True
              )
      

  4. Configure GraphQL Resolvers
    1. StartStream resolver: Connect to the Request Lambda.
    2. PublishToken resolver: Trigger subscription with a NONE data source.
  5. Client subscription setup
    1. Make a startStream mutation.
      const { sessionId } = await client.mutate({
        mutation: START_STREAM,
        variables: { prompt }
      });
      

    2. Subscribe to receive tokens.
      client.subscribe({
        query: ON_TOKEN_RECEIVED,
        variables: { sessionId }
      }).subscribe({
        next: ({ data }) => {
          if (data.onTokenReceived.isComplete) {
            // Handle completion
          } else {
            // Append token to UI
            appendToken(data.onTokenReceived.token);
          }
        }
      });
      

Authentication with Amazon Cognito

AWS AppSync integrates seamlessly with Amazon Cognito User Pools. Setting the API’s auth mode to Amazon Cognito User Pool needs a valid JWT for every GraphQL operation. This is the most developer-friendly option for authentication. AWS AppSync handles the handshake and token refresh.

Pros and cons of AWS AppSync subscriptions

Pros:

  • Fully managed real-time protocol: You don’t deal with raw WebSockets or connection IDs at all. AWS AppSync automatically establishes and maintains a secure WebSocket for subscriptions (no need for a connect or disconnect Lambda).
  • Streamlined authentication: Built-in support for Amazon Cognito User Pool tokens means that you can secure the API without writing custom authorizers.

Cons:

  • Potential overhead and complexity: For a direct case (one prompt—one stream), introducing GraphQL and AWS AppSync might be seen as over-engineering if your app doesn’t use GraphQL for other use cases.
  • 30-second resolver limit: AWS AppSync has a 30-second limit for mutation resolvers, thus you need to design the initial request to start the process and immediately return, relying on a subscription to stream the results progressively to avoid blocking the user.

Conclusion

The Amazon Bedrock streaming interface unlocks fluid, low-latency LLM experiences. You can use the right AWS serverless architecture to deliver streamed responses in a secure, scalable, and cost-effective way.

  • Lambda function URLs with streaming: Direct, single-user applications and prototypes.
  • API Gateway WebSocket: Multi-turn conversations, collaborative applications.
  • AppSync: Complex applications already using GraphQL.

Each method is serverless, production-ready, and fully integrated with Amazon Cognito for secure access control. AWS provides the flexibility to design high-quality AI user experiences at scale.

Refer to GitHub sample source code for more details.

Comparative table

Feature LAMBDA FUNCTION URLS API GATEWAY WEBSOCKET APIs APPSYNC GRAPHQL SUBSCRIPTIONS
Complexity Lowest Medium High
Real-time focus Limited Strong Strong
Authentication Needs custom logic Needs custom logic Built-in Amazon Cognito support
Scalability Good Good Excellent
GraphQL support None None Native
Use cases Q&A Chatbots, real-time apps Complex apps, multi-user scenarios
Cost Pay per invocation Connection time and Lambda execution Request/connection-based pricing

 

How to update CRLs without public access using AWS Private CA

Post Syndicated from Rochak Karki original https://aws.amazon.com/blogs/security/how-to-update-crls-without-public-access-using-aws-private-ca/

Certificates and the hierarchy of trust they create are the backbone of a secure infrastructure. AWS Private Certificate Authority is a highly available certificate authority (CA) that you can use to create private CA hierarchies, secure your applications and devices with private certificates, and manage certificate lifecycles.

A certificate revocation list (CRL) is a file that contains a signed list of certificates revoked before their scheduled expiration date. Certificates can be revoked for a variety of reasons, including unintended key exposure, or because of discontinued use.

AWS Private CA writes CRLs to an Amazon Simple Storage Service (Amazon S3) bucket that you specify. CRLs are public, fully qualified domain names (FQDNs), but you might have requirements for a CRL that is only accessible internally to your organization, or you might have security standards that require all S3 buckets to have Amazon S3 block public access enabled.

The recommended practice for S3 buckets is to enable Block Public Access, which enables only authorized and authenticated AWS accounts to have access to a bucket and its contents. However, because some public key infrastructure (PKI) clients retrieve CRLs across the public internet, a workaround might be necessary to serve CRLs without requiring authenticated client access to an S3 bucket. One recommended solution is to use Amazon CloudFront to provide access to the CRL. This will likely be the best solution for most customers. Our documentation specifically highlights CloudFront as the recommended implementation path. However, you might not be able to use CloudFront or might need another option.

You might need a solution where the CRL lookups don’t traverse the public internet. In this post, we go over two different approaches to achieve this.

Option 1: Relocate CRLs to an internally accessible location

By default, AWS Private CA writes CRLs to an S3 bucket that you specify. This solution consists of moving the CRL to a separate location that is internally accessible to your TLS clients, but not accessible via the public internet such as an on-premises server. A CRL distribution point (CDP) is a link that points to the location of the CRL where revoked certificates appear. However, when private certificates are generated by AWS Certificate Manager (ACM), the CDP universal resource identifiers (URI) in the certificates point by default to the S3 bucket initially specified.

This solution uses a custom CNAME in the CDP to indicate, during certificate generation, the location where the CRL will ultimately be located.

The steps in the solution are as follows:

  1. Select the S3 bucket where the CRL will be stored.
  2. Issue a certificate through the CA with a custom CNAME.
  3. Create an AWS Lambda function that moves the CRL file from the S3 bucket to another specified location.
  4. Create an Amazon Simple Notification Service (Amazon SNS) notification that alerts a user to the success metric of the CRL generation event.

Prerequisites:

For this walkthrough, you must have the following resources ready to use:

  1. An AWS account with:
    • An AWS Identity and Access Management (IAM) role with permissions for Amazon S3, ACM Private CA, Amazon EventBridge, and Lambda
    • An ACM private CA root and subordinate CA configured in the same AWS Region
    • An S3 bucket for the CRL with permissions that allow the AWS Private CA service principal to PutObject, PutObjectACL, GetBucketACL and GetBucketLocation (see the following example bucket policy)
{     
    "Version": "2012-10-17",     
    "Statement": [         
        {             
            "Effect": "Allow",             
            "Principal": {                 
                "Service": "acm-pca.amazonaws.com"             
            },             
            "Action": [                 
                "s3:PutObject",                 
                "s3:PutObjectAcl",                 
                "s3:GetBucketAcl",                 
                "s3:GetBucketLocation"             
            ],             
            "Resource": [                 
                "arn:aws:s3:::<name-of-bucket>/*",                 
                "arn:aws:s3:::<name-of-bucket>"             
            ],             
            "Condition": {                 
                "StringEquals": {                     
                    "aws:SourceAccount": "<account-num-here>",                     
                    "aws:SourceArn": "<subordinate-ca-arn-here>"                 
                }             
            }         
        }     
    ] 
}

2. AWS Command Line Interface (AWS CLI) configured

Deploy:

With the prerequisites in place, you’re ready to deploy the first solution.

To enable CRL distribution:

  1. Use your account to sign in to the AWS Management Console for AWS Private Certificate Authority.
  2. Select the name of your subordinate CA. This should take you to another page with more details.
  3. Scroll down and choose the Revocation configuration tab.
  4. Choose Edit on the top right.
  5. Figure 1: Edit the revocation configuration

    Figure 1: Edit the revocation configuration

  6. Select Activate CRL distribution. Select the CRL S3 bucket you created prior to the walkthrough.
  7. Figure 2: Enter a name for your CRL

    Figure 2: Enter a name for your CRL

  8. Modify the CDP by expanding the CRL settings dropdown. In the Custom CRL Name field, enter the URL where you will eventually move the CRL. This should be a place that is accessible by your internal organization, but not accessible externally. If you use partitioned CRLs, select the Enable partitioning checkbox. To learn more about CRL partitioning, see Plan your AWS Private CA certificate revocation method.
  9. Choose Save changes.

To create an SNS topic and Lambda function:

  1. Go to the Amazon SNS console.
  2. Create a standard SNS topic. Leave all options as default and subscribe an appropriate email to the topic.
  3. Figure 3: Create an SNS topic

    Figure 3: Create an SNS topic

  4. Go to the Lambda console.
  5. Choose Create Function.
  6. Enter a name for your function. Under Runtime, select Python 3.12 from the dropdown.
  7. Figure 4: Create a Lambda function

    Figure 4: Create a Lambda function

  8. Verify that the role associated with your Lambda function has permissions to get objects from the S3 bucket where AWS Private CA places the CRL (set when you configured the revocation details for the CA), copy objects in Amazon S3, then put objects in an S3 bucket (or wherever the new CRL distribution point specified in the certificate custom CNAME will be—for example, an internal-only accessible location), and publish to an Amazon SNS topic. The Lambda function also checks the success metric of a CRL generation event. If the event fails, an SNS topic will notify an admin. If the event is successful, a copy of the CRL in the original S3 bucket is created in the new specified location and an SNS topic will notify an admin.

Example code (Python 3.13):

import boto3 
import json 

def lambda_handler(event, context):     
	#create a s3 client     
	s3 = boto3.client('s3')          

	#create a sns client     
	sns = boto3.client('sns')     
    topicArn = "<sns-topic-arn-here>”     
    
    #get name of the CA from the CW event     
    caID = event['resources'][0].split('/')[-1]          
    status = event['detail']['result']     
    if status == 'success':              
    	
        source = '<ORIGINS3BUCKET>'         
        destination = '<DESTINATION-S3BUCKET>'         
        #See below note for more clarification on S3 CRL paths         
        folder = 'crl/'         
        file = caID + '.crl'         
        key = folder + file              
        
        try:             
        	copySource = {                 
            	'Bucket': source,                 
                'Key': key             
           	}                      
            
            s3.copy_object(                 
            	CopySource=copySource,                 
                Bucket=destination,                 
                Key=file             
          	)             
            response = sns.publish(                 
            	TopicArn=<sns-topicArn>,                 
                Message=f'Successfully moved {key} from {source} to {destination} in {caID}',                 
                Subject="CRL Upload Success"             
          	)                      
            
            return {                 
            	'statusCode': 200,                 
                'body': json.dumps(f'Successfully moved {key} from {source} to {destination} in {caID}')             
          	}                  
    	
        except s3.exceptions.NoSuchKey:             
        	response = sns.publish(                 
            	TopicArn=<sns-topicArn>,                 
                Message=f"Object {key} not found in {source}",                 
                Subject='CRL Upload Failure'             
          	)             
            return {                 
            	'statusCode': 404,                 
                'body': json.dumps(f'Object {key} not found in {source}')             
          	}                  
   		except Exception as e:             
    		print(e)             
        	response = sns.publish(                 
        		TopicArn=<sns-topicArn>,                 
            	Message=f'Error moving object: {str(e)}',                 
            	Subject='Failure Uploading CRL'             
     		)             
			return {                 
    			'statusCode': 500,                 
        		'body': json.dumps(f'Error moving object: {str(e)}')             
  			}     
    else:         
    	response = sns.publish(                 
        		TopicArn=<sns-topicArn>,                 
            	Message=f'Certificate Authority {caID} CRL creation {status}',                 
            	Subject='CRL Upload Failure'             
     		)         
        return {             
        	'statusCode': 200,             
            'body': json.dumps(f'Certificate Authority {caID} CRL creation {status}')         
      	}

Note: By default, the non-partitioned CRL path in S3 is <s3-bucket-name>/crl/<CA-ID>.crl. If you used a custom path, modify the path name to the CRL accordingly. Alternatively, if using partitioned CRLs, the path changes to <s3-bucket-name>/crl/<CA-ID>/<partition_GUID>.crl; in that case, you can loop over each file in the <CA-ID> path to achieve the same effect.

To create an EventBridge that deploys your Lambda function:

  1. Go to the EventBridge console. Under Buses, select Rules.
  2. Choose Create Rule.
  3. Enter a name for your rule. Under Rule Type, select Rule with an Event Pattern and choose Next.
  4. Under Events, select AWS events or EventBridge partner events as the Event Source.
  5. For the Event pattern, select Use pattern form. For the Event source, select AWS services. For Event Type, select ACM Private CA CRL Generation.
Figure 5: Configure the event pattern

Figure 5: Configure the event pattern

  1. Choose Next.
  2. Under Target types, choose AWS Service, and then select Lambda function from the Select a target dropdown and select the function that you created earlier.
  3. Figure 6: Select the Lambda function as the target

    Figure 6: Select the Lambda function as the target

  4. Choose Next. Review your topic, then choose Update rule.
  5. To test the success of the Lambda function:

    1. To test the EventBridge topic, create and revoke a certificate. You can do this using the AWS CLI by getting the serial number of a certificate using openSSL:
      openssl x509 -in cert.pem -noout -serial
    2. Use the following command to revoke the certificate:
      aws acm-pca revoke-certificate —certificate-authority-arn <CA ARN> \ —certificate-serial <SERIAL NUMBER RETURNED IN STEP 1> --revocation-reason “UNSPECIFIED”
    3. To make sure that the Lambda function is triggered, wait 5–30 minutes. Check CloudTrail to make sure that RevokeCertificate was called, then monitor the CloudWatch log of the Lambda function. You should also get a notification message from your SNS topic.
    4. You have now successfully moved your CRL to a new location.

    Option 2: Implement Private CRL Access Through AWS Private CA

    This solution provides private Certificate CRL access within AWS Private CA, avoiding the need for public internet exposure. The design centers on establishing root and subordinate CAs with CRL functionality enabled within a dedicated S3 bucket, combined with a private network infrastructure using Gateway VPC endpoints and private subnets. Security is enforced through an S3 bucket policy that accomplishes three critical objectives:

    • Authorizing essential AWS Private CA permissions
    • Constraining CRL access to a designated Gateway VPC endpoint
    • Explicitly blocking access attempts from other sources.

    The solution includes private DNS zone configuration for proper resolution and can be verified through access testing confirming successful CRL retrieval from private VPC instances while making sure that requests from public instances are denied, maintaining a strictly private PKI.

    1. Create a root CA and subordinate CA with CRL enabled
    2. Configure a dedicated S3 bucket for CRL storage
    3. Issue private certificates through ACM
    4. Set up a VPC with private subnets
    5. Configure a Gateway VPC endpoint for Amazon S3
    6. Set up route tables for local traffic only
    7. Implement an S3 bucket policy with specific permissions
    8. Configure private DNS resolution
    9. Set up access controls through VPC endpoints
    10. Test private access from within the VPC
    11. Verify that public access is blocked

    Prerequisites for CRL solution 2

    For this walkthrough, you must have the following resources available:

    Deploy CRL solution 2

    With the prerequisites in place, you’re ready to use the console and AWS CLI to deploy the solution.

    To deploy the solution:

    1. Go to the AWS Private Certificate Authority console.
    2. In the navigation pane, choose Create a Private CA.
      1. Under Mode options, select General-purpose.
      2. For CA type options, select root.
      3. For the Subject distinguished name options: Fill in at least one of the subject distinguished name options: Organization(O), Organization unit (OU), Country(C), State, Locality name, and Common name (CN).
        Figure 7: Create a private CA (root)

        Figure 7: Create a private CA (root)

      4. Select Key algorithm options, for example, RSA 2046.
      5. Under Certificate revocation options, select Activate CRL Distribution, and select or create an S3 bucket for CRL storage.
      6. Under Pricing, select the checkbox to acknowledge pricing and then select Create CA.
    Figure 8: Configure a private CA (root)

    Figure 8: Configure a private CA (root)

    3. After creating a root CA, repeat all of step 2 to create a subordinate CA, selecting
    Subordinate CA under
    CA options (step 2-b). When completed, both the root CA and subordinate CA will be visible on the Private certificate authority page.

    Figure 9: View of root CA and subordinate CA

    Figure 9: View of root CA and subordinate CA

    With the root CA and subordinate CA in place, the next step is to create a VPC gateway endpoint for S3 access to enable private network communication.

    To create a VPC gateway endpoint:

    1. Go to the Amazon VPC console
    2. In the left navigation pane, select Endpoints, and choose Create Endpoint.
    3. Configure the Gateway VPC endpoint settings:
      1. Enter a descriptive name for your endpoint (optional).
      2. Type: Select AWS services.
      3. Services: Select the service name com.amazonaws.[region].s3 from the list.
      4. Type: Verify that Gateway is selected (automatically chosen for Amazon S3).
      5. VPC: Choose the VPC where you want to create the endpoint.
      6. Route tables: Select the route tables associated with the subnets that need Amazon S3 access.
      7. Policy: Select Full Access or create a custom policy to restrict access to specific S3 buckets or actions.
      8. Review your configuration and choose Create endpoint.
    Figure 10: Gateway VPC endpoint configuration

    Figure 10: Gateway VPC endpoint configuration

    1. Create two private subnets:
      1. In the Amazon VPC console, choose Subnets and then Create subnet.
      2. Select your VPC and enter the subnet details (name, Availability Zone, and CIDR block).
      3. Repeat for the second subnet in a different Availability Zone.
    2. Configure route tables:
      1. Navigate to Route Tables and choose Create route table.
      2. Create and name two route tables for your private subnets.
      3. Associate each route table with its corresponding private subnet.
      4. Make sure that each route table contains only local routes (VPC CIDR).
      5. Remove any routes for internet access (0.0.0.0/0).
    Figure 11: Private route table configuration

    Figure 11: Private route table configuration

    1. You can see now see under Resource Map that the Gateway VPC endpoint provides secure access to Amazon S3 resources within the private network.
    Figure 12: VPC private instance configuration

    Figure 12: VPC private instance configuration

    1. Use the following example code to implement a bucket policy that enforces the following key security controls:
      • Grant AWS Private CA the necessary permissions for certificate management.
      • Restrict CRL access exclusively through the specified VPC endpoint.
      • Explicitly deny GetObject requests not originating from the designated Gateway VPC endpoint.
    Figure 13: S3 bucket policy

    Figure 13: S3 bucket policy

    The following is an example S3 bucket policy for private CA CRL access with VPC endpoint restrictions:

    {     
        "Version": "2012-10-17",     
        "Statement": [         
            {             
                "Effect": "Allow",             
                "Principal": {                 
                    "Service": "acm-pca.amazonaws.com"             
                    },             
                "Action": [                 
                    "s3:PutObject",                 
                    "s3:PutObjectAcl",                 
                    "s3:GetBucketAcl",                 
                    "s3:GetBucketLocation"             
                ],             
                "Resource": [                 
                    "<arn:aws:s3:::BUCKET_NAME>",                 
                    "<arn:aws:s3:::BUCKET_NAME>/"           
                ],           
                "Condition": {               
                    "StringEquals": {                   
                        "aws:SourceArn": "<arn:aws:acm-pca:REGION:ACCOUNT_ID:certificate-authority/CA_ID>",                   
                        "aws:SourceAccount": "<ACCOUNT_ID>"               
                        }           
                }       
            },       
            {           
                "Sid": "Allow Access to CRL",           
                "Effect": "Allow",            
                "Principal": "",             
                "Action": "s3:GetObject",             
                "Resource": "<arn:aws:s3:::BUCKET_NAME/crl/CA_ID.crl>",             
                "Condition": {                 
                    "StringEquals": {                     
                        "aws:SourceVpce": "<VPCE_ID>"                 
                        }             
                }         
            },         
            {             
                "Sid": "Access-to-specific-VPCE-only",             
                "Effect": "Deny",             
                "Principal": "",            
                "Action": "s3:GetObject",           
                "Resource": [               
                    "<arn:aws:s3:::BUCKET_NAME>",               
                    "<arn:aws:s3:::BUCKET_NAME>/"             
                ],             
                "Condition": {                 
                    "StringNotEquals": {                     
                        "aws:SourceVpce": "<VPCE_ID>"                 
                        }             
                }         
            }     
        ] 
    }

    Figure 14: S3 bucket CRL properties

    Figure 14: S3 bucket CRL properties

    Create a private hosted zone:

    1. Go to the Route 53 console.
    2. In the left navigation pane, choose Hosted zones.
    3. Choose Create hosted zone.
    4. Configure the following:
      1. Domain name: Enter s3.amazonaws.com
      2. Description: (optional) enter Private hosted zone for S3 CRL endpoint
      3. Type: Select Private hosted zone.
      4. VPC: For Region, select your VPC’s Region; for VPC ID, select your VPC from the dropdown list.
    5. Choose Create hosted zone.

    Create a record set:

    1. Inside your new private hosted zone:
      1. Choose Create record.
      2. Select Simple routing policy.
      3. Choose Next.
    2. Configure record:
      1. Record name: Enter your S3 bucket name.
      2. Record type: Select A – Routes traffic to an IPv4 address.
      3. Alias: Toggle Yes.
      4. Route traffic to: Select Alias to S3 website endpoint.
      5. Region: Select your Region.
      6. S3 endpoint: Select from dropdown list.
      7. TTL: Leave as default (300 seconds).
    3. Choose Create record.
    Figure 15: Hosted zone details

    Figure 15: Hosted zone details

    Verify configuration:

    1. Go to the Amazon EC2 console and choose Launch instance.
    2. Select Amazon Linux 2.
    3. Choose Instance Type.
    4. Select you VPC and subnet.
    5. Under Network settings, select Create security group, then choose Allow SSH traffic from and enter your IP address.
    6. Choose Launch instance.
    7. After the instance is launched, select the instance and choose Connect.
    8. Select EC2 Instance Connect and choose Connect.

    Test the solution

    To test private access from an EC2 instance within your private VPC, verify CRL access using:
    curl -s https://<bucket-name>.s3.<region>.amazonaws.com/crl/<certificate-id>.crl | openssl crl -text -noout

    If successful, the command completes the following steps, as shown in Figure 16:

    1. Retrieves the CRL from Amazon S3
    2. Decodes it using OpenSSL
    3. Displays comprehensive CRL information including issuer details, update timestamps, revoked certificate list, signature algorithm, and other metadata
    Figure 16: Public access verification

    Figure 16: Public access verification

    To validate your security controls, attempt access from a public EC2 instance using the following command:
    curl https://<bucket-name>.s3.<region>.amazonaws.com/crl/<certificate-id>.crl

    This should fail, receiving an access denied error confirming that the CRL cannot be accessed from the public internet, as shown in Figure 17.

    Figure 17: Access denied error confirming that the CRL cannot be accessed from the public internet

    Figure 17: Access denied error confirming that the CRL cannot be accessed from the public internet

    Conclusion

    In this post, we walked you through two solutions that you can use to make your CRLs accessible to your internal organization, but not publicly available. First, we showed you how to configure a custom CNAME in your CRL distribution point and deploy Lambda functions to automatically copy each newly generated CRL from the default S3 bucket into a private S3 store.

    Next, we showed you a VPC architecture that uses an Amazon S3 VPC gateway endpoint, tightly scoped bucket policies, and private Route 53 DNS zones to make sure that CRL retrieval is confined to your VPC. We also covered the essential IAM and bucket policies that your clients need to access those CRLs securely. You can get started with setting up this solution on AWS Private CA today.

    If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Certificate Manager forum or contact AWS Support.

Rochak Karki

Rochak Karki

Rochak is a Security Specialist Solutions Architect at AWS, focusing on threat detection, incident response, and data protection helping customers build secure environments. Rochak is a US Army veteran and holds a Bachelor of Science in Engineering from the University of Wyoming. Outside of work, he enjoys spending time with family and friends, hiking, and traveling.

Cheryl Wang

Cheryl is an Associate Security Solutions Architect at AWS based in the SF Bay Area. Cheryl is passionate about cybersecurity and helping customers improve their security infrastructure. She holds a B.A. in Computer Science from Wellesley College. Outside of work, she enjoys writing and playing guzheng.

Improve API discoverability with the new Amazon API Gateway Portal

Post Syndicated from Giedrius Praspaliauskas original https://aws.amazon.com/blogs/compute/improve-api-discoverability-with-the-new-amazon-api-gateway-portal/

Amazon API Gateway now provides a fully managed portal feature, Amazon API Gateway Portal, that eliminates the need for static websites, open source solutions, or third-party offerings, which often led to fragmented API lifecycle management and increased costs. API Gateway Portal integrates with the API Gateway service and offers features like API products, interactive “Try it” functionality, and documentation for your API portfolio.

This fully managed solution addresses the need for a seamless way to showcase APIs and help developers quickly find, try, and integrate with them. By providing a managed solution that handles infrastructure, security, and scalability, API providers can focus on creating valuable APIs and delivering a great developer experience.

In this post, we will show how you can use the new portal feature to create customizable portals with enhanced security features in minutes, with APIs from multiple accounts, without managing any infrastructure.

Overview

A developer portal is a web page where API providers can share their APIs and API documentation by grouping them into portal products. Each portal product is a logical grouping of REST APIs and contains the documentation that you create and publish for your API consumers. Product pages within a portal contain the custom documentation at the portal product level. Product REST endpoint pages contain the documentation for each of the REST APIs with the details of the path and method of a REST API and the stage it’s deployed to. The combination of Product pages and Product REST endpoint pages provide the complete documentation for our API consumers on how to start using your REST APIs.

This abstraction allows you to organize endpoints from multiple APIs and stages into coherent product offerings for your consumers. For example, if you operate multiple APIs supporting a pet adoption service, you can create an “AdoptAnimals” portal product that groups dog-related endpoints from one API with cat-related endpoints from another API, while organizing user management functions into a separate “AdoptProcess” portal product.

With this flexibility you can present your APIs in a way that matches your business logic rather than your technical architecture and organize your APIs in ways that make the most sense for your consumers. For large enterprises managing extensive API portfolios, API Gateway Portal offers centralized catalogs of APIs across business groups, reducing duplicate work and improving standardization.

The portal feature automatically creates developer portals that display APIs with documentation, interactive testing capabilities, and integrated consumer analytics. The platform uses AWS Resource Access Manager (RAM) for multi-account API sharing, Amazon Cognito for access control, and Amazon CloudWatch for centralized monitoring.

Key features of API Gateway Portal

The API Gateway Portal provides comprehensive functionality for both API providers and consumers.

The following is a list of the key features that were introduced by the service at launch:

Customizable portal experience: You control your portal’s branding through custom logos and color schemes. You can configure custom domain names with SSL certificates managed by AWS Certificate Manager, or use the default domain structure provided by AWS.

Flexible access control: Access to developer portals can be controlled using Amazon Cognito, you can configure portals to be either publicly accessible or require authentication. Integration with Cognito user pools provides secure and scalable identity and access management that is enterprise-grade, cost-effective, and customizable. For organizations using existing identity systems, Cognito supports federation with SAML and OpenID Connect identity providers.

Cross-account API organization: The portal supports sharing portal products across AWS accounts using AWS RAM, so that organizations can create a unified API catalog while maintaining flexibility for API providers to develop and maintain APIs in their own accounts. When you share a portal product with another account, that account cannot modify any properties of your portal product or product endpoint pages, so API providers maintain control over their APIs while still enabling discovery across the organization. The cross-account sharing capabilities provide significant governance benefits for enterprise customers, including centralized discovery, standardization, reduced duplication, clear ownership, and controlled access.

Documentation: Beyond API reference documentation synchronized from your API definitions, you can add supplemental documentation including guides, use cases, and integration examples.

Search, discovery, and interactive API exploration: Consumers can search across your entire catalog. The portal provides intuitive customizable navigation and organization to help users find the right endpoints for their needs. Using the “Try It” functionality consumers can try APIs directly from the portal. Users can input request parameters, headers, and see live responses, reducing time-to-value for API integrations. This environment includes built-in limits for security and cost control.

Access control and governance

Amazon API Gateway Portal provides security and governance capabilities essential for production deployments.

Identity and access management: Integration with Cognito user pools provides secure and scalable identity and access management that is enterprise-grade, cost-effective, and customizable, including multi-factor authentication, password policies, and user lifecycle management.

API authorization: The portal respects existing authorization mechanisms configured on your APIs, including AWS IAM, Lambda authorizers, and Cognito user pools. Portal access doesn’t bypass your established security controls.

Cross-account governance: When sharing portal products across accounts using AWS RAM, the original API owners retain full control over their endpoints, including authorization strategies, integration configurations, and stage settings. Portal owners can use shared portal products but cannot modify the underlying API configurations.

Audit and monitoring: All portal management activities integrate with AWS CloudTrail for comprehensive audit logging. You can use Amazon CloudWatch RUM to perform real user monitoring to collect and view analytics about API consumers in near real time.

Resource limits: The service includes built-in quotas to prevent abuse, including limits on API testing rate limits, payload sizes, and integration timeouts. With these limits the “Try It” functionality cannot impact your production API performance.

Getting Started

Setting up a portal involves three main steps: creating portal products, configuring the portal, and publishing for consumer access. We will walk through those steps in more detail.

Create portal product

The following procedure shows you how to create a portal product:

  1. Navigate to the API Gateway console and select Portal products from the main navigation.
  2. Choose Create portal product and specify your portal product details including name, description, and visibility settings.
  3. Next, select the endpoints you want to include in this portal product. You can choose entire API stages or specific resources and methods, and even rename endpoints with user-friendly names for better discoverability.
  4. The system automatically imports your API documentation. You can improve the documentation with additional context, use cases, and examples later.
  5. Organize product endpoints into custom categories that reflect your business logic rather than technical implementation details.

Configure the developer portal

The following procedure shows how to create a portal.

  1. Select Developer portals in the API Gateway console navigation.
  2. Specify your portal name, description, and domain configuration.
  3. Choose between adding your prefix to the default AWS domain or configuring a custom domain name with your own SSL certificate.
  4. Configure access control by selecting authentication requirements. For internal portals, you might require Amazon Cognito authentication, while public portals can allow anonymous access to documentation.
  5. Upload your logo and select color themes to match your brand identity.
  6. Add your portal products. You can include products from your account or products shared with you from other accounts through AWS RAM. The portal provides search and filtering capabilities for consumers.

Preview and publish

Before making your portal publicly available, use the preview functionality to review the consumer experience. The preview shows exactly how your portal will appear to users, including navigation, documentation, and available API testing capabilities.

When you’re satisfied with the configuration, choose Publish portal to make it accessible to consumers. The publishing process typically completes within a few minutes, and API Gateway provides the final portal URL for distribution to your consumers.

Conclusion and next steps

The new API Gateway Portal eliminates the complexity of building and maintaining custom API documentation sites. Your developers get a professional, feature-rich experience where they can discover and try your APIs immediately. Plus, since everything stays within AWS, you get built-in security, simplified operations, and comprehensive observability through integration with services like CloudWatch and CloudTrail.

Ready to streamline your API discovery experience? Here’s how to get started:

Getting started with Amazon S3 Tables in Amazon SageMaker Unified Studio

Post Syndicated from David Pasha original https://aws.amazon.com/blogs/big-data/getting-started-with-amazon-s3-tables-in-amazon-sagemaker-unified-studio/

Modern data teams face a critical challenge: their analytical datasets are scattered across multiple storage systems and formats, creating operational complexity that slows down insights and hampers collaboration. Data scientists waste valuable time navigating between different tools to access data stored in various locations, while data engineers struggle to maintain consistent performance and governance across disparate storage solutions. Teams often find themselves locked into specific query engines or analytics tools based on where their data resides, limiting their ability to choose the best tool for each analytical task.

Amazon SageMaker Unified Studio addresses this fragmentation by providing a single environment where teams can access and analyze organizational data using AWS analytics and AI/ML services. The new Amazon S3 Tables integration solves a fundamental problem: it enables teams to store their data in a unified, high-performance table format while maintaining the flexibility to query that same data seamlessly across multiple analytics engines—whether through JupyterLab notebooks, Amazon Redshift, Amazon Athena, or other integrated services. This eliminates the need to duplicate data or compromise on tool choice, allowing teams to focus on generating insights rather than managing data infrastructure complexity.

Table buckets are the third type of S3 bucket, taking place alongside the existing general purpose buckets, directory buckets, and now the fourth type – vector buckets. You can think of a table bucket as an analytics warehouse that can store Apache Iceberg tables with various schemas. Additionally, S3 Tables deliver the same durability, availability, scalability, and performance characteristics as S3 itself, and automatically optimize your storage to maximize query performance and to minimize cost.

In this post, you learn how to integrate SageMaker Unified Studio with S3 tables and query your data using Athena, Redshift, or Apache Spark in EMR and Glue.

Integrating S3 Tables with AWS analytics services

S3 table buckets integrate with AWS Glue Data Catalog and AWS Lake Formation to allow AWS analytics services to automatically discover and access your table data. For more information, see creating an S3 Tables catalog.

Before you get started with SageMaker Unified Studio, your administrator must first create a domain in the SageMaker Unified Studio and provide you with the URL. For more information, see the SageMaker Unified Studio Administrator Guide.

If you’ve never used S3 Tables in SageMaker Studio, you can allow it to enable the S3 Tables analytics integration when you create a new S3 Tables catalog in SageMaker Unified Studio.

Note: This integration needs to be configured individually in each AWS Region.

When you integrate using SageMaker Unified Studio, it takes the following actions in your account:

  • Creates a new AWS Identity and Access Management (IAM) service role that gives AWS Lake Formation access to all your tables and table buckets in the same AWS Region where you are going to provision the resources. This allows Lake Formation to manage access, permissions, and governance for all current and future table buckets.
  • Creates a catalog from an S3 table bucket in the AWS Glue Data Catalog.
  • Add the Redshift service role (AWSServiceRoleForRedshift) as a Lake Formation Read-only administrator permissions.

Prerequisites

Creating catalogs from S3 table buckets in SageMaker Unified Studio

To get started using S3 Tables in SageMaker Unified Studio you create a new Lakehouse catalog with S3 table bucket source using the following steps.

  1. Open the SageMaker console and use the region selector in the top navigation bar to choose the appropriate AWS Region.
  2. Select your SageMaker domain.
  3. Select or create a new project you want to create a table bucket in.
  4. In the navigation menu select Data, then select + to add a new data source.
  5. Choose Create Lakehouse catalog.
  6. In the add catalog menu, choose S3 Tables as the source.
  7. Enter a name for the catalog blogcatalog.
  8. Enter database name taxidata.
  9. Choose Create catalog.
  10. The following steps will help you create these resources in your AWS account:
    1. A new S3 table bucket and the corresponding Glue child catalog under the parent Catalog s3tablescatalog.
    2. Go to Glue console, expand Data Catalog, Click databases, a new database within that Glue child catalog. The database name will match the database name you provided.
    3. Wait for the catalog provisioning to finish.
  11. Create tables in your database, then use the Query Editor or a Jupyter notebook to run queries against them.

Creating and querying S3 table buckets

After adding an S3 Tables catalog, it can be queried using the format s3tablescatalog/blogcatalog. You can begin creating tables within the catalog and query them in SageMaker Studio using the Query Editor or JupyterLab. For more information, see Querying S3 Tables in SageMaker Studio.

Note: In SageMaker Unified Studio, you can create S3 tables only using the Athena engine. However, once the tables are created, they can be queried using Athena, Redshift, or through Spark in EMR and Glue.

Using the query editor

Creating a table in the query editor

  1. Navigate to the project you created in the top center menu of the SageMaker Unified Studio home page.
  2. Expand the Build menu in the top navigation bar, then choose Query editor.
  3. Launch a new Query Editor tab. This tool functions as a SQL notebook, enabling you to query across multiple engines and build visual data analytics solutions.
  4. Select a data source for your queries by using the menu in the upper-right corner of the Query Editor.
    1. Under Connections, choose Lakehouse (Athena) to connect to your Lakehouse resources.
    2. Under Catalogs, choose S3tablescatalog/blogcatalog.
    3. Under Databases, choose the name of the database for your S3 tables.
  5. Select Choose to connect to the database and query engine.
  6. Run the following SQL query to create a new table in the catalog.
    CREATE TABLE taxidata.taxi_trip_data_iceberg (
    pickup_datetime timestamp,
    dropoff_datetime timestamp,
    pickup_longitude double,
    pickup_latitude double,
    dropoff_longitude double,
    dropoff_latitude double,
    passenger_count bigint,
    fare_amount double
    )
    PARTITIONED BY
    (day(pickup_datetime))
    TBLPROPERTIES (
    'table_type' = 'iceberg'
    );

    After you create the table, you can browse to it in the Data explorer by choosing S3tablescatalog →s3tableCatalog →taxidata→taxi_trip_data_iceberg.

  7. Insert data into a table with the following DML statement.
    INSERT INTO taxidata.taxi_trip_data_iceberg VALUES (
    TIMESTAMP '2025-07-20 10:00:00',
    TIMESTAMP '2025-07-20 10:45:00',
    -73.985,
    40.758,
    -73.982,
    40.761,
    2, 23.75
    );

  8. Select data from a table with the following query.
    SELECT * FROM taxidata.taxi_trip_data_iceberg
    WHERE pickup_datetime >= TIMESTAMP '2025-07-20'
    AND pickup_datetime < TIMESTAMP '2025-07-21';

You can learn more about the Query Editor and explore additional SQL examples in the SageMaker Unified Studio documentation.

Before proceeding with JupyterLab setup:

To create tables using the Spark engine via a Spark connection, you must grant the S3TableFullAccess permission to the Project Role ARN.

  1. Locate the Project Role ARN in SageMaker Unified Studio Project Overview.
  2. Go to the IAM console then select Roles.
  3. Search for and select the Project Role.
  4. Attach the S3TableFullAccess policy to the role, so that the project has full access to interact with S3 Tables.

Using JupyterLab

  1. Navigate to the project you created in the top center menu of the SageMaker Unified Studio home page.
  2. Expand the Build menu in the top navigation bar, then choose JupyterLab.
  3. Create a new notebook.
  4. Select Python3 Kernel.
  5. Choose PySpark as the connection type.
  6. Select your table bucket and namespace as the data source for your queries:
    1. For Spark engine, execute query USE s3tablescatalog_blogdata

Querying data using Redshift:

In this section, we walk through how to query the data using Redshift within SageMaker Unified Studio.

  1. From the SageMaker Studio home page, choose your project name in the top center navigation bar.
  2. In the navigation panel, expand the Redshift project folder.
  3. Open the blogdata@s3tablescatalog database.
  4. Expand the taxidata schema.
  5. Under the Tables section, locate and expand taxi_trip_data_iceberg.
  6. Review the table metadata to view all columns and their corresponding data types.
  7. Open the Sample data tab to preview a small, representative subset of records.
  8. Choose Actions.
  9. Select Preview data from the dropdown to open and view the full dataset in the data viewer.

When you select your table, the Query Editor automatically opens with a pre-populated SQL query. This default query retrieves the top 10 records from the table, giving you an instant preview of your data. It uses standard SQL naming conventions, referencing the table by its fully qualified name in the format database_schema.table_name. This approach ensures the query accurately targets the intended table, even in environments with multiple databases or schemas.

Best practices and considerations

The following are some considerations you should take note of.

  • When you create an S3 table bucket using the S3 console, integration with AWS analytics services is enabled automatically by default. You can also choose to set up the integration manually through a guided process in the console. Also, when you create S3 Table bucket programmatically using the AWS SDK, or AWS CLI, or REST APIs, the integration with AWS analytics services is not automatically configured. You need to manually perform the steps required to integrate the S3 Table bucket with AWS Glue Data Catalog and Lake Formation, allowing these services to discover and access the table data.
  • When creating an S3 table bucket for use with AWS analytics services like Athena, we recommend using all lowercase letters for the table bucket name. This requirement ensures proper integration and visibility within the AWS analytics ecosystem. Learn more about it from getting started with S3 tables.
  • S3 Tables offer automatic table maintenance features like compaction, snapshot management, and unreferenced file removal to optimize data for analytics workloads. However, there are some limitations to consider. Please read more on it from considerations and limitations for maintenance jobs.

Conclusion

In this post, we discussed how to use SageMaker Unified Studio’s integration with S3 Tables to enhance your data analytics workflows. The post explained the setup process, including creating a Lakehouse catalog with S3 table bucket source, configuring necessary IAM roles, and establishing integration with AWS Glue Data Catalog and Lake Formation. We walked you through practical implementation steps, from creating and managing Apache Iceberg based S3 tables to executing queries through both the Query Editor and JupyterLab with PySpark, as well as accessing and analyzing data using Redshift.

To get started with SageMaker Unified Studio and S3 Tables integration, visit Access Amazon SageMaker Unified Studio documentation.


About authors

Sakti Mishra

Sakti Mishra

Sakti is a Principal Data and AI Solutions Architect at AWS, where he helps customers modernize their data architecture and define end-to end-data strategies, including data security, accessibility, governance, and more. He is also the author of Simplify Big Data Analytics with Amazon EMR and AWS Certified Data Engineer Study Guide. Outside of work, Sakti enjoys learning new technologies, watching movies, and visiting places with family.

Vivek Shrivastava

Vivek Shrivastava

Vivek is a Principal Data Architect, Data Lake in AWS Professional Services. He is a big data enthusiast and holds 14 AWS Certifications. He is passionate about helping customers build scalable and high-performance data analytics solutions in the cloud. In his spare time, he loves reading and finds areas for home automation.

David Pasha

David Pasha

David is a Senior Healthcare and Life Sciences (HCLS) Technical Account Manager with 16 years of expertise in analytics. As an active member of the Analytics Technical Field Community (TFC), he specializes in designing and implementing scalable data warehouse solutions for customers in the cloud.

Debu Panda

Debu Panda

Debu is a Senior Manager, Product Management at AWS. He is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world.

Building responsive APIs with Amazon API Gateway response streaming

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/compute/building-responsive-apis-with-amazon-api-gateway-response-streaming/

Today, AWS announced support for response streaming in Amazon API Gateway to significantly improve the responsiveness of your REST APIs by progressively streaming response payloads back to the client. With this new capability, you can use streamed responses to enhance user experience when building LLM-driven applications (such as AI agents and chatbots), improve time-to-first-byte (TTFB) performance for web and mobile applications, stream large files, and perform long-running operations while reporting incremental progress using protocols such as server-sent events (SSE).

In this post you will learn about this new capability, the challenges it addresses, and how to use response streaming to improve the responsiveness of your applications.

Overview

Consider this scenario – you’re running an AI-powered agentic application that uses an Amazon Bedrock foundation model. Your users interact with the application through an API, asking complex questions that require detailed responses. Before response streaming, users would send their prompts and wait to eventually receive the application response, sometimes for tens of seconds. This awkward pause between questions and responses created a disconnected, unnatural experience.

With the new API Gateway response streaming capability, the interaction through the API becomes much more fluid and natural. As soon as your application starts processing the model response, you can stream it back to your users using the API Gateway.

The following animation illustrates this significant user experience improvement. The prompt on the left is processed using a non-streaming response with user having to wait for several seconds to receive the result. The prompt on the right is using the new API Gateway response streaming, significantly reducing TTFB and improving user experience.

Figure 1. Comparing user experience before (left) and after (right) enabling API Gateway response streaming when returning a response from a Bedrock foundational model.

Your users can now see AI responses appear in real-time, word by word, just like watching someone type. This immediate feedback makes your applications feel more responsive and engaging, keeping users connected throughout the interaction. In addition, you don’t have to worry about response size limits or implement complex workarounds – the streaming happens automatically and efficiently, letting you focus on building great user experiences rather than managing infrastructure constraints.

Understanding response steaming

In the traditional request-response model, responses must be fully computed before being sent to the client. This can negatively impact user experience – the client must wait for the complete response to be generated on the server-side and transmitted over-the-wire. This is especially pronounced in interactive, latency-sensitive cloud applications such as AI agents, chatbots, virtual assistants, or music generators.

Figure 2. Response is returned to the client only after it’s been fully generated, increasing time-to-first-byte latency.

Another important scenario is returning larger response payloads, such as images, large documents, or datasets. In some cases, these payloads may exceed the 10 MB response size limit or default integration timeout limit of 29 seconds of API Gateway. Before the launch of response streaming, developers worked around these limitations by using pre-signed Amazon S3 URLs to download large responses or accepting lower RPS for an increase in timeout. While functional, these workarounds introduced additional latency and architectural complexity.

With response streaming support you can address these challenges. You can now update your REST APIs to return streamed responses, significantly enhancing user experience, improving TTFB performance, supporting response payload sizes to exceed 10 MB, and serving requests that can take up to 15 minutes.

Figure 3. Response streaming reduces time-to-first-byte and improves user experience.

The response streaming capability is already delivering significant performance for organizations:

“Working closely with the AWS teams to enable response streaming was instrumental in advancing our roadmap to deliver the most performant storefront experiences for our largest customers at Salesforce Commerce Cloud. Our collaboration exceeded our Core Web Vital goals; we saw our Total Blocking Time metrics drop by over 98%, which will enable our customers to drive higher revenue and conversion rates.”, says Drew Lau, Senior Director of Product Management at Salesforce.

Response streaming is supported for any HTTP-proxy integration, AWS Lambda functions (using proxy integration mode), and private integrations. To get started, configure your API integration to stream the response from your backend, as described in the following sections, and redeploy your API for changes to take effect.

Getting started with response streaming

To enable response streaming for your REST APIs, update your integration configuration to set the response transfer mode to STREAM. This enables API Gateway to start streaming the response to the client as soon as response bytes become available. When using response streaming, you can configure request timeout up to 15 minutes. For best time to first byte user experience, AWS strongly recommends your backend integration also implements response streaming.

You can enable response streaming in several different ways, as illustrated in the following snippets:

Using the API Gateway console, when creating method integrations, select Stream for the Response transfer mode.

Figure 4. Enabling response streaming in API Gateway Console.

Setting response transfer mode using the Open API spec:

paths:
  /products:
    get:
      x-amazon-apigateway-integration:
        httpMethod: "GET"
        uri: "https://example.com"
        type: "http_proxy"
        timeoutInMillis: 300000
        responseTransferMode: "STREAM"

Setting response transfer mode using infrastructure-as-code (IaC) frameworks, such as AWS CloudFormation. Note the /response-streaming-invocations Uri fragment, it tells API Gateway to use the Lambda InvokeWithResponseStreaming endpoint:

MyProxyResourceMethod:
  Type: 'AWS::ApiGateway::Method'
  Properties:
    RestApiId: !Ref LambdaSimpleProxy
    ResourceId: !Ref ProxyResource
    HttpMethod: ANY
    Integration:
      Type: AWS_PROXY
      IntegrationHttpMethod: POST
      ResponseTransferMode: STREAM
      Uri: !Sub arn:aws:apigateway:${APIGW_REGION}:lambda:path/2021-11-
           15/functions/${FN_ARN}/response-streaming-invocations

Updating response transfer mode using the AWS CLI:

aws apigw update-integration \
   --rest-api-id a1b2c2 \
   --resource-id aaa111 \
   --http-method GET \
   --patch-operations "op='replace',path='/responseTransferMode',value=STREAM" \
   --region us-west-2

Using response streaming with Lambda functions

When using Lambda functions as a downstream integration endpoint, your Lambda functions must be streaming-enabled. The API Gateway uses the InvokeWithResponseStreaming API to invoke functions, as illustrated in the following diagram, and requires Lambda proxy integration. See the API Gateway documentation for additional guidance.

Figure 5. Using API Gateway response streaming with Lambda functions for interactive AI applications.

When you use response streaming with Lambda functions, API Gateway expects the handler response stream to contain the following components (in order):

  • JSON response metadata – Must be a valid JSON object and can only contain statusCode, headers, multiValueHeaders, and cookies fields (all optional). Metadata cannot be an empty string; at a minimum it must be an empty JSON object.
  • The 8-null-byte delimiter – Lambda adds this delimiter automatically when you use the built-in awslambda.HttpResponseStream.from() method, as illustrated below. When not using this method, you’re responsible for adding the delimiter yourself.
  • Response payload – Can be empty.

The following code snippet illustrates how you can return a streamed response from your Lambda functions so it will be compatible with API Gateway response streaming:

export const handler = awslambda.streamifyResponse(
   async (event, responseStream, context) => {

      const httpResponseMetadata = {
         statusCode: 200,
         headers: {
            'Content-Type': 'text/plain',
            'X-Custom-Header': 'some-value'
         }
      };

      responseStream = awslambda.HttpResponseStream.from(
         responseStream,
         httpResponseMetadata
      );

      responseStream.write('hello');
      await new Promise(r => setTimeout(r, 1000));
      responseStream.write(' world');
      await new Promise(r => setTimeout(r, 1000));
      responseStream.write('!!!');
      responseStream.end();
   }
);

Refer to the API Gateway documentation for further implementation guidelines.

Using response streaming with HTTP Proxy integrations

You can stream HTTP responses from your applications used as downstream integration endpoints, for example web servers running on Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS). In this case, you must use HTTP_PROXY integration and specify the response transfer mode as STREAM (using the console, AWS CLI, or IaC). Redeploy your API after modifying it.

Figure 6. Using API Gateway response streaming with HTTP server applications.

Once API Gateway receives a streaming response from your application, it will wait until the HTTP headers block transfer is complete. Then, it will send to the client an HTTP response status code and headers, followed by the content from your application as it gets received by the API Gateway service. It will continue streaming response from your application to the client until the stream ends (up to 15 minutes).

Many popular API and web application development frameworks provide response streaming abstractions. The following code snippet illustrates how you can implement HTTP response streaming using FastAPI:

app = FastAPI()

async def stream_response():
   yield b"Hello "
   await asyncio.sleep(1)
   yield b"World "
   await asyncio.sleep(1)
   yield b"!"

@app.get("/")
async def main():
   return StreamingResponse(stream_response(), media_type="text/plain")

Adding real-time response streaming to your HTTP clients

Different HTTP clients have different ways to process streamed response fragments as they arrive. The following code snippet illustrates how to process a streamed response with a Node.js application:

const request = http.request(options, (response)=>{
   response.on('data', (chunk) => {
      console.log(chunk);
   });

   response.on('end', () => {
      console.log('Response complete’);
   });
});

request.end();

When using CURL, you can use the –no-buffer argument to print response fragments as they arrive.

curl --no-buffer {URL}

Sample code

Clone this sample project from GitHub to see API Gateway response streaming in action. Follow instructions in the README.md to provision the sample project in your AWS account.

Considerations

Before you enable response streaming, consider:

  • Response streaming is available for REST APIs and can be used with HTTP_PROXY integrations, Lambda integrations (in proxy mode), and private integrations.
  • You can use API Gateway response streaming with any endpoint type, such as Regional, Private, and Edge-optimized, with or without custom domain names.
  • When using response streaming, you can configure response timeouts up to 15 minutes, according to your scenario requirements.
  • All streaming responses from Regional or Private endpoints are subject to a 5-minute idle timeout. All streaming responses from edge-optimized endpoints are subject to a 30-second idle timeout.
  • Within each streaming response, the first 10MB of response payload is not subject to any bandwidth restrictions. Response payload data exceeding 10MB is restricted to 2MB/s.
  • Response streaming is compatible with API Gateway security capabilities such as authorizers, WAF, access controls, TLS/mTLS, request throttling, and access logging.
  • When processing streamed responses, the following features are not supported: response transformation with VTL, integration response caching, and content encoding.
  • Always protect your APIs against unauthorized access and other potential security threats by implementing proper authorization with Lambda Authorizers or Amazon Cognito User Pools. Read REST API protection documentation and API Gateway security documentation for additional details.

Observability

You can continue using existing observability capabilities, such as execution logs, access logs, AWS X-Ray integration, and Amazon CloudWatch metrics with API Gateway response streaming.

In addition to the existing access logs variables, the following new variables are available:

  • $content.integration.responseTransferMode – the response transfer mode of your integration. This can be either BUFFERED or STREAMED.
  • $context.integration.timeToAllHeaders – the time between when API Gateway establishes the integration connection to when it receives all integration response headers from the client.
  • $context.integration.timeToFirstContent – the time between when API Gateway establishes the integration connection to when it receives the first content bytes.

See API Gateway documentation for more information.

Pricing

With this new capability, you continue to pay the same API Invoke rates for streamed responses. Each 10MB of response data, rounded up to the nearest 10MB, is billed as a single request. See API Gateway pricing page for additional details.

Conclusion

The new response streaming capability for Amazon API Gateway enhances how you can build and deliver responsive APIs in the cloud. With immediate streaming of response data as it becomes available, you can significantly improve time-to-first-byte performance and overcome traditional payload size and timeout limitations. This is particularly valuable for AI-powered applications, file transfers, and interactive web experiences that demand real-time responsiveness.

To learn more about API Gateway response streaming see the service documentation.

To learn more about building Serverless architectures see Serverless Land.

Optimize latency-sensitive workloads with Amazon EC2 detailed NVMe statistics

Post Syndicated from Sanjeev Malladi original https://aws.amazon.com/blogs/compute/optimize-latency-sensitive-workloads-with-amazon-ec2-detailed-nvme-statistics/

Amazon Elastic Cloud Compute (Amazon EC2) instances with locally attached NVMe storage can provide the performance needed for workloads demanding ultra-low latency and high I/O throughput. High-performance workloads, from high-frequency trading applications and in-memory databases to real-time analytics engines and AI/ML inference, need comprehensive performance tracking. Operating system tools like iostat and sar provide valuable system-level insights, and Amazon CloudWatch offers important disk IOPs and throughput measurements, but high-performance workloads can benefit from even more detailed visibility into instance store performance.

For latency-sensitive applications where every millisecond counts, enhanced performance monitoring tools provide deep visibility into storage systems, so your teams can track and analyze behavior at a 1 second granularity. This detailed insight can help your organization detect bottlenecks quickly, fine-tune application performance, and deliver reliable service.

In this post, we discuss how you can use Amazon EC2 detailed performance statistics for instance store NVMe volumes, a set of new metrics that provide per-second granularity, to provide real-time visibility into your locally attached storage performance. These statistics are similar to the Amazon EBS detailed performance statistics, providing a consistent monitoring experience across both storage types. You can access these statistics directly from your NVMe devices attached to the Amazon EC2 instance using nvme-cli or using CloudWatch agent to monitor I/O performance at the storage level. We also provide examples of how to use these statistics to identify performance bottlenecks.

Feature overview

Amazon EC2 Nitro-based instances with locally attached NVMe instance storage now offer 11 comprehensive metrics at per-second granularity. These metrics, similar to EBS volume metrics, include queue length measurements, IOPS, throughput data, and IO latency histograms for the locally attached NVMe instance storage. Additionally, they also include IO size-specific latency histograms to provide even more detailed insights into performance patterns of the local NVMe instance storage. These metrics are collected and presented separately for each individual NVMe volume available on an instance.

The statistics are presented in three main formats:

    1. Cumulative counters that track IO operations, throughput, and read/write times
    2. Real-time queue length, displaying the current value at the time of your query
    3. Latency histograms visualizing the distribution of IO operations across different latency ranges by displaying both cumulative view and IO size-specific distributions

Prerequisites

To access detailed performance statistics for local instance storage, complete the following steps:

    1. Launch a new Amazon EC2 Nitro instance or use an existing one, then connect to it using SSH or your preferred connection method.
    2. Identify the NVMe device associated with the local storage to query for the performance statistics. For example, you can run the nvme-cli command in the CLI to output all NVMe devices on the instance.
      $ sudo nvme list.

      The following is an example output of the list command that lists the NVMe devices on the instance and their volume Serial Numbers (SN; masked in the below output for privacy). In this demonstration, consider that the local storage used by your application is /dev/nvme1n1.

      Terminal output showing five NVMe devices: one EBS volume and four EC2 instance storage volumes with 3.75TB capacity each

    3. If you are using Amazon Linux 2023 version 2023.8.20250915 (or later) or Amazon Linux 2 2.0.20251014.0 (or later) you can proceed to Step 4 because nvme-cli will use the latest version. If you are using an earlier Amazon Linux version, update the nvme-cli using the following command, where 2023.8.20250915 can be replaced with the latest Amazon Linux 2023 version:
      $ sudo dnf upgrade --releasever=2023.8.20250915
    4. Run the nvme-cli, with the correct permissions, and pass the device as a parameter. You can use --help to get details on the command usage:
      $ sudo nvme amzn stats --help

      Example output:
      Command help output for 'nvme amzn stats' showing usage syntax and format options
      If you prefer output in a JSON format, you can provide the -o json parameter to the command.

      $ sudo nvme amzn stats /dev/nvme1n1 -o json

      The following output (without the -o json parameter) shows cumulative read/write operations, read/write bytes, total processing time (read and write in microseconds), and duration (in microseconds) when application attempted to exceed the instance’s IOPS/throughput limits.
      Storage performance metrics showing read operations count, total bytes, and timing statistics for an EC2 NVMe volume
      It also displays read/write I/O latency histograms, with each row representing completed I/O operations within a specific bin of time (in microseconds).
      Read latency distribution histogram showing operation counts across different microsecond ranges, with peak activity in 2048-4096 rangeWrite latency distribution histogram showing zero operations across all time ranges, indicating no write activity
      If you want to view the latency histograms across 5 different IO bands: (0, 512 Byte], (512B, 4KiB], (4KiB, 8KiB], (8KiB 32KiB], (32 KiB, MAX], you can provide --details or -d parameter to the command:

      $ sudo nvme amzn stats -d /dev/nvme1n

      The following image is an excerpt of the above command’s output, showing the additional latency histograms (read and write) of the 5 different IO bands.
      Dual read/write I/O latency histogram analyzing small block operations from 0-512 bytes with peak at 4096-8192 rangePerformance analysis histogram showing I/O patterns for 512-4K blocks with significant activity in 512-1024 rangeDual histogram showing I/O latency patterns for 4K-8K block operations with concentrated activity at 4096-8192Performance analysis histogram displaying I/O patterns for 8K-32K blocks with peak activity in 4096-8192 rangeComprehensive I/O latency histogram analyzing largest block sizes from 32K to maximum with concentrated activity in 4096-8192

You can run the stats command at a per second granularity. You can also write scripts to pull the stats at a desired interval (every second or any other duration) with each subsequent output reflecting the updated cumulative totals for the metrics. Calculating the difference in the statistics across the last two outputs allows you to derive insight into the instance storage profile during the interval. Below is a sample script you can use to pull the stats at a default interval of 1 second or at your desired interval.

#!/bin/bash 
# interval of 1 second 
INTERVAL=${1:-1} 
while true; do 
	echo "=== $(date) ===" 
	sudo nvme amzn stats /dev/nvme1 || break 
	echo 
	sleep $INTERVAL 
done

You can save this script, make it executable and run it at either the default 1-second interval or provide a custom interval when executing the script. For example, if you saved the script as nvme_stats.sh, you could use the following commands to make it executable and run to get the output at the default 1-second interval (assuming you are in the same directory as that of nvme_stats.sh).

chmod +x nvme_stats.sh
./nvme_stats.sh

If, for instance, you want to get the output at every 5 seconds, you can use the command below (after making the script executable)

./nvme_stats.sh 5

You can also integrate with CloudWatch using CloudWatch agent to collect and publish these statistics for historical tracking, trend visualization through dashboards, and performance-based alerts to correlate with application metrics and automated notifications for performance issues.

Deriving insights from the Amazon EC2 instance store NVMe detailed performance statistics

Similar to EBS detailed performance statistics, you can use Amazon EC2 instance store NVMe statistics to troubleshoot various workload performance issues. As mentioned in the preceding section, you can also use the detailed statistics to view I/O latency histograms to observe the spread of I/O latency within the period. You can use the read/write operations and time spent statistics to calculate the average latency. The detailed statistics show the average latency at per-second granularity.

The next two example scenarios demonstrate key performance analysis using the statistics. In Scenario 1, we will use the EC2 Instance Local Storage Performance Exceeded (us) metric to check if I/O demands exceed instance storage capabilities, helping with instance right-sizing for sufficient I/O application performance. In Scenario 2, we will use IO-size specific histograms (using --details) to diagnose how large block writes affect subsequent read performance – an issue typically hidden by traditional monitoring tools’ aggregated metrics across all IO sizes.

Scenario 1: Identifying when applications exceed instance storage performance limits

Understanding whether your application’s I/O demands exceed your instance store volumes’ capabilities is important for performance troubleshooting. When applications generate I/O workloads that consistently attempt to exceed the IOPS and throughput limits of specific Amazon EC2 instance types, you’ll experience increased latency and degraded performance. The EC2 Instance Local Storage Performance Exceeded (us) metric helps identify these scenarios by showing the duration (in microseconds) when workloads exceeded supported instance performance. A non-zero value or increasing count between snapshots indicates your current instance size or type may not provide sufficient I/O performance for your application.

The following section shows how to identify if an application is sending more IOPS than the instance’s local storage can support.

The example scenario: An application on an i3en.xlarge instance shows elevated write latency of >1ms. You want to determine if the application’s workload is exceeding the instance’s NVMe volume supported performance.

    1. Select the Instance Storage NVMe device you want to analyze – Identify the instance you want to analyze for the application experiencing elevated latency.
    2. Identify the NVMe device – Use the following nvme-cli command, and identify the NVMe device associated with that instance storage.
      $ sudo nvme list

      Example scenario: We used the list and identified /dev/nvme1n1 as the NVMe device associated with the i3en.xlarge instance that is running the application which is currently seeing elevated write latency >1ms (while read latency is <50us as per normal conditions), so now we want to. analyze it.

    3. Collect statistics for the device at a single point in time or at desired intervals – Collect the detailed performance statistics using the nvme-cli command or use the sample script provided in previous section to capture statistics at the desired intervals, if needed.
      $ sudo nvme amzn stats /dev/nvme1n1

      Example scenario: We choose to collect the statistics only once after noticing elevated write latency of the application.

    4. Analyze the statistics to check if the application demands more than the supported performance of the instance storage – Confirm existence of overall I/O latency degradation by comparing two sets of read/write I/O latency histograms taken some time apart.Example scenario: The following output shows Read IO histogram of the NVMe local instance storage taken 40 seconds apart with no read IO latency issues (as normal read latency for this workload is < 50 us).

      Metric captured at time T:
      AWS EC2 storage performance histogram showing read latency distribution, peak at 16-32 microsecond bucket
      Metric captured at time T+40s:
      AWS EC2 storage performance data showing increased read latency concentration in 16-32 microsecond bucket
      The following output shows Write IO histogram taken 40 seconds apart. We can discern that many write IOs fall into the 1ms – 2ms latency range, which is not expected for this application.
      Metric captured at time T:
      AWS EC2 storage write performance data showing majority of operations between 1-2ms latency
      Metric captured at time T+40s:
      AWS EC2 storage performance metrics showing increased write operations clustered in 1-2ms latency range

    5. Analyze the EC2 Instance Local Storage Performance Exceeded (us) metric which shows total time (in microseconds) IOPS requests exceed volume limits. Ideally, the incremental count of this metric between two snapshot times should be minimal, as any value above 0 indicates that the workload demanded more IOPS than the volume could deliver.Example scenario: Comparing metrics 40 seconds apart shows that for more than 34 seconds, the application’s IOPS demands surpassed the IOPS supported by the local instance storage. This explains elevated write latency: excess IOPS above what the underlying storage can physically handle queue the operations, increasing wait times. This indicates that the i3en.xlarge instance chosen to run this application cannot meet the application’s performance requirements, suggesting either upgrading to a larger instance size or re-evaluating the instance type itself.
      Metric captured at time T:
      EC2 Instance Local Storage Performance exceeded output of nvme-cli for the described scenario at time T
      Metric captured at time T+40s:
      EC2 Instance Local Storage Performance exceeded output of nvme-cli for the described scenario at time T+40 with increased count of metric

It’s important to have the right instance size to avoid performance bottlenecks to your application. Refer to the Amazon EC2 instance documentation for more information on the different instances and their storage size.

Scenario 2: Identifying the block size causing elevated latency in your applications

Many storage performance issues arise from complex interactions between read and write operations with different I/O sizes, which traditional system-level monitoring tools like iostat or sar cannot effectively diagnose due to their aggregated metrics across all I/O sizes. EC2 instance store NVMe detailed performance statistics solves this by providing I/O-size specific latency histograms through the --details option in NVMe CLI. These histograms show latency data for different I/O size ranges: (0, 512 Byte], (512B, 4KiB], (4KiB, 8KiB], (8KiB, 32KiB], (32KiB, MAX], for a more precise correlation between application workload patterns and I/O size-specific latency metrics for targeted optimizations.

In this example scenario, your application performs small reads (typically <=4KiB, like metadata read) followed by large writes (>=32KiB) and shows unexpectedly high read latency. This common issue occurs when large writes impact subsequent read operations’ performance, creating a cascading effect on overall I/O performance.

    1. Gather read and write IO latency by size ranges – Use the NVMe CLI with the --details option to gather read and write IO latency by size ranges:
      $ sudo nvme amzn stats /dev/nvme1n1 --details

    2. Confirm existence of overall IO latency degradation – In the example scenario, examining overall IO latency, both read (left) and write (right) operations are showing higher than expected latency.
      NVMe storage read latency histogram highlighting concentrated IO operations in 4K-16K microsecond rangeNVMe storage write latency histogram highlighting concentrated IO operations in 8-32K microsecond range
    3. Examine the output for patterns across different IO size bands – Analyzing latency by operation sizes shows small read operations (512 bytes to 4K), typically fast, are experiencing unexpected latency spikes while large writes (32K+) show significant delays. Small reads should theoretically maintain good performance regardless of other I/O activities.
      NVMe storage read/write latency histogram highlighting concentrated IO operations in 8-16K microsecond range for IO band of 512 - 4KNVMe storage read/write latency histogram highlighting concentrated IO operations in 8-16K microsecond range in IO band 32K and above
      The observed pattern indicates that the backed-up large write operations create system-wide congestion, affecting all I/O operations of types and sizes. Despite the storage system’s capability to handle small reads efficiently, the queued large writes slow down both read and write operations at the application level.

Based on this analysis, we can implement several targeted optimizations to the application, like using smaller block sizes for write operations when possible, or batching smaller writes instead of performing large single writes.

Clean up

If you created an Amazon EC2 instance with NVMe volume for this exercise, then terminate and delete the appropriate instance to avoid future costs.

Conclusion

Amazon EC2 detailed performance statistics for instance store NVMe volumes provide real-time, sub-minute storage performance monitoring, similar to the detailed performance statistics available for Amazon EBS volumes. This offers consistent monitoring experience across both storage types, with additional IO-size based latency histograms for instance storage for better optimization of I/O patterns, and more effective troubleshooting.

To learn more about Amazon EC2 instance store NVMe volumes, optimization techniques for latency-sensitive workloads or other Amazon EC2 related topics, visit the Amazon EC2 documentation page or explore our other AWS Storage Blog posts on performance optimization.

We’d love to hear how you’re using these statistics to enhance your workloads, or if you have any questions, in the comments section below.

Announcing CloudFormation IDE Experience: End-to-End Development in Your IDE

Post Syndicated from Damola Oluyemo original https://aws.amazon.com/blogs/devops/announcing-cloudformation-ide-experience-end-to-end-development-in-your-ide/

If you’ve developed AWS CloudFormation templates, you know the drill; write YAML(YAML Ain’t Markup Language) in your IDE(Integrated Development Environment), switch to the AWS Management Console to validate, jump to documentation to verify property names. Then run CFN Lint(Cloudformation Linter) in your terminal, deploy and wait, then troubleshoot failures back in the console. This constant context switching between your IDE, AWS Console, documentation pages, and validation tools fragments your workflow and kills productivity. What should take 30 minutes often stretches into hours of iteration cycles.

Today, we’re excited to introduce the CloudFormation IDE Experience, a comprehensive solution that brings the entire CloudFormation development lifecycle into your IDE. No more context switching. No more fragmented workflows. Just one unified, intelligent development experience from authoring to deployment.

In this post, you’ll learn how the Cloudformation IDE Experience transforms your workflow with intelligent authoring, real-time validation, AWS integration, and more.

What is the CloudFormation IDE Experience?

The CloudFormation IDE Experience reimagines how you build infrastructure as code by creating an end-to-end development loop entirely within your IDE. Unlike generic YAML or JSON editors, this is a CloudFormation-first solution built specifically for infrastructure developers.

This solution covers the complete lifecycle; from intelligent authoring with smart code completion and navigation that understands CloudFormation semantics, to real-time multi-layer validation that catches issues before deployment. It provides direct AWS integration for seamless resource imports and stack visibility, monitors configuration drift between your templates and deployed resources, and includes server-side pre-deployment checks that prevent common deployment failures.
The result? A development environment that understands your infrastructure code as deeply as your IDE understands your application code.

Core Features

Quick Project Setup with CFN Init

CFN Init streamlines project setup by creating a structured CloudFormation project with environment configurations in seconds. Run “CFN Init: Initialize Project” from the Command Palette, configure your environments (dev, staging, production), and associate each with an AWS profile.

The CloudFormation Explorer displays your environments, letting you switch between them with a single click. Each environment maintains its own deployment settings and parameter values, eliminating manual configuration and ensuring consistent deployments across your infrastructure lifecycle.

Intelligent Authoring with Intelligent Code Completion

The IDE understands CloudFormation semantics and provides context-aware suggestions as you type. Only required properties appear automatically, while optional properties surface on hover, so when you add a Properties section to an EC2 VPC resource, nothing appears because it has no required properties. Create a subnet, however, and VpcId appears immediately because it’s required.

When you use !GetAtt or !Ref, the IDE knows exactly which attributes and resources are available. Navigation features like go-to-definition for logical IDs and hover tooltips let you explore complex templates without losing context. The IDE also provides full support for CloudFormation intrinsic functions and pseudo parameters.

Multi-Layer Validation System

The IDE provides comprehensive validation at multiple levels:

Static Validation (Real-time)

  • CloudFormation Guard Integration: Security and compliance checks using AWS Security pillar rules. For example, it automatically flags insecure configurations like MapPublicIpOnLaunch: true on subnets
  • CFN Lint Integration: Advanced syntax and logic validation, including overlapping CIDR block detection, resource dependency validation, and property checks beyond basic schema validation

Interactive Error Resolution
When errors occur, the IDE doesn’t just highlight them, it helps you fix them. Contextual error messages explain what’s wrong and why it matters, while one-click quick fixes automatically correct common issues like missing required properties or invalid reference formats. If you reference a non-existent resource, the IDE suggests valid alternatives from your template. Reference an invalid attribute with !GetAtt, the IDE immediately shows which attributes are actually available for that resource type.

AWS Resource Integration (CCAPI)

Import existing AWS resources directly into your templates using the Cloud Control API (CCAPI). Browse live resources and view all CloudFormation stacks in your AWS account from within the IDE. Pull resource configurations directly into your template with one click, complete with accurate property values. This transforms existing infrastructure into Infrastructure-as-Code without manual reconstruction or switching to the console to look up property values.

Server-Side Validation

Before you deploy, the IDE performs comprehensive server-side validation through AWS’s intelligent validation service that analyzes your CloudFormation templates against real-world deployment patterns and catches issues static analysis can’t detect.

The AWS’s intelligent validation service uses AWS-managed hooks to analyze your change sets before execution across three categories. Enhanced template validation covers CFN Lint blind spots like transforms and parameter values. Primary identifier conflict detection finds existing resources with the same identifiers before you attempt deployment. Resource state validation checks resource readiness ensuring, for example, that Amazon Simple Storage Service(S3) buckets are empty before deletion attempts.

This validation is based on analysis of the top CloudFormation failure patterns, helping you catch issues before they cause rollbacks or failed states.

Getting Started

Getting started with the CloudFormation IDE Experience is straightforward:

Prerequisite:

  1. Install an IDE that supports the CloudFormation extension, such as Visual Studio Code, Kiro
  2. Download the CloudFormation extension for your platform (available through the AWS Toolkit)
  3. Install the extension following the standard VS Code extension installation process

No complex dependency management or schema updates required—all configuration and updates are handled automatically.

Let’s See How It Works

Let’s walk through a practical example that demonstrates the IDE experience in action. We’ll build a simple Amazon Virtual Private Cloud (Amazon VPC) infrastructure with subnets and an S3 bucket.

Setting Up Your Project

Start by initializing a new CloudFormation project. Open the Command Palette, run “CFN Init: Initialize Project”, choose your project location, and set up environments. For this example, create a “beta” environment and associate it with your AWS development profile. The IDE creates your project structure with configuration files ready to use. You can now select your “beta” environment from the CloudFormation Explorer to ensure all deployments use the correct settings.

Figure 1: Initializing a CloudFormation project with environment configuration

Starting with Intelligent Authoring

Create a new CloudFormation template and start typing AWS::EC2::VPC. The IDE provides intelligent completions as you type.

Cloudformation IDE extension intelligent completion

Figure 2.0: Resource type auto-completion with CloudFormation-aware IntelliSense

When you add the Properties section, notice something interesting: nothing appears automatically. That’s because Amazon Elastic Compute Cloud (Amazon EC2) VPC has no required properties.

Cloudformation IDE extension doesn't suggest optional properties
Figure 2.1: No automatic suggestions for VPC properties since none are required

Hover over Properties to see all available options with their types and documentation links.

Hover information displaying optional properties and their documentation

Figure 2.2: Hover information displaying optional properties and their documentation

Add a CIDR block, then create a subnet. This time, when you type Properties, VpcId appears immediately because it’s required.

Required properties VpcID automatically suggested for EC2 Subnet
Figure 2.3: Required properties VpcID automatically suggested for EC2 Subnet

The IDE provides the resource names in your template, and when you use !GetAtt or !Ref, it knows which attributes are available for each resource type.

Type-aware completions for intrinsic functions like !GetAtt & !Ref

Figure 2.4: Type-aware completions for intrinsic functions like !GetAtt & !Ref

Real-Time Validation in Action

As you continue building, add MapPublicIpOnLaunch: true to make a public subnet. Immediately, a blue squiggly line appears.

CloudFormation Guard warning highlighted in real-time

Figure 3: CloudFormation Guard warning highlighted in real-time

Hovering reveals a CloudFormation Guard warning from the AWS Security pillar rules: this configuration isn’t recommended for security compliance.

Security compliance warning with detailed explanation

Figure 3.1: Security compliance warning with detailed explanation

Create a second subnet by copying the first, but now red squiggly lines appear. CFN Lint has detected overlapping CIDR blocks between your two subnets – an issue that would fail during deployment. You can fix it immediately with the contextual information provided.

CFN Lint error detection for overlapping CIDR blocks providing detailed error information helping you resolve the issue quickly
Figure 3.2: CFN Lint error detection for overlapping CIDR blocks providing detailed error information helping you resolve the issue quickly

Importing Existing Resources

Now you need an S3 bucket. Instead of writing it from scratch, open the Resource Explorer panel on the left. Using CCAPI integration, you can see all your existing AWS resources. Select an S3 bucket and click “Import resource state”. The IDE pulls in the complete resource configuration with all properties already set. You can now iterate on this resource without needing to remember or look up all the configuration details.

Automatically imported resource configuration from live AWS resources

Figure 4: Automatically imported resource configuration from live AWS resources

Developer Experience Benefits

The CloudFormation IDE Experience delivers measurable improvements across productivity and quality:

Productivity Gains:

  • Reduced context switching: Keep your entire workflow in one place
  • Faster iteration cycles: Catch and fix issues in seconds, not minutes or hours
  • Shift-left validation: Identify problems before deployment, not after
  • Intelligent assistance: Spend less time in documentation, more time building

Quality Improvements:

  • Proactive error prevention: Multi-layer validation catches issues early
  • Security by default: Built-in compliance checks from CloudFormation Guard
  • Best practice enforcement: Automated guidance aligned with AWS recommendations
  • Deployment confidence: Pre-deployment validation reduces rollback scenarios

What previously took hours of troubleshooting and multiple deployment attempts now becomes a confident 30-minute development cycle.

“I will definitely use these features; they help to reduce the feedback loop and speed up the development of IaC templates.” – AWS Community Builder

Things to Know

Platform Support

The CloudFormation IDE Experience is available for:

  • Visual Studio Code: Full feature support
  • Kiro: Full feature support
  • Cursor: Full feature support
  • JetBrains IDEs: Complete integration across the IntelliJ family (Fast Follow)
  • Operating Systems: macOS (ARM), Linux (x64) and Windows(…)

Conclusion

The CloudFormation IDE Experience eliminates the context switching that fragments your workflow. Write, validate, and deploy all from one environment. What used to take hours of iteration now takes minutes.

Ready to get started? Install the CloudFormation extension from the AWS Toolkit for VS Code and experience the difference. For detailed setup instructions and feature documentation, see the CloudFormation IDE Experience guide.

About the Authors:

Damola Oluyemo

Damola Oluyemo is a Solutions Architect at Amazon Web Services focused on Enterprise customers. He helps customers design cloud solutions while exploring the potential of Infrastructure as Code and generative AI in software development.

Jehu Gray

Jehu Gray is a Prototyping Architect at Amazon Web Services where he helps customers design solutions that fits their needs. He enjoys exploring what’s possible with IaC.

Safely Handle Configuration Drift with CloudFormation Drift-Aware Change Sets

Post Syndicated from JJ Lei original https://aws.amazon.com/blogs/devops/safely-handle-configuration-drift-with-cloudformation-drift-aware-change-sets/

Introduction

Is configuration drift preventing you from accessing the speed, safety, and governance benefits of AWS CloudFormation for infrastructure management? Configuration drift occurs when cloud resources are modified outside of CloudFormation, leading to a mismatch in the actual state and template definition of resources. Drift tends to accumulate from infrastructure changes that engineers make via the AWS Management Console to resolve production incidents or troubleshoot malfunctioning applications. Drift can cause unexpected changes during subsequent IaC deployments or leave resources in a non-compliant state. Unresolved drift can lead to cost increases when resources are over-provisioned outside of template definitions, or compliance violations that may result in audit penalties. Additionally, drift makes it hard to reproduce applications for testing or disaster recovery.

CloudFormation now offers drift-aware change sets that allow you to safely handle configuration drift and keep your infrastructure in sync with your templates. In this post, we will explore the process of leveraging drift-aware change sets to resolve common scenarios in which drift impacts the availability or security of your application.

Solution Overview

Drift-aware change sets are a type of CloudFormation change sets that can bring drifted resources in line with template definitions and preview the required changes to actual infrastructure states before deployment. Drift-aware change sets surface a three-way comparison of your new template, actual resource states, and previous template before deployment, allowing you to prevent unexpected overwrites of drift. Additionally, drift-aware change sets offer you a systematic mechanism to restore drifted resources to approved template definitions, strengthening the reproducibility and compliance posture of applications. You can create drift-aware change sets either from the CloudFormation Management Console or from the AWS CLI or SDK by passing the --deployment-mode REVERT_DRIFT parameter to the CreateChangeSet API.

Prerequisites

AWS CLI latest version with CloudFormation permissions configured.

AWS Identity and Access Management (IAM) permissions required: Permissions to create and manage CloudFormation stacks, AWS Lambda functions, Security Groups, Amazon Simple Storage Service (Amazon S3) buckets, and IAM roles. PowerUserAccess or Administrator access recommended for testing.

• Test environment (non-production AWS account recommended)

• Basic CloudFormation knowledge (stacks, templates, change sets)

Important Note: These sample templates are provided for educational purposes only and should not be used in production environments without proper security review and testing. You are responsible for testing, securing, and optimizing these templates based on your specific quality control practices and standards. Deploying these templates may incur AWS charges for creating or using AWS resources. Work with your security and legal teams to meet your organizational security, regulatory, and compliance requirements before any production deployment.

Scenario 1: Prevent Dangerous Overwrites

This scenario demonstrates how drift-aware change sets prevent dangerous overwrites when Lambda function memory is increased outside of CloudFormation during an outage, and a subsequent template update could accidentally reduce memory, causing performance issues.

Story: Your team deploys a Lambda function with 128 MB memory via CloudFormation. During a production outage, an engineer increases the memory to 512 MB through the Lambda Console to resolve performance issues. Later, another developer updates the template to 256 MB for a code change, unaware of the console modification. Without drift-aware change sets, CloudFormation would unexpectedly reduce memory from 512 MB to 256 MB—potentially causing the outage to recur.

User journey: Create stack with 128MB => Increase memory to 512MB via console during outage => Create drift-aware change set with 256MB template => Review three-way comparison showing dangerous memory reduction => Cancel change set to prevent outage => Update template to match production state (512MB) => Create and execute drift-aware change set with updated template (512MB) to resolve drift

Scenario Flow

1. Create Stack

Deploy CloudFormation stack with Lambda function (128 MB memory).

Figure 1

CloudFormation stack “lambda-memory-drift-test” successfully deployed with CREATE_COMPLETE status

2. Emergency Memory Increase (Console)

Manually increase Lambda memory to 512 MB through AWS Console (simulating emergency performance fix during outage).

Figure 2

Initial Lambda function showing 128 MB memory as configured in template

Figure 3

Lambda memory increased to 512 MB through console during outage, creating drift from template

3. Create Drift-Aware Change Set

Create change set with 256 MB template using drift-aware mode to reveal the dangerous memory reduction.

Figure 4

CloudFormation console showing the new “Drift aware change set” option selected. This compares the new template with the live state of your stack and shows changes to drifted resources before deployment, unlike standard change sets that only compare templates.

aws cloudformation create-change-set \
--stack-name lambda-memory-drift-test \
--change-set-name detect-memory-overwrite \
--template-body file://lambda-memory-drift-scenario-256mb.yaml \
--deployment-mode REVERT_DRIFT \
--capabilities CAPABILITY_IAM \
--region us-east-1

4. Review Change Set – The Critical Three-Way Comparison

Examine the drift-aware change set to see the dangerous memory reduction that would occur.

Figure 5

Critical insight revealed: The change set shows Live resource state (512 MB) vs Proposed resource state (256 MB), revealing a dangerous memory reduction that would impact performance.

Figure 6: view drift

Drift analysis: Clicking “View drift” reveals the complete picture – Previous template (128 MB) vs Live resource state (512 MB). This shows the live state has 4x more memory than the original template, indicating emergency changes were made during the outage that must be preserved.

Key Insight: The drift-aware change set reveals that:

  • Previous template: 128 MB (original deployment)
  • Live resource state: 512 MB (emergency change during outage)
  • Proposed template: 256 MB (new deployment)

This would cause a dangerous reduction from 512 MB to 256 MB, potentially recreating the original performance issue. Without drift-aware change sets, this critical information would be hidden.

5. Recreate Drift-aware Change Set with Updated Template (512MB) to Resolve Drift

Update the template to match the live production state (512 MB) and create a new drift-aware change set to safely resolve the drift.

Figure 7

Resolution confirmed: The drift-aware change set shows both Live resource state and Proposed resource state at 512 MB, with change set action ” Sync with live”. This verifies that the updated template now matches production, preventing the dangerous memory reduction and safely resolving the drift without impacting performance.

CloudFormation Templates

Initial Template (128 MB):

Resources:
  DriftTestFunction:
    Type: AWS::Lambda::Function
    Properties:
      Runtime: python3.9
      Handler: index.lambda_handler
      MemorySize: 128
      ReservedConcurrentExecutions: 5
      Role: !GetAtt LambdaExecutionRole.Arn
      Code:
        ZipFile: |
          def lambda_handler(event, context):
              return {'statusCode': 200, 'body': 'Hello!'}
  LambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

Updated Template (256 MB – lambda-memory-drift-scenario-256mb.yaml):

Resources:
  DriftTestFunction:
    Type: AWS::Lambda::Function
    Properties:
      Runtime: python3.9
      Handler: index.lambda_handler
      MemorySize: 256
      ReservedConcurrentExecutions: 5
      Role: !GetAtt LambdaExecutionRole.Arn
      Code:
        ZipFile: |
          def lambda_handler(event, context):
              return {'statusCode': 200, 'body': 'Hello!'}
  LambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

CLI Commands

  1. Create stack:
aws cloudformation create-stack --stack-name lambda-memory-drift-test --template-body file://lambda-memory-drift-scenario.yaml --capabilities CAPABILITY_IAM --region us-east-1
  1. Get function name:
aws cloudformation describe-stack-resources --stack-name lambda-memory-drift-test --logical-resource-id DriftTestFunction --query 'StackResources[0].PhysicalResourceId' --output text --region us-east-1
  1. Create drift-aware change set:
aws cloudformation create-change-set --stack-name lambda-memory-drift-test --change-set-name detect-memory-overwrite --template-body file://lambda-memory-drift-scenario-256mb.yaml --deployment-mode REVERT_DRIFT --capabilities CAPABILITY_IAM --region us-east-1
  1. Describe change set:
aws cloudformation describe-change-set --change-set-name detect-memory-overwrite --stack-name lambda-memory-drift-test --region us-east-1

Scenario 2: Remediate Unauthorized Changes

This scenario demonstrates how drift-aware change sets systematically remediate unauthorized changes when a developer adds temporary debugging rules to a security group but forgets to remove them, creating a compliance violation.

Story: Your team deploys a security group with only HTTP access via CloudFormation for compliance. During debugging, a developer adds SSH access (port 22) through the AWS Console for their IP address to troubleshoot an application issue. They forget to remove this rule after debugging. Later, security compliance requires reverting to the original template state. A standard change set shows no changes since the template is unchanged, but a drift-aware change set can detect and systematically remove the unauthorized SSH rule.

User journey: Create stack with HTTP-only access => Add SSH rule via console for debugging => Forget to remove SSH rule => Create drift-aware change set with REVERT_DRIFT mode => Review change set showing SSH rule removal => Execute change set to restore compliance

Scenario Flow

1. Create Stack

Deploy CloudFormation stack with security group allowing only HTTP traffic.

Figure 8

CloudFormation stack “sg-revert-drift-test” successfully deployed with DriftTestSecurityGroup resource

2. Make Unauthorized Changes (Console)

Manually add SSH ingress rule through AWS Console (simulating developer debugging access that wasn’t removed).

Figure 9: http only

Initial security group showing only HTTP (port 80) access as configured in template – compliant state

Figure 10: ssh-added

Security group now shows 2 permission entries: SSH (port 22) for specific IP and HTTP (port 80) for all traffic. The SSH rule creates drift and a compliance violation that needs systematic removal.

3. Create Drift-Aware Change Set

Create change set using REVERT_DRIFT mode to systematically remove the unauthorized SSH rule.

Figure 11

Creating drift-aware change set for security group compliance restoration. Note the “Drift aware change set” option is selected to compare with live state and detect unauthorized changes.

aws cloudformation create-change-set \
--stack-name sg-revert-drift-test \
--change-set-name revert-ssh-drift \
--use-previous-template \
--deployment-mode REVERT_DRIFT \
--region us-east-1

4. Review Change Set – Systematic Compliance Restoration

Examine the drift-aware change set to see systematic removal of unauthorized SSH rule.

Figure 12

Compliance violation detected: The drift -aware change set shows that the SSH rule in the live resource state (rule 232 for IP 15.248.7.53/32 on port 22) is not present in the proposed resource state derived from the template. This unauthorized SSH rule violates security policy and will be systematically removed

Key Insight: The drift-aware change set enables systematic compliance restoration by:

  • Previous template: Only HTTP (port 80) access – compliant state
  • Live resource state: HTTP + SSH (port 22) for 15.248.7.53/32 – compliance violation
  • Action: Remove unauthorized SSH rule to restore compliance

This provides a systematic, auditable way to remove unauthorized changes rather than manual cleanup.

Figure 13

Stack events showing successful execution of the drift-aware change set – SSH rule removed

CloudFormation Templates

security-group-drift-scenario.yaml:

Resources:
  DriftTestSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: "Security group for drift testing"
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 80
          ToPort: 80
          CidrIp: 0.0.0.0/0
          Description: "Allow HTTP traffic for demo purposes"
      SecurityGroupEgress:
        - IpProtocol: -1
          CidrIp: 0.0.0.0/0
          Description: "Allow all outbound traffic"

CLI Commands

  1. Create stack:
aws cloudformation create-stack --stack-name sg-revert-drift-test --template-body file://security-group-drift-scenario.yaml --region us-east-1
  1. Get security group ID:
aws ec2 describe-security-groups --filters "Name=tag:aws:cloudformation:stack-name,Values=sg-revert-drift-test" --query 'SecurityGroups[0].GroupId' --output text --region us-east-1
  1. Create drift-aware change set:
aws cloudformation create-change-set --stack-name sg-revert-drift-test --change-set-name revert-ssh-drift --template-body file://security-group-drift-scenario.yaml --deployment-mode REVERT_DRIFT --region us-east-1
  1. Describe change set:
aws cloudformation describe-change-set --change-set-name revert-ssh-drift --stack-name sg-revert-drift-test --region us-east-1

Scenario 3: Recreate Deleted Resources

This scenario demonstrates drift detection when a dependent resource (logs bucket) is accidentally deleted outside of CloudFormation during troubleshooting. The main application bucket depends on this logs bucket for access logging. You need to recreate the deleted resource while maintaining the existing infrastructure dependencies.

Story: Your team deploys a main S3 bucket with a dependent logs bucket for access logging via CloudFormation. During troubleshooting, an operator accidentally deletes the logs bucket through the AWS Console. The main bucket still exists but its logging configuration now references a non-existent bucket. You need to recreate the deleted logs bucket while maintaining the dependency relationship.

User journey: Create stack with main and logs buckets => Accidentally delete logs bucket => Create drift-aware change set with REVERT_DRIFT mode => Review change set showing LogBucket will be recreated => Execute change set to restore deleted resource

Scenario Flow

1. Create Stack

Deploy CloudFormation stack with main S3 bucket and dependent logs bucket.

Figure 14

CloudFormation stack “s3-deletion-drift-test” successfully deployed with both LogBucket and MainBucket resources in CREATE_COMPLETE status

2. Accidental Deletion (Console)

Manually delete the logs bucket through AWS Console (simulating accidental deletion during troubleshooting).

Figure 15

LogBucket accidentally deleted outside of CloudFormation during troubleshooting, creating drift – the MainBucket still exists but its logging configuration now references a non-existent bucket

3. Create Drift-Aware Change Set

Create change set using REVERT_DRIFT mode to recreate the deleted LogBucket.

Figure 16

Creating drift-aware change set with “Drift aware change set” option selected to detect and recreate the deleted resource by comparing template with live state

aws cloudformation create-change-set \
--stack-name s3-deletion-drift-test \
--change-set-name recreate-deleted-bucket \
--use-previous-template \
--deployment-mode REVERT_DRIFT \
--region us-east-1

4. Review Change Set – Resource Recreation

Examine change set to see LogBucket recreation while preserving MainBucket dependencies.

Figure 17

Change set preview showing LogBucket will be recreated to restore the deleted resource and MainBucket updated to maintain infrastructure dependencies

Key Insight: The drift-aware change set detects that:

  • Template expectation: Both LogBucket and MainBucket should exist
  • Live resource state: Only MainBucket exists, LogBucket is missing
  • Action: Recreate LogBucket with original configuration to restore logging functionality

This enables systematic recovery of accidentally deleted resources while maintaining infrastructure dependencies.

CloudFormation Templates

s3-drift-scenario.yaml:

Resources:
  LogBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: AES256
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true
      VersioningConfiguration:
        Status: Enabled
  
  MainBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: AES256
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true
      VersioningConfiguration:
        Status: Enabled
      LoggingConfiguration:
        DestinationBucketName: !Ref LogBucket

CLI Commands

  1. Create stack:
aws cloudformation create-stack --stack-name s3-deletion-drift-test --template-body file://s3-drift-scenario.yaml --region us-east-1
  1. Get LogBucket name:
aws cloudformation describe-stack-resources --stack-name s3-deletion-drift-test --logical-resource-id LogBucket --query 'StackResources[0].PhysicalResourceId' --output text --region us-east-1
  1. Create drift-aware change set:
aws cloudformation create-change-set --stack-name s3-deletion-drift-test --change-set-name recreate-deleted-bucket --template-body file://s3-drift-scenario.yaml --deployment-mode REVERT_DRIFT --region us-east-1
  1. Describe change set:
aws cloudformation describe-change-set --change-set-name recreate-deleted-bucket --stack-name s3-deletion-drift-test --region us-east-1

Best Practices

When working with drift-aware change sets, consider these best practices:

Always review three-way comparisons before executing change sets to understand the full impact

Use REVERT_DRIFT deployment mode when you want to bring resources back to template compliance

Document emergency changes made outside of CloudFormation to inform future template updates

Implement change management processes to minimize unauthorized drift

Regular drift detection helps identify configuration changes before they become problematic

Test drift-aware change sets in non-production environments first

Cleanup

Important: Execute these cleanup commands promptly after completing the scenarios to avoid incurring unnecessary AWS charges. Resources such as Lambda functions, S3 buckets (even if empty), and security groups may incur costs if left running. Ensure all stacks are successfully deleted by verifying the DELETE_COMPLETE status.

Commands to delete all test resources:

# Scenario 1: Lambda Memory Drift
aws cloudformation delete-stack --stack-name lambda-memory-drift-test --region us-east-1

# Scenario 2: Security Group Drift
aws cloudformation delete-stack --stack-name sg-revert-drift-test --region us-east-1

# Scenario 3: S3 Bucket Deletion Drift
aws cloudformation delete-stack --stack-name s3-deletion-drift-test --region us-east-1

# Verify all stacks are deleted
aws cloudformation list-stacks --stack-status-filter DELETE_COMPLETE --region us-east-1

Note: CloudFormation will automatically clean up all resources created by the stacks, including Lambda functions, security groups, and S3 buckets.

Conclusion

Drift-aware change sets enable you to mitigate the operational and security risks of configuration drift, allowing you to confidently automate and govern your infrastructure updates with CloudFormation. Through the scenarios described in this post, you have seen how you can leverage drift-aware change sets to prevent outages in production environments, maintain the integrity of your test environments, and manage the compliance posture of all environments. Remember to thoroughly review the infrastructure changes previewed by drift-aware change sets before executing deployments.

Available Now

Drift-aware change sets are available in AWS Regions where CloudFormation is available. Please refer to the AWS Region table to learn more.

Cross-account lakehouse governance with Amazon S3 Tables and SageMaker Catalog

Post Syndicated from Sneha Rao original https://aws.amazon.com/blogs/big-data/cross-account-lakehouse-governance-with-amazon-s3-tables-and-sagemaker-catalog/

Organizations increasingly face challenges when analyzing data stored across multiple AWS accounts and storage formats. Data teams often need to query both traditional Amazon Simple Storage Service (Amazon S3) objects and Apache Iceberg tables, leading to costly data duplication, potential inconsistencies, and complex permission management across accounts.

To address these challenges, you can combine Amazon S3 Tables, which provides native Apache Iceberg support within S3, with Amazon SageMaker Catalog for unified data governance. This solution supports secure cross-account data access without duplicating datasets or compromising security controls.

In this post, we walk you through a practical solution for secure, efficient cross-account data sharing and analysis. You’ll learn how to set up cross-account access to S3 Tables using federated catalogs in Amazon SageMaker, perform unified queries across accounts with Amazon Athena in Amazon SageMaker Unified Studio, and implement fine-grained access controls at the column level using AWS Lake Formation.

This post helps you establish proper governance and security controls for S3 Tables in a multi-account environment, enabling secure and efficient cross-account data access.

Solution overview

We walk you through implementing a three-account lakehouse governance architecture where you can securely share data. As shown in the following diagram, Account A serves as your data producer with S3 Tables, Account B acts as your central governance hub with SageMaker Catalog, and Account C represents your data consumers. We’ll demonstrate step-by-step how to configure cross-account access and implement governance controls so consumers can discover and query data from both S3 tables and traditional S3 buckets.

Prerequisite and Set up

In this post, we focus on how to do the cross account set up and how to onboard S3 Tables. All three accounts are in the same AWS Region. To implement this solution, you will need three individual accounts (A, B, C). The setup in the accounts should look like the following:

  • Account A (Producer): Create an Amazon S3 Table on the account.
  • Account B (Central governance and producer): This is another account where you have data in Amazon S3 buckets catalog via Glue Catalog. You would onboard these into domain portal.
  • Account C (Consumer account): Identify an account where you have consumers query data using Athena to follow along.

The following are the high-level implementation steps for this solution:

Step 1: Configure cross-account association for governance.
Step 2: Create three Project Profiles in Account B pointing to tables in Account A, B, and C.
Step 3: Create three Projects.
Step 4: Set up permissions for Projects in AWS Lake Formation.
Step 5: In Account B, create Datasource to connect S3 Table from Account A and Glue Catalog Tables from Account B.
Step 6: Publish and Subscribe to asset.
Step 7: Query S3 table (Account A) and S3 (Account B) data together in SQL editor (Account C).

Step 1

A. Configure cross-account association for governance

In this section, we associate Account A and C in the Governance account B.

  1. Open the SageMaker Unified Studio console in Account B.
  2. Navigate to Domains, select your domain, then choose the Account associations tab.
  3. Choose Request association and enter the Account IDs for Account A and Account C.
  4. Submit the association request and verify the accounts appear with “Requested” status.

B. Enable Blueprints for your domain in Accounts A, B, and C

The LakeHouseDatabase blueprint enables SageMaker Unified Studio to securely manage, query, and share data from S3, Redshift, and other sources using open standards—so in this step, you enable it in Accounts A, B, and C to support unified data access and collaboration.

  1. In Account A, in the SageMaker console, navigate to your domain and select the Blueprints tab.
  2. Select the LakeHouseDatabase blueprint and choose Enable.
  3. Keeping the Permissions and resources section at the default settings, choose Enable Blueprint.
  4. Back on the blueprints screen, select the Tooling blueprint and choose Enable.
  5. Keeping the Permissions and resources section at the default settings, configure the Networking section with the desired VPC and subnet configurations.
  6. Choose Enable Blueprint.
  7. Repeat Step1.B and enable the same blueprints in Account B to make S3 data publishable and Account C so consumers can query the data using Athena.

Step 2: Create Project Profiles in Account B

Use the documentation to create three project profiles in Account B using the ‘LakeHouseDatabase’ Blueprint, with each profile configured for Accounts A, B, and C respectively. For this post, we use the following naming convention:

  • datalake-project-profile-s3tables (for Account A)
  • datalake-project-profile (for Account B)
  • datalake-project-profile-consumer (for Account C)

Step 3: Create three Projects for accounts A, B, and C

  1. Using the documentation, create one Project in each account. For this post, we use the following naming convention:
    • ‘producer-s3tables’ – This is configured for Account A
    • ‘producer-s3’ – This is configured for Account B
    • ‘consumer’ – This is configured for Account C
  2. After creating the Project, locate and make note of the Project role ARN listed under Project details on the project overview page.

Step 4: Set up permissions for Projects in AWS Lake Formation

In Account A, onboard the S3 table in SageMaker Lakehouse and grant permissions to the project role:

  1. In the AWS Lake Formation console, choose Permissions, choose Data permissions, and then choose Grant.
  2. Choose Principals, select IAM users and roles, then select the role generated by the project producer-s3tables in Step 3.
  3. In LF-Tags or catalog resources, choose Named data catalog resources, select the S3 table catalog from the Catalogs list.
  4. In Catalog permissions, configure the Catalog permissions and grantable permissions. Choose Grant to apply the following permissions.

In Account A, we repeat these steps for grant permissions to the database:

  1. In the AWS Lake Formation console, choose Permissions, choose Data permissions, and then choose Grant.
  2. Choose Principals, select IAM users and roles, then select the role generated by the project producer-s3tables in Step 3.
  3. In LF-Tags or catalog resources, choose Named data catalog resources, choose both the S3 table catalog and database from their respective dropdown lists.
  4. Configure database permissions and grantable permissions. Choose Grant to apply the following permissions.

In Account A, repeat these steps for grant permissions to the table in the database:

  1. In the AWS Lake Formation console, choose Permissions, choose Data permissions, and then choose Grant.
  2. Choose Principals, select IAM users and roles, then select the role generated by the project producer-s3tables in Step 3.
  3. In LF-Tags or catalog resources, choose Named data catalog resources, choose both the S3 table catalog, database, and S3 table from their respective dropdown lists.
  4. Configure table permissions and grantable permissions. Choose Grant to apply the following permissions.

Repeat Step 4 in Accounts B to onboard S3 to SageMaker Lakehouse and grant the necessary permissions to the role created by your project for Account B.

Step 5: Create Datasource and onboard S3 Table from Account A and Glue Catalog Tables from Account B

To enable unified access and cross-account analytics with data lineage tracking, you’ll connect your SageMaker Unified Studio project to S3 tables from both accounts:

  1. Navigate to your project in SageMaker Unified Studio, select Data sources under the Project catalog section and choose Create data source.
  2. Enter a name, description, and select AWS Glue as the Data source type. Under Data selection, specify the S3 table catalog name.
  3. In this post, we will keep the Publishing setting and Metadata settings as the default configuration.
  4. Choose the run preference as Run on demand to manually initiate data source runs.
  5. Configure any optional connection settings, such as importing data lineage or setting up data quality options. Review your configuration and create the data source.
  6. Once created, run the data source to import the Glue assets into your project’s inventory.
  7. Add asset filter to restrict consumer access, On the Asset filters tab, choose Add asset filter.
  8. Select Column as the filter type, choose the columns for consumer access, and create the asset filter.
  9. Select the assets created and choose Publish assets to the SageMaker Unified Studio catalog to make them discoverable by other users.
  10. Use the documentation to add Glue catalog as data source for S3.

Step 6: Subscribe to the asset from Consumer account in Account C

In Account C, enable the consumer teams to discover, request, and subscribe to those assets for secure, governed data sharing and collaboration across projects.

  1. In SageMaker Unified Studio, select the consumer project.
  2. Use the Discover menu (top navigation) and go to Catalog.
  3. Browse or search for the published asset (S3 tables from Account A).
  4. Select the desired asset (S3 tables from Account A) and choose Subscribe.
  5. In the subscription pop-up:
    1. Choose the target project for asset access.
    2. Provide a short justification for the access request.
  6. Submit the subscription request.
  7. Repeat step 6 to enable the consumer (Account C) teams to discover assets in Account B.

Approve or reject a subscription request

  1. In Account A, open the SageMaker Unified Studio portal.
  2. Under Project catalog, Subscription requests, Incoming requests tab locate and view the subscription request.
  3. Review the requester and justification.
  4. Choose the option to approve with row and column filters. For this post, we use the filter that we created earlier.
  5. Repeat step 6 to enable the consumer (Account C) teams to discover assets in Account B.

Step 7: Analyze S3 table and S3 data together in query editor

Account C (consumer) now has full access to the customer data in S3 from Account B, and the daily_sales_by_customer data in S3 tables from Account A with restricted columns. Both datasets contain a common column Customer_id.

To generate combined insights, assets from Account A and Account B can be queried and joined on Customer_id.

  1. In SageMaker Unified Studio (consumer project in Account C), go to the Build section and select Query Editor.
  2. Run the following SQL query to join the assets from Account B and Account A on the common column Customer_id, enabling unified cross-account analytics.
    SELECT
        c.c_last_name,
        c.c_first_name,
        d.*
    FROM "awsdatacatalog"."glue_db_cqmfkub9co3rqh"."customer" c
    JOIN "awsdatacatalog"."glue_db_cqmfkub9co3rqh"."daily_sales_by_customer" d
        ON c.c_customer_id = d.customer_id
    LIMIT 10;

This approach allows combining filtered, governed data from multiple accounts into a single query for comprehensive insights.

Clean up

To avoid ongoing charges, clean up the resources created during this walkthrough. Complete these steps in the specified order to facilitate proper resource deletion. You might need to add respective delete permissions for databases, table buckets, and tables if your IAM user or role doesn’t already have them.

  1. Delete any created IAM roles or policies.
  2. Delete all the projects you created in the SageMaker Unified Studio domain.
  3. Delete the SageMaker Unified Studio domain you created.

Conclusion

In this post, we explored how Amazon SageMaker Catalog integrates with S3 Tables to provide comprehensive data governance in cross-account environments. We demonstrated how data publishers can onboard S3 Tables to SageMaker Lakehouse while data consumers can efficiently search, request access, and leverage approved datasets for analytics and AI development.

The integration between SageMaker Catalog, S3 Tables, and AWS AWS Lake Formation creates a unified governance framework that eliminates data silos while maintaining robust security controls. Through automated subscription workflows and fine-grained access permissions, organizations can implement self-service data access without compromising compliance or data quality.


About the authors

Sneha Rao

Sneha Rao

Sneha is a Solutions Architect at AWS who helps strategic enterprise customers design architectures on the cloud. She’s passionate about creating inclusive learning experiences that make complex technologies approachable and impactful. Outside of work, Sneha enjoys painting, exploring local coffee shops, and going on outdoor adventures with her Cavapoo, Taz.

Deepmala Agarwal

Deepmala Agarwal

Deepmala is passionate about helping customers build out scalable, distributed, and data-driven solutions on AWS. When not at work, Deepmala likes spending time with family, walking, listening to music, watching movies, and cooking!

Viral Thakkar

Viral Thakkar

Viral is a Software Engineer at AWS, working on Amazon DataZone with a primary focus on distributed systems and data governance with deep expertise in building large-scale data analytics and pipelining solutions. He is passionate about tackling complex distributed systems challenges while also creating tools and automated scripts that simplify day-to-day workflows and improve productivity.

Santhosh Padmanabhan

Santhosh Padmanabhan

Santhosh is a Software Development Manager at AWS, leading the Amazon DataZone engineering team. His team designs, builds, and operates services specializing in data, machine learning, and AI governance. With deep expertise in building distributed data systems at scale, Santhosh plays a key role in advancing AWS’s data governance capabilities.

Abbas Makhdum

Abbas Makhdum

Abbas is Head of Product Marketing for Amazon SageMaker Catalog at AWS, where he leads go-to-market strategy and launches for data and AI governance solutions. With deep expertise across data, AI, and analytics, Abbas has also authored a book on data governance with O’Reilly. He is passionate about helping organizations unlock business value by making data and AI more accessible, transparent, and governed.