Tag Archives: announcements

Achieve 2x faster data lake query performance with Apache Iceberg on Amazon Redshift

2025-11-27 Kalaiselvi Kamaraj

Post Syndicated from Kalaiselvi Kamaraj original https://aws.amazon.com/blogs/big-data/achieve-2x-faster-data-lake-query-performance-with-apache-iceberg-on-amazon-redshift/

With the growing adoption of open table formats like Apache Iceberg, Amazon Redshift continues to advance its capabilities for open format data lakes. In 2025, Amazon Redshift delivered several performance optimizations that improved query performance over twofold for Iceberg workloads on Amazon Redshift Serverless, delivering exceptional performance and cost-effectiveness for your data lake workloads.

In this post, we describe some of the optimizations that led to these performance gains. Data lakes have become a foundation of modern analytics, helping organizations store vast amounts of structured and semi-structured data in cost-effective data formats like Apache Parquet while maintaining flexibility through open table formats. This architecture creates unique performance optimization opportunities across the entire query processing pipeline.

Performance enhancements

Our latest enhancements span multiple areas of the Amazon Redshift SQL query processing engine, including vectorized scanners that accelerate execution, optimal query plans powered by just-in-time (JIT) runtime statistics, distributed Bloom filters, and new decorrelation rules.

The following chart summarizes the performance improvements achieved so far in 2025, as measured by industry standard 10 TB TPC-DS and TPC-H benchmarks run on Iceberg tables on an 88 RPU Redshift Serverless endpoint.

Find the best performance for your workloads

The performance results presented in this post are based on benchmarks derived from the industry-standard TPC-DS and TPC-H benchmarks, and have the following characteristics:

The schema and data of Iceberg tables are used unmodified from TPC-DS. Tables are partitioned to reflect real-world data organization patterns.
The queries are generated using the official TPC-DS and TPC-H kits with query parameters generated using the default random seed of the kits.
The TPC-DS test includes all 99 TPC-DS SELECT queries. It doesn’t include maintenance and throughput steps. The TPC-H test includes all 22 TPC-H SELECT queries.
Benchmarks are run out of the box: no manual tuning or stats collection is done for the workloads.

In the following sections, we discuss key performance improvements delivered in 2025.

Faster data lake scans

To improve data lake read performance, the Amazon Redshift team built a completely new scan layer designed from the ground-up for data lakes. This new scan layer includes a purpose-built I/O subsystem, incorporating smart prefetch capabilities to reduce data latency. Additionally, the new scan layer is optimized for processing Apache Parquet files, the most commonly used file format for Iceberg, through fast vectorized scans.

This new scan layer also includes sophisticated data pruning mechanisms that operate at both partition and file levels, dramatically reducing the volume of data that needs to be scanned. This pruning capability works in harmony with the smart prefetch system, creating a coordinated approach that maximizes efficiency throughout the entire data retrieval process.

JIT ANALYZE for Iceberg tables

Unlike traditional data warehouses, data lakes often lack comprehensive table- and column-level statistics about the underlying data, making it challenging for the planner and optimizer in the query engine to choose up-front which execution plan will be most optimal. Sub-optimal plans can lead to slower and less predictable performance.

JIT ANALYZE is a new Amazon Redshift feature that automatically collects and uses statistics for Iceberg tables during query execution—minimizing manual statistics collection while giving the planner and optimizer in the query engine the information it needs to generate optimal query plans. The system uses intelligent heuristics to identify queries that will benefit from statistics, performs fast file-level sampling using Iceberg metadata, and extrapolates population statistics using advanced techniques.

JIT ANALYZE delivers out-of-the-box performance nearly equal to queries that have pre-calculated statistics, while providing the foundation for many other performance optimizations. Some TPC-DS queries improved by 50 times faster with these statistics.

Query optimizations

For correlated subqueries such as those that contain EXISTS/IN clauses, Amazon Redshift uses decorrelation rules to rewrite the queries. In many cases, these decorrelation rules were not producing optimal plans, resulting in query execution performance regressions. To address this, we introduced a new internal join type, SEMI JOIN, and a new decorrelation rule based on this join type. This decorrelation rule helps in producing the most optimal plans, thereby improving execution performance. For instance, one of the TPC-DS queries that contains EXIST clause ran 7 times faster with this optimization.

We introduced distributed Bloom filter optimization for data lake workloads. Distributed Bloom filters create Bloom filters locally in every compute node and then distributes them to every other node. Distributing Bloom filters can significantly reduce the amount of data that needs to be sent over the network for the join by filtering out the tuples earlier. This provides good performance gains for large, complex data lake queries that process and join large amounts of data.

Conclusion

These performance improvements for Iceberg workloads represent a major leap forward in Redshift data lake capabilities. By focusing on out-of-the-box performance, we’ve made it straightforward to achieve exceptional query performance without complex tuning or optimization.

These improvements demonstrate the power of deep technical innovation combined with practical customer focus. JIT ANALYZE reduces the operational burden of statistics management while providing optimal query planning information. The new Redshift data lake query engine on Redshift Serverless was rewritten from the ground up for best-in-class scan performance, and lays the groundwork for more advanced performance optimizations. Semi-join optimizations tackle some of the most challenging query patterns in analytical workloads. You can run complex analytical workloads on your Iceberg data and get fast, predictable query performance.

Amazon Redshift is committed to being the best analytics engine for data lake workloads, and these performance optimizations represent our continued investment in that goal.

To learn more about Amazon Redshift and its performance capabilities, visit the Amazon Redshift product page. To get started with Redshift, you can try Amazon Redshift Serverless and start querying data in minutes without having to set up and manage data warehouse infrastructure. For more details on performance best practices, see the Amazon Redshift Database Developer Guide. To stay up-to-date with the latest developments in Amazon Redshift, subscribe to the What’s New in Amazon Redshift RSS feed.

Special thanks to this post’s contributors: Martin Milenkoski, Gerard Louw, Konrad Werblinski, Mengchu Cai, Mehmet Bulut, Mohammed Alkateb, and Sanket Hase

Introducing catalog federation for Apache Iceberg tables in the AWS Glue Data Catalog

2025-11-27 Debika D

Post Syndicated from Debika D original https://aws.amazon.com/blogs/big-data/introducing-catalog-federation-for-apache-iceberg-tables-in-the-aws-glue-data-catalog/

Apache Iceberg has become the standard choice of open table format for organizations seeking robust and reliable analytics at scale. However, enterprises increasingly find themselves navigating complex multi-vendor landscapes with disparate catalog systems. Managing data across these has become a major challenge for organizations operating in multi-vendor environments. This fragmentation drives significant operational complexity, particularly around access control and governance. Customers using AWS analytics services such as Amazon Redshift, Amazon EMR, Amazon Athena, Amazon SageMaker, and AWS Glue to analyze Iceberg tables in the AWS Glue Data Catalog want to get the same price-performance for workloads in remote catalogs. Simply migrating or replacing these remote catalogs isn’t practical, leaving teams to implement and maintain synchronization processes that continuously replicate metadata across systems, creating operational overhead, escalating costs, and risking data inconsistencies.

AWS Glue now supports catalog federation for remote Iceberg tables in the Data Catalog. With catalog federation, you can query remote Iceberg tables, stored in Amazon Simple Storage Service (Amazon S3) and cataloged in remote Iceberg catalogs, using AWS analytics engines and without moving or duplicating tables. After a remote catalog is integrated, AWS Glue always fetch the latest metadata in the background, so you always have access to the Iceberg metadata through your preferred AWS analytics services. This capability supports both coarse-grained access control and fine-grained permissions through AWS Lake Formation, giving you the flexibility on how and when remote Iceberg tables are shared with data consumers. With integration for Snowflake Polaris Catalog, Databricks Unity Catalog, and other custom catalogs supporting Iceberg REST specifications, you can federate to remote catalogs, discover databases and tables, configure access permissions, and begin querying remote Iceberg data.

In this post, we discuss how to get started with catalog federation for Iceberg tables in the Data Catalog.

Solution overview

Catalog federation uses the Data Catalog to communicate with remote catalog systems to discover catalog objects and Lake Formation to authorize access to their data in Amazon S3. When you query a remote Iceberg table, the Data Catalog discovers the latest table information in the remote catalog at query runtime, getting the table’s S3 location, current schema, and partition information. Your analytics engine (Athena, Amazon EMR, or Amazon Redshift) Your analytics engine (Athena, EMR, or Redshift) then uses this information to access Iceberg data files directly from Amazon S3. And Lake Formation manages access to the table by vending scoped credentials to the table data stored in Amazon S3, allowing the engines to apply fine-grained permissions to the federated table. This approach avoids metadata and data duplication while providing real-time access to remote Iceberg tables through your preferred AWS analytics engines.

The Data Catalog facilitates connectivity to remote catalog systems that support Apache Iceberg by establishing an AWS Glue connection with the remote catalog endpoint. You can connect the Data Catalog to remote Iceberg REST catalogs using OAuth2 or custom authentication mechanisms using an access token. During integration, administrators configure a principal (service account or identity) with the appropriate permissions to access resources in the remote catalog. The AWS Glue connection object uses this configured principal’s credentials to authenticate and access metadata in the remote catalog server. You can also connect the Data Catalog to remote catalogs that use a private link or proxy for isolating and restricting network access. After it’s connected, this integration uses the standardized Iceberg REST API specification to retrieve the most current table metadata information from these remote catalogs. AWS Glue onboards these remote catalogs as federated catalogs within its own catalog infrastructure, enabling unified metadata access across multiple catalog systems.

Lake Formation serves as the centralized authorization layer for managing user access to federated catalog resources. When users attempt to access tables and databases in federated catalogs, Lake Formation evaluates their permissions and enforces fine-grained access control policies.

Beyond metadata authorization, the catalog federation also manages secure access to the actual underlying data files. It accomplishes this through credential vending mechanisms that issue temporary, scope-limited credentials. AWS Glue federated catalogs work with your preferred AWS analytics engines and query services, enabling consistent metadata access and unified data governance across your analytics workloads.

In the following sections, we walk through the steps to integrate the Data Catalog with your remote catalog server:

Set up an integration principal in the remote catalog and provide required access on catalog resources to this principal. Enable OAuth based authentication for the integration principal.
Create a federated catalog in the Data Catalog using the AWS Glue connection. Create an AWS Glue connection that uses the credentials of the integration principal (in Step1) to connect to the Iceberg REST endpoint of the remote catalog. Configure an AWS Identity and Access Management (IAM) role with permission to S3 locations where the remote table data resides. In a cross-account scenario, make sure the bucket policy grants required access to this IAM role. This federated catalog mirrors the catalog object in your remote catalog server.
Discover Iceberg tables in federated catalogs using Lake Formation or AWS Glue APIs. Query Iceberg tables using AWS analytics engines. During query operations, Lake Formation manages fine-grained permission on federated resources and credential vending to underlying data for the end-users.

Prerequisites

Before you begin, verify you have the following setup in AWS:

An AWS account.
The AWS Command Line Interface (AWS CLI) version 2.31.38 or later installed and configured.
An IAM admin role or user with appropriate permissions to the following services:
- IAM
- AWS Glue Data Catalog
- Amazon S3
- AWS Lake Formation
- AWS Secrets manager
- Amazon Athena
Create a data lake admin. For instructions, see Create a data lake administrator.

Set up authentication credentials in remote Iceberg catalog

Catalog federation to a remote Iceberg catalog uses the OAuth2 credentials of the principal configured with metadata access. This authentication mechanism allows the AWS Glue Data Catalog to access the metadata of various objects (such as databases, and tables) within the remote catalogs, based on the privileges associated with the principal. To support proper functionality, you must grant the principal with the necessary permissions to read the metadata of these objects. Generate the CLIENT_ID and CLIENT_SECRET to enable OAuth based authentication for the integration principal.

Create AWS Glue catalog federation using connection to remote Iceberg catalog

Create a federated catalog in the Data Catalog that mirrors a catalog object in the remote Iceberg catalog server and is used by the AWS Glue service to federate metadata queries such as ListDatabases, ListTables, and GetTable to the remote catalog. As data lake administrator, you can create a federated catalog in the Data Catalog using an AWS Glue connection object that is registered with AWS Lake Formation.

Configure data source connection for AWS Glue connection

Catalog federation uses an AWS Glue connection for metadata access when you provide authentication and Iceberg REST API endpoint configurations in the remote catalog. The AWS Glue connection supports OAuth2 or custom as the authentication method.

Connect using OAuth2 authentication

For the OAuth2 authentication method, you can provide a client secret either directly as input or stored in AWS Secrets Manager and used by the AWS Glue connection object during authentication. AWS Glue internally manages the token refresh upon expiration. To store the client secret in Secrets manager, complete the following steps:

On the Secrets Manager console, choose Secrets in the navigation pane.
Choose Store a new secret.
Choose Other type of secret, provide the key name as USER_MANAGED_CLIENT_APPLICATION_CLIENT_SECRET, and enter the client secret value.
Choose Next and provide a name for the secret.
Choose Next and choose Store to save the secret.

Connect using custom authentication

For custom authentication, use Secrets Manager to store and retrieve the access token. This access token is created, refreshed, and managed by the customer’s application or system, providing proper control and management over the authentication process. To store the access token in Secrets Manager, complete the following steps:

On the Secrets Manager console, choose Secrets in the navigation pane.
Choose Store a new secret.
Choose Other type of secret and provide the key name as BEARER_TOKEN with the value noted as the access token of the integration principal.
Choose Next and provide a name for the secret.
Choose Next and choose Store to save the secret.

Register AWS Glue connection with Lake Formation

Create an IAM role that Lake Formation can use to vend credentials and attach permission on S3 bucket prefixes where the Iceberg tables are stored. Optionally, if you’re using Secrets Manager to store the client secret or are using a network configuration, you can add permissions for those services to this role. For instruction, refer to Catalog federation to remote Iceberg catalogs.

Complete the following steps to register the connection:

On the Lake Formation console, choose Catalogs in the navigation pane.
Choose Create catalog and select the data source.
Provide the federated catalog details:
1. Name of the federated catalog.
2. Catalog name in the remote catalog server and this needs to match the exact catalog name in remote catalog.
Provide AWS Glue connection details. To reuse an existing connection, choose Select existing connection and choose the connection to reuse. For a first-time setup, choose Input new connection configuration and provide the following information:
1. Provide the AWS Glue connection name.
2. Provide the remote catalog Iceberg REST API endpoint.
3. Specify the catalog object casing type. The connection can support uppercase objects through the object hierarchy or lowercase objects.
4. Configure authentication parameters:
  1. For OAuth2: Provide the client ID and client secret directly or choose the secret where the client secret is stored, token authorization URL, and scope mapped to the credential.
  2. For custom: Provide the secret managed by Secrets Manager where the access token is stored.
  3. Network configuration: If you have a network and/or proxy setup, you can provide this information. Otherwise, leave this section as default.
Register the connection with Lake Formation using the IAM role with access to the bucket where the remote table metadata and data is stored.
Verify the connection by choosing Run test.
After the test is successful, create the catalog.

You can now discover remote objects under the federated catalog. You can onboard other remote catalogs by reusing the existing connection configured to the same external catalog instance.

Query the federated catalog objects using AWS analytical engines

As the data lake administrator, you can now manage access control on databases and tables in a federated catalog using AWS Lake Formation. You can also use tag-based access control to scale your permission model by tagging the resource based on the access control mechanism.

After permissions are granted, an IAM principal or an IAM user can access the federated tables using AWS analytical services including Athena, Amazon Redshift, Amazon EMR, and Amazon SageMaker. Query the federated Iceberg table using Athena as shown in the following example.

Clean up

To avoid incurring ongoing charges, complete the following steps to clean up the resources created during this walkthrough:

Delete the federated catalog in the Data Catalog:

aws glue delete-catalog \
    --name <your-federated-catalog-name>

Deregister the AWS Glue connection from Lake Formation:

aws lakeformation deregister-resource \
    --resource-arn <your-glue-connector-arn>

Revoke Lake Formation permissions (if any were granted):

# List existing permissions first
aws lakeformation list-permissions \
    --catalog-id <your-account-id> \
    --resource '{
        "Catalog": {}
    }'

# Revoke permissions as needed
aws lakeformation revoke-permissions \
    --principal '{
        "DataLakePrincipalIdentifier": "<principal-arn>"
    }' \
    --resource '{
        "Database": {
            "CatalogId": "<catalog-id>",
            "Name": "<database-name>"
        }
    }' \
    --permissions ["SELECT", "DESCRIBE"]

Delete the AWS Glue connection:

aws glue delete-connection \
    --connection-name <your-glue-connection-to-snowflake-account>

Delete IAM roles and policies associated with Lake Formation and the AWS Glue connection:

# Detach policies from the role
aws iam detach-role-policy \
    --role-name <your-lakeformation-role-name> \
    --policy-arn <your-lakeformation-policy-arn>

# Delete the custom policy
aws iam delete-policy \
    --policy-arn <your-lakeformation-policy-arn>

# Delete the role
aws iam delete-role \
    --role-name <your-lakeformation-role-name>
# Detach policies from the role
aws iam detach-role-policy \
    --role-name <your-glue-connection-role-name> \
    --policy-arn <your-glue-connection-policy-arn>

# Delete the custom policy
aws iam delete-policy \
    --policy-arn <your-glue-connection-policy-arn>

# Delete the role
aws iam delete-role \
    --role-name <your-glue-connection-role-name>

Delete the Secrets Manager secret:

# Schedule secret for deletion (7-30 days)
aws secretsmanager delete-secret \
    --secret-id <your-snowflake-secret>

This teardown guide doesn’t affect the actual metadata in the remote catalog server nor the data in S3 buckets. It only affects the federation configurations in the Data Catalog and Lake Formation. Any corresponding service principals or configurations in the remote catalog server must be addressed separately.

Make sure you follow the teardown steps in the specified order to avoid dependency conflicts. For example, an AWS Glue connection object can’t be deleted if an AWS Glue catalog object is associated with it.

Additionally, make sure you have the necessary permissions to delete these resources.

Conclusion

In this post, we explored how catalog federation addresses the growing challenge of managing Iceberg tables across multi-vendor catalog environments. We walked through the architecture, demonstrating how the Data Catalog communicates with remote catalog systems, including Snowflake Polaris Catalog, Databricks Unity Catalog, and custom Iceberg REST-compliant catalogs, with centralized authorization and credential vending for secure data access. We covered the setup process, including configuring authentication principals, creating federated catalogs using AWS Glue connections, to implementing fine-grained access controls and querying remote Iceberg tables directly from AWS analytics engines.

Catalog federation offers several advantages:

Query your Iceberg data where it lives while maintaining security, governance, and price-performance benefits of AWS analytics services
Remove operational overheads and costs to maintain synchronization processes
Avoid data duplication and inconsistencies
Get real-time access to up-to-date table schemas without migrating or replacing existing catalogs.

To learn more, refer to Catalog federation to remote Iceberg catalogs.

About the authors

Accelerate data lake operations with Apache Iceberg V3 deletion vectors and row lineage

2025-11-27 Ron Ortloff

Post Syndicated from Ron Ortloff original https://aws.amazon.com/blogs/big-data/accelerate-data-lake-operations-with-apache-iceberg-v3-deletion-vectors-and-row-lineage/

Organizations building petabyte-scale data lakes face increasing challenges as their data grows. Batch updates and compliance deletes create a proliferation of positional delete files, slowing downstream data pipelines and driving up storage costs. Tracking data changes for audit trails and incremental processing requires custom, engine-specific implementations that add complexity and maintenance burden. As data volumes scale, these challenges compound, leaving data teams juggling custom solutions and increasing operational costs just to maintain data freshness and compliance.

Apache Iceberg V3 addresses these challenges with two new capabilities: deletion vectors and row lineage. AWS now delivers these capabilities across Apache Spark on Amazon EMR 7.12, AWS Glue, Amazon SageMaker notebooks, Amazon S3 Tables, and the AWS Glue Data Catalog, giving you a complete, integrated V3 experience without custom implementation. This means faster writes, lower storage costs, comprehensive audit trails, and efficient incremental processing, all working seamlessly across your entire data lake architecture.

In this post, we walk you through the new capabilities in Iceberg V3, explain how deletion vectors and row lineage address these challenges, explore real-world use cases across industries, and provide practical guidance on implementing Iceberg V3 features across AWS analytics, catalog, and storage services.

What’s new in Iceberg V3

Iceberg V3 introduces new capabilities and data types. Two key capabilities that address the challenges discussed earlier are deletion vectors and row lineage.

Deletion vectors replace positional delete files with an efficient binary format stored as Puffin files. Instead of creating separate delete files for each delete operation, the deletion vector consolidates these delete references to a single delete vector per data file, rather than a delete reference file per deleted row. During query execution, engines efficiently filter out deleted rows using these compact vectors, maintaining query performance while removing the need to merge multiple delete files.

This avoids write amplification from random batch updates and GDPR compliance deletes, significantly reducing the overhead of maintaining fresh data. High-frequency update workloads can see immediate improvements in write performance and reduced storage costs from fewer small delete files. Additionally, having fewer small delete files reduces table maintenance costs for compaction operations.

Row lineage enables precise change tracking at the row level with full auditability. Row lineage adds metadata fields to each data file that track when rows were created and last modified. The _row_id field uniquely identifies each row, and the _last_updated_sequence_number field tracks the snapshot when the row was last modified. These fields enable efficient change tracking queries without scanning entire tables, and they’re automatically maintained by the Iceberg specification without requiring custom code.

Before row lineage, change tracking in Iceberg provided only the net changes between snapshots, making it difficult to track individual record modifications. Row lineage metadata fields can now be queried to return all incremental changes, giving you full fidelity for auditing data modifications and regulatory compliance. For data transformations, your downstream systems can process changes incrementally, speeding up data pipelines and reducing compute costs for change data capture (CDC) workflows. Row lineage is engine agnostic, interoperable, and built into the Iceberg V3 specification, alleviating the need for custom, engine-specific change tracking implementations.

Real-world use cases

The new Iceberg V3 capabilities address critical challenges across multiple industries:

Marketing and advertising services organizations – You can now efficiently handle GDPR right-to-be-forgotten requests and regulatory compliance deletes without the write amplification that previously degraded pipeline performance. Row lineage provides complete audit trails for data modifications, meeting strict regulatory requirements for data governance.
Ecommerce platforms processing millions of product updates and inventory changes daily – You can maintain data freshness while reducing storage costs. Deletion vectors enable faster upsert operations, helping teams meet aggressive SLA requirements during peak shopping periods.
Healthcare and life sciences companies – You can track patient data modifications with precision for compliance purposes while efficiently processing large-scale genomic datasets. Row lineage provides the detailed change history required for clinical trial audits and regulatory submissions.
Media and entertainment providers managing large catalogs of user viewing data – You can efficiently process incremental changes for recommendation engines. Row lineage enables downstream analytics systems to process only changed records, reducing compute costs in incremental processing scenarios.

Get started with Iceberg V3

To take advantage of deletion vectors for optimized writes and row lineage for built-in change tracking in Iceberg V3, set the table property format-version = 3 during table creation. Alternatively, setting this property on an existing Iceberg V2 table atomically upgrades the table without data rewrites. Before creating or upgrading V3 tables, make sure the Iceberg engines in your solution are V3-compatible. Refer to Apache Iceberg V3 on AWS for more details.

Create a new V3 table with Apache Spark on Amazon EMR 7.12

The following code creates a new table named customer_data. Setting the table property format-version = 3 creates a V3 table. If the format-version table property is not explicitly set, a V2 table is created. V2 is currently the Iceberg default table version. Setting write.delete.mode, write.update.mode, and write.merge.mode to merge-on-read configures Spark to write deletion vectors for delete, update, or merge statements performed on the table.

CREATE TABLE customer_data (
customer_id bigint,
name string,
email string,
last_purchase timestamp,
total_spent decimal(10,2)
)
USING iceberg
TBLPROPERTIES (
'format-version' = '3',
'write.delete.mode' = 'merge-on-read',
'write.update.mode' = 'merge-on-read',
'write.merge.mode' = 'merge-on-read'
)

Run the following code to insert records into the customer_data table:

INSERT INTO customer_data VALUES
 (1, 'Alejandro Rosalez', '[email protected]', TIMESTAMP '2025-11-24 18:55:27', 42.97)
,(2, 'Akua Mansa', '[email protected]', TIMESTAMP '2025-11-24 17:55:27', 25.02)
,(3, 'Ana Carolina Silva','[email protected]', TIMESTAMP '2025-11-24 16:55:27', 43.67)
,(4, 'Arnav Desai','[email protected]', TIMESTAMP '2025-11-24 15:55:27', 98.32)
,(5, 'Carlos Salazar','[email protected]', TIMESTAMP '2025-11-24 12:55:27', 76.45)

Delete a record where customer_id = 5 to generate a delete file:

DELETE 
  FROM customer_data 
  WHERE customer_id = 5

Updating a record with the following update statement also generates a delete file:

UPDATE customer_data
  SET name = 'Mansa Akua' 
  WHERE customer_id = 2

The last part of this example queries the manifest’s metadata table to verify delete files were produced:

SELECT added_snapshot_id
      ,sum(added_delete_files_count) as added_delete_files_count 
FROM customer_data.manifests 
GROUP BY added_snapshot_id 
ORDER BY added_snapshot_id

This query will result in three records returned, as shown in the following screenshot. The added_delete_files_count for the first snapshot that inserts records should be 0. The next two snapshots for the corresponding delete and update statements should have 1 each for added_delete_files_count value.

Query row lineage for change tracking

Row lineage is automatically enabled on V3 tables. The following example includes row lineage metadata fields and an example of how to query table changes after a row lineage sequence number:

SELECT
customer_id,
name,
email,
_row_id,
_last_updated_sequence_number
FROM customer_data
WHERE _last_updated_sequence_number > 0
ORDER BY _last_updated_sequence_number, _row_id

Running this query after the previous insert, update, and delete statements returns four records, as shown in the following screenshot. The deleted record is removed. The _last_updated_sequence_number is 3 for the update to customer_id = 2.

Upgrade an existing V2 table

You can upgrade your existing V2 tables to V3 with the following command:

ALTER TABLE existing_customer_data
SET TBLPROPERTIES ('format-version' = '3')

When you upgrade a table from V2 to V3, several important operations occur atomically:

A new metadata snapshot is created atomically, resulting in no data loss.
Existing Parquet data files are reused without modification.
Row-lineage fields (_row_id and _last_updated_sequence_number) are added to the table metadata.
The next compaction operation will remove old V2 positional delete files. If new deletion vector files are generated before compaction runs, they will merge existing V2 positional delete files.
New modifications will automatically use V3’s deletion vector files.
The upgrade does not perform a historical backfill of row-lineage change tracking records.

The upgrade process is synchronous and completes in seconds for most tables. If the upgrade fails, an error message is returned immediately, and the table remains in its V2 state.

Getting the most from Iceberg V3

In this section, we share the key things we’ve learned from customers already using these features.

Know your workload pattern

Deletion vectors work best when you’re doing lots of writes, such as high-frequency updates, batch deletes, or CDC workloads making random non-append-only updates. If you’re writing more than you’re reading, deletion vectors will deliver immediate performance gains. To unlock these benefits, set your table to merge-on-read mode for delete, update, and merge operations.

Let AWS handle compaction

Enable automatic compaction through the Data Catalog or use S3 Tables (on by default). You will get hands-free optimization without building custom maintenance jobs. Deletion vectors produce fewer delete files than positional deletes in Iceberg V2. Given a similar pattern and amount of modified records, V3 compaction should be quicker and cost less than V2.

Understand the importance of row lineage when using the V2 changelog

With the Spark changelog procedure in Iceberg V2, if a row gets inserted and then deleted between snapshots, it disappears from your change feed—you never see it. Iceberg V3 row lineage captures both operations because _last_updated_sequence_number updates on each modification. This full fidelity is important for audit trails and regulatory compliance where you need to prove what happened to every record. Performance-wise, the V2 changelog requires scanning and merging delete files to compute changes—that’s compute you’re paying for on every read. V3 row lineage stores metadata fields directly on each row, so filtering by _last_updated_sequence_number is a simple metadata scan.

Test before you upgrade

Iceberg V3 upgrades are atomic and fast, but test in dev first. Make sure all your query engines support Iceberg V3 before upgrading shared tables—mixing V2 and V3 engines causes headaches. After upgrading, keep a few V2 snapshots around temporarily for time-travel queries while you validate performance.

Conclusion

Iceberg V3 support across AWS analytics, catalog, and storage services marks a significant advancement in data lake capabilities. By combining deletion vectors’ write optimization with row lineage’s comprehensive change tracking, you can build more efficient, auditable, and cost-effective data lakes at scale. The seamless interoperability across AWS services makes sure your data lake architecture remains flexible and future-proof.

To learn more about AWS support for Iceberg V3, refer to Using Apache Iceberg on AWS.

To learn more about building modern data lakes with Iceberg on AWS, refer to Analytics on AWS.

About the authors

Amazon Route 53 launches Accelerated recovery for managing public DNS records

2025-11-26 Micah Walter

Post Syndicated from Micah Walter original https://aws.amazon.com/blogs/aws/amazon-route-53-launches-accelerated-recovery-for-managing-public-dns-records/

Today, we’re announcing Amazon Route 53 Accelerated recovery for managing public DNS records, a new Domain Name Service (DNS) business continuity feature that is designed to provide a 60-minute recovery time objective (RTO) during service disruptions in the US East (N. Virginia) AWS Region. This enhancement ensures that customers can continue making DNS changes and provisioning infrastructure even during regional outages, providing greater predictability and resilience for mission-critical applications.

Customers running applications that require business continuity have told us they need additional DNS resilience capabilities to meet their business continuity requirements and regulatory compliance obligations. While AWS maintains exceptional availability across our global infrastructure, organizations in regulated industries like banking, FinTech, and SaaS want the confidence that they will be able to make DNS changes even during unexpected regional disruptions, allowing them to quickly provision standby cloud resources or redirect traffic when needed.

Accelerated recovery for managing public DNS records addresses this need by targeting DNS changes that customers can make within 60 minutes of a service disruption in the US East (N. Virginia) Region. The feature works seamlessly with your existing Route 53 setup, providing access to key Route 53 API operations during failover scenarios, including ChangeResourceRecordSets, GetChange, ListHostedZones, and ListResourceRecordSets. Customers can continue using their existing Route 53 API endpoint without modifying applications or scripts.

Let’s try it out
Configuring a Route53 hosted zone to use accelerated recovery is simple. Here I am creating a new hosted zone for a new website I’m building.

Once I have created my hosted zone, I see a new tab labeled Accelerated recovery. I can see here that accelerated recovery is disabled by default.

To enable it, I just need to click the Enable button and confirm my choice in the modal that appears as depicted in the dialog below.

Enabling accelerated recovery will take a couple minutes to complete. Once it’s enabled, I see a green Enabled status as depicted in the screenshot below.

I can disable accelerated recovery at any time from this same area of the AWS Management Console. I can also enable accelerated recovery for any existing hosted zones I have already created.

Enhanced DNS business continuity
With accelerated recovery enabled, customers gain several key capabilities during service disruptions. The feature maintains access to essential Route 53 API operations, ensuring that DNS management remains available when it’s needed most. Organizations can continue to make critical DNS changes, provision new infrastructure, and redirect traffic flows without waiting for full service restoration.

The implementation is designed for simplicity and reliability. Customers don’t need to learn new APIs or modify existing automation scripts. The same Route 53 endpoints and API calls continue to work, providing a seamless experience during both normal operations and failover scenarios.

Now available
Accelerated recovery for Amazon Route 53 public hosted zones is available now. You can enable this feature through the AWS Management Console, AWS Command Line Interface (AWS CLI), AWS Software Development Kit (AWS SDKs), or infrastructure as code tools like AWS CloudFormation and AWS Cloud Development Kit (AWS CDK). There is no additional cost for using accelerated recovery.

To learn more about accelerated recovery and get started, visit the documentation. This new capability represents our continued commitment to providing customers with the DNS resilience they need to build and operate mission-critical applications in the cloud.

Node.js 24 runtime now available in AWS Lambda

2025-11-26 Andrea Amorosi

Post Syndicated from Andrea Amorosi original https://aws.amazon.com/blogs/compute/node-js-24-runtime-now-available-in-aws-lambda/

You can now develop AWS Lambda functions using Node.js 24, either as a managed runtime or using the container base image. Node.js 24 is in active LTS status and ready for production use. It is expected to be supported with security patches and bugfixes until April 2028.

The Lambda runtime for Node.js 24 includes a new implementation of the Runtime Interface Client (RIC), which integrates your functions code with the Lambda service. Written in TypeScript, the new RIC streamlines and simplifies Node.js support in Lambda, removing several legacy features. In particular, callback-based function handlers are no longer supported.

Node.js 24 includes several additions to the language, such as Explicit Resource Management, as well as changes to the runtime implementation and the standard library. With this release, Node.js developers can take advantage of these new features and enhancements when creating serverless applications on Lambda.

You can develop Node.js 24 Lambda functions using the AWS Management Console, AWS Command Line Interface (AWS CLI), AWS SDK for JavaScript, AWS Serverless Application Model (AWS SAM), AWS Cloud Development Kit (AWS CDK), and other infrastructure as code tools. You can use Node.js 24 with Powertools for AWS Lambda (TypeScript), a developer toolkit to implement serverless best practices and increase developer velocity. Powertools includes libraries to support common tasks such as observability, AWS Systems Manager Parameter Store integration, idempotency, batch processing, and more. You can also use Node.js 24 with Lambda@Edge to customize low-latency content delivered through Amazon CloudFront.

This blog post highlights important changes to the Node.js runtime, notable Node.js language updates, and how you can use the new Node.js 24 runtime in your serverless applications.

Node.js 24 runtime changes

The Lambda Runtime for Node.js 24 includes the following changes relative to the Node.js 22 and earlier runtimes.

Removing support for callback-based function handlers

Starting with the Node.js 24 runtime, Lambda no longer supports the callback-based handler signature for asynchronous operations. Callback-based handlers take three parameters, with the third parameter a callback. For example:

export const handler = (event, context, callback) => {
    try {
        // Some processing...
        
        // Success case
        // First parameter (error) is null, second is the result
        callback(null, {
            statusCode: 200,
            body: JSON.stringify({
                message: "Operation completed successfully"
            })
        });
        
    } catch (error) {
        // Error case
        // First parameter contains the error
        callback(error);
    }
};

The modern approach to asynchronous programming in Node.js is to use the async/await pattern. Lambda introduced support for async handlers with the Node.js 8 runtime, launched in 2018. Here’s how the above function looks when using an async handler:

export const handler = async (event, context) => {
    try {
	  // Some processing
        
        return {
            statusCode: 200,
            body: JSON.stringify({
                message: "Operation completed successfully"
            })
        };
        
    } catch (error) {
        throw error;
    }
};

The Node.js 24 runtime still supports synchronous function handlers that do not use callbacks:

export const handler = (event, context) => {
    // Perform some synchronous data processing
    // Return response
    return {
        statusCode: 200,
        body: JSON.stringify(response)
    };
};

And Node.js 24 still supports response streaming, enabling more responsive applications by accelerating the time-to-first-byte:

export const handler = awslambda.streamifyResponse(async (event, responseStream, context) => {
    // Convert event to a readable stream
    const requestStream = Readable.from(Buffer.from(JSON.stringify(event)));
    // Stream the response using pipeline
    await pipeline(requestStream, responseStream);
});

This change to remove support for callback-based function handlers only affects Node.js 24 (and later) runtimes. Existing runtimes for Node.js 22 and earlier continue to support callback-based function handlers. When migrating functions that use callback-based handlers to Node.js 24, you need to modify your code to use one of the supported function handler signatures

As part of this change, context.callbackWaitsForEmptyEventLoop is removed. In addition, the previously deprecated context.succeed, context.fail, and context.done methods have also been removed. This aligns the runtime with modern Node.js patterns for clearer, more consistent error and result handling.

Harmonizing streaming and non-streaming behavior for unresolved promises

The Node.js 24 runtime also resolves a previous inconsistency in how unresolved promises were handled. Previously, Lambda would not wait for unresolved promises once the handler returns except when using response streaming. Starting with Node.js 24, the response streaming behavior is now consistent with non-streaming behavior, and Lambda no longer waits for unresolved promises once your handler returns or the response stream ends. Any background work (for example, pending timers, fetches, or queued callbacks) is not awaited implicitly. If your response depends on additional asynchronous operations, ensure you await them in your handler or integrate them into the streaming pipeline before closing the stream or returning, so the response only completes after all required work has finished.

Experimental Node.js features

Node.js enables certain experimental features by default in the upstream language releases. Such features include support for importing modules using require() in ECMAScript modules (ES modules) and automatically detecting ES vs CommonJS modules. As they are experimental, these features may be unstable or undergo breaking changes in future Node.js updates. To provide a stable experience, Lambda disables these features by default in the corresponding Lambda runtimes.

Lambda allows you to re-enable these features by adding the --experimental-require-module flag or the --experimental-detect-module flag to the NODE_OPTIONS environment variable. Enabling experimental Node.js features may affect performance and stability, and these features can change or be removed in future Node.js releases; such issues are not covered by AWS Support or the Lambda SLA.

ES modules in CloudFormation inline functions

With AWS CloudFormation inline functions, you provide your function code directly in the CloudFormation template. They’re particularly useful when deploying custom resources. With inline functions, the code filename is always index.js, which by default Node.js interprets as a CommonJS module. With the Node.js 24 runtime, you can use ES modules when authoring inline functions by passing the --experimental-detect-module flag via the NODE_OPTIONS environment variable. Previously, you needed a zip or container package to use ES modules. With Node.js 24, you can write inline functions using standard ESM syntax (import/export) and top‑level await), which simplifies small utilities and bootstrap logic without requiring a packaging step.

Node.js 24 language features

Node.js 24 introduces several language updates and features that enhance developer productivity and improve application performance.

Node.js 24 includes Undici 7, a newer version of the HTTP client that powers global ⁠fetch. This version brings performance improvements and broader protocol capabilities. Network‑heavy Lambda functions that call AWS services or external APIs can benefit from better connection management and throughput, especially when reusing clients or using HTTP/2 where supported. Most applications should work without changes, but you should validate behavior for advanced scenarios, such as custom headers or streaming bodies, and continue to define HTTP clients outside of the handler to maximize connection reuse across invocations.

The JavaScript Explicit Resource Management syntax (⁠using and ⁠await using) enables deterministic clean-up of resources when a block completes. For Lambda handlers, this makes it easier to ensure short‑lived objects, such as streams, temporary buffers, or file handles, are disposed of promptly, which reduces the risk of resource leaks across warm invocations. You should continue to define long‑lived clients, for example SDK clients or database pools, outside the handler to benefit from connection reuse, and apply explicit disposal only to resources you want to tear down at the end of each invocation.

Finally, the ⁠AsyncLocalStorage API now uses ⁠AsyncContextFrame by default, improving the performance and reliability of async context propagation. This benefits common serverless patterns such as timers, correlating logs, managing tracing IDs and request‑scoped metadata across async and await boundaries, and streams without manual parameter threading. If you already use ⁠AsyncLocalStorage‑based libraries for logging or observability, you may see lower overhead and more consistent context propagation in Node.js 24.

For a detailed overview of Node.js 24 language features, see the Node.js 24 release blog post and the Node.js 24 changelog.

Performance considerations

At launch, new Lambda runtimes receive less usage than existing established runtimes. This can result in longer cold start times due to reduced cache residency within internal Lambda sub-systems. Cold start times typically improve in the weeks following launch as usage increases. As a result, AWS recommends not drawing conclusions from side-by-side performance comparisons with other Lambda runtimes until the performance has stabilized. Since performance is highly dependent on workload, customers with performance-sensitive workloads should conduct their own testing, instead of relying on generic test benchmarks.

Builders should continue to measure and test function performance and optimize function code and configuration for any impact. To learn more about how to optimize Node.js performance in Lambda, see our blog post Optimizing Node.js dependencies in AWS Lambda.

Migration from earlier Node.js runtimes

We’ve already discussed changes that are new to the Node.js 24 runtime, such as removing support for callback-based function handlers. As a reminder, we’ll recap some previous changes for customers upgrading from older Node.js functions.

AWS SDK for JavaScript

Up until Node.js 16, Lambda’s Node.js runtimes included the AWS SDK for JavaScript version 2. This has since been superseded by the AWS SDK for JavaScript version 3, which was released in December 2024. Starting with Node.js 18, and continuing with Node.js 24, the Lambda Node.js runtimes include version 3. When upgrading from Node.js 16 or earlier runtimes and using the included version 2, you must upgrade your code to use the v3 SDK.

For optimal performance, and to have full control over your code dependencies, we recommend bundling and minifying the AWS SDK in your deployment package, rather than using the SDK included in the runtime. For more information, see Optimizing Node.js dependencies in AWS Lambda.

Amazon Linux 2023

The Node.js 24 runtime is based on the provided.al2023 runtime, which is based on the Amazon Linux 2023 minimal container image. The Amazon Linux 2023 minimal image uses microdnf as a package manager, symlinked as dnf. This replaces the yum package manager used in Node.js 18 and earlier AL2-based images. If you deploy your Lambda function as a container image, you must update your Dockerfile to use dnf instead of yum when upgrading to the Node.js 24 base image from Node.js 18 or earlier.

Learn more about the provided.al2023 runtime in the blog post Introducing the Amazon Linux 2023 runtime for AWS Lambda and the Amazon Linux 2023 launch blog post.

Using the Node.js 24 runtime in AWS Lambda

Finally, we’ll review how to configure your functions to use Node.js 24, using a range of deployment tools.

AWS Management Console

When using the AWS Lambda Console, you can choose Node.js 24.x in the Runtime dropdown when creating a function:

Creating Node.js function in the AWS Management Console

To update an existing Lambda function to Node.js 24, navigate to the function in the Lambda console, click Edit in the Runtime settings panel, then choose Node.js 24.x from the Runtime dropdown:

Editing Node.js function runtime

AWS Lambda container image

Change the Node.js base image version by modifying the FROM statement in your Dockerfile.

FROM public.ecr.aws/lambda/nodejs:24
# Copy function code
COPY lambda_handler.mjs ${LAMBDA_TASK_ROOT}

AWS Serverless Application Model

In AWS SAM, set the Runtime attribute to node24.x to use this version:

AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31
Resources:
  MyFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: lambda_function.lambda_handler
      Runtime: nodejs24.x
      CodeUri: my_function/.
      Description: My Node.js Lambda Function

AWS SAM supports generating this template with Node.js 24 for new serverless applications using the sam init command. For more information, refer to the AWS SAM documentation.

AWS Cloud Development Kit (AWS CDK)

In AWS CDK, set the runtime attribute to Runtime.NODEJS_24_X to use this version.

import * as cdk from "aws-cdk-lib";
import * as lambda from "aws-cdk-lib/aws-lambda";
import * as path from "path";
import { Construct } from "constructs";
export class CdkStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);
    // The code that defines your stack goes here
    // The Node.js 24 enabled Lambda Function
    const lambdaFunction = new lambda.Function(this, "node24LambdaFunction", {
      runtime: lambda.Runtime.NODEJS_24_X,
      code: lambda.Code.fromAsset(path.join(__dirname, "/../lambda")),
      handler: "index.handler",
    });
  }
}

Conclusion

AWS Lambda now supports Node.js 24 as a managed runtime and container base image. This release uses a new runtime interface client, removes support for callback-based function handlers, and includes several other changes to streamline and simplify Node.js support in Lambda.

You can build and deploy functions using Node.js 24 using the AWS Management Console, AWS CLI, AWS SDK, AWS SAM, AWS CDK, or your choice of infrastructure as code tool. You can also use the Node.js 24 container base image if you prefer to build and deploy your functions using container images.

To find more Node.js examples, use the Serverless Patterns Collection. For more serverless learning resources, visit Serverless Land

The attendee’s guide to hybrid cloud and edge computing at AWS re:Invent 2025

2025-11-25 Rachel Zheng

Post Syndicated from Rachel Zheng original https://aws.amazon.com/blogs/compute/the-attendees-guide-to-hybrid-cloud-and-edge-computing-at-aws-reinvent-2025/

AWS re:Invent 2025 returns to Las Vegas, Nevada, from December 1–5, 2025. This year, we’re offering a comprehensive lineup of sessions and booth activities to help you build resilient, performant, and scalable applications wherever you need them—in the cloud, on premises, or at the edge.

Session types

These sessions are available in the following formats. Most of the sessions are under the topic of Hybrid Cloud and Multicloud (HMC) in the event catalog. If you plan to attend in person, lightning talks and theater sessions are walk-up only. For all other session types, you can reserve your seat through the event portal (login required) or join as a walk-up based on availability.

Breakout sessions – Lecture-style 60-minute presentations led by AWS experts and customers.
Lightning talks – 20-minute content on specific topics. Each Hybrid Cloud lightning talk features a real-world customer implementation.
Chalk talks – 60-minute interactive sessions where AWS experts lead discussions and whiteboard in real time.
Code talks – 60-minute sessions featuring coding demonstrations and technical implementations.
Workshops – Hands-on, 120-minute sessions where you work directly with AWS services in a guided environment.
Theater sessions – 15-minute quick sessions on the Expo floor, typically featuring partner solutions.

Our Hybrid Cloud sessions explore how you can extend AWS infrastructure, services, and tools to distributed locations for low latency, data residency, and local processing needs. These sessions focus on AWS Local Zones, AWS Outposts, and AWS Dedicated Local Zones. We’ve curated the following sessions by theme to help you navigate re:Invent and find content most relevant to your needs.

Leadership session

HMC202 | Breakout Session | AWS wherever you need it: From the cloud to the edge

Wednesday, Dec 3 | 2:30 PM – 3:30 PM PST

Wynn | Convention Promenade | Lafite 7 | Content Hub | Turquoise Theater

Presented by our engineering and product management leaders, Spencer Dillard and Madhura Kale, this session provides an overview of all our latest innovations for hybrid cloud and edge computing, and how they help you address infrastructure requirements in digital sovereignty, generative AI with local data processing needs, and migration and modernization with on-premises dependencies.

Theme 1: Generative AI and agentic AI with local data processing

As you scale generative AI implementations from pilots to production, you need to balance speed of innovation with data sovereignty requirements, low-latency edge inference needs, and space, power, and cost efficiency. In the following sessions, you will learn how to address these challenges.

HMC308 | Breakout session | Build generative and agentic AI applications on-premises and at the edge

Thursday, Dec 4 | 2:30 PM – 3:30 PM PST

Wynn | Upper Convention Promenade | Cristal 7

This session shares reference architectures, best practices, and demos for running small language models (SLMs), Retrieval Augmented Generation (RAG), and agentic AI with AWS Hybrid and Edge services. Gain insights into strategies for model selection and optimization.

HMC324 | Lightning talk | BCC: Hybrid architecture for generative AI to meet regulatory needs

Monday, Dec 1 | 10:00AM – 10:20AM PST

Mandalay Bay | Oceanside C | Content Hub | Lightning Theater

Learn how Bank CenterCredit (BCC) implemented two generative AI use cases: anonymizing personally identifiable information (PII) in customer service calls with Outposts before sending it to the parent AWS Region for foundation model (FM) fine-tuning, and implementing local RAG with regulated data to improve HR efficiency.

HMC311 | Chalk talk | Developing end-to-end SLM pipelines from the cloud to the edge

Thursday, Dec 4 | 11:30AM – 12:30 PM PST

MGM | Level 3 | Chairman’s 362

This session presents a comprehensive approach to deploying SLMs to Local Zones and Outposts using Amazon SageMaker and Amazon EKS. Learn how to deliver domain-specific, fine-tuned SLMs from Regions to edge locations for low-latency inference.

HMC312 | Chalk talk | Implement RAG while meeting data residency requirements

Wednesday, Dec 3 | 5:30PM – 6:30 PM PST

Mandalay Bay | Level 3 South | South Seas A

This session explores how to implement RAG with on-premises and edge data to help you meet data residency and digital sovereignty needs.

HMC317 | Code talk | Implement Agentic AI at the edge for industrial automation

Monday, Dec 1 | 10:30AM – 11:30AM PST

Mandalay Bay | Level 3 South | Jasmine H

Manufacturing and industrial operations demand real-time, intelligent decision-making in low-connectivity environments. Learn how to deploy SmolVLMs (small vision language models) and AI agents to automate site operations using Outposts and Strands Agents.

HMC302 | Workshop | Implementing agentic AI solutions on-premises and at the edge

Wednesday, Dec 3 | 4:00PM – 6:00PM PST

MGM | Level 3 | Chairman’s 368

In this workshop, learn how to extend Amazon Bedrock AgentCore to Outposts and Local Zones to build distributed agentic applications using Model Context Protocol (MCP) and agent-to-agent (A2A) communication with on-premises data.

HMC305 | Workshop | Low-latency SLM deployment: Optimizing inference on AWS Hybrid and Edge services

Monday, Dec 1 | 08:00AM – 10:00AM PST

MGM | Level 1 | Grand 117

This workshop demonstrates a fully local deployment approach for running SLMs at the edge using Local Zones and Outposts. The implementation focuses on achieving low-latency inference and enabling data sovereignty compliance through RAG within local infrastructure.

Theme 2: Migration and modernization with on-premises or edge dependencies

Certain workloads need to stay on-premises or closer to end users. These can be driven by data residency and digital sovereignty requirements or the need to access legacy on-premises systems at a low latency. When a Region is not close enough to meet these needs, AWS brings AWS infrastructure and AWS services closer to where you need them to accelerate migration and modernization.

HMC309 | Breakout session | Migrating your VMware workloads with on-premises dependencies

Thursday, Dec 4 | 11:30AM – 12:30PM PST

Caesars Forum | Level 1 | Summit 212 | Content Hub | Orange Theater

Learn how AWS can help you migrate VMware-based workloads while addressing data residency requirements and latency-sensitive application interdependencies. Gain insights from a real-world implementation at Caesars Entertainment.

HMC217 | Lightning talk | Rivian: Modernize mission-critical manufacturing applications with AWS

Wednesday, Dec 3 | 2:30PM – 2:50PM PST

Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Lightning Theater

Manufacturing applications like Ignition, SCADA, MES, and robotic control require low-latency connectivity to on-premises manufacturing equipment. Learn how Rivian is modernizing mission-critical and latency-sensitive manufacturing applications with Outposts.

HMC313 | Chalk talk | Extend Amazon EKS clusters for on-premises and edge use cases

Wednesday, Dec 3 | 4:00PM – 5:00PM PST

MGM | Level 3 | Chairman’s 356

Dive deep into strategies for modernizing distributed applications with Amazon EKS across Regions, Local Zones, and Outposts.

HMC303 | Workshop | Migrate and modernize VMware workloads with on-premises dependencies

Tuesday, Dec 2 | 12:30PM – 2:30PM PST

Wynn | Convention Promenade | Margaux 2

This workshop guides you through migrating VMware and other on-premises applications to Outposts while modernizing them through containerization.

Theater session | Deploying robust disaster recovery for mission-critical workloads

Wednesday, Dec 3 | 1:00PM – 1:15PM PST

The Venetian | Level 2 | Expo Hall B | NetApp Booth (#1039)

With Outposts third-party storage integration, you can modernize right inside your data centers while leveraging your investment in on-premises storage solutions. Join this session to learn how you can implement a robust disaster recovery solution for mission-critical workloads using Amazon FSx for NetApp ONTAP and on-premises ONTAP with Outposts. Learn how to perform real-time SnapMirror replication between Regions and on-premises environments and monitor replication status and RPO metrics.

Theme 3: Data residency and digital sovereignty

As organizations scale innovative solutions globally, they need to navigate complex digital sovereignty requirements. Learn how AWS Hybrid and Edge services help you adopt a consistent approach to security, monitoring, management, and auditing across jurisdictions while meeting regulatory obligations.

HMC310 | Breakout session | Digital sovereignty and data residency with AWS Hybrid and Edge services

Tuesday, Dec 2 | 4:30PM – 5:30PM PST

Caesars Forum | Level 1 | Summit 212 | Content Hub | Yellow Theater

This session examines best practices for data residency, security controls, and operational consistency across distributed locations. Hear how AWS helped Geidea, a leading fintech company, accelerate business expansion in the Middle East while meeting country-specific data residency requirements.

HMC214 | Lightning talk | DraftKings: Meeting gaming regulations at Super Bowl scale with AWS Local Zones

Wednesday, Dec 3 | 2:00PM – 2:20PM PST

Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Lightning Theater

Learn how DraftKings built a scalable edge strategy using Regions and Local Zones to meet Federal Wire Act requirements while accelerating expansion into 26 US states.

HMC316 | Chalk talk | Address digital sovereignty with hybrid cloud solutions

Monday, Dec 1 | 1:30PM – 2:30PM PST

Mandalay Bay | Lower Level North | South Pacific D

Learn how to choose the best AWS infrastructure option for your sovereign needs and architect applications for data residency and resiliency. Discover how to implement security controls to regulate how data can be stored, processed, and transferred, and how to prevent unauthorized data access.

Theme 4: Optimizing for low and ultra-low latency

Systems like online ticketing, real-time threat detection, manufacturing control, and financial trading require network latency ranging from double-digit milliseconds to low double-digit microseconds. The Hybrid Cloud track discusses how AWS brings cloud services closer to end users and data generation points to satisfy various latency profiles.

SPF206 | Breakout session | Ticketmaster: Enhancing live event experience for fans with AWS Local Zones

Thursday, Dec 4 | 2:30PM – 3:30PM PST

Caesars Forum | Level 1 | Sports Forum | Mainstage

Discover how Ticketmaster delivers superior live event experiences by bringing its online ticket sales platform closer to fans using Local Zones.

HMC216 | Lightning talk | LSEG: Modernizing critical financial systems with multicloud and hybrid cloud

Tuesday, Dec 2 | 3:30PM – 3:50PM PST

Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Lightning Theater

Learn how LSEG is transforming its global PriceStream FX trading platform, implementing a sophisticated architecture for ultra-low latency with managed services. Additionally, explore how LSEG is modernizing clearing systems as critical national infrastructure, balancing regulatory compliance, operational excellence, and business continuity with strict RPO/RTO requirements.

HMC215 | Lightning talk | AWS Local Zones – Sophos’ new edge in the global race against cyber-attacks

Tuesday, Dec 2 | 3:00PM – 3:20PM PST

Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Lightning Theater

Discover how AWS Hybrid and Edge solutions transform how organizations deliver low-latency applications at the edge. Learn how Local Zones extends Regions and services closer to population centers, fitting use cases from media streaming to real-time gaming and financial trading.

HMC213 | Lightning talk | AWS Local Zones & Outposts: Verifone’s Global Edge Computing Strategy

Tuesday, Dec 2 | 11:30AM – 11:50AM PST

Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Lightning Theater

Hear from Verifone on how it transformed its global payment ecosystem, managing 35 million terminals with innovative edge computing. Dive into strategies for deploying point-of-sale (POS) software using multi-tier architectures.

HMC402 | Chalk talk | Meet ultra-low latency and high throughput needs with AWS Outposts

Wednesday, Dec 3 | 10:00AM – 11:00AM PST

Mandalay Bay | Level 2 South | Lagoon G

Dive deep into the new category of accelerated networking Amazon EC2 instances on Outposts, purpose-built for modernizing ultra-low latency and high-throughput mission-critical workloads.

Theme 5: Architecture considerations for hybrid cloud

Running applications outside of Regions requires special architectural considerations to address space, power, and networking constraints. We will share best practices and implementation guidance on improving security, resiliency, and availability.

HMC327 | Lightning talk | Nasdaq: Build resilient infrastructure for global financial services

Tuesday, Dec 2 | 11:00AM – 11:20AM PST

Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Lightning Theater

This session discusses how Nasdaq modernizes its mission-critical capital markets infrastructure while upholding the highest level of resiliency. Learn how Nasdaq integrates Outposts and multi-Region deployments into its core trading and surrounding systems, balancing cloud flexibility with performance and reliability.

HMC328 | Lightning talk | Build resilient and low-latency hybrid telecom infrastructure at scale

Thursday, Dec 4 | 5:00PM – 5:20PM PST

Mandalay Bay | Level 2 South | Oceanside C | Content Hub | Lightning Theater

Discover how Liberty Latin America (LLA) built telecom infrastructure in a hybrid cloud architecture for millions of subscribers. Learn how LLA created a resilient, low-latency network with over 180 VPCs.

HMC314 | Chalk talk | Deploying for resilience: HA/DR strategies for AWS Outposts and AWS Local Zones

Monday, Dec 1 | 1:30PM – 2:30PM PST

MGM | Level 1 | Boulevard 169

In this chalk talk, learn how to plan and implement resilient deployments to deliver high availability and disaster recovery, especially for business-critical or mission-critical workloads.

HMC315 | Chalk talk | Deep dive on AWS hybrid and edge networking architectures

Tuesday, Dec 2 | 1:30PM – 2:30PM PST

MGM | Level 1 | Boulevard 156

This chalk talk covers patterns such as private vs. public connectivity, service link and on-premises connectivity, and designing resilient Multi-AZ architecture across Outposts and Local Zones.

HMC403 | Code talk | Build and optimize edge architecture for resiliency with AI

Wednesday, Dec 3 | 2:30PM – 3:30PM PST

MGM | Level 3 | Chairman’s 356

This live coding session explores how to automate edge infrastructure operations with AI. Discover how to build resilient architectures with the latest Outposts and Local Zones APIs and Strands Agents.

HMC301 | Workshop | Build and operate resilient and performant distributed applications [REPEAT]

HMC301-R: Monday, Dec 1 | 3:00PM – 5:00PM PST | MGM | Grand 117

HMC301-R1: Thursday, Dec 4 | 3:00PM – 5:00PM PST | MGM | Premier 318

This workshop explores how to design and implement applications for multi-geo operations while meeting data residency and performance requirements. You will learn how to design fault-tolerant, latency-sensitive applications across distributed locations with limited hardware resources.

Activities in the Expo

Beyond sessions, join us in the re:Invent Expo (The Venetian, Level 2, Hall B) to meet with AWS experts and learn through interactive demos.

AWS Village (Booth #750)

Connect with AWS experts in the Hybrid Cloud kiosk and AWS Global Infrastructure kiosk to discuss your hybrid cloud and edge computing needs. Watch demos on the following topics and ask questions:

Consistent Amazon EKS experience across distributed locations and how it can simplify GPU resource orchestration
How to optimize and deploy SLMs locally for AI inference with low latency or data residency needs
How to build agentic AI workflows at the edge for manufacturing automation

Stop by the Migration & Modernization area of the AWS Village to see hardware innovations inside the latest generation of Outposts up close and personal.

AWS for Industries Pavillion (Booth #111)

Explore how AWS Hybrid and Edge services unlock new use cases and improve operational efficiency across multiple industries through immersive experiences:

AWS for Telecom – Modernizing telecom infrastructure while meeting data sovereignty and performance requirements
AWS for Public Sector – Accelerating tactical edge deployments to improve mission outcomes
AWS for Automotive – Advancing global R&D of embedded hardware and software

Partner booths

Discover how AWS Hybrid and Edge services and AWS Partner solutions unlock additional use cases through demos:

Pure Storage (Booth #1756) – Simplifying cloud migration with on-premises dependencies through Outposts integration with Pure Storage FlashArray
Intel (Booth #1010): – Powering physical AI and agentic systems for real-time operations in manufacturing
Seagate (Booth #159) – Facilitating data ingestion and pre-processing at the edge

Ready for re:Invent 2025?

We hope this guide to hybrid cloud and edge computing helps you maximize your learning experience at re:Invent. Not able to attend in-person? Register for the virtual-only event offered at no additional cost to livestream keynotes and innovation talks, and access on-demand session content. See you in Las Vegas or virtually!

Introducing guidelines for network scanning

2025-11-25 Stephen Goodman

Post Syndicated from Stephen Goodman original https://aws.amazon.com/blogs/security/introducing-guidelines-for-network-scanning/

Amazon Web Services (AWS) is introducing guidelines for network scanning of customer workloads. By following these guidelines, conforming scanners will collect more accurate data, minimize abuse reports, and help improve the security of the internet for everyone.

Network scanning is a practice in modern IT environments that can be used for either legitimate security needs or abused for malicious activity. On the legitimate side, organizations conduct network scans to maintain accurate inventories of their assets, verify security configurations, and identify potential vulnerabilities or outdated software versions that require attention. Security teams, system administrators, and authorized third-party security researchers use scanning in their standard toolkit for collecting security posture data. However, scanning is also performed by threat actors attempting to enumerate systems, discover weaknesses, or gather intelligence for attacks. Distinguishing between legitimate scanning activity and potentially harmful reconnaissance is a constant challenge for security operations.

When software vulnerabilities are found through scanning a given system, it’s particularly important that the scanner is well-intentioned. If a software vulnerability is discovered and attacked by a threat actor, it could allow unauthorized access to an organization’s IT systems. Organizations must effectively manage their software vulnerabilities to protect themselves from ransomware, data theft, operational issues, and regulatory penalties. At the same time, the scale of known vulnerabilities is growing rapidly, at a rate of 21% per year for the past 10 years as reported in the NIST National Vulnerability Database.

With these factors at play, network scanners need to scan and manage the collected security data with care. There are a variety of parties interested in security data, and each group uses the data differently. If security data is discovered and abused by threat actors, then system compromises, ransomware, and denial of service can create disruption and costs for system owners. With the exponential growth of data centers and connected software workloads providing critical services across energy, manufacturing, healthcare, government, education, finance, and transportation sectors, the impact of security data in the wrong hands can have significant real-world consequences.

Multiple parties

Multiple parties have vested interests in security data, including at least the following groups:

Organizations want to understand their asset inventories and patch vulnerabilities quickly to protect their assets.
Program auditors want evidence that organizations have robust controls in place to manage their infrastructure.
Cyber insurance providers want risk evaluations of organizational security posture.
Investors performing due diligence want to understand the cyber risk profile of an organization.
Security researchers want to identify risks and notify organizations to take action.
Threat actors want to exploit unpatched vulnerabilities and weaknesses for unauthorized access.

The sensitive nature of security data creates a complex ecosystem of competing interests, where an organization must maintain different levels of data access for different parties.

Motivation for the guidelines

We’ve described both the legitimate and malicious uses of network scanning, and the different parties that have an interest in the resulting data. We’re introducing these guidelines because we need to protect our networks and our customers; and telling the difference between these parties is challenging. There’s no single standard for the identification of network scanners on the internet. As such, system owners and defenders often don’t know who is scanning their systems. Each system owner is independently responsible for managing identification of these different parties. Network scanners might use unique methods to identify themselves, such as reverse DNS, custom user agents, or dedicated network ranges. In the case of malicious actors, they might attempt to evade identification altogether. This degree of identity variance makes it difficult for system owners to know the motivation of parties performing network scanning.

To address this challenge, we’re introducing behavioral guidelines for network scanning. AWS seeks to provide network security for every customer; our goal is to screen out abusive scanning that doesn’t meet these guidelines. Parties that broadly network scan can follow these guidelines to receive more reliable data from AWS IP space. Organizations running on AWS receive a higher degree of assurance in their risk management.

When network scanning is managed according to these guidelines, it helps system owners strengthen their defenses and improve visibility across their digital ecosystem. For example, Amazon Inspector can detect software vulnerabilities and prioritize remediation efforts while conforming to these guidelines. Similarly, partners in AWS Marketplace use these guidelines to collect internet-wide signals and help organizations understand and manage cyber risk.

“When organizations have clear, data-driven visibility into their own security posture and that of their third parties, they can make faster, smarter decisions to reduce cyber risk across the ecosystem.” – Dave Casion, CTO, Bitsight

Of course, security works better together, so AWS customers can report abusive scanning to our Trust & Safety Center as type Network Activity > Port Scanning and Intrusion Attempts. Each report helps improve the collective protection against malicious use of security data.

The guidelines

To help ensure that legitimate network scanners can clearly differentiate themselves from threat actors, AWS offers the following guidance for scanning customer workloads. This guidance on network scanning complements the policies on penetration testing and vulnerability reporting. AWS reserves the right to limit or block traffic that appears non-compliant with these guidelines. A conforming scanner adheres to the following practices:

Observational

Perform no actions that attempt to create, modify, or delete resources or data on discovered endpoints.
Respect the integrity of targeted systems. Scans cause no degradation to system function and cause no change in system configuration.
Examples of non-mutating scanning include:
- Initiating and completing a TCP handshake
- Retrieving the banner from an SSH service

Identifiable

Provide transparency by publishing sources of scanning activity.
Implement a verifiable process for confirming the authenticity of scanning activities.
Examples of identifiable scanning include:
- Supporting reverse DNS lookups to one of your organization’s public DNS zones for scanning Ips.
- Publishing scanning IP ranges, organized by types of requests (such as service existence, vulnerability checks).
- If HTTP scanning, have meaningful content in user agent strings (such as names from your public DNS zones, URL for opt-out)

Cooperative

Limit scan rates to minimize impact on target systems.
Provide an opt-out mechanism for verified resource owners to request cessation of scanning activity.
Honor opt-out requests within a reasonable response period.
Examples of cooperative scanning include:
- Limit scanning to one service transaction per second per destination service.
- Respect site settings as expressed in robots.txt and security.txt and other such industry standards for expressing site owner intent.

Confidential

Maintain secure infrastructure and data handling practices as reflected by industry-standard certifications such as SOC2.
Ensure no unauthenticated or unauthorized access to collected scan data.
Implement user identification and verification processes.

See the full guidance on AWS.

What’s next?

As more network scanners follow this guidance, system owners will benefit from reduced risk to their confidentiality, integrity, and availability. Legitimate network scanners will send a clear signal of their intention and improve their visibility quality. With the constantly changing state of networking, we expect that this guidance will evolve along with technical controls over time. We look forward to input from customers, system owners, network scanners and others to continue improving security posture across AWS and the internet.

If you have feedback about this post, submit comments in the Comments section below or contact AWS Support.

The Future of AWS CodeCommit

2025-11-24 Anthony Hayes

Post Syndicated from Anthony Hayes original https://aws.amazon.com/blogs/devops/aws-codecommit-returns-to-general-availability/

Back in July 2024, we announced plans to de-emphasize AWS CodeCommit based on adoption patterns and our assessment of customer needs. We never stopped looking at the data or listening to you, and what you’ve shown us is clear: you need an AWS-managed solution for your code repositories. Based on this feedback, CodeCommit is returning to full General Availability, effective immediately.

We Listened, and We Heard You

After the de-emphasis announcement last year, we heard from many of you. Your feedback was direct and revealing. You told us that CodeCommit isn’t just another code repository for you—it’s a critical piece of your infrastructure. Its deep IAM integration, VPC endpoint support, CloudTrail logging, and seamless connectivity with CodePipeline and CodeBuild provide value that’s difficult to replicate with third-party solutions, especially for teams operating in regulated industries or those who want all their development infrastructure within AWS boundaries. In short, we learned that CodeCommit is essential for many of you, so we’re bringing it back.

We acknowledge the uncertainty the de-emphasis has caused. If you invested time and resources planning or executing a migration away from CodeCommit, we apologize. We’ve learned from this, and we’re committed to doing better.

What’s Changing Today

Here’s what you need to know:

CodeCommit is open to new customers again – New customer sign-ups are open as of today. If you’ve been waiting to onboard new accounts or create repositories, you can do so right now through the AWS Console, CLI, or APIs.

For current and former customers – If you already migrated away, we understand you may have completed your transition to GitHub, GitLab, Bitbucket, or another provider. Those are excellent platforms, and we fully support your decision to use them. If you’re interested in returning to CodeCommit, our support team and account teams are available to help.

If you’re mid-migration, you can pause or reverse your plans. Contact AWS Support or your account team to discuss your specific situation and determine the best path forward.

If you stayed with CodeCommit, thank you for your patience during this period. We’re working through the backlog of feature requests and support tickets that accumulated, prioritizing by customer need. Continue to tell us how we can improve the service and support your workflows (human, machine, and agentic) moving forward.

What’s Coming Next

We’re not just maintaining CodeCommit—we’re investing in it. Here’s what’s on the roadmap:

Git LFS Support (Q1 2026) – This has been your most requested feature. Git Large File Storage will enable you to efficiently manage large binary files like images, videos, design assets, and compiled binaries without bloating your repositories. You’ll get faster clones, better performance, and cleaner version history for large assets.

Regional Expansions (Starting Q3 2026) – CodeCommit will expand to additional AWS Regions in eu-south-2 and ca-west-1, bringing the service closer to where you’re building and deploying your applications.

We’ll share more details about these features and additional roadmap items in the coming months. Keep an eye on our What’s New feed for the latest AWS launches.

Pricing, SLA, and Getting Started

Pricing remains unchanged—you can review the current structure on the CodeCommit pricing page. We continue to maintain our 99.9% uptime SLA as defined in our service terms.

If you’re new to CodeCommit or returning after a migration, check out our Getting Started Guide for step-by-step instructions. For migration assistance or questions about your specific setup, contact AWS Support or your account team.

Available Now

AWS CodeCommit is available now in 29 regions. New customers can begin creating repositories immediately. Visit the CodeCommit console to get started.

Thank you for your feedback, your patience, and your continued trust in AWS. We’re committed to making CodeCommit the best integrated Git repository service for AWS development.

Learn More:

Enhancing API security with Amazon API Gateway TLS security policies

2025-11-21 Anton Aleksandrov

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/compute/enhancing-api-security-with-amazon-api-gateway-tls-security-policies/

As compliance frameworks evolve and cryptographic standards advance, organizations are looking for additional controls to improve their cloud security posture. One of the neccesary controls is a more granular TLS configuration, for example when regulatory requirements mandate disabling older ciphers like CBC or enforcing TLS 1.3 as a minimum version.

In this post, you will learn how the new Amazon API Gateway’s enhanced TLS security policies help you meet standards such as PCI DSS, Open Banking, and FIPS, while strengthening how your APIs handle TLS negotiation. This new capability increases your security posture without adding operational complexity, and provides you with a single, consistent way to standardize TLS configuration across your API Gateway infrastructure.

Overview

Previously, API Gateway offered limited control over TLS configuration, and only for custom domain names. Default endpoints used fixed security policies, which meant you often had to introduce additional infrastructure, such as custom Amazon CloudFront distributions, to meet your organization’s security or compliance requirements.

With this launch, you can configure TLS behavior directly on all REST API endpoint types, including Regional, edge-optimized, and private, and apply consistent TLS settings across both your APIs and their custom domain names. You can choose from predefined enhanced security policies to enforce the minimum TLS versions and cipher suites that your workloads require. For example, you can enforce TLS 1.3, use hardened TLS 1.2 without CBC ciphers, adopt FIPS-aligned suites for government workloads, or prepare for the future with policies that include post-quantum cryptography (PQC). The new security policies provide finer-grained control without adding operational complexity, helping you align your APIs with evolving security and compliance expectations.

Understanding API Gateway security policies

A security policy in API Gateway is a predefined combination of a minimum TLS version and a curated set of cipher suites. When a client connects to your REST API or custom domain name, API Gateway uses the selected policy to determine which protocol versions and ciphers it will accept during the TLS handshake. This gives you a predictable and enforceable way to control how clients establish encrypted connections to your APIs.

API Gateway supports two categories of security policies. Legacy policies, such as TLS_1_0 or TLS_1_2, remain available for backwards compatibility. Enhanced policies, identified by the SecurityPolicy_* prefix, provide stricter and more modern controls for regulated workloads, advanced governance, or cryptographic hardening. When you use an enhanced policy, you must also specify an endpoint access mode, which adds additional validation for how traffic reaches your API, as described in the following sections.

Enhanced policies follow a consistent naming patterns that helps you quickly understand what each policy enforces. For example, for REGIONAL and PRIVATE endpoint types, the following pattern applies:

SecurityPolicy_[TLS-Versions]_[Variant]_[YYYY-MM]

From this structure, you can identify the minimum TLS versions supported, any specialized cryptographic variants (such as FIPS, PFS, or PQ), and the release date of the policy. For example, SecurityPolicy_TLS13_1_3_2025_09 accepts only TLS 1.3 traffic, while SecurityPolicy_TLS13_1_2_PFS_PQ_2025_09 supports TLS 1.2 as lowest and TLS 1.3 as highest TLS version with forward secrecy and post-quantum enhancements.

Each policy maps to a curated combination of ciphers. For instance, SecurityPolicy_TLS13_1_3_2025_09 accepts only three TLS 1.3 cipher suites (TLS_AES_128_GCM_SHA256, TLS_AES_256_GCM_SHA384, and TLS_CHACHA20_POLY1305_SHA256) and rejects any other protocol versions or ciphers. For a full list of supported policies and ciphers, and naming pattern for the EDGE endpont type, see the API Gateway documentation.

How security policies apply to default endpoints and custom domains

You can use API Gateway to attach different security policies to your default API endpoint and custom domain names. During TLS negotiation, API Gateway selects the policy based on the Server Name Indication (SNI) value in the client’s TLS handshake, not the HTTP Host header. This means the policy depends on the hostname the client uses when initiating TLS.

For example, if a client connects directly to your default endpoint, such as:

https://abcdef1234.execute-api.us-east-1.amazonaws.com

API Gateway uses the policy attached to that default endpoint because the SNI value matches its hostname.

If the client instead connects through a custom domain name, such as:

https://api.example.com

API Gateway uses the policy attached to that custom domain. In this case, the SNI value api.example.com determines which policy is enforced.

This distinction is important even if you disable your default endpoint. TLS negotiation always occurs before API Gateway evaluates endpoint settings, so the default endpoint security policy still applies to clients that connect directly to its hostname. To avoid unexpected client behavior, you should keep the API and its custom domain name aligned with the same security policy whenever possible.

Understanding endpoint access mode

When you use an enhanced security policy (SecurityPolicy_*), you must also specify an endpoint access mode. Endpoint access mode defines how strictly API Gateway validates the network path a request takes before it reaches your API. This gives you an additional layer of governance and helps you prevent unauthorized or misrouted traffic.

You can choose between two modes:

BASIC mode provides standard API Gateway behavior. It is the recommended starting point when you migrate an existing API to an enhanced security policy. Clients can continue reaching your API as they do today, without additional validation.
STRICT mode adds enforcement checks to ensure that requests originate from the correct endpoint type, and TLS negotiation aligns with your configuration.

When you enable STRICT mode, API Gateway performs additional validations, such as:

The SNI and HTTP Host header values match
The request originates from the same endpoint type as your API (Regional, edge-optimized, or private)

If any of these validations fail, API Gateway rejects the request. STRICT is a viable choice when you need stronger security guarantees, such as when running regulated or sensitive workloads. See API Gateway documentation for additional details.

When you switch from BASIC to STRICT mode, it takes up to 15 minutes for the change to fully propagate. Your API remains available during this period. If your endpoint access mode is set to STRICT, you cannot change the endpoint type until you revert the mode back to BASIC.

Applying security policies to new and existing APIs

You can apply a security policy when you create a new REST API or custom domain name, or update an existing resource to use one of the enhanced SecurityPolicy_* options. When migrating existing APIs, the recommended approach is to start with BASIC mode, validate client behavior (SNI and HTTP Host header values match, request originates from the same endpoint type as your API), and then move to STRICT mode once you confirm compatibility.

The following code snippets illustrate how to apply security policies to different scenarios:

Create a REST API with a security policy and STRICT endpoint access mode

You can attach a security policy directly during API creation, removing the need for extra infrastructure just to control TLS negotiation.

aws apigateway create-rest-api \
  --name "your-private-api-name" \
  --endpoint-configuration '{"types":["PRIVATE"]}' \
  --security-policy "SecurityPolicy_TLS13_1_3_2025_09" \
  --endpoint-access-mode STRICT \
  --policy file://api-policy.json

Create a custom domain name with a security policy and STRICT endpoint access mode

You can also specify the security policy when creating a custom domain name. API Gateway applies the selected policy during TLS negotiation based on the SNI value the client provides.

aws apigateway create-domain-name \
  --domain-name api.example.com \
  --regional-certificate-arn arn:aws:acm:region:account-id:certificate/certificate-id \
  --endpoint-configuration '{"types":["REGIONAL"]}' \
  --security-policy SecurityPolicy_TLS13_1_3_2025_09 \
  --endpoint-access-mode STRICT

Updating existing REST API

If you are migrating an existing API, start by applying the enhanced security policy with BASIC mode. After confirming that your clients can connect with BASIC mode as expected, proceed to enable the STRICT mode.

1. Apply the new policy with BASIC mode

aws apigateway update-rest-api --rest-api-id abcd123 --patch-operations '[
    {
         "op": "replace",
         "path": "/securityPolicy",
         "value": "SecurityPolicy_TLS13_1_3_2025_09"
    },
    {
         "op": "replace",
         "path": "/endpointAccessMode",
         "value": "BASIC"
     }
]'

Verify your clients can consume the API as expected using access logs and performance metrics in Amazon CloudWatch.

2. Enable the STRICT mode after validation

aws apigateway update-rest-api --rest-api-id abcd123 --patch-operations '[
    {
        "op": "replace",
        "path": "/endpointAccessMode",
        "value": "STRICT"
     }
]'

Updating existing custom domain name

Custom domain names follow the same migration approach as REST APIs.

1. Apply the new policy with BASIC mode and validate clients can successfully connect.

aws apigateway update-domain-name --domain-name api.example.com --patch-operations '[
    {
        "op": "replace",
        "path": "/securityPolicy",
        "value": "SecurityPolicy_TLS13_1_3_2025_09"
    },
    {
        "op": "replace",
        "path": "/endpointAccessMode",
        "value": "BASIC"
     }
]'

2. Enable the STRICT mode after validation

aws apigateway update-domain-name --domain-name api.example.com --patch-operations '[
    {
        "op": "replace",
        "path": "/endpointAccessMode",
        "value": "STRICT"
     }
]'

After you update your REST API or custom domain configuration, redeploy your API so that stages receive the new settings. When you change a security policy, the update takes up to 15 minutes to complete. The API status appears as UPDATING while the change propagates and returns to AVAILABLE when complete. Your API remains fully functional throughout this process.

Rolling back endpoint access mode

If you notice clients failing to connect to your API after applying the STRICT mode, you can revert the endpoint access mode back to BASIC at any time. Below code snippet illustrates doing this for a REST API.

aws apigateway update-rest-api --rest-api-id abcd123 --patch-operations '[
    {
      "op": "replace",
      "path": "/endpointAccessMode",
      "value": "BASIC"
    }
  ]'

You can use the same approach to update a custom domain name.

Monitoring TLS usage and policy migrations

As you adopt enhanced security policies, it is important to understand how clients negotiate encrypted connections with your API. Monitoring helps you verify client readiness, identify legacy consumers that may require updates, and validate that STRICT endpoint access mode behaves as expected during rollout. Use the following API Gateway access logs variables to monitor protocol and cipher usage over time.

$context.tlsVersion – the negotiated TLS version
$context.cipherSuite – the cipher suite selected during the handshake

You can use these variables to confirm that:

Clients are using the expected minimum TLS version
BC-based ciphers are no longer used after you move to a hardened policy
PQC and FIPS-aligned policies are being exercised by the appropriate clients

Access logs are especially useful during migrations, where validating the actual client behavior is a prerequisite before enabling STRICT mode. For example, if you still observe live clients negotiating TLS 1.0 or TLS 1.2 CBC ciphers after applying a hardened policy in BASIC mode, you can identify the affected clients and plan remediation before switching to STRICT mode.

Future-proof security configurations

Some of the new policies combine TLS 1.3 with post-quantum cryptography (PQC) to help you prepare for a future where quantum-capable threat actors exist. With these policies you can start testing and adopting quantum-resistant algorithms without redesigning your API architecture.

As standards evolve and new cipher suites are introduced, API Gateway’s policy model provides you with a clear path for adding new variants while keeping your configuration simple and predictable.

Conclusion and next steps

Enhanced TLS security policies and endpoint access mode in the Amazon API Gateway gives you direct control over how clients establish secure connections to your APIs. You can choose the policies that match your compliance needs, such as PCI DSS, FIPS, Open Banking, PQC, and use STRICT mode to control how traffic reaches your endpoints and apply additional domain-level validations, further hardening security of your APIs

To get started:

Review the list of available security policies in the API Gateway documentation.
Identify which REST APIs and domains require stronger TLS controls.
Apply an appropriate SecurityPolicy-* policy with BASIC mode.
Validate client behavior using access logs and CloudWatch metrics.
Move to STRICT mode when you are ready to enforce additional connection-level protection.

For more information about building Serverless architectures, see ServerlessLand.com

Build scalable REST APIs using Amazon API Gateway private integration with Application Load Balancer

2025-11-21 Christian Silva

Post Syndicated from Christian Silva original https://aws.amazon.com/blogs/compute/build-scalable-rest-apis-using-amazon-api-gateway-private-integration-with-application-load-balancer/

This post is written by Vijay Menon, Principal Solutions Architect, and Christian Silva, Senior Solutions Architect.

Today, we announced Amazon API Gateway REST API’s support for private integration with Application Load Balancers (ALBs). You can use this new capability to securely expose your VPC-based applications through your REST APIs without exposing your ALBs to the public internet.

Prior to this launch, if you wanted to connect API Gateway to private ALBs, you would have had to use a Network Load Balancer (NLB) as an intermediary, increasing cost and complexity. Now, you can directly integrate API Gateway with private ALBs without requiring an NLB, reducing operational overhead and optimizing cost.

Previous architecture: Connecting API Gateway to private ALBs

Before this launch, API Gateway REST APIs connect to private ALB resources through an NLB positioned in front of the ALB. Many customers have successfully built and operated production workloads using this architecture, demonstrating its reliability for business-critical applications. The following architecture demonstrates this setup.

Figure 1. Previous architecture: API Gateway to private ALB via intermediary NLB

In response to customer feedback for a simplified architecture and reduced costs, we’ve extended VPC link v2 support to REST APIs. This feature now enables direct private ALB integration for REST APIs, eliminating the need for an intermediary NLB.

New architecture: Connecting API Gateway to private ALBs

With direct private ALB integration, this architecture becomes simpler and more efficient. The integration removes the need for an intermediate NLB, reducing the number of hops between client and your services. This streamlined setup simplifies the architecture for applications, allowing more efficient use of ALB’s layer-7 load-balancing capabilities, authentication, and authorization features. While these ALB features were technically accessible before, the new architecture removes the overhead and complexity of managing an additional NLB. Here’s how the simplified architecture looks now:

Figure 2. Direct integration between API Gateway and private ALB

Benefits of a direct integration between your API Gateway endpoint and your private ALB

Architectural simplification and operational excellence: Now that your API Gateway can directly connect to your private ALB, you no longer need an NLB to act as a bridge between your API Gateway and your private ALB. This eliminates the need to provision, configure, manage, or monitor an intermediate load balancer. The reduction in infrastructure components translates to reduced operational overhead and fewer potential failure points. Traffic flows directly from API Gateway to your ALB within the Amazon Web Services (AWS) network, reducing network hops and latency.
Improved scalability: VPC link v2 supports a one-to-many relationship with load balancers. A single VPC link v2 allows API Gateway to integrate with multiple ALBs or NLBs within your VPC. This architectural advantage is particularly valuable for organizations managing complex applications with multiple microservices, each potentially behind its own ALB, or those running numerous APIs. The ability to consolidate multiple load balancer connections through a single VPC link not only reduces administrative overhead but also provides greater flexibility in scaling your architecture. As your application grows and you add more services or load balancers, you won’t need to provision additional VPC links, making it easier to expand your infrastructure while maintaining operational efficiency.
Cost optimization: You can remove the NLB from your architecture and thereby eliminate both the hourly charges for running the NLB and the associated Network Load Balancer Capacity Units (NLCU) used per hour. For organizations running multiple environments or numerous APIs, these savings can accumulate to thousands of dollars annually. Moreover, your data transfer patterns become more efficient. Traffic flows directly from API Gateway to your ALB within the AWS network, which avoids any unnecessary hops that could incur more data transfer charges. This streamlined path not only reduces costs but also improves performance by minimizing network latency.

Getting started

This tutorial demonstrates the setup using both the AWS Management Console and AWS Command Line Interface (AWS CLI). Before you begin, make sure that you have an internal ALB configured in your VPC. For resources that need naming, use appropriate names for your environment.

Step 1: Create a VPC link v2
The first step in our process is to create a VPC link v2, which will enable API Gateway to route traffic to your internal ALB. Here’s how to set it up:

Navigate to the API Gateway console.
In the left navigation pane, choose VPC links.
Choose Create VPC link.
Choose VPC link v2 as the VPC link type.
Provide a descriptive name for your VPC link.
Choose your VPC and subnets where your ALB resides. For high availability, choose subnets in multiple AWS Availability Zones (AZs) that match your ALB configuration.
Assign one or more security groups to your VPC link. These security groups will control the traffic flow between API Gateway and your VPC.
Choose Create and wait for the VPC link status to become Available. This process can take a few minutes.

Alternatively, you can create a VPC link v2 using the AWS CLI:

# Create VPC link v2
aws apigatewayv2 create-vpc-link \
    --name "test-vpc-link-v2" \
    --subnet-ids "<your-subnet1-id>" "<your-subnet2-id>" \
    --security-group-ids "<your-security-group-id>" \
    --region <your-AWS-region>

# Check VPC link v2 status
aws apigatewayv2 get-vpc-link \
    --vpc-link-id "<your-vpc-link-v2-id>" \
    --region <your-AWS-region>

Step 2: Create a REST API and configure integration
With your VPC link v2 now available, the next step is to create a REST API and configure it to use the VPC Link. This process involves creating the API, setting up resources and methods, and configuring the integration with your internal ALB.

In the API Gateway console, choose Create API.
Choose REST API.
Enter an API name and choose Create API.
Create a new resource by choosing Actions, then choose Create resource. This resource will represent the endpoint for your API.
Create a method by choosing Actions, then choose Create method. The method defines the type of request your API will accept (GET, POST, etc.).
Now, configure the integration. This is where you’ll connect your API to your internal ALB via the VPC link v2:
1. Choose VPC link as the integration type.
2. Choose the HTTP method for your backend integration.
3. Choose your newly created VPC link v2.
4. Specify your ALB as the Integration target.
5. Enter the endpoint URL for your integration. The port specified in the URL is used to route requests to the backend.
6. Set the Integration timeout.

Using the AWS CLI:

# Create REST API
aws apigateway create-rest-api \
    --name "test-rest-api" \
    --description "REST API integration with internal ALB via VPC link v2" \
    --region <your-AWS-region>

# Get REST API’s root resource ID
aws apigateway get-resources \
    --rest-api-id "<your-rest-api-id>" \
    --region <your-AWS-region>

# Create a new resource
aws apigateway create-resource \
    --rest-api-id "<your-rest-api-id>" \
    --parent-id "<your-parent-id>" \
    --path-part "internal-alb" \ 
    --region <your-AWS-region>

# Create a new method
aws apigateway put-method \
    --rest-api-id "<your-rest-api-id>" \
    --resource-id "<your-resource-id>" \
    --http-method ANY \
    --authorization-type NONE \
    --region <your-AWS-region>

# Create the integration
aws apigateway put-integration \
    --rest-api-id "<your-rest-api-id>" \
    --resource-id "<your-resource-id>" \
    --http-method ANY \
    --type HTTP_PROXY \
    --integration-http-method ANY \
    --uri "http://test-internal-alb.com/test" \
    --connection-type VPC_LINK \
    --connection-id "<your-vpc-link-v2-id>" \
    --integration-target "<your-ALB-arn>" \
    --region <your-AWS-region>

Step 3: Deploy and test
With your API configured, it’s time to deploy it and verify that it’s working correctly.

Choose Deploy API to create a new deployment of your API.
Create a new stage (for example “test”). Stages allow you to manage multiple versions of your API.
After deployment, you’ll receive an API endpoint URL. Copy this URL as you’ll need it for testing.

Test your API using your preferred API client or a simple curl command.

Using the AWS CLI:

# Create a new deployment to a test stage
aws apigateway create-deployment \
    --rest-api-id "<your-rest-api-id>" \
    --stage-name "test" \
    --region <your-AWS-region>

Test your API integration using a curl command:

curl https://<rest-api-id>.execute-api.<your-aws-region>.amazonaws.com/internal-alb {"message": "Hello from internal ALB"}

Step 4: Scale your VPC link v2
A single VPC link can now connect to multiple ALBs or NLBs within your VPC, simplifying infrastructure management. This AWS CLI snippet demonstrates API Gateway integrating with multiple internal services, for example orders and payments services, each behind its own ALB, using a single VPC link v2. Note how the same VPC link ID is used across both integrations.

# Orders service integration (ALB-1)
aws apigateway put-integration \
    --rest-api-id "<your-rest-api-id>" \
    --resource-id "<orders-resource-id>" \
    --http-method ANY \
    --type HTTP_PROXY \
    --integration-http-method ANY \
    --uri "<your-orders-alb-endpoint>" \
    --connection-type VPC_LINK \
    --connection-id "<your-vpc-link-v2-id>" \
    --integration-target "<your-orders-alb-arn>" \
    --region "<your-aws-region>"

# Payments service integration (ALB-2)
aws apigateway put-integration \
    --rest-api-id "<your-rest-api-id>" \
    --resource-id "<payments-resource-id>" \
    --http-method ANY \
    --type HTTP_PROXY \
    --integration-http-method ANY \
    --uri "<your-payments-alb-endpoint>" \
    --connection-type VPC_LINK \
    --connection-id "<your-vpc-link-v2-id>" \
    --integration-target "<your-payments-alb-arn>" \
    --region "<your-aws-region>"

For a detailed, step-by-step guide, please see our official documentation in the API Gateway Developer Guide.

Use cases

Private ALB integration with API Gateway enables architectural patterns that solve enterprise challenges. These are three key scenarios where organizations can use this new capability:

Microservices on Amazon ECS and Amazon EKS: Exposing microservices running on Amazon ECS or Amazon EKS becomes simpler with this integration. It allows secure, path-based routing to different services without exposing your ALB to the public internet or using complex NLB proxy patterns.
Hybrid cloud architectures: Seamless and secure connectivity between cloud-native APIs and on-premises resources is achieved via AWS Direct Connect or AWS Site-to-Site VPN. This setup allows flexible routing based on HTTP methods and headers to various internal systems.
Enterprise modernization: Gradual application modernization is facilitated by enabling phased migration from monolithic architectures to microservices. Organizations can route traffic between legacy and new components while maintaining operational continuity and minimizing risk.

Conclusion

Direct private integration between API Gateway REST APIs and ALBs enhances API architecture on AWS. By simplifying infrastructure and reducing operational overhead, this capability improves performance and efficiency for API-driven applications.

This feature is available today in all AWS Regions where VPC link v2 and ALBs are present. We can’t wait to see what you build with it and how it transforms your API architectures. Get started now by visiting the API Gateway console and creating your first VPC link v2 for direct ALB integration.

For more information, visit the API Gateway product page, review our pricing details, and explore the comprehensive developer documentation to learn about all the powerful features available to help you build world-class APIs on AWS.

Accelerate investigations with AWS Security Incident Response AI-powered capabilities

2025-11-21 Daniel Begimher

Post Syndicated from Daniel Begimher original https://aws.amazon.com/blogs/security/accelerate-investigations-with-aws-security-incident-response-ai-powered-capabilities/

If you’ve ever spent hours manually digging through AWS CloudTrail logs, checking AWS Identity and Access Management (IAM) permissions, and piecing together the timeline of a security event, you understand the time investment required for incident investigation. Today, we’re excited to announce the addition of AI-powered investigation capabilities to AWS Security Incident Response that automate this evidence gathering and analysis work.

AWS Security Incident Response helps you prepare for, respond to, and recover from security events faster and more effectively. The service combines automated security finding monitoring and triage, containment, and now AI-powered investigation capabilities with 24/7 direct access to the AWS Customer Incident Response Team (CIRT).

While investigating a suspicious API call or unusual network activity, scoping and validation require querying multiple data sources, correlating timestamps, identifying related events, and building a complete picture of what happened. Security operations center (SOC) analysts devote a significant amount of time to each investigation, with roughly half of that effort spent manually gathering and piecing together evidence from various tools and complex logs. This manual effort can delay your analysis and response.

AWS is introducing an investigative agent to Security Incident Response, changing this paradigm and adding layers of efficiency. The investigative agent helps you reduce the time required to validate and respond to potential security events. When a case for a security concern is created, either by you or proactively by Security Incident Response, the investigative agent asks clarifying questions to make sure it understands the full context of the potential security event. It then automatically gathers evidence from CloudTrail events, IAM configurations, and Amazon Elastic Compute Cloud (Amazon EC2) instance details and even analyzes cost usage patterns. Within minutes, it correlates the evidence, identifies patterns, and presents you with a clear summary.

How it works in practice

Before diving into an example, let’s paint a clear picture of where the investigative agent lives, how it’s accessed, and its purpose and function. The investigative agent is built directly into Security Incident Response and is automatically available when you create a case. Its purpose is to act as your first responder—gathering evidence, correlating data across AWS services, and building a comprehensive timeline of events so you can quickly move from detection to recovery.

For example: you discover that AWS credentials for an IAM user in your account were exposed in a public GitHub repository. You need to understand what actions were taken with those credentials and properly scope the potential security event, including lateral movement and reconnaissance operations. You need to identify persistence mechanisms that might have been created and determine the appropriate containment steps. To get started, you create a case in the Security Incident Response console and describe the situation.

Here’s where the agent’s approach differs from traditional automation: it asks clarifying questions first. When were the credentials first exposed? What’s the IAM user name? Have you already rotated the credentials? Which AWS account is affected?

This interactive step gathers the appropriate details and metadata before it starts gathering evidence. Specifically, you’re not stuck with generic results—the investigation is tailored to your specific concern.

After the agent has what it needs, it investigates. It looks up CloudTrail events to see what API calls were made using the compromised credentials, pulls IAM user and role details to check what permissions were granted, identifies new IAM users or roles that were created, checks EC2 instance information if compute resources were launched, and analyzes cost and usage patterns for unusual resource consumption. Instead of you querying each AWS service, the agent orchestrates this automatically.

Within minutes, you get a summary, as shown in the following figure. The investigation summary includes a high-level summary and critical findings, which include the credential exposure pattern, observed activity and the timeframe, affected resources, and limiting factors.

Figure 1 – Investigation summary

This response was generated using AWS Generative AI capabilities. You are responsible for evaluating any recommendations in your specific context and implementing appropriate oversight and safeguards. Learn more about AWS Responsible AI requirements.

Note: The preceding example is representative output. Exact formatting will vary depending on findings.

The investigation summary includes various tabs for detailed information, such as technical findings with an events timeline, as shown in the following figure:

Figure 2 – Security event timeline

When seconds count, this transparency is paramount to a quick, high-fidelity, and accurate response—especially if you need to escalate to the AWS CIRT, a dedicated group of AWS security experts, or explain your findings to leadership, creating a single lens for stakeholders to view the incident.

When the investigation is complete, you have a high-resolution picture of what happened and can make informed decisions about containment, eradication, and recovery. For the preceding exposed credentials scenario, you might need to:

Delete the compromised access keys
Remove the newly created IAM role
Terminate the unauthorized EC2 instances
Review and revert associated IAM policy changes
Check for additional access keys created for other users.

When you engage with the CIRT, they can provide additional guidance on containment strategies based on the evidence the agent gathered.

What this means for your security operations

The leaked credentials scenario shows what the agent can do for a single incident. But the bigger impact is on how you operate day-to-day:

You spend less time on evidence collection. The investigative agent automates the most time-consuming part of investigations—gathering and correlating evidence from multiple sources. Instead of spending an hour on manual log analysis, you can spend most of that time on making containment decisions and preventing recurrence.
You can investigate in plain language. The investigative agent uses natural language processing (NLP), which you can use to describe what you’re investigating in plain language, such as unusual API calls from IP address X or data access from terminated employee’s credentials, and the agent translates that into the technical queries needed. You don’t need to be an expert in AWS log formats or know the exact syntax for querying CloudTrail.
You get a foundation for high-fidelity and accurate investigations. The investigative agent handles the initial investigation—gathering evidence, identifying patterns, and providing a comprehensive summary. If your case requires deeper analysis or you need guidance on complex scenarios, you can engage with the AWS CIRT, who can immediately build on the work the agent has already done, speeding up their response time. They see the same evidence and timeline, so they can focus on advanced threat analysis and containment strategies rather than starting from scratch.

Getting started

If you already have Security Incident Response enabled, the AI-powered investigation capabilities are available now—no additional configuration needed. Create your next security case and the agent will start working automatically.

If you’re new to Security Incident Response, here’s how to set it up:

Enable Security Incident Response through your AWS Organizations management account. This takes a few minutes through the AWS Management Console and provides coverage across your accounts.
Create a case. Describe what you’re investigating; you can do this through the Security Incident Response console or an API, or set up automatic case creation from Amazon GuardDuty or AWS Security Hub alerts.
Review the analysis. The agent presents its findings through the Security Incident Response console, or you can access them through your existing ticketing systems such as Jira or ServiceNow.

The investigative agent uses the AWS Support service-linked role to gather information from your AWS resources. This role is automatically created when you set up your AWS account and provides the necessary access for Support tools to query CloudTrail events, IAM configurations, EC2 details, and cost data. Actions taken by the agent are logged in CloudTrail for full auditability.

The investigative agent is included at no additional cost with Security Incident Response, which now offers metered pricing with a free tier covering your first 10,000 findings ingested per month. Beyond that, findings are billed at rates that decrease with volume. With this consumption-based approach, you can scale your security incident response capabilities as your needs grow.

How it fits with existing tools

Security Incident Response cases can be created by customers or proactively by the service. The investigative agent is automatically triggered when a new case is created, and cases can be managed through the console, API, or Amazon EventBridge integrations.

You can use EventBridge to build automated workflows that route security events from GuardDuty, Security Hub, and Security Incident Response itself to create cases and initiate response plans, enabling end-to-end detection-to-investigation pipelines. Before the investigative agent begins its work, the service’s auto-triage system monitors and filters security findings from GuardDuty and third-party security tools through Security Hub. It uses customer-specific information, such as known IP addresses and IAM entities, to filter findings based on expected behavior, reducing alert volume while escalating alerts that require immediate attention. This means the investigative agent focuses on alerts that actually need investigation.

Conclusion

In this post, I showed you how the new investigative agent in AWS Security Incident Response automates evidence gathering and analysis, reducing the time required to investigate security events from hours to minutes. The agent asks clarifying questions to understand your specific concern, automatically queries multiple AWS data sources, correlates evidence, and presents you with a comprehensive timeline and summary while maintaining full transparency and auditability.

With the addition of the investigative agent, Security Incident Response customers now get the speed and efficiency of AI-powered automation, backed by the expertise and oversight of AWS security experts when needed.

The AI-powered investigation capabilities are available today in all commercial AWS Regions where Security Incident Response operates. To learn more about pricing and features, or to get started, visit the AWS Security Incident Response product page.

If you have feedback about this post, submit comments in the Comments section below.

Introducing Cluster insights: Unified monitoring dashboard for Amazon OpenSearch Service clusters

2025-11-21 Siddhant Gupta

Post Syndicated from Siddhant Gupta original https://aws.amazon.com/blogs/big-data/introducing-cluster-insights-unified-monitoring-dashboard-for-amazon-opensearch-service-clusters/

Amazon OpenSearch Service clusters offer a wealth of operational metrics accessible through CloudWatch and the Amazon OpenSearch Service console to support effective performance monitoring and alert creation. Yet, pinpointing resiliency and performance challenges within your cluster can prove daunting. The process of identifying resource-intensive queries or understanding performance degradation trends can be time-consuming.

To address these challenges, we launched Cluster insights, which presents a unified dashboard delivering curated insights along with actionable mitigation steps. The dashboard displays detailed metrics at the node, index, and shard levels, coupled with a concise summary of security and resiliency best practices to uphold peak resiliency and availability.

This blog will guide you through setting up and using Cluster Insights, including key features and metrics. By the conclusion, you’ll understand how to use Cluster insights to recognize and address performance and resiliency issues within your OpenSearch Service clusters.

Getting Started with Cluster insights

Cluster insights is available at no additional cost to OpenSearch Service users running OpenSearch version 2.17 or later. Accessing Cluster insights requires admin-level permissions for your OpenSearch domain. Cluster insights is available only through the OpenSearch UI. OpenSearch UI offers support to multiple data sources, zero downtime upgrades for your dashboard experience, and curated workspaces for effective team collaborations. You first need to associate a data source (your clusters) with an OpenSearch UI application. Detailed steps are described in the user guide. Your OpenSearch UI console experience will look like following screenshots.

To access Cluster insights using the OpenSearch UI application:

In the Amazon OpenSearch Service console, navigate to OpenSearch UI (Dashboards) and choose the Application URL to access your OpenSearch UI application.
OpenSearch UI application, choose the settings icon at the left-bottom corner, then choose Data administration.
On the Data administration overview page, or under Manage data in the left navigation, select Cluster insights.

Cluster insights overview

The Cluster insights – Overview acts as a landing page to show health and insights for all connected OpenSearch domains. It is organized into five sections:

Current cluster status – Displays cluster health status (Green, Yellow, and Red) in a donut chart.
Insights trend – Tracks issue patterns over the past 30 days, helping you identify emerging problems and track resolution progress. This trend analysis becomes particularly valuable when monitoring the impact of operational changes or troubleshooting recurring issues.
Current open insights – Shows the count and severity breakdown of currently active insights across your clusters.
OpenSearch service clusters – Lists all domains with their vital statistics such as health status, insights count, nodes, shards, and active queries.
Top insights by severity – Prioritizes issues that need immediate attention. Each insight comes with a clear description and specific recommendations, transforming complex monitoring data into actionable tasks. This prioritized view helps teams can focus on critical issues first, whether they’re addressing shard size problems, disk space issues, or performance bottlenecks.

Together, these sections provide a comprehensive view of your OpenSearch Service infrastructure so you can assess cluster health, identify trends, and take action on critical issues from a single dashboard.

Cluster health

When you choose a specific cluster from the OpenSearch domains on the Cluster insights – Overview page, you will see cluster-specific details including health status, active insights, and performance metrics. The overview section displays cluster health along with essential metrics including count of shards, nodes, indices, and a total document size. You can also review the configuration best practices followed by domain across resiliency and security areas.

The lower section contains a table of actionable insights that presents a detailed view of current issues. This table mirrors the insights from the landing page but focuses specifically on issues affecting the selected cluster. You can observe high-severity issues such as low disk space and shard count problems, as well as medium-severity concerns that may impact cluster performance.

Each insight entry serves as an interactive element – selecting any issue reveals an in-depth analysis complete with root cause identification and specific remediation steps. The table includes important metadata such as generation timestamps, severity levels, recommendation counts, and current status, so users can prioritize and address issues effectively.

Insight details

Every insight offers detailed analysis and actionable recommendations. Take the Shard Count insight as an example: selecting it reveals a comprehensive breakdown of the issue. You’ll see that your OpenSearch cluster has breached the number of shards allowed on the nodes based on its JVM heap size, along with a detailed list of affected resources.

The detailed view includes a resource map that precisely identifies each impacted node and index, displaying critical information such as node IDs, shard counts, and the indices contributing to the issue.

The recommendations are organized into two levels: cluster-level recommendations address overall architecture improvements, such as scaling your cluster or adjusting global shard allocation settings. Index-level recommendations provide specific actions for individual indices—for example, you might see suggestions to move idle shards to UltraWarm storage. These are shards without any search or indexing operations for the last 10 days and are at least 5 days old, making them ideal candidates for warm storage to reduce the active shard count. All of this guidance is available directly within the Cluster insights interface, eliminating the need to switch between different tools or consoles.

Node, Index, Shard, and Query view

Next to cluster health, you can review Node, Index, Shard, and Query details for a specific cluster. These views present critical metrics such as resource (CPU, memory, disk) utilization, search and index latency.

Node view

The Node view tab provides a comprehensive view of individual node performance across your cluster. This table displays critical metrics for each node including heat score indicating overall node health, resource utilization (CPU, memory, disk), search and indexing latency and rates, along with quick links to view top N shards and queries running on each node.

This view helps you identify nodes experiencing high resource utilization or performance degradation. You can drill deeper into each node by clicking on the node ID to view detailed time-based metrics showing resource usage trends over time. Additionally, you can click the top N shards link to navigate directly to the Shard View, automatically filtered to show only the shards running on the selected node, allowing you to pinpoint which specific shards are contributing to performance issues.

Index view

The Index view tab shows performance metrics aggregated at the index level. For each index, you can monitor document count and storage size, search latency and rate, indexing latency and rate, and access top N queries affecting the index. This perspective is valuable for understanding which indices are driving cluster load and identifying optimization opportunities at the index configuration level.

Shard view

The Shard view tab offers the most granular view of cluster performance by displaying metrics for individual shards. Each row shows shard ID and its assigned node, index association and resource pressure metrics (CPU, memory), along with search and indexing latency per shard. This detailed view enables you to pinpoint specific shards causing performance issues, identify shard placement imbalances, and take targeted remediation actions.

Query view

The Query view on the Cluster insights page solves presents live dashboards that break down execution stats, CPU and memory usage, and completion progress for every query. This helps monitor which queries are driving the biggest resource consumption (the Top-N queries). With intuitive donut charts and scoreboards showing distribution by node, index, and user, this interface helps operators to quickly pinpoint performance bottlenecks and heavy workloads, supporting targeted optimization and confident scaling decisions.

Query insights

In addition to Cluster insights, you can also get Query insights to view the exact queries running and latencies across Expand, Query, and Fetch phases that provides valuable insights for search developers to further fine-tune their queries.

Conclusion

Cluster insights transforms OpenSearch Service cluster management from reactive troubleshooting to proactive optimization. By providing unified dashboards with heat score, and best practices across stability, resiliency, and security pillars, it offers visibility into your search infrastructure at the account level.

The actionable recommendations and step-by-step remediation guidance help users of all experience levels effectively resolve complex issues like shard imbalances and resource bottlenecks.

The integration with Query insights delivers real-time visibility into resource consumption patterns so that teams can identify and optimize performance-critical queries through detailed profiling and latency analysis.

For more information, see the AWS OpenSearch Service User Guide for additional details.

About the authors

Introducing VPC encryption controls: Enforce encryption in transit within and across VPCs in a Region

2025-11-21 Sébastien Stormacq

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/introducing-vpc-encryption-controls-enforce-encryption-in-transit-within-and-across-vpcs-in-a-region/

Today, we’re announcing virtual private cloud (VPC) encryption controls, a new capability of Amazon Virtual Private Cloud (Amazon VPC) that helps you audit and enforce encryption in transit for all traffic within and across VPCs in a Region.

Organizations across financial services, healthcare, government, and retail face significant operational complexity in maintaining encryption compliance across their cloud infrastructure. Traditional approaches require piecing together multiple solutions and managing complex public key infrastructure (PKI), while manually tracking encryption across different network paths using spreadsheets—a process prone to human error that becomes increasingly challenging as infrastructure scales.

Although AWS Nitro based instances automatically encrypt traffic at the hardware layer without affecting performance, organizations need simple mechanisms to extend these capabilities across their entire VPC infrastructure. This is particularly important for demonstrating compliance with regulatory frameworks such as Health Insurance Portability and Accountability (HIPAA), Payment Card Industry Data Security Standard (PCI DSS), and Federal Risk and Authorization Management Program (FedRAMP), which require proof of end-to-end encryption across environments. Organizations need centralized visibility and control over their encryption status, without having to manage performance trade-offs or complex key management systems.

VPC encryption controls address these challenges by providing two operational modes: monitor and enforce. In monitor mode, you can audit the encryption status of your traffic flows and identify resources that allow plaintext traffic. The feature adds a new encryption-status field to VPC flow logs, giving you visibility into whether traffic is encrypted using Nitro hardware encryption, application-layer encryption (TLS), or both.

After you’ve identified resources that need modification, you can take steps to implement encryption. AWS services, such as Network Load Balancer, Application Load Balancer, and AWS Fargate tasks, will automatically and transparently migrate your underlying infrastructure to Nitro hardware without any action required from you and with no service interruption. For other resources, such as the previous generation of Amazon Elastic Compute Cloud (Amazon EC2) instances, you will need to switch to modern Nitro based instance types or configure TLS encryption at application level.

You can switch to enforce mode after all resources have been migrated to encryption-compliant infrastructure. This migration to encryption-compliant hardware and communication protocols is a prerequisite for enabling enforce mode. You can configure specific exclusions for resources such as internet gateways or NAT gateways, that don’t support encryption (because the traffic flows outside of your VPC or the AWS network).

Other resources must be encryption-compliant and can’t be excluded. After activation, enforce mode provides that all future resources are only created on compatible Nitro instances, and unencrypted traffic is dropped when incorrect protocols or ports are detected.

Let me show you how to get started

For this demo, I started three EC2 instances. I use one as a web server with Nginx installed on port 80, serving a clear text HTML page. The other two are continuously making HTTP GET requests to the server. This generates clear text traffic in my VPC. I use the m7g.medium instance type for the web server and one of the two clients. This instance type uses the underlying Nitro System hardware to automatically encrypt in-transit traffic between instances. I use a t4g.medium instance for the other web client. The network traffic of that instance is not encrypted at the hardware level.

To get started, I enable encryption controls in monitor mode. In the AWS Management Console, I select Your VPCs in the left navigation pane, then I switch to the VPC encryption controls tab. I choose Create encryption control and select the VPC I want to create the control for.

Each VPC can have only one VPC encryption control associated with it, creating a one-to-one relationship between the VPC ID and the VPC encryption control Id. When creating VPC encryption controls, you can add tags to help with resource organization and management. You can also activate VPC encryption control when you create a new VPC.

I enter a Name for this control. I select the VPC I want to control. For existing VPCs, I have to start in Monitor mode, and I can turn on Enforce mode when I’m sure there is no unencrypted traffic. For new VPCs, I can enforce encryption at the time of creation.

Optionally, I can define tags when creating encryption controls for an existing VPC. However, when enabling encryption controls during VPC creation, separate tags can’t be created for VPC encryption controls—because they automatically inherit the same tags as the VPC. When I’m ready, I choose Create encryption control.

Alternatively, I can use the AWS Command Line Interface (AWS CLI):

aws ec2 create-vpc-encryption-control --vpc-id vpc-123456789

Next, I audit the encryption status of my VPC using the console, command line, or flow logs:

aws ec2 create-flow-logs \
  --resource-type VPC \
  --resource-ids vpc-123456789 \
  --traffic-type ALL \
  --log-destination-type s3 \
  --log-destination arn:aws:s3:::vpc-flow-logs-012345678901/vpc-flow-logs/ \
  --log-format '${flow-direction} ${traffic-path} ${srcaddr} ${dstaddr} ${srcport} ${dstport} ${encryption-status}'
{
    "ClientToken": "F7xmLqTHgt9krTcFMBHrwHmAZHByyDXmA1J94PsxWiU=",
    "FlowLogIds": [
        "fl-0667848f2d19786ca"
    ],
    "Unsuccessful": []
}

After a few minutes, I see this traffic in my logs:

flow-direction traffic-path srcaddr dstaddr srcport dstport encryption-status
ingress - 10.0.133.8 10.0.128.55 43236 80 1 # <-- HTTP between web client and server. Encrypted at hardware-level
egress 1 10.0.128.55 10.0.133.8 80 43236 1
ingress - 10.0.133.8 10.0.128.55 36902 80 1
egress 1 10.0.128.55 10.0.133.8 80 36902 1
ingress - 10.0.130.104 10.0.128.55 55016 80 0 # <-- HTTP between web client and server. Not encrypted at hardware-level
egress 1 10.0.128.55 10.0.130.104 80 55016 0
ingress - 10.0.130.104 10.0.128.55 60276 80 0
egress 1 10.0.128.55 10.0.130.104 80 60276 0

10.0.128.55 is the web server with hardware-encrypted traffic, serving clear text traffic at application level.
10.0.133.8 is the web client with hardware-encrypted traffic.
10.0.130.104 is the web client with no encryption at the hardware level.

The encryption-status field tells me the status of the encryption for the traffic between the source and destination address:

0 means the traffic is in clear text
1 means the traffic is encrypted at the network layer (Level 3) by the Nitro system
2 means the traffic is encrypted at the application layer (Level7, TCP Port 443 and TLS/SSL)
3 means the traffic is encrypted both at the application layer (TLS) and the network layer (Nitro)
“-” means VPC encryption controls are not enabled, or AWS Flow Logs don’t have the status information.

The traffic originating from the web client on the instance that isn’t Nitro based (10.0.130.104), is flagged as 0. The traffic initiated from the web client on the Nitro- ased instance (10.0.133.8) is flagged as 1.

I also use the console to identify resources that need modification. It reports two nonencrypted resources: the internet gateway and the elastic network interface (ENI) of the instance that isn’t based on Nitro.

I can also check for nonencrypted resources using the CLI:

aws ec2 get-vpc-resources-blocking-encryption-enforcement --vpc-id vpc-123456789

After updating my resources to support encryption, I can use the console or the CLI to switch to enforce mode.

In the console, I select the VPC encryption control. Then, I select Actions and Switch mode.

Or the equivalent CLI:

aws ec2 modify-vpc-encryption-control --vpc-id vpc-123456789 --mode enforce

How to modify the resources that are identified as nonencrypted?

All your VPC resources must support traffic encryption, either at the hardware layer or at the application layer. For most resources, you don’t need to take any action.

AWS services accessed through AWS PrivateLink and gateway endpoints automatically enforce encryption at the application layer. These services only accept TLS-encrypted traffic. AWS will automatically drop any traffic that isn’t encrypted at the application layer.

When you enable monitor mode, we automatically and gradually migrate your Network Load Balancers, Application Load Balancers, AWS Fargate clusters, and Amazon Elastic Kubernetes Service (Amazon EKS) clusters to hardware that inherently supports encryption. This migration happens transparently without any action required from you.

Some VPC resources require you to select the underlying instances that support modern Nitro hardware-layer encryption. These include EC2 Instances, Auto Scaling groups, Amazon Relational Database Service (Amazon RDS) databases (including Amazon DocumentDB), Amazon ElastiCache node-based clusters, Amazon Redshift provisioned clusters, EKS clusters, ECS with EC2 capacity, MSK Provisioned, Amazon OpenSearch Service, and Amazon EMR. To migrate your Redshift clusters, you must create a new cluster or namespace from a snapshot.

If you use newer-generation instances, you likely already have encryption-compliant infrastructure because all recent instance types support encryption. For older-generation instances that don’t support encryption-in transit, you’ll need to upgrade to supported instance types.

Something to know when using AWS Transit Gateway

When creating a Transit Gateway through AWS CloudFormation with VPC encryption enabled, you need two additional AWS Identity and Access Management (IAM) permissions: ec2:ModifyTransitGateway and ec2:ModifyTransitGatewayOptions. These permissions are required because CloudFormation uses a two-step process to create a Transit Gateway. It first creates the Transit Gateway with basic configuration, then calls ModifyTransitGateway to enable encryption support. Without these permissions, your CloudFormation stack will fail during creation when attempting to apply the encryption configuration, even if you’re only performing what appears to be a create operation.

Pricing and availability

You can start using VPC encryption controls today in these AWS Regions: US East (Ohio, N. Virginia), US West (N. California, Oregon), Africa (Cape Town), Asia Pacific (Hong Kong, Hyderabad, Jakarta, Melbourne, Mumbai, Osaka, Singapore, Sydney, Tokyo), Canada (Central), Canada West (Calgary), Europe (Frankfurt, Ireland, London, Milan, Paris, Stockholm, Zurich), Middle East (Bahrain, UAE), and South America (São Paulo).

VPC encryption controls is free of cost until March 1, 2026. The VPC pricing page will be updated with details as we get closer to that date.

To learn more, visit the VPC encryption controls documentation or try it out in your AWS account. I look forward to hearing how you use this feature to strengthen your security posture and help you meet compliance standards.

— seb

Introducing attribute-based access control for Amazon S3 general purpose buckets

2025-11-21 Matheus Guimaraes

Post Syndicated from Matheus Guimaraes original https://aws.amazon.com/blogs/aws/introducing-attribute-based-access-control-for-amazon-s3-general-purpose-buckets/

As organizations scale, managing access permissions for storage resources becomes increasingly complex and time-consuming. As new team members join, existing staff changes roles, and new S3 buckets are created, organizations must constantly update multiple types of access policies to govern access across their S3 buckets. This challenge is especially pronounced in multi-tenant S3 environments where administrators must frequently update these policies to control access across shared datasets and numerous users.

Today we’re introducing attribute-based access control (ABAC) for Amazon Simple Storage Service (S3) general purpose buckets, a new capability you can use to automatically manage permissions for users and roles by controlling data access through tags on S3 general purpose buckets. Instead of managing permissions individually, you can use tag-based IAM or bucket policies to automatically grant or deny access based on tags between users, roles, and S3 general purpose buckets. Tag-based authorization makes it easy to grant S3 access based on project, team, cost center, data classification, or other bucket attributes instead of bucket names, dramatically simplifying permissions management for large organizations.

How ABAC works
Here’s a common scenario: as an administrator, I want to give developers access to all S3 buckets meant to be used in development environments.

With ABAC, I can tag my development environment S3 buckets with a key-value pair such as environment:development and then attach an ABAC policy to an AWS Identity and Access Management (IAM) principal that checks for the same environment:development tag. If the bucket tag matches the condition in the policy, the principal is granted access.

Let’s see how this works.

Getting started
First, I need to explicitly enable ABAC on each S3 general purpose bucket where I want to use tag-based authorization.

I navigate to the Amazon S3 console, select my general purpose bucket then navigate to Properties where I can find the option to enable ABAC for this bucket.

I can also use the AWS Command Line Interface (AWS CLI) to enable it programmatically by using the new PutBucketAbac API. Here I am enabling ABAC on a bucket called my-demo-development-bucket located in the US East (Ohio) us-east-2 AWS Region.

aws s3api put-bucket-abac --bucket my-demo-development-bucket abac-status Status=Enabled --region us-east-2

Alternatively, if you use AWS CloudFormation, you can enable ABAC by setting the AbacStatus property to Enabled in your template.

Next, let’s tag our S3 general purpose bucket. I add an environment:development tag which will become the criteria for my tag-based authorization.

Now that my S3 bucket is tagged, I’ll create an ABAC policy that verifies matching environment:development tags and attach it to an IAM role called dev-env-role. By managing developer access to this role, I can control permissions to all development environment buckets in a single place.

I navigate to the IAM console, choose Policies, and then Create policy. In the Policy editor, I switch to JSON view and create a policy that allows users to read, write and list S3 objects, but only when they have a tag with a key of “environment” attached and its value matches the one declared on the S3 bucket. I give this policy the name of s3-abac-policy and save it.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/environment": "development"
                }
            }
        }
    ]
}

I then attach this s3-abac-policy to the dev-env-role.

That’s it! Now a user assuming the dev-role can access any ABAC-enabled bucket with the tag environment:development such as my-demo-development-bucket.

Using your existing tags
Keep in mind that although you can use your existing tags for ABAC, because these tags will now be used for access control, we recommend reviewing your current tag setup before enabling the feature. This includes reviewing your existing bucket tags and tag-based policies to prevent unintended access, and updating your tagging workflows to use the standard TagResource API (since enabling ABAC on your buckets will block the use of the PutBucketTagging API). You can use AWS Config to check which buckets have ABAC enabled and review your usage of PutBucketTagging API in your application using AWS Cloudtrail management events.

Additionally, the same tags you use for ABAC can also serve as cost allocation tags for your S3 buckets. Activate them as cost allocation tags in the AWS Billing Console or through APIs, and your AWS Cost Explorer and Cost and Usage Reports will automatically organize spending data based on these tags.

Enforcing tags on creation
To help standardize access control across your organization, you can now enforce tagging requirements when buckets are created through service control policies (SCPs) or IAM policies using the aws:TagKeys and aws:RequestTag condition keys. Then you can enable ABAC on these buckets to provide consistent access control patterns across your organization. To tag a bucket during creation you can add the tags to your CloudFormation templates or provide them in the request body of your call to the existing S3 CreateBucket API. For example, I could enforce a policy for my developers to create buckets with the tag environment=development so all my buckets are tagged accurately for cost allocation. If I want to use the same tags for access control, I can then enable ABAC for these buckets.

Things to know

With ABAC for Amazon S3, you can now implement scalable, tag-based access control across your S3 buckets. This feature makes writing access control policies simpler, and reduces the need for policy updates as principals and resources come and go. This helps you reduce administrative overhead while maintaining strong security governance as you scale.

Attribute-based access control for Amazon S3 general purpose buckets is available now through the AWS Management Console, API, AWS SDKs, AWS CLI, and AWS CloudFormation at no additional cost. Standard API request rates apply according to Amazon S3 pricing. There’s no additional charge for tag storage on S3 resources.

You can use AWS CloudTrail to audit access requests and understand which policies granted or denied access to your resources.

You can also use ABAC with other S3 resources such as S3 directory bucket, S3 access points and S3 tables buckets and tables. To learn more about ABAC on S3 buckets see the Amazon S3 User Guide.

You can use the same tags you use for access control for cost allocation as well. You can activate them as cost allocation tags through the AWS Billing Console or APIs. Check out the documentation for more details on how to use cost allocation tags.

Introducing the Landing Zone Accelerator on AWS Universal Configuration and LZA Compliance Workbook

2025-11-21 Kevin Donohue

Post Syndicated from Kevin Donohue original https://aws.amazon.com/blogs/security/introducing-the-landing-zone-accelerator-on-aws-universal-configuration-and-lza-compliance-workbook/

We’re pleased to announce the availability of the latest sample security baseline from Landing Zone Accelerator on AWS (LZA)—the Universal Configuration. Developed from years of field experience with highly regulated customers including governments across the world, and in consultation with AWS Partners and industry experts, the Universal Configuration was built to help you implement security and compliance at scale for on your regulated workloads. By setting a high bar with the latest AWS security best practices, the Universal Configuration can help address technical control requirements from compliance frameworks across different geographic regions and industry verticals. The Universal Configuration’s multi-account security architecture provides a foundation to host your diverse workload requirements today along with providing the ability to explore the generative AI and agentic AI solutions that will shape your organization in the future. It can also replace months of complex planning and design by deploying a comprehensive security and compliance-driven environment based on AWS Well-Architected principles in a matter of hours.

As organizations grow, they typically pursue or must adhere to new security compliance certifications. LZA and the Universal Configuration help organizations of all sizes and phases in their security and compliance journey. The speed of deployment, step-by-step documentation, and compliance resources can reduce traditional assessment and authorization timelines by months and result in more predictable and successful audit outcomes. This enables more freedom to invest resources to grow the business instead of choosing between security and compliance tradeoffs.

The Universal Configuration helps organizations:

Automate the deployment of a secure multi-account AWS environment
- Foundational security controls based on AWS Well-Architected best practices
- Apply consistent and predictable security controls post-deployment
- Enable and integrate with native AWS security, identity, and compliance services
Implement controls across system layers
- Organization-wide security architecture
- Perimeter and resource-specific preventative, proactive, and detective controls
- Support for multi-AWS Region resilience, disaster recovery, and active failover
Establish a foundation for security and compliance readiness
- Built-in AWS security best practices and technical implementation statements
- Map LZA capabilities across global and industry-specific compliance frameworks
- Deploy hundreds of controls hours instead of months

The LZA Compliance Workbook

The LZA engine has been a trusted tool for quickly deploying secure multi-account AWS environments for over 4 years. It is also cost effective because you pay only for the AWS services used to operate your environment. The Universal Configuration is the first sample configuration accompanied by the LZA Compliance Workbook available on AWS Artifact. It’s a first-of-its-kind resource with detailed control mappings showing how the Universal Configuration can help you address requirements from frameworks including NIST 800-53 Rev5, CMMC/NIST 800-171, ISO-27001, HIPAA, C5:2020, NATO D-32 (Appendix B), and DoD CCI.

The LZA Compliance Workbook is regularly maintained to reflect the latest Universal Configuration baseline and will include additional compliance mappings in future releases. The workbook contains detailed security configuration descriptions based on the Universal Configuration deployment files, along with control requirement mappings and implementation statements that translate its security capabilities into a compliance-friendly format. By combining AWS security best practices with global compliance expertise, the Universal Configuration delivers predicable security outcomes while also helping you meet regional and industry requirements.

Getting started

To get started with the Landing Zone Accelerator on AWS Universal Configuration, the LZA Implementation Guide walks you through the steps, use cases, and considerations when deploying with LZA. You can download the LZA Compliance Workbook from AWS Artifact today and configure notifications to receive emails when future versions are released. You can view the deployment files and additional technical implementation guidance on the GitHub Universal Configuration sample and documentation page. Additionally, visit the AWS Partner Network (APN) for help with audit and advisory initiatives, cloud migrations, deploying the LZA Universal Configuration, and other services. You can visit the AWS Partner Finder tool and search by solution for Landing Zone Accelerator for the latest LZA Partner offerings.

If you have feedback about this post, submit comments in the Comments section below.

Enforce business glossary classification rules in Amazon SageMaker Catalog

2025-11-20 Ramesh H Singh

Post Syndicated from Ramesh H Singh original https://aws.amazon.com/blogs/big-data/enforce-business-glossary-classification-rules-in-amazon-sagemaker-catalog/

Organizations are scaling their data catalogs faster than ever. Maintaining consistent metadata standards across teams remains a challenge. Business glossaries define the language of the enterprise—terms like Customer Profile, Transaction, or Confidential Data—but assets are often published without these classifications, leading to inconsistent metadata and poor discoverability.

To address this, Amazon SageMaker Catalog now supports metadata enforcement rules for glossary terms classification (tagging) at the asset level. With this capability, administrators can require that assets include specific business terms or classifications. Data producers must apply required glossary terms or classifications before an asset can be published. This enforces metadata consistency across the catalog and makes sure assets carry the business context needed for effective discovery and governance.

This capability builds on existing metadata rule features for enforcing required metadata fields during asset publishing. The new addition extends those rules to cover glossary term validation, strengthening the link between business language and technical data assets.

In this post, we show how to enforce business glossary classification rules in SageMaker Catalog.

Why metadata enforcement matters

A common governance challenge is the lack of standardized tagging and classification for assets entering enterprise catalogs. Without enforcement, data producers might publish assets missing required business terms (such as data sensitivity level or product domain), resulting in inconsistent metadata that confuses business users, unreliable search and filtering results, and manual cleanup and downstream compliance risks.

By automatically validating metadata at publish time, SageMaker Catalog validates metadata when assets are published. This offers the following key benefits:

Assets are classified with approved business terms before publication
Validation supports compliance with internal glossary and classification standards
Consistent tagging enhances search accuracy and reduces noise
Incomplete or incorrectly tagged assets don’t reach consumers

How metadata enforcement works

On the Amazon SageMaker Unified Studio console, administrators navigate to Catalog, Governance, Rules and create metadata rules targeting the asset publishing workflow. Rules can specify required glossary terms or classification fields (for example, Business Unit, PII Category, or Data Sensitivity). Rules can apply organization-wide or within specific domains or projects.

When a producer attempts to publish an asset, SageMaker Catalog checks that the asset includes the required glossary terms or classifications. If any required metadata is missing, the publish action fails with a clear error message. After the metadata is added, the asset can be published successfully.

Enforced tagging makes sure published assets can be searched and filtered using consistent business terminology, improving catalog usability for analysts and business users.

Solution overview

For this post, we explore a financial services use case. Our example a financial services company defines a rule requiring all datasets published from the project to have ‘Finance’ glossary associated:

A data producer attempting to publish a new dataset without this tag receives a validation error
After applying the correct classification, the dataset publishes successfully
Analysts can now filter the catalog to find only Finance datasets or join assets consistently tagged with the same glossary term

In the following sections, we walk through the steps to configure this solution. We create a rule that all assets published from a specific project should have a business unit tag called Finance.

Prerequisites

To test this solution, you should have a SageMaker Unified Studio domain set up with a domain owner or domain unit owner privileges. You should also have an existing project to publish assets and catalog assets. For instructions to create these assets, see the Getting started guide.

In this example, we created a project named financial_analysis and a test table. For instructions to create a table, see Get started with Amazon S3 Tables in Amazon SageMaker Unified Studio. To ingest the sample data to SageMaker Catalog and generate business metadata, see Create an Amazon SageMaker Unified Studio data source for Amazon Redshift in the project catalog.

Create glossary and add terms

Complete the following steps to create a new glossary and add terms:

In SageMaker Unified Studio, on the Discover menu, choose Glossaries.
Choose Create glossary.
Provide details for your glossary, including name, owning project, and optional description.
For Glossary restriction, turn on Enabled.
Choose Create.
Create the term Finance in the Business Unit Details glossary.

Create rule to enforce glossary terms

Complete the following steps to create a rule to define glossary terms:

On the Govern menu, choose Domain units.
On the Rules tab, choose Add.
Add a publishing rule for the Finance project to have the Finance tag for all assets published to the catalog.
Choose Add rule.

The following screenshot shows the configuration details for your new rule.

Publish asset with enforced rules

Complete the following steps to publish your asset with the enforced rules:

On the financial_analysis project page, go to your asset.
In the Glossary terms section, choose Add terms.

If you choose Publish without adding the needed term, you get an error stating the Finance term should be assigned.
Choose Finance to add the required term.
Choose Publish asset.

The following screenshot shows the published asset and the required terms in the glossary.

Conclusion

With metadata enforcement rules for glossary terms, SageMaker Catalog brings stronger control and consistency to how organizations publish and manage their data assets. By requiring approved business classifications before publication, teams can make sure assets adhere to enterprise metadata standards, improving governance, discoverability, and trust in shared catalogs. This capability helps organizations scale their catalog governance without adding manual overhead—embedding compliance and quality directly into the publishing workflow.

Metadata enforcement rules for glossary terms are available in AWS Regions where SageMaker Catalog operates. Get started with this capability, refer to the user guide.

About the Authors

Enhanced data discovery in Amazon SageMaker Catalog with custom metadata forms and rich text documentation

2025-11-20 Ramesh H Singh

Post Syndicated from Ramesh H Singh original https://aws.amazon.com/blogs/big-data/enhanced-data-discovery-in-amazon-sagemaker-catalog-with-custom-metadata-forms-and-rich-text-documentation/

Amazon SageMaker Catalog now supports custom metadata forms and rich text descriptions at the column level, extending existing curation capabilities for business names, descriptions, and glossary term classifications.

With these new features, data stewards can define and capture business-specific metadata directly in individual columns, and authors can use markdown-enabled rich text to provide detailed documentation and business context. Both form fields and formatted descriptions are indexed in real time, making them immediately discoverable through catalog search.

Column-level context is essential for understanding and trusting data. This release helps organizations improve data discoverability, collaboration, and governance by letting metadata stewards document columns using structured and formatted information that aligns with internal standards.

In this post, we show how to enhance data discovery in SageMaker Catalog with custom metadata forms and rich text documentation at the schema level.

Key capabilities

SageMaker Catalog now offers the following key capabilities:

Custom metadata forms – Data stewards can now use custom metadata forms to capture organization-specific metadata fields for columns such as Business Owner, Regulatory Classification, Units of Measure, or Approved Use Case. Each field is stored as a key-value pair and indexed for search, enabling business-level queries like “find columns where sensitivity = confidential.”
Rich text (markdown) descriptions – Each column supports a markdown-enabled description field. Authors can format text with headings, bullet lists, and hyperlinks to add deeper business or operational context—for example, logic definitions, sample values, or data lineage references.
Real-time indexing for search – Custom form values and rich text content are indexed as soon as they are saved. Users can search using a metadata value, keyword, or glossary term across columns.

Solution overview

For this post, we explore a financial services use case. Our example financial services organization defines a column metadata form that includes several fields, as illustrated in the following table.

Field	Example Value
Approved Use Case	Financial revenue modeling
Business Owner	Finance Office
Domain	RF

For a dataset column named revenue, the author adds the following markdown description:

# Business Revenue

- Use for Financial Modeling
- Use only for batch use cases

When analysts search for Domain = RF, this column appears in results with complete business context.

In the following sections, we demonstrate how to use to use metadata forms for columns and add rich text descriptions that is searchable.

Prerequisites

To test this solution, you should have an Amazon SageMaker Unified Studio domain set up with a domain owner or domain unit owner privileges. You should also have an existing project to publish assets and catalog assets. For instructions to create these assets, see the Getting started guide.

In this example, we created a project named financial_analysis and a test table. To create a similar table, see Get started with Amazon S3 Tables in Amazon SageMaker Unified Studio. To ingest the sample data to SageMaker Catalog and generate business metadata, see Create an Amazon SageMaker Unified Studio data source for Amazon Redshift in the project catalog.

Create new metadata form

Complete the following steps to create a new metadata form:

In SageMaker Unified Studio, go to your project.
Under Project catalog in the navigation pane, choose Metadata entities.
Choose Create metadata form.
Provide an optional display name, a technical name, and an optional description, then choose Create metadata form.
Define the form fields. In this example, we add the fields Domain, Business Owner, and Approved Use Case.
For Requirement Options, select the configuration for each field. For our use case, we select Always required.
Choose Create field.
Turn on Enabled so the form is visible and can be used for assets.

Attach metadata form to column

Complete the following steps to attach the metadata form to a column:

Under Project catalog in the navigation pane, choose Assets.
Search for and select your asset (for this example, we use the asset business_finance).
On the Schema tab, choose View/Edit next to the revenue field.
Choose Add metadata form.
Choose the form you created and choose Add.
Add details for the metadata form fields

Add additional context as formatted text

Next, we enter a rich text description for each column using the markdown editor, including headings, bullet lists, links, and sample values. Complete the following steps:

Choose Edit next to README for the revenue field where you added the metadata form.
Enter details and choose Save.
Choose Preview to view the formatted README at the column level.

Publish and verify search

Now you’re ready to publish the asset. The metadata form values and markdown descriptions become part of the catalog record and are indexed for search. You can also see the history of revisions on the History tab. Other project users can see the metadata form and rich text description for the published assets and subscribe to the data asset. You can create more data products with these assets, and they will also have the column metadata form and README.

In the catalog search UI, data users can now filter on custom form fields (for example, “Domain = RF”) or search in natural language for text that matches the column description.

Best practices

Consider the following best practices when using this feature:

Define metadata forms aligned with your business vocabulary (domains, owners, sensitivity levels) proactively before publishing assets at scale.
Make column descriptions actionable—include business definitions, value ranges, logic, update cadence, and dependencies.
Verify the catalog indexing is timely; publish changes proactively so search results reflect new metadata.
Use governance controls. You can combine column-level metadata with existing asset-level templates and approval workflows to enforce publishing standards.
Monitor search usage and metadata completeness; target high-value datasets for complete column-level documentation first.
Do not store confidential or sensitive information in your metadata forms.

Conclusion

With column-level metadata forms and rich text descriptions, SageMaker Catalog helps organizations deliver higher-quality metadata, stronger governance, and better data discovery. These features make it straightforward for teams to capture complete business context and for analysts to quickly locate and understand the data they need.

Custom metadata forms and rich text descriptions at the column level are now available in AWS Regions where SageMaker is supported.

To learn more about SageMaker, see the Amazon SageMaker User Guide. Get started with this capability, refer to the user guide.

About the Authors

Building multi-tenant SaaS applications with AWS Lambda’s new tenant isolation mode

2025-11-20 Anton Aleksandrov

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/compute/building-multi-tenant-saas-applications-with-aws-lambdas-new-tenant-isolation-mode/

Today, AWS announced a new tenant isolation mode for AWS Lambda, that allows you to process function invocations in separate execution environments for each application end-user or tenant invoking your Lambda function. This capability simplifies building secure multi-tenant SaaS applications by managing tenant-level compute environment isolation and request routing for you. As a result, you can focus on your core business logic rather than implementing your own tenant-aware compute environment isolation.

Overview

Lambda runs your function code in secure execution environments that leverage Firecracker virtualization to provide isolation. These execution environments never share or reuse virtual resources (such as vCPU, disk, or memory) across functions, or even across different versions of the same function. However, Lambda can reuse execution environments for multiple invocations of the same function version, as these execution environments are fully set-up and can therefore deliver faster request processing for your functions.

Figure 1. Incoming invocations processed by a collection of execution environments that belong to a single function.

Multi-tenant SaaS applications that handle sensitive tenant-specific data or execute code supplied dynamically by tenants may need a higher degree of isolation—at the individual application tenant level rather than at the function level—for secure code execution and to reduce the risk of cross-tenant data access.

Prior to today’s launch, developers would implement custom solutions, such as SDKs or application logic to manage isolation within function code. This approach was bug-prone, required more work from application development teams, and didn’t ensure isolation at the compute environment level.

Alternatively, developers adopted the approach of creating separate functions per application tenant, replicating the same code across hundreds or thousands of tenants. This approach provided stronger compute environment isolation than sharing compute environments across multiple tenants of the same function, but increased implementation overhead and operational complexity as workloads grew to support a larger number of tenants over time.

Figure 2. Using function-per-tenant model, each tenant’s requests are processed by a separate function.

Starting today, AWS Lambda offers a new tenant isolation mode that lets you isolate execution environments used across different tenants of your multi-tenant SaaS applications, even when all of the tenants invoke the same function. When you enable the new tenant isolation mode, you include a tenant identifier with each function invocation. Lambda uses this identifier to route the request to the correct execution environment. As a result, each execution environment is reused only for invocations from the same tenant. This means you still get the performance benefits of warm execution environments, while ensuring that each tenant’s workloads remain isolated.

Figure 3. With the new tenant isolation capability, Lambda creates separate execution environments per tenant for a single function.

For organizations handling sensitive tenant-specific data or running untrusted code supplied dynamically by end-users, Lambda’s new tenant isolation mode provides the security benefits of per-tenant compute environment separation without the operational complexity of managing individual functions or infrastructure for each tenant.

Example scenario

Consider building a multi-tenant serverless SaaS application. To optimize performance, your function handler can retrieve tenant-specific configuration and data, cache it in memory, and reuse it for subsequent invocations from the same tenant. For example, you might cache tenant-specific database location, feature flags, or business rules that are frequently accessed during request processing. You may store this information within the application runtime process as global variables or as files in the /tmp directory. However, if the underlying execution environment is used to serve multiple tenants, this approach can potentially expose data across tenants.

With tenant isolation mode you can address this risk with much simpler architecture and configuration. This built-in capability makes Lambda an excellent choice for multi-tenant SaaS applications needing isolated compute environments for individual tenants.

Getting Started with Lambda Tenant Isolation Mode

Use the new tenancy-config parameter to configure tenant isolation mode when you create your function. You can only apply this configuration at function creation time; it cannot be updated for existing functions. The following snippet creates a function with tenancy config using the AWS CLI.

aws lambda create-function \
   --function-name my-function1 \
   --runtime nodejs22.x \
   --zip-file fileb://my-function1.zip \
   --handler index.handler \
   --role arn:aws:iam:1234567890:role/my-function-role \
   --tenancy-config '{"TenantIsolationMode": "PER_TENANT"}'

After the function is created, you must provide the tenant ID parameter with each invocation. Lambda uses this identifier to ensure that the execution environment used for a particular tenant is never reused for other tenants. For subsequent invocations from the same tenant, Lambda may reuse the execution environment to optimize performance. Specify this tenant-id parameter as illustrated below:

aws lambda invoke \
   --function-name my-function \
   --tenant-id BlueTenant \
   response.json

The new tenant-id parameter is required for functions using the tenant isolation mode. Function invocations omitting this parameter will fail with an invocation error, as shown below:

aws lambda invoke --function-name multitenant-function out.json

An error occurred (InvalidParameterValueException) when calling the Invoke operation:
The invoked function is enabled with tenancy configuration. 
Add a valid tenant ID in your request and try again.

Lambda makes the tenant ID parameter available through your function handler’s context object. This allows you to access tenant-specific information in your code, for example if you wish to implement custom logic based on the tenant identity, as shown below:

exports.handler = async function (event, context) {
   const tenantId = context.tenantId;

   // Process tenant-specific logic

   return {
      statusCode: 200,
      body: `OK for tenantId=${tenantId}`
   };
};

The following table outlines differences between Lambda functions with and without tenant isolation mode enabled:

Feature	Without the new tenant isolation mode	With the new tenant isolation mode
Execution environment isolation	Isolated per function version.	Isolated per end-user or tenant invoking a function version.
Execution environment reuse	Can be reused to process all invocations of a function version.	Can only be reused to process invocations from the same tenant invoking a function version.
Data stored on local disk and in-memory	Potentially accessible across all invocations of a function version.	Potentially accessible across invocations from the same tenant. Not accessible for invocations from other tenants.
Cold starts	Occur when there are no warm execution environments available to process incoming invocation.	Occur when there are no tenant-specific warm execution environments available to process incoming invocation. More cold starts expected due to tenant-specific execution environments.

Integrating with Amazon API Gateway

Amazon API Gateway uses Lambda’s Invoke API to invoke Lambda functions. When using the Invoke API, Lambda expects the tenant ID parameter to be passed using the X-Amz-Tenant-Id HTTP header. You can configure API Gateway to inject this HTTP header into the Lambda invocation request with a value obtained from client request properties such as HTTP header, query parameter, or path parameter. When using Lambda Authorizers, you can obtain the value from authorization context information returned by the authorizer, such as principal ID or JWT claim. See API Gateway documentation to learn how you can return authorization information from Lambda authorizers to be used for the X-Amz-Tenant-Id header value.

Figure 4. Obtaining X-Amz-Tenant-Id header value from authentication sources.

The following screenshot illustrates API Gateway Lambda integration configuration, where the incoming request to API Gateway includes an x-tenant-id header that is mapped to the X-Amz-Tenant-Id request header to invoke a Lambda function using tenant isolation mode.

Figure 5. Mapping client request header to Lambda tenant-id header.

The following code snippet illustrates this configuration implemented with the AWS CDK.

const lambdaIntegration = new ApiGw.LambdaIntegration(fn, {
   requestParameters: {
      // This configures API Gateway to inject X-Amz-Tenant-Id header
      // into downstream requests. The header value is obtained from 
      // x-tenant-id header in the client request.
      'integration.request.header.X-Amz-Tenant-Id': 'method.request.header.x-tenant-id'
   }
});

resource.addMethod('GET', lambdaIntegration, {
   requestParameters: {
      // This enables API Gateway to use the x-tenant-id header value 
      // obtained from the client request. The header name is arbitrary.
      // you can use any other header name. 
      'method.request.header.x-tenant-id': true
   }
});

Tenant-aware observability

For functions using tenant isolation, Lambda automatically includes the tenant ID in function logs when you have JSON logging enabled, making it easier to monitor and debug tenant-specific issues. Note that the tenantId property is available during function invocation, rather than during function initialization. The tenantId property is included for both platform events (like platform.start and platform.report) and custom logs you print in your function code, as shown in the following screenshot:

Figure 6. Lambda function logs with tenantId.

Lambda creates a separate CloudWatch log stream for each execution environment. You can use CloudWatch Log Insights to find log streams that belong to a particular tenant by filtering by tenant Id:

fields @logStream, @message
| filter tenantId=='BlueTenant' or record.tenantId=='BlueTenant'
| stats count() as logCount by @logStream
| sort @timestamp desc

You can also retrieve tenant-specific logs across all log streams:

fields @message
| filter tenantId=='BlueTenant' or record.tenantId=='BlueTenant'
| limit 1000

Each log stream starts with function initialization logs followed by the invocation logs. This structure helps you to debug tenant-specific issues and understand the lifecycle of each tenant’s execution environments.

Considerations

When using the new tenant isolation for Lambda functions, consider the following:

Each tenant’s execution environments are isolated from other tenants so that tenant-specific data stored on disk or in memory remain separated from other tenants invoking the same Lambda function.
All tenants share the function’s execution role. For more fine-grained permissions for individual tenants, consider propagating tenant-scoped credentials from the upstream application components invoking your Lambda function.
Your application may experience higher percentage of cold starts, as Lambda processes requests in separate execution environments for each tenant invoking your functions.
You pay a fee for each new tenant-specific execution environment created, depending on the memory configured for your function. See Lambda pricing page for details.

Best practices

When using the new tenant isolation mode for Lambda functions, AWS recommends the following best practices:

Implement robust tenant ID validation at the application layer to prevent unauthorized access through tenant ID manipulation. Consider using a dedicated service or database to maintain valid tenant IDs.
Monitor and audit tenant access patterns regularly to detect potential security anomalies or unauthorized cross-tenant access attempts.
Be aware of Lambda concurrency quotas when building multi-tenant applications. You might need to request quota increases based on your tenant count and usage patterns.

Sample code

Follow the instructions in this GitHub repository to provision a sample project in your own account and see the new Lambda tenant isolation mode in action. The sample project illustrates how to integrate a function using the new tenant isolation mode with Amazon API Gateway and propagate tenant identity from client requests.

Conclusion

The new tenant isolation mode for Lambda simplifies building serverless multi-tenant SaaS applications on AWS. By automatically managing application tenant-level compute environment isolation, this capability eliminates the need for custom isolation logic or separate tenant functions, allowing you to focus on the core business logic while AWS handles the complexities of tenant-aware compute environment isolation.

Combined with the existing security features in Lambda, rapid scaling, and pay-per-use pricing, tenant isolation mode makes Lambda an even more compelling choice for modern SaaS applications, whether you’re building new solutions or enhancing existing ones.

To learn more, refer to the documentation for tenant isolation. For details on pricing, refer to Lambda’s pricing page.

Improve API discoverability with the new Amazon API Gateway Portal

2025-11-20 Giedrius Praspaliauskas

Post Syndicated from Giedrius Praspaliauskas original https://aws.amazon.com/blogs/compute/improve-api-discoverability-with-the-new-amazon-api-gateway-portal/

Amazon API Gateway now provides a fully managed portal feature, Amazon API Gateway Portal, that eliminates the need for static websites, open source solutions, or third-party offerings, which often led to fragmented API lifecycle management and increased costs. API Gateway Portal integrates with the API Gateway service and offers features like API products, interactive “Try it” functionality, and documentation for your API portfolio.

This fully managed solution addresses the need for a seamless way to showcase APIs and help developers quickly find, try, and integrate with them. By providing a managed solution that handles infrastructure, security, and scalability, API providers can focus on creating valuable APIs and delivering a great developer experience.

In this post, we will show how you can use the new portal feature to create customizable portals with enhanced security features in minutes, with APIs from multiple accounts, without managing any infrastructure.

Overview

A developer portal is a web page where API providers can share their APIs and API documentation by grouping them into portal products. Each portal product is a logical grouping of REST APIs and contains the documentation that you create and publish for your API consumers. Product pages within a portal contain the custom documentation at the portal product level. Product REST endpoint pages contain the documentation for each of the REST APIs with the details of the path and method of a REST API and the stage it’s deployed to. The combination of Product pages and Product REST endpoint pages provide the complete documentation for our API consumers on how to start using your REST APIs.

This abstraction allows you to organize endpoints from multiple APIs and stages into coherent product offerings for your consumers. For example, if you operate multiple APIs supporting a pet adoption service, you can create an “AdoptAnimals” portal product that groups dog-related endpoints from one API with cat-related endpoints from another API, while organizing user management functions into a separate “AdoptProcess” portal product.

With this flexibility you can present your APIs in a way that matches your business logic rather than your technical architecture and organize your APIs in ways that make the most sense for your consumers. For large enterprises managing extensive API portfolios, API Gateway Portal offers centralized catalogs of APIs across business groups, reducing duplicate work and improving standardization.

The portal feature automatically creates developer portals that display APIs with documentation, interactive testing capabilities, and integrated consumer analytics. The platform uses AWS Resource Access Manager (RAM) for multi-account API sharing, Amazon Cognito for access control, and Amazon CloudWatch for centralized monitoring.

Key features of API Gateway Portal

The API Gateway Portal provides comprehensive functionality for both API providers and consumers.

The following is a list of the key features that were introduced by the service at launch:

Customizable portal experience: You control your portal’s branding through custom logos and color schemes. You can configure custom domain names with SSL certificates managed by AWS Certificate Manager, or use the default domain structure provided by AWS.

Flexible access control: Access to developer portals can be controlled using Amazon Cognito, you can configure portals to be either publicly accessible or require authentication. Integration with Cognito user pools provides secure and scalable identity and access management that is enterprise-grade, cost-effective, and customizable. For organizations using existing identity systems, Cognito supports federation with SAML and OpenID Connect identity providers.

Cross-account API organization: The portal supports sharing portal products across AWS accounts using AWS RAM, so that organizations can create a unified API catalog while maintaining flexibility for API providers to develop and maintain APIs in their own accounts. When you share a portal product with another account, that account cannot modify any properties of your portal product or product endpoint pages, so API providers maintain control over their APIs while still enabling discovery across the organization. The cross-account sharing capabilities provide significant governance benefits for enterprise customers, including centralized discovery, standardization, reduced duplication, clear ownership, and controlled access.

Documentation: Beyond API reference documentation synchronized from your API definitions, you can add supplemental documentation including guides, use cases, and integration examples.

Search, discovery, and interactive API exploration: Consumers can search across your entire catalog. The portal provides intuitive customizable navigation and organization to help users find the right endpoints for their needs. Using the “Try It” functionality consumers can try APIs directly from the portal. Users can input request parameters, headers, and see live responses, reducing time-to-value for API integrations. This environment includes built-in limits for security and cost control.

Access control and governance

Amazon API Gateway Portal provides security and governance capabilities essential for production deployments.

Identity and access management: Integration with Cognito user pools provides secure and scalable identity and access management that is enterprise-grade, cost-effective, and customizable, including multi-factor authentication, password policies, and user lifecycle management.

API authorization: The portal respects existing authorization mechanisms configured on your APIs, including AWS IAM, Lambda authorizers, and Cognito user pools. Portal access doesn’t bypass your established security controls.

Cross-account governance: When sharing portal products across accounts using AWS RAM, the original API owners retain full control over their endpoints, including authorization strategies, integration configurations, and stage settings. Portal owners can use shared portal products but cannot modify the underlying API configurations.

Audit and monitoring: All portal management activities integrate with AWS CloudTrail for comprehensive audit logging. You can use Amazon CloudWatch RUM to perform real user monitoring to collect and view analytics about API consumers in near real time.

Resource limits: The service includes built-in quotas to prevent abuse, including limits on API testing rate limits, payload sizes, and integration timeouts. With these limits the “Try It” functionality cannot impact your production API performance.

Getting Started

Setting up a portal involves three main steps: creating portal products, configuring the portal, and publishing for consumer access. We will walk through those steps in more detail.

Create portal product

The following procedure shows you how to create a portal product:

Navigate to the API Gateway console and select Portal products from the main navigation.
Choose Create portal product and specify your portal product details including name, description, and visibility settings.
Next, select the endpoints you want to include in this portal product. You can choose entire API stages or specific resources and methods, and even rename endpoints with user-friendly names for better discoverability.
The system automatically imports your API documentation. You can improve the documentation with additional context, use cases, and examples later.
Organize product endpoints into custom categories that reflect your business logic rather than technical implementation details.

Configure the developer portal

The following procedure shows how to create a portal.

Select Developer portals in the API Gateway console navigation.
Specify your portal name, description, and domain configuration.
Choose between adding your prefix to the default AWS domain or configuring a custom domain name with your own SSL certificate.
Configure access control by selecting authentication requirements. For internal portals, you might require Amazon Cognito authentication, while public portals can allow anonymous access to documentation.
Upload your logo and select color themes to match your brand identity.
Add your portal products. You can include products from your account or products shared with you from other accounts through AWS RAM. The portal provides search and filtering capabilities for consumers.

Preview and publish

Before making your portal publicly available, use the preview functionality to review the consumer experience. The preview shows exactly how your portal will appear to users, including navigation, documentation, and available API testing capabilities.

When you’re satisfied with the configuration, choose Publish portal to make it accessible to consumers. The publishing process typically completes within a few minutes, and API Gateway provides the final portal URL for distribution to your consumers.

Conclusion and next steps

The new API Gateway Portal eliminates the complexity of building and maintaining custom API documentation sites. Your developers get a professional, feature-rich experience where they can discover and try your APIs immediately. Plus, since everything stays within AWS, you get built-in security, simplified operations, and comprehensive observability through integration with services like CloudWatch and CloudTrail.

Ready to streamline your API discovery experience? Here’s how to get started:

Visit the API Gateway console to create your first portal
Follow our step-by-step tutorial in the Developer Guide
Learn more about API development on Serverless Land

Building responsive APIs with Amazon API Gateway response streaming

2025-11-20 Anton Aleksandrov

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/compute/building-responsive-apis-with-amazon-api-gateway-response-streaming/

Today, AWS announced support for response streaming in Amazon API Gateway to significantly improve the responsiveness of your REST APIs by progressively streaming response payloads back to the client. With this new capability, you can use streamed responses to enhance user experience when building LLM-driven applications (such as AI agents and chatbots), improve time-to-first-byte (TTFB) performance for web and mobile applications, stream large files, and perform long-running operations while reporting incremental progress using protocols such as server-sent events (SSE).

In this post you will learn about this new capability, the challenges it addresses, and how to use response streaming to improve the responsiveness of your applications.

Overview

Consider this scenario – you’re running an AI-powered agentic application that uses an Amazon Bedrock foundation model. Your users interact with the application through an API, asking complex questions that require detailed responses. Before response streaming, users would send their prompts and wait to eventually receive the application response, sometimes for tens of seconds. This awkward pause between questions and responses created a disconnected, unnatural experience.

With the new API Gateway response streaming capability, the interaction through the API becomes much more fluid and natural. As soon as your application starts processing the model response, you can stream it back to your users using the API Gateway.

The following animation illustrates this significant user experience improvement. The prompt on the left is processed using a non-streaming response with user having to wait for several seconds to receive the result. The prompt on the right is using the new API Gateway response streaming, significantly reducing TTFB and improving user experience.

Figure 1. Comparing user experience before (left) and after (right) enabling API Gateway response streaming when returning a response from a Bedrock foundational model.

Your users can now see AI responses appear in real-time, word by word, just like watching someone type. This immediate feedback makes your applications feel more responsive and engaging, keeping users connected throughout the interaction. In addition, you don’t have to worry about response size limits or implement complex workarounds – the streaming happens automatically and efficiently, letting you focus on building great user experiences rather than managing infrastructure constraints.

Understanding response steaming

In the traditional request-response model, responses must be fully computed before being sent to the client. This can negatively impact user experience – the client must wait for the complete response to be generated on the server-side and transmitted over-the-wire. This is especially pronounced in interactive, latency-sensitive cloud applications such as AI agents, chatbots, virtual assistants, or music generators.

Figure 2. Response is returned to the client only after it’s been fully generated, increasing time-to-first-byte latency.

Another important scenario is returning larger response payloads, such as images, large documents, or datasets. In some cases, these payloads may exceed the 10 MB response size limit or default integration timeout limit of 29 seconds of API Gateway. Before the launch of response streaming, developers worked around these limitations by using pre-signed Amazon S3 URLs to download large responses or accepting lower RPS for an increase in timeout. While functional, these workarounds introduced additional latency and architectural complexity.

With response streaming support you can address these challenges. You can now update your REST APIs to return streamed responses, significantly enhancing user experience, improving TTFB performance, supporting response payload sizes to exceed 10 MB, and serving requests that can take up to 15 minutes.

Figure 3. Response streaming reduces time-to-first-byte and improves user experience.

The response streaming capability is already delivering significant performance for organizations:

“Working closely with the AWS teams to enable response streaming was instrumental in advancing our roadmap to deliver the most performant storefront experiences for our largest customers at Salesforce Commerce Cloud. Our collaboration exceeded our Core Web Vital goals; we saw our Total Blocking Time metrics drop by over 98%, which will enable our customers to drive higher revenue and conversion rates.”, says Drew Lau, Senior Director of Product Management at Salesforce.

Response streaming is supported for any HTTP-proxy integration, AWS Lambda functions (using proxy integration mode), and private integrations. To get started, configure your API integration to stream the response from your backend, as described in the following sections, and redeploy your API for changes to take effect.

Getting started with response streaming

To enable response streaming for your REST APIs, update your integration configuration to set the response transfer mode to STREAM. This enables API Gateway to start streaming the response to the client as soon as response bytes become available. When using response streaming, you can configure request timeout up to 15 minutes. For best time to first byte user experience, AWS strongly recommends your backend integration also implements response streaming.

You can enable response streaming in several different ways, as illustrated in the following snippets:

Using the API Gateway console, when creating method integrations, select Stream for the Response transfer mode.

Figure 4. Enabling response streaming in API Gateway Console.

Setting response transfer mode using the Open API spec:

paths:
  /products:
    get:
      x-amazon-apigateway-integration:
        httpMethod: "GET"
        uri: "https://example.com"
        type: "http_proxy"
        timeoutInMillis: 300000
        responseTransferMode: "STREAM"

Setting response transfer mode using infrastructure-as-code (IaC) frameworks, such as AWS CloudFormation. Note the /response-streaming-invocations Uri fragment, it tells API Gateway to use the Lambda InvokeWithResponseStreaming endpoint:

MyProxyResourceMethod:
  Type: 'AWS::ApiGateway::Method'
  Properties:
    RestApiId: !Ref LambdaSimpleProxy
    ResourceId: !Ref ProxyResource
    HttpMethod: ANY
    Integration:
      Type: AWS_PROXY
      IntegrationHttpMethod: POST
      ResponseTransferMode: STREAM
      Uri: !Sub arn:aws:apigateway:${APIGW_REGION}:lambda:path/2021-11-
           15/functions/${FN_ARN}/response-streaming-invocations

Updating response transfer mode using the AWS CLI:

aws apigw update-integration \
   --rest-api-id a1b2c2 \
   --resource-id aaa111 \
   --http-method GET \
   --patch-operations "op='replace',path='/responseTransferMode',value=STREAM" \
   --region us-west-2

Using response streaming with Lambda functions

When using Lambda functions as a downstream integration endpoint, your Lambda functions must be streaming-enabled. The API Gateway uses the InvokeWithResponseStreaming API to invoke functions, as illustrated in the following diagram, and requires Lambda proxy integration. See the API Gateway documentation for additional guidance.

Figure 5. Using API Gateway response streaming with Lambda functions for interactive AI applications.

When you use response streaming with Lambda functions, API Gateway expects the handler response stream to contain the following components (in order):

JSON response metadata – Must be a valid JSON object and can only contain statusCode, headers, multiValueHeaders, and cookies fields (all optional). Metadata cannot be an empty string; at a minimum it must be an empty JSON object.
The 8-null-byte delimiter – Lambda adds this delimiter automatically when you use the built-in awslambda.HttpResponseStream.from() method, as illustrated below. When not using this method, you’re responsible for adding the delimiter yourself.
Response payload – Can be empty.

The following code snippet illustrates how you can return a streamed response from your Lambda functions so it will be compatible with API Gateway response streaming:

export const handler = awslambda.streamifyResponse(
   async (event, responseStream, context) => {

      const httpResponseMetadata = {
         statusCode: 200,
         headers: {
            'Content-Type': 'text/plain',
            'X-Custom-Header': 'some-value'
         }
      };

      responseStream = awslambda.HttpResponseStream.from(
         responseStream,
         httpResponseMetadata
      );

      responseStream.write('hello');
      await new Promise(r => setTimeout(r, 1000));
      responseStream.write(' world');
      await new Promise(r => setTimeout(r, 1000));
      responseStream.write('!!!');
      responseStream.end();
   }
);

Refer to the API Gateway documentation for further implementation guidelines.

Using response streaming with HTTP Proxy integrations

You can stream HTTP responses from your applications used as downstream integration endpoints, for example web servers running on Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS). In this case, you must use HTTP_PROXY integration and specify the response transfer mode as STREAM (using the console, AWS CLI, or IaC). Redeploy your API after modifying it.

Figure 6. Using API Gateway response streaming with HTTP server applications.

Once API Gateway receives a streaming response from your application, it will wait until the HTTP headers block transfer is complete. Then, it will send to the client an HTTP response status code and headers, followed by the content from your application as it gets received by the API Gateway service. It will continue streaming response from your application to the client until the stream ends (up to 15 minutes).

Many popular API and web application development frameworks provide response streaming abstractions. The following code snippet illustrates how you can implement HTTP response streaming using FastAPI:

app = FastAPI()

async def stream_response():
   yield b"Hello "
   await asyncio.sleep(1)
   yield b"World "
   await asyncio.sleep(1)
   yield b"!"

@app.get("/")
async def main():
   return StreamingResponse(stream_response(), media_type="text/plain")

Adding real-time response streaming to your HTTP clients

Different HTTP clients have different ways to process streamed response fragments as they arrive. The following code snippet illustrates how to process a streamed response with a Node.js application:

const request = http.request(options, (response)=>{
   response.on('data', (chunk) => {
      console.log(chunk);
   });

   response.on('end', () => {
      console.log('Response complete’);
   });
});

request.end();

When using CURL, you can use the –no-buffer argument to print response fragments as they arrive.

curl --no-buffer {URL}

Sample code

Clone this sample project from GitHub to see API Gateway response streaming in action. Follow instructions in the README.md to provision the sample project in your AWS account.

Considerations

Before you enable response streaming, consider:

Response streaming is available for REST APIs and can be used with HTTP_PROXY integrations, Lambda integrations (in proxy mode), and private integrations.
You can use API Gateway response streaming with any endpoint type, such as Regional, Private, and Edge-optimized, with or without custom domain names.
When using response streaming, you can configure response timeouts up to 15 minutes, according to your scenario requirements.
All streaming responses from Regional or Private endpoints are subject to a 5-minute idle timeout. All streaming responses from edge-optimized endpoints are subject to a 30-second idle timeout.
Within each streaming response, the first 10MB of response payload is not subject to any bandwidth restrictions. Response payload data exceeding 10MB is restricted to 2MB/s.
Response streaming is compatible with API Gateway security capabilities such as authorizers, WAF, access controls, TLS/mTLS, request throttling, and access logging.
When processing streamed responses, the following features are not supported: response transformation with VTL, integration response caching, and content encoding.
Always protect your APIs against unauthorized access and other potential security threats by implementing proper authorization with Lambda Authorizers or Amazon Cognito User Pools. Read REST API protection documentation and API Gateway security documentation for additional details.

Observability

You can continue using existing observability capabilities, such as execution logs, access logs, AWS X-Ray integration, and Amazon CloudWatch metrics with API Gateway response streaming.

In addition to the existing access logs variables, the following new variables are available:

$content.integration.responseTransferMode – the response transfer mode of your integration. This can be either BUFFERED or STREAMED.
$context.integration.timeToAllHeaders – the time between when API Gateway establishes the integration connection to when it receives all integration response headers from the client.
$context.integration.timeToFirstContent – the time between when API Gateway establishes the integration connection to when it receives the first content bytes.

See API Gateway documentation for more information.

Pricing

With this new capability, you continue to pay the same API Invoke rates for streamed responses. Each 10MB of response data, rounded up to the nearest 10MB, is billed as a single request. See API Gateway pricing page for additional details.

Conclusion

The new response streaming capability for Amazon API Gateway enhances how you can build and deliver responsive APIs in the cloud. With immediate streaming of response data as it becomes available, you can significantly improve time-to-first-byte performance and overcome traditional payload size and timeout limitations. This is particularly valuable for AI-powered applications, file transfers, and interactive web experiences that demand real-time responsiveness.

To learn more about API Gateway response streaming see the service documentation.

To learn more about building Serverless architectures see Serverless Land.