Tag Archives: Intermediate (200)

Achieve 2x faster data lake query performance with Apache Iceberg on Amazon Redshift

Post Syndicated from Kalaiselvi Kamaraj original https://aws.amazon.com/blogs/big-data/achieve-2x-faster-data-lake-query-performance-with-apache-iceberg-on-amazon-redshift/

With the growing adoption of open table formats like Apache Iceberg, Amazon Redshift continues to advance its capabilities for open format data lakes. In 2025, Amazon Redshift delivered several performance optimizations that improved query performance over twofold for Iceberg workloads on Amazon Redshift Serverless, delivering exceptional performance and cost-effectiveness for your data lake workloads.

In this post, we describe some of the optimizations that led to these performance gains. Data lakes have become a foundation of modern analytics, helping organizations store vast amounts of structured and semi-structured data in cost-effective data formats like Apache Parquet while maintaining flexibility through open table formats. This architecture creates unique performance optimization opportunities across the entire query processing pipeline.

Performance enhancements

Our latest enhancements span multiple areas of the Amazon Redshift SQL query processing engine, including vectorized scanners that accelerate execution, optimal query plans powered by just-in-time (JIT) runtime statistics, distributed Bloom filters, and new decorrelation rules.

The following chart summarizes the performance improvements achieved so far in 2025, as measured by industry standard 10 TB TPC-DS and TPC-H benchmarks run on Iceberg tables on an 88 RPU Redshift Serverless endpoint.

Find the best performance for your workloads

The performance results presented in this post are based on benchmarks derived from the industry-standard TPC-DS and TPC-H benchmarks, and have the following characteristics:

  • The schema and data of Iceberg tables are used unmodified from TPC-DS. Tables are partitioned to reflect real-world data organization patterns.
  • The queries are generated using the official TPC-DS and TPC-H kits with query parameters generated using the default random seed of the kits.
  • The TPC-DS test includes all 99 TPC-DS SELECT queries. It doesn’t include maintenance and throughput steps. The TPC-H test includes all 22 TPC-H SELECT queries.
  • Benchmarks are run out of the box: no manual tuning or stats collection is done for the workloads.

In the following sections, we discuss key performance improvements delivered in 2025.

Faster data lake scans

To improve data lake read performance, the Amazon Redshift team built a completely new scan layer designed from the ground-up for data lakes. This new scan layer includes a purpose-built I/O subsystem, incorporating smart prefetch capabilities to reduce data latency. Additionally, the new scan layer is optimized for processing Apache Parquet files, the most commonly used file format for Iceberg, through fast vectorized scans.

This new scan layer also includes sophisticated data pruning mechanisms that operate at both partition and file levels, dramatically reducing the volume of data that needs to be scanned. This pruning capability works in harmony with the smart prefetch system, creating a coordinated approach that maximizes efficiency throughout the entire data retrieval process.

JIT ANALYZE for Iceberg tables

Unlike traditional data warehouses, data lakes often lack comprehensive table- and column-level statistics about the underlying data, making it challenging for the planner and optimizer in the query engine to choose up-front which execution plan will be most optimal. Sub-optimal plans can lead to slower and less predictable performance.

JIT ANALYZE is a new Amazon Redshift feature that automatically collects and uses statistics for Iceberg tables during query execution—minimizing manual statistics collection while giving the planner and optimizer in the query engine the information it needs to generate optimal query plans. The system uses intelligent heuristics to identify queries that will benefit from statistics, performs fast file-level sampling using Iceberg metadata, and extrapolates population statistics using advanced techniques.

JIT ANALYZE delivers out-of-the-box performance nearly equal to queries that have pre-calculated statistics, while providing the foundation for many other performance optimizations. Some TPC-DS queries improved by 50 times faster with these statistics.

Query optimizations

For correlated subqueries such as those that contain EXISTS/IN clauses, Amazon Redshift uses decorrelation rules to rewrite the queries. In many cases, these decorrelation rules were not producing optimal plans, resulting in query execution performance regressions. To address this, we introduced a new internal join type, SEMI JOIN, and a new decorrelation rule based on this join type. This decorrelation rule helps in producing the most optimal plans, thereby improving execution performance. For instance, one of the TPC-DS queries that contains EXIST clause ran 7 times faster with this optimization.

We introduced distributed Bloom filter optimization for data lake workloads. Distributed Bloom filters create Bloom filters locally in every compute node and then distributes them to every other node. Distributing Bloom filters can significantly reduce the amount of data that needs to be sent over the network for the join by filtering out the tuples earlier. This provides good performance gains for large, complex data lake queries that process and join large amounts of data.

Conclusion

These performance improvements for Iceberg workloads represent a major leap forward in Redshift data lake capabilities. By focusing on out-of-the-box performance, we’ve made it straightforward to achieve exceptional query performance without complex tuning or optimization.

These improvements demonstrate the power of deep technical innovation combined with practical customer focus. JIT ANALYZE reduces the operational burden of statistics management while providing optimal query planning information. The new Redshift data lake query engine on Redshift Serverless was rewritten from the ground up for best-in-class scan performance, and lays the groundwork for more advanced performance optimizations. Semi-join optimizations tackle some of the most challenging query patterns in analytical workloads. You can run complex analytical workloads on your Iceberg data and get fast, predictable query performance.

Amazon Redshift is committed to being the best analytics engine for data lake workloads, and these performance optimizations represent our continued investment in that goal.

To learn more about Amazon Redshift and its performance capabilities, visit the Amazon Redshift product page. To get started with Redshift, you can try Amazon Redshift Serverless and start querying data in minutes without having to set up and manage data warehouse infrastructure. For more details on performance best practices, see the Amazon Redshift Database Developer Guide. To stay up-to-date with the latest developments in Amazon Redshift, subscribe to the What’s New in Amazon Redshift RSS feed.


Special thanks to this post’s contributors: Martin Milenkoski, Gerard Louw, Konrad Werblinski, Mengchu Cai, Mehmet Bulut, Mohammed Alkateb, and Sanket Hase

Accelerate data lake operations with Apache Iceberg V3 deletion vectors and row lineage

Post Syndicated from Ron Ortloff original https://aws.amazon.com/blogs/big-data/accelerate-data-lake-operations-with-apache-iceberg-v3-deletion-vectors-and-row-lineage/

Organizations building petabyte-scale data lakes face increasing challenges as their data grows. Batch updates and compliance deletes create a proliferation of positional delete files, slowing downstream data pipelines and driving up storage costs. Tracking data changes for audit trails and incremental processing requires custom, engine-specific implementations that add complexity and maintenance burden. As data volumes scale, these challenges compound, leaving data teams juggling custom solutions and increasing operational costs just to maintain data freshness and compliance.

Apache Iceberg V3 addresses these challenges with two new capabilities: deletion vectors and row lineage. AWS now delivers these capabilities across Apache Spark on Amazon EMR 7.12, AWS Glue, Amazon SageMaker notebooks, Amazon S3 Tables, and the AWS Glue Data Catalog, giving you a complete, integrated V3 experience without custom implementation. This means faster writes, lower storage costs, comprehensive audit trails, and efficient incremental processing, all working seamlessly across your entire data lake architecture.

In this post, we walk you through the new capabilities in Iceberg V3, explain how deletion vectors and row lineage address these challenges, explore real-world use cases across industries, and provide practical guidance on implementing Iceberg V3 features across AWS analytics, catalog, and storage services.

What’s new in Iceberg V3

Iceberg V3 introduces new capabilities and data types. Two key capabilities that address the challenges discussed earlier are deletion vectors and row lineage.

Deletion vectors replace positional delete files with an efficient binary format stored as Puffin files. Instead of creating separate delete files for each delete operation, the deletion vector consolidates these delete references to a single delete vector per data file, rather than a delete reference file per deleted row. During query execution, engines efficiently filter out deleted rows using these compact vectors, maintaining query performance while removing the need to merge multiple delete files.

This avoids write amplification from random batch updates and GDPR compliance deletes, significantly reducing the overhead of maintaining fresh data. High-frequency update workloads can see immediate improvements in write performance and reduced storage costs from fewer small delete files. Additionally, having fewer small delete files reduces table maintenance costs for compaction operations.

Row lineage enables precise change tracking at the row level with full auditability. Row lineage adds metadata fields to each data file that track when rows were created and last modified. The _row_id field uniquely identifies each row, and the _last_updated_sequence_number field tracks the snapshot when the row was last modified. These fields enable efficient change tracking queries without scanning entire tables, and they’re automatically maintained by the Iceberg specification without requiring custom code.

Before row lineage, change tracking in Iceberg provided only the net changes between snapshots, making it difficult to track individual record modifications. Row lineage metadata fields can now be queried to return all incremental changes, giving you full fidelity for auditing data modifications and regulatory compliance. For data transformations, your downstream systems can process changes incrementally, speeding up data pipelines and reducing compute costs for change data capture (CDC) workflows. Row lineage is engine agnostic, interoperable, and built into the Iceberg V3 specification, alleviating the need for custom, engine-specific change tracking implementations.

Real-world use cases

The new Iceberg V3 capabilities address critical challenges across multiple industries:

  • Marketing and advertising services organizations – You can now efficiently handle GDPR right-to-be-forgotten requests and regulatory compliance deletes without the write amplification that previously degraded pipeline performance. Row lineage provides complete audit trails for data modifications, meeting strict regulatory requirements for data governance.
  • Ecommerce platforms processing millions of product updates and inventory changes daily – You can maintain data freshness while reducing storage costs. Deletion vectors enable faster upsert operations, helping teams meet aggressive SLA requirements during peak shopping periods.
  • Healthcare and life sciences companies – You can track patient data modifications with precision for compliance purposes while efficiently processing large-scale genomic datasets. Row lineage provides the detailed change history required for clinical trial audits and regulatory submissions.
  • Media and entertainment providers managing large catalogs of user viewing data – You can efficiently process incremental changes for recommendation engines. Row lineage enables downstream analytics systems to process only changed records, reducing compute costs in incremental processing scenarios.

Get started with Iceberg V3

To take advantage of deletion vectors for optimized writes and row lineage for built-in change tracking in Iceberg V3, set the table property format-version = 3 during table creation. Alternatively, setting this property on an existing Iceberg V2 table atomically upgrades the table without data rewrites. Before creating or upgrading V3 tables, make sure the Iceberg engines in your solution are V3-compatible. Refer to Apache Iceberg V3 on AWS for more details.

Create a new V3 table with Apache Spark on Amazon EMR 7.12

The following code creates a new table named customer_data. Setting the table property format-version = 3 creates a V3 table. If the format-version table property is not explicitly set, a V2 table is created. V2 is currently the Iceberg default table version. Setting write.delete.mode, write.update.mode, and write.merge.mode to merge-on-read configures Spark to write deletion vectors for delete, update, or merge statements performed on the table.

CREATE TABLE customer_data (
customer_id bigint,
name string,
email string,
last_purchase timestamp,
total_spent decimal(10,2)
)
USING iceberg
TBLPROPERTIES (
'format-version' = '3',
'write.delete.mode' = 'merge-on-read',
'write.update.mode' = 'merge-on-read',
'write.merge.mode' = 'merge-on-read'
)

Run the following code to insert records into the customer_data table:

INSERT INTO customer_data VALUES
 (1, 'Alejandro Rosalez', '[email protected]', TIMESTAMP '2025-11-24 18:55:27', 42.97)
,(2, 'Akua Mansa', '[email protected]', TIMESTAMP '2025-11-24 17:55:27', 25.02)
,(3, 'Ana Carolina Silva','[email protected]', TIMESTAMP '2025-11-24 16:55:27', 43.67)
,(4, 'Arnav Desai','[email protected]', TIMESTAMP '2025-11-24 15:55:27', 98.32)
,(5, 'Carlos Salazar','[email protected]', TIMESTAMP '2025-11-24 12:55:27', 76.45)

Delete a record where customer_id = 5 to generate a delete file:

DELETE 
  FROM customer_data 
  WHERE customer_id = 5

Updating a record with the following update statement also generates a delete file:

UPDATE customer_data
  SET name = 'Mansa Akua' 
  WHERE customer_id = 2

The last part of this example queries the manifest’s metadata table to verify delete files were produced:

SELECT added_snapshot_id
      ,sum(added_delete_files_count) as added_delete_files_count 
FROM customer_data.manifests 
GROUP BY added_snapshot_id 
ORDER BY added_snapshot_id

This query will result in three records returned, as shown in the following screenshot. The added_delete_files_count for the first snapshot that inserts records should be 0. The next two snapshots for the corresponding delete and update statements should have 1 each for added_delete_files_count value.

Query row lineage for change tracking

Row lineage is automatically enabled on V3 tables. The following example includes row lineage metadata fields and an example of how to query table changes after a row lineage sequence number:

SELECT
customer_id,
name,
email,
_row_id,
_last_updated_sequence_number
FROM customer_data
WHERE _last_updated_sequence_number > 0
ORDER BY _last_updated_sequence_number, _row_id

Running this query after the previous insert, update, and delete statements returns four records, as shown in the following screenshot. The deleted record is removed. The _last_updated_sequence_number is 3 for the update to customer_id = 2.

Upgrade an existing V2 table

You can upgrade your existing V2 tables to V3 with the following command:

ALTER TABLE existing_customer_data
SET TBLPROPERTIES ('format-version' = '3')

When you upgrade a table from V2 to V3, several important operations occur atomically:

  • A new metadata snapshot is created atomically, resulting in no data loss.
  • Existing Parquet data files are reused without modification.
  • Row-lineage fields (_row_id and _last_updated_sequence_number) are added to the table metadata.
  • The next compaction operation will remove old V2 positional delete files. If new deletion vector files are generated before compaction runs, they will merge existing V2 positional delete files.
  • New modifications will automatically use V3’s deletion vector files.
  • The upgrade does not perform a historical backfill of row-lineage change tracking records.

The upgrade process is synchronous and completes in seconds for most tables. If the upgrade fails, an error message is returned immediately, and the table remains in its V2 state.

Getting the most from Iceberg V3

In this section, we share the key things we’ve learned from customers already using these features.

Know your workload pattern

Deletion vectors work best when you’re doing lots of writes, such as high-frequency updates, batch deletes, or CDC workloads making random non-append-only updates. If you’re writing more than you’re reading, deletion vectors will deliver immediate performance gains. To unlock these benefits, set your table to merge-on-read mode for delete, update, and merge operations.

Let AWS handle compaction

Enable automatic compaction through the Data Catalog or use S3 Tables (on by default). You will get hands-free optimization without building custom maintenance jobs. Deletion vectors produce fewer delete files than positional deletes in Iceberg V2. Given a similar pattern and amount of modified records, V3 compaction should be quicker and cost less than V2.

Understand the importance of row lineage when using the V2 changelog

With the Spark changelog procedure in Iceberg V2, if a row gets inserted and then deleted between snapshots, it disappears from your change feed—you never see it. Iceberg V3 row lineage captures both operations because _last_updated_sequence_number updates on each modification. This full fidelity is important for audit trails and regulatory compliance where you need to prove what happened to every record. Performance-wise, the V2 changelog requires scanning and merging delete files to compute changes—that’s compute you’re paying for on every read. V3 row lineage stores metadata fields directly on each row, so filtering by _last_updated_sequence_number is a simple metadata scan.

Test before you upgrade

Iceberg V3 upgrades are atomic and fast, but test in dev first. Make sure all your query engines support Iceberg V3 before upgrading shared tables—mixing V2 and V3 engines causes headaches. After upgrading, keep a few V2 snapshots around temporarily for time-travel queries while you validate performance.

Conclusion

Iceberg V3 support across AWS analytics, catalog, and storage services marks a significant advancement in data lake capabilities. By combining deletion vectors’ write optimization with row lineage’s comprehensive change tracking, you can build more efficient, auditable, and cost-effective data lakes at scale. The seamless interoperability across AWS services makes sure your data lake architecture remains flexible and future-proof.

To learn more about AWS support for Iceberg V3, refer to Using Apache Iceberg on AWS.

To learn more about building modern data lakes with Iceberg on AWS, refer to Analytics on AWS.


About the authors

Ron Ortloff

Ron Ortloff

Ron is a Principal Product Manager at AWS.

AWS Private Certificate Authority now supports partitioned CRLs

Post Syndicated from Kartik Bhatnagar original https://aws.amazon.com/blogs/security/aws-private-certificate-authority-now-supports-partitioned-crls/

Public Key Infrastructure (PKI) is essential for securing and establishing trust in digital communications. As you scale your digital operations, you’ll issue and revoke certificates. Revoking certificates is useful especially when employees leave, migrate to a new certificate authority hierarchy, meet compliance, and respond to security incidents. Use the Certificate Revocation List (CRL) or Online Certificate Status Protocol (OCSP) method to track revoked certificates. You can use Amazon Web Services (AWS) Private Certificate Authority (AWS Private CA) to create a certificate authority (CA), which publishes revocation information through these methods so that systems can verify certificate validity.

As enterprises continue to scale their operations, they face limitations when using complete CRLs to issue and revoke more than 1 million certificates. The workaround of increasing CRL file sizes isn’t viable, because many applications can’t process large CRL files (with some needing a 1 MB maximum). Furthermore, alternative solutions like OCSP may be rejected by major trust stores and browser vendors due to privacy concerns and compliance requirements. These constraints significantly impact your ability to scale PKI infrastructure efficiently while maintaining security and compliance standards.

Feature release: Addressing challenges

AWS Private CA addresses these challenges with partitioned CRLs, which enable the issuance and revocation of up to 100 million certificates per CA. This feature distributes revocation information across multiple smaller, manageable CRL partitions, each maintaining a maximum size of 1 MB for more effective application compatibility. At the time of issuance, certificates are automatically bound to specific CRL partitions through a critical Issuer Distribution Point (IDP) extension, which contains a unique URI identifying the partition. Validation works by comparing the CRL URI in the certificate’s CRL Distribution Point (CDP) extension against the CRL’s IDP extension, which provides accurate certificate validation.

Partitioned CRL provides automatic scaling of certificate issuance limits from 1M to 100M certificates per CA, support for both new and existing CAs, flexible configuration options for CRL naming and paths, backward compatibility by preserving existing complete CRL functionality while offering partitioned CRL as an optional feature, and compliance with industry standards such as RFC5280 while maintaining security and operational efficiency.

Configuring Partitioned CRLs in AWS Private CA

You can configure Partitioned CRLs for existing CAs in AWS Private CA by using the following steps.

  1. Choose Private certificate authorities in the left navigation bar.
  2. Choose the hyperlink in the Subject column that is your CA to go into its details.

    Note: Verify that you are in the correct AWS Region.

    Figure 1: Certificate Authority selection

    Figure 1: Certificate Authority selection

  3. Choose the Revocation configuration tab and you should observe the CRL distribution enabled or disabled. If it is disabled, then you should enable it in the next steps.
    Figure 2: Certification Authority general configuration information

    Figure 2: Certification Authority general configuration information

  4. Choose Edit.
  5. Check the checkbox of Activate CRL distribution.

    If CRL distribution was enabled already, then skip to step 7.

  6. Under S3 bucket URI, choose an existing bucket from the list. You can observe detailed steps listed in Step 6 of the instructions in Create a private CA in AWS Private CA.

    You must verify that BPA is disabled for the account and for the bucket, and you must manually attach a policy to it before you can begin generating CRLs. Use one of the policy patterns described in Access policies for CRLs in Amazon S3. For more information, go to Adding a bucket policy using the Amazon S3 console.

  7. Expand CRL settings for more configuration options.
  8. Check the Enable partitioning checkbox to enable partitioning of CRLs. This creates a partitioned CRL.

    If you don’t enable partitioning, then a complete CRL is created and your CA is subject to the limit of 1M issued or revoked certificates. For more information, go to AWS Private CA quotas.

    Figure 3: Certificate revocation options

    Figure 3: Certificate revocation options

  9. Choose Save changes.
  10. CRL distribution shows as enabled with partitioned CRLs. The limit of 1M automatically updates to 100M per CA.
    Figure 4: Certificate revocation configuration

    Figure 4: Certificate revocation configuration

Conclusion

The AWS Private CA partitioned CRLs can deliver substantial benefits across multiple dimensions. From a security perspective, the feature maintains certificate validation while supporting comprehensive revocation capabilities for up to 100M certificates per CA. Therefore, you can respond effectively to security incidents or key compromises. Operationally, it reduces CA rotation, lessening administrative overhead and streamlining PKI management. Furthermore, maintaining CRL partition sizes at 1 MB provides broad compatibility with applications while supporting automated partition management. Moreover, this makes it particularly valuable when you need scalable, standards-compliant certificate management. Regarding compliance, you can use the feature to comply with multiple industry requirements: it supports WebTrust principles and criteria and ETSI TSP standards, maintains compatibility with RFC5280, aligns with browser trust store requirements for both CRL and OCSP support, and provides the flexibility needed for emerging standards such as Matter.

Lastly, you can maximize the value of your general purpose or short-lived CA while all certificates remain revocable by enabling Partitioned CRL for no added charge on top of AWS Private CA and Amazon Simple Storage Service (Amazon S3).

Start creating your CA in AWS Private CA using AWS Management Console.

If you have feedback about this post, submit comments in the Comments section below.

Kartik Bhatnagar

Kartik Bhatnagar

Kartik is a San Francisco-based Solutions Architect at AWS, specializing in data security. With experience serving both startups and enterprises across fintech, healthcare, and media industries as a DevOps Engineer and Systems Architect, he helps customers design secure, scalable AWS solutions. Off-duty, he enjoys cricket, tennis, food hopping, and hiking.

How Octus achieved 85% infrastructure cost reduction with zero downtime migration to Amazon OpenSearch Service

Post Syndicated from Vaibhav Sabharwal original https://aws.amazon.com/blogs/big-data/how-octus-achieved-85-infrastructure-cost-reduction-with-zero-downtime-migration-to-amazon-opensearch-service/

As data volumes continue to grow exponentially, there is increasing pressure to optimize search infrastructure costs while maintaining the high performance and reliability that mission-critical workloads demand. Many companies find themselves managing complex, expensive search systems that require significant operational overhead and limit their ability to scale efficiently. The challenge becomes even more acute when organizations need to migrate between search systems, a process that traditionally involves substantial downtime, complex data synchronization, and significant impact on business operations. Enterprise applications cannot afford service interruptions that could impact customer experiences, business intelligence, or operational continuity. Migration strategies need to deliver cost optimization and operational improvements while maintaining zero downtime and facilitating complete data integrity throughout the transition process.

Founded in 2013, Octus, formerly Reorg, is the essential credit intelligence and data provider for the world’s leading buy side firms, investment banks, law firms and advisory firms. By surrounding unparalleled human expertise with proven technology, data and AI tools, Octus unlocks powerful truths that fuel decisive action across financial industries.

This post highlights how Octus migrated its Elasticsearch workloads running on Elastic Cloud to Amazon OpenSearch Service. The journey traces Octus’s shift from managing multiple systems to adopting a cost-efficient solution powered by OpenSearch Service. Along the way, we share the architecture choices and implementation strategies that made the migration successful. The result is uninterrupted service availability throughout migration, with improved performance and greater cost efficiency.

Strategic requirements

We identified several requirements that made Amazon OpenSearch Service the right choice for their migration:

  • Cost efficiency: The OpenSearch Service pricing model enabled us to optimize cloud spend without compromising performance.
  • Responsive support: AWS provided dependable, high-quality support to accelerate issue resolution and instill confidence.
  • Consistent reliability: OpenSearch Service provides an SLA up to 99.99% offering the reliability required for Octus’s mission-critical workloads.
  • Seamless migration with no query downtime: Migration Assistant for Amazon OpenSearch Service provided Octus with a migration path while maintaining uninterrupted query availability during the migration, facilitating business continuity.
  • Operational simplification: Consolidating onto AWS reduced infrastructure complexity while maintaining high security standards.

Solution overview

The Migration Assistant for Amazon OpenSearch Service provides a suite of tools to aid in Elasticsearch to OpenSearch Service migrations. Octus use the following capabilities for their migration:

  • Metadata migration: The tool enabled Octus to migrate dozens of indices with diverse mappings and settings. When a backward incompatibility was identified with timestamp metadata, a custom JavaScript transformation, integrated directly into the Migration Assistant tooling, was applied to automatically adjust the mappings across the indices and facilitate compatibility.
  • Historical data migration: Octus used Reindex-from-Snapshot to migrate the historical documents from a point-in-time snapshot of the source cluster, scaling this process without impacting the source cluster since the snapshot was stored in Amazon Simple Storage Service (Amazon S3). Reindex-from-Snapshot also enabled Octus to adjust the sharding scheme during migration, helping to optimize cluster performance on the target.
  • Live Traffic Replay: Once backfill was complete, Octus used Migration Assistant’s Traffic Replayer to send the captured live traffic (from the Traffic Capture Proxy) to the target cluster with required request transformations for OpenSearch Service compatibility, resulting in the target cluster containing the documents from the source cluster with updates being performed in real time.

The following diagram illustrates the implementation architecture diagram for this migration.


Figure 1 – Migration Assistant architecture with migration steps

For more information about the Migration Assistant for Amazon OpenSearch Service, visit the AWS Solutions home page.

Each node in the diagram correlates to the following steps in the migration process:

  1. Client traffic is directed to the existing cluster.
  2. An Application Load Balancer with capture proxies relays traffic to a source while replicating data to Amazon Managed Streaming for Apache Kafka (Amazon MSK).
  3. Using the migration console, a point-in-time snapshot is taken. Once the snapshot completes, the Metadata Migration Tool is used to establish indexes, templates, component templates, and aliases on the target cluster. With continuous traffic capture in place, Reindex-from-Snapshot, migrates data from the source.
  4. Once Reindex-from-Snapshot is complete, captured traffic is replayed from Amazon Managed Streaming for Apache Kafka (Amazon MSK) to the target cluster by Traffic Replayer.
  5. Performance and behavior of traffic sent to the source and target clusters are compared by reviewing logs and metrics.
  6. After confirming that the target cluster’s functionality meets expectations, clients are redirected to the new target.

Complete migration and optimization journey

Octus’s migration from Elastic Cloud to Amazon OpenSearch Service encompassed both the core migration effort and subsequent optimization phases. The goal was to successfully migrate the search infrastructure, applications, and data from Elastic Cloud to a new OpenSearch Service domain with minimal disruption, while continuously optimizing performance and costs based on real-world usage data.

Octus used their in-house custom infrastructure frameworks (their internal tooling for infrastructure automation) to build, deploy and monitor the target OpenSearch Service 1.3 domain, establishing a solid foundation for the migration. This approach used familiar internal processes while moving to the fully managed AWS service. Refer to AWS documentation to implement security best practices when using OpenSearch Service.

Pre-migration optimization

Prior to initiating the migration, Octus conducted optimization activities on the source Elasticsearch cluster to streamline the migration process. This included removing unused indexes that had accumulated over time and removing large documents that would unnecessarily extend migration duration and increase storage transfer costs. These preparatory steps significantly reduced the data volume requiring migration and minimized the overall migration complexity, enabling more efficient use of the Migration Assistant tools.

Technical constraints and version considerations

The migration involved specific version compatibility challenges that influenced the technical approach. The source Elasticsearch cluster was running version 7.17, and the Python client applications were also constrained to Elasticsearch 7.17 compatibility. To support the transition, the team used Reindex-from-Snapshot, which enables cross-system migrations by reindexing data from existing snapshots into a new OpenSearch Service cluster. RFS also rewrites indices created on older versions of Lucene, simplifying future upgrades to the latest version of OpenSearch Service. While evaluating a move to OpenSearch 1 or 2, Octus selected OpenSearch 1.3 as the target to minimize client-side changes and reduce migration complexity, while positioning themselves for simpler upgrades later.

The version selection particularly impacted the R application environment, as R language (an open-source programming language for statistical computing and data analysis) lacked native OpenSearch 1.3 client support. This constraint required Octus to develop a custom client solution using the ropensci/elastic library to integrate with the new OpenSearch Service domain. The Python environment presented similar challenges, where the Elasticsearch 7.17 client constraints necessitated careful consideration of the migration approach. These client compatibility concerns were among the factors that influenced the choice of Migration Assistant tools over traditional snapshot-based methods, as the Migration Assistant provided better support for managing version-specific client interactions during the transition.

Looking forward, Octus plans to upgrade to newer OpenSearch versions as their application stack evolves and client library support matures, so that they can leverage the latest features and performance improvements while maintaining the stability achieved through this migration.

Application modernization across multiple languages

The application changes represented a significant technical undertaking across multiple programming environments:

  • Legacy PHP systems (5.6 and Laravel 4.2): Octus handled mapping type deprecation on OpenSearch requests as specifying these mapping types are not supported, while continuing to use the elasticsearch connector library with username/password authentication.
  • Modern PHP applications (8.1 and Laravel 9): These underwent more comprehensive changes, replacing the elasticsearch/elasticsearch library with the opensearch-project/opensearch-php client and leveraging IAM authentication to connect to the clusters.
  • Python environment: Applications spanning versions 3.8, 3.10, 3.11, and 3.13 with Django frameworks 2.1, 3.2, and 5.2 required replacing the elasticsearch library with opensearch-py and transitioning to IAM authentication.
  • R applications: For R 4.5.1 applications, Octus utilized a custom library ropensci/elastic to facilitate compatibility.

Traffic routing and enhanced monitoring

To facilitate the migration, Octus redirected their existing clients to route requests to the source cluster through Migration Assistant’s Traffic Capture Proxy, migrating the data from live traffic to their target cluster.

The monitoring infrastructure underwent significant enhancement during this process. Octus’s observability infrastructure monitors the overall health of OpenSearch Service clusters which includes cluster manager and data nodes, network, data storage, security and IAM access. It also monitors the indexing and search performance of their applications. This alleviated the need for a separate monitoring cluster as logs and metrics were shipped directly to Datadog, significantly improving observability. The Datadog monitors were defined using Infrastructure-as-Code and integrated seamlessly into their infrastructure frameworks.

Cutover and initial results

The Site Reliability Engineering team meticulously planned the release, achieving a successful migration from Elasticsearch to OpenSearch Service and cutover of the Elasticsearch client to the OpenSearch Service clients with no downtime for the system application and zero data loss. The initial migration phase resulted in a 52% cost reduction while achieving operational benefits including zero downtime for the system app, no data loss, full Infrastructure-as-Code implementation for infrastructure and monitoring, and enhanced observability.

Post-migration optimization

Following the migration, Octus conducted comprehensive optimization based on operational data from production and other environments in the new OpenSearch Service setup. This real-world usage data provided valuable insights into actual resource consumption, enabling informed decisions regarding further cluster resizing.

Through usage metric analysis and strategic resizing, Octus aligned cluster size more precisely with operational needs, facilitating continued performance while minimizing expenditure. This optimization phase delivered an additional 33% cost reduction compared to the original Elastic Cloud costs, bringing the total reduction to 85% while maintaining consistent and optimal performance.

Operational monitoring

Octus uses Datadog to monitor both search and indexing latency providing real-time visibility into Amazon OpenSearch Service cluster performance. The following screenshot showcases how custom Datadog dashboards provide a live view of the OpenSearch Service clusters. This visualization offers both a high-level overview and detailed insights into the ingestion process, helping us understand the storage and document count. The bottom half of the dashboard presents a time-series view of individual node health and performance metrics like read and write latency, throughput and IOPS.


Figure 2 – DataDog dashboards

Migration observability

Migration Assistant for Amazon OpenSearch Service provides several dashboards to observe and validate the progress of a migration. By using these observability features customers can track both backfill and live capture and replay progress, facilitating confidence before switching production workloads to the target cluster.The following graphs are an example from Octus’s migration, where approximately 4TB of data was migrated in about 9 hours (from 08:00 to 17:00).


Figure 3 – Backfill progress by disk usage


Figure 4 – Backfill progress by searchable documents

Once the backfill is complete, the captured traffic is replayed to synchronize ongoing activity between the source and target clusters.

At the time the backfill finished (around 17:00), the target cluster was approximately 467 minutes behind the source. The replay process rapidly reduced this lag by processing captured traffic at a faster rate than it was originally ingested at the source.


Figure 5 – Replay lag after backfill completion

When the lag time reached 0, the target cluster was fully in sync and production traffic could safely be rerouted. Octus chose to observe replayed traffic on the target for several days before making the final switchover.

Achieving excellence

Octus’s migration to Amazon OpenSearch Service has yielded remarkable results:

  • Scalability – Octus has almost doubled the number of documents available for Q&A across three environments in days instead of weeks. Their use of Amazon Elastic Container Service (Amazon ECS) with AWS Fargate with auto scaling rules and controls gives them elastic scalability for their services during peak usage hours.
  • Cost reduction – By moving away from Elastic Cloud to OpenSearch Service, Octus’s monthly infrastructure costs are now 85% lower.
  • Enhanced search performance – Octus maintained consistent response times throughout the migration with no negative impact on latency, while achieving a 20% improvement in query throughput and overall search performance.
  • Zero downtime – Octus experienced zero downtime during migration and 100% uptime overall for the whole application.
  • Reduced operational overhead – Post-migration, Octus’s DevOps and SRE teams see 30% less maintenance burden and overheads. Supporting SOC2 compliance is also straightforward now that they’re using one system.
  • Accelerated timeline delivery – The entire migration was completed ahead of schedule, moving from planning to full completion in under one quarter.

“Moving from Elastic Cloud to Amazon OpenSearch Service was a key component of our broader strategy to minimize third-party dependencies and strengthen the reliability of Octus’ system infrastructure. Migration Assistant for Amazon OpenSearch Service enabled us to execute a seamless transition with zero data loss and virtually no downtime for our users.” – Vishal Saxena, CTO, Octus

Conclusion

In this post, we showed you how Octus successfully migrated their Elasticsearch workloads from Elastic Cloud to Amazon OpenSearch Service using the Migration Assistant for OpenSearch Service, achieving zero downtime and significant operational improvements.

The Migration Assistant for OpenSearch Service supported this complex migration through its comprehensive suite of tools. The Metadata Migration capability migrated dozens of indices with diverse mappings and settings, with custom JavaScript transformations handling backward incompatibilities. Reindex-from-Snapshot migrated the historical documents from point-in-time snapshots without impacting the source cluster, while also optimizing the sharding scheme for improved performance. Live Traffic Replay made sure the target cluster remained synchronized with real-time updates throughout the migration process.

The migration delivered substantial results across the dimensions. Octus achieved an 85% reduction in monthly infrastructure costs while nearly doubling the number of documents available for search across three environments. Search performance improved by 20% in query throughput with consistent response times and no negative impact on latency. The migration maintained zero downtime and 100% uptime for the entire application, with DevOps and SRE teams experiencing 30% less maintenance burden and operational overhead. The entire migration was completed ahead of schedule in under one quarter.

To learn more about the Migration Assistant for OpenSearch Service and how it can help you achieve similar results, visit the AWS Solutions home page.

Visit Octus to learn how we deliver rigorously verified intelligence at speed and create a complete picture for professionals across the entire credit lifecycle. Follow Octus on LinkedIn and X.


About the Authors

Harmandeep Sethi

Harmandeep Sethi

Harmandeep is Head of SRE Engineering and Infrastructure Frameworks at Octus. with nearly 10 years of experience leading high-performing teams in the implementation of large-scale systems. He has played a pivotal role in transforming and modernizing Octus’s Search Engine infrastructure and services by driving best practices in observability, resilience engineering, and the automation of operational processes through Infrastructure Frameworks.

Serhii Shevchenko

Serhii Shevchenko

Serhii is a Site Reliability Engineer at Octus. With 9 years of combined experience in software development and site reliability engineering, his expertise focuses on enhancing system reliability and performance. He was a key developer on the application side for the company’s critical migration from Elasticsearch Cloud to AWS OpenSearch. His planning was instrumental in executing the transition with zero client-facing downtime.

Govind Bajaj

Govind Bajaj

Govind is a Senior Site Reliability Engineer at Octus, specializing in architecting and implementing scalable infrastructure that supports high-performing engineering teams and critical systems. With over 8 years of experience, he excels at breaking down complex problems and turning them into practical, well-designed solutions, with a strong focus on building secure, observable, and resilient platforms.

Virendra Shinde

Virendra Shinde

Virendra is the Head of Platform at Octus, where he oversees cloud infrastructure, site reliability, and the core frameworks that power the Octus product suite. Before joining Octus, he spent two years at Grayscale Investments building an investor portal and data APIs from the ground up. Prior to that, he spent eight years at Blackstone leading multiple development teams. He holds a Master’s degree in Information Management from the University of Maryland.

Brian Presley

Brian Presley

Brian is a Software Development Manager at OpenSearch, leading teams behind OpenSearch Migrations and OpenSearch Serverless to build scalable, high-impact search and analytics solutions.

Andre Kurait

Andre Kurait

Andre is a Software Development Engineer II at AWS, based in Austin, Texas. He is currently working on Migration Assistant for Amazon OpenSearch Service. Prior to joining Amazon OpenSearch, Andre worked within Amazon Health Services. In his free time, Andre enjoys traveling, cooking, and playing in his church sport leagues. Andre holds Bachelor of the Science degrees from the University of Kansas in Computer Science and Mathematics.

Vaibhav Sabharwal

Vaibhav Sabharwal

Vaibhav is a Senior Solutions Architect at AWS based out of New York. He is passionate about learning new cloud technologies and assisting customers in building cloud adoption strategies, designing innovative solutions, and driving operational excellence. As a member of the Financial Services and Storage Technical Field Communities at AWS, he actively contributes to the collaborative efforts within the industry.

AWS Secrets Manager launches Managed External Secrets for Third-Party Credentials

Post Syndicated from Rohit Panjala original https://aws.amazon.com/blogs/security/aws-secrets-manager-launches-managed-external-secrets-for-third-party-credentials/

Although AWS Secrets Manager excels at managing the lifecycle of Amazon Web Services (AWS) secrets, managing credentials from third-party software providers presents unique challenges for organizations as they scale usage of their cloud applications. Organizations using multiple third-party services frequently develop different security approaches for each provider’s credentials because there hasn’t been a standardized way to manage them. When storing these third-party credentials in Secrets Manager, organizations frequently maintain additional metadata within secret values to facilitate service connections. This approach requires updating entire secret values when metadata changes and implementing provider-specific secret rotation processes that are manual and time consuming. Organizations looking to automate secret rotation typically develop custom functions tailored to each third-party software provider, requiring specialized knowledge of both third-party and AWS systems.

To help customers streamline third-party secrets management, we’re introducing a new feature in AWS Secrets Manager called managed external secrets. In this post, we explore how this new feature simplifies the management and rotation of third-party software credentials while maintaining security best practices.

Introducing managed external secrets

AWS Secrets Manager has a proven track record of helping customers secure and manage secrets for AWS services such as Amazon Relational Database Service (Amazon RDS) or Amazon DocumentDB through managed rotation capabilities. Building on this success, Secrets Manager now introduces managed external secrets, a new secret type that extends this same seamless experience to third-party software applications like Salesforce, simplifying secret management challenges through standardized formats and automated rotation.

You can use this capability to store secrets vended by third-party software providers in predefined formats. These formats were developed in collaboration with trusted integration partners to define both the secret structure and required metadata for rotation, eliminating the need for you to define your own custom storage strategies. Managed external secrets also provides automated rotation by directly integrating with software providers. With no rotation functions to maintain, you can reduce your operational overhead while benefiting from essential security controls, including fine-grained permissions management using AWS Identity and Access Management (IAM), secret access monitoring through Amazon CloudWatch and AWS CloudTrail, and automated secret-specific threat detection through Amazon GuardDuty. Moreover, you can implement centralized and consistent secret management practices across both AWS and third-party secrets from a single service, eliminating the need to operate multiple secrets management solutions at your organization. Managed external secrets follows standard Secrets Manager pricing, with no additional cost for using this new secret type.

Prerequisites

To create a managed external secret, you need an active AWS account with appropriate access to Secrets Manager. The account must have sufficient permissions to create and manage secrets, including the ability to access the AWS Management Console or programmatic access through the AWS Command Line Interface (AWS CLI) or AWS SDKs. At minimum, you need IAM permissions for the following actions: secretsmanager:DescribeSecret, secretsmanager:GetSecretValue, secretsmanager:UpdateSecret, and secretsmanager:UpdateSecretVersionStage.

You must have valid credentials and necessary access permissions for the third-party software provider you plan to have AWS manage secrets for.

For secret encryption, you must decide whether to use an AWS managed AWS Key Management Service (AWS KMS) key or a customer managed KMS key. For customer managed keys, make sure you have the necessary key policies configured. The AWS KMS key policy should allow Secrets Manager to use the key for encryption and decryption operations.

Create a managed external secret

Today, managed external secrets supports three integration partners: Salesforce, Snowflake, and BigID. Secrets Manager will continue to expand its partner list and more third-party software providers will be added over time. For the latest list, refer to Integration Partners.

To create a managed external secret, follow the steps in the following sections.

Note: This example demonstrates the steps for retrieving Salesforce External Client App Credentials, but a similar process can be followed for other third-party vendor credentials integrated with Secrets Manager.

To select secret type and add details:

  1. Go to the Secrets Manager service in the AWS Management Console and choose Store a new secret.
  2. Under Secret type, select Managed external secret.
  3. In the AWS Secrets Manager integrated third-party vendor credential section, select your provider from the available options. For this walkthrough, we select Salesforce External Client App Credential.
  4. Enter your configurations in the Salesforce External Client App Credential secret details section. The Salesforce External Client App credentials consist of several key components:
    1. The Consumer key (client ID), which serves as the credential identifier for OAuth 2.0. You can retrieve the consumer key directly from the Salesforce External Client App Manager OAuth settings.
    2. The Consumer secret (client secret), which functions as the private password for OAuth 2.0 authentication. You can retrieve the consumer secret directly from the Salesforce External Client App Manager OAuth settings.
    3. The Base URI, which is your Salesforce org’s base URL (formatted as https://MyDomainName.my.salesforce.com), is used to interact with Salesforce APIs.
    4. The App ID, which identifies your Salesforce External Client Apps (ECAs) and can be retrieved by calling the Salesforce OAuth usage endpoint.
    5. The Consumer ID, which identifies your Salesforce ECA, can be retrieved by calling the Salesforce OAuth credentials by App ID endpoint. For a list of commands, refer to Stage, Rotate, and Delete OAuth Credentials for an External Client App in the Salesforce documentation.
  5. Select the Encryption key from the dropdown menu. You can use an AWS managed KMS key or a customer managed KMS key.
  6. Choose Next.
Figure 1: Choose secret type

Figure 1: Choose secret type

To configure a secret:

  1. In this section, you need to provide information for your secret’s configuration.
  2. In Secret name, enter a descriptive name and optionally enter a detailed Description that helps identify the secret’s purpose and usage. You also have additional configuration choices available: you can attach Tags for better resource organization, set specific Resource permissions to control access, and select Replicate secret for multi-Region resilience.
  3. Choose Next.
Figure 2: Configure secret

Figure 2: Configure secret

To configure rotation and permissions (optional):

In the optional Configure rotation step, the new secret configuration introduces two key sections focused on metadata management, which are stored separately from the secret value itself.

  1. Under Rotation metadata, specify the API version your Salesforce app is using. To find the API version, refer to List Available REST API Versions in the Salesforce documentation. Note: The minimum version needed is v65.0.
  2. Select an Admin secret ARN, which contains the administrative OAuth credentials that are used to rotate the Salesforce client secret.
  3. In the Service permissions for secret rotation section, Secrets Manager automatically creates a role with necessary permissions to rotate your secret values. These default permissions are transparently displayed in the interface for review when you choose view permission details. You can deselect the default permissions for more granular control over secret rotation management.
  4. Choose Next.
Figure 3: Configure rotation

Figure 3: Configure rotation

To review:

In the final step, you’ll be presented with a summary of your secret’s configuration. On the Review page, you can verify parameters before proceeding with secret creation.

After confirming that the configurations are correct, choose Store to complete the process and create your secret with the specified settings.

Figure 4: Review

Figure 4: Review

After successful creation, your secret will appear on the Secrets tab. You can view, manage, and monitor aspects of your secret, including its configuration, rotation status, and permissions. After creation, review your secret configuration, including encryption settings and resource policies for cross-account access, and examine the sample code provided for different AWS SDKs to integrate secret retrieval into your applications. The Secrets tab provides an overview of your secrets, allowing for central management of secrets. Select your secret to view Secret details.

Figure 5: View secret details

Figure 5: View secret details

Your managed external secret has been successfully created in Secrets Manager. You can access and manage this secret through the Secrets Manager console or programmatically using AWS APIs.

Onboard as an integration partner with Secrets Manager

With the new managed external secret type, third-party software providers can integrate with Secrets Manager and offer their customers a programmatic way to securely manage secrets vended by their applications on AWS. This integration provides their customers with a centralized solution for managing both the lifecycle of AWS and third-party secrets, including automatic rotation capabilities from the moment of secret creation. Software providers like Salesforce are already using this capability.

“At Salesforce, we believe security shouldn’t be a barrier to innovation, it should be an enabler. Our partnership with AWS on managed external secrets represents security-by-default in action, removing operational burden from our customers while delivering enterprise-grade protection. With AWS Secrets Manager now extending to partners and automated zero-touch rotation eliminating human risk, we’re setting a new industry standard where secure credentials become seamless without specialized expertise or additional costs.” — Jay Hurst, Sr. Vice President, Product Management at Salesforce

There is no additional cost to onboard with Secrets Manager as an integration partner. To get started, partners must follow the process listed on the partner onboarding guide. If you have questions about becoming an integration partner, contact our team at [email protected] with Subject: [Partner Name] Onboarding request.

Conclusion

In this post, we introduced managed external secrets, a new secret type in Secrets Manager that addresses the challenges of securely managing the lifecycle of third-party secrets through predefined formats and automated rotation. By eliminating the need to define custom storage strategies and develop complex rotation functions, you can now consistently manage your secrets—whether from AWS services, custom applications, or third-party providers—from a single service. Managed external secrets provide the same security features as standard Secrets Manager secrets, including fine-grained permissions management, observability, and compliance controls, while adding built-in integration with trusted partners at no additional cost.

To get started, refer to the technical documentation. For information on migrating your existing partner secrets to managed external secrets, see Migrating existing secrets. The feature is available in all AWS Regions where AWS Secrets Manager is available. For a list of Regions where Secrets Manager is available, see the AWS Region table. If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on Secrets Manager re:Post or contact AWS Support.

Rohit Panjala

Rohit is a Security Specialist at AWS, focused on data protection and cryptography services. He is responsible for developing and implementing go-to-market (GTM) strategies and driving customer and partner adoptions for AWS data protection services on a global scale. Before joining AWS, he worked in security product management at IBM and electrical engineering roles. He holds a BS in engineering from The Ohio State University.

Rochak Karki

Rochak is a Security Specialist Solutions Architect at AWS, focusing on threat detection, incident response, and data protection to help customers build secure environments. Rochak is a US Army veteran and holds a BS in engineering from the University of Wyoming. Outside of work, he enjoys spending time with family and friends, hiking, and traveling.

How potential performance upside with AWS Graviton helps reduce your costs further

Post Syndicated from Markus Adhiwiyogo original https://aws.amazon.com/blogs/compute/how-potential-performance-upside-with-aws-graviton-helps-reduce-your-costs-further/

Amazon Web Services (AWS) provides many mechanisms to optimize the price performance of workloads running on Amazon Elastic Compute Cloud (Amazon EC2), and the selection of the optimal infrastructure to run on can be one of the most impactful levers. When we started building the AWS Graviton processor, our goal was to optimize AWS Graviton features and capabilities to deliver a processor that provides the best price performance across a broad array of cloud workloads running on Amazon EC2. That goal continues to be our guiding principle, and today customers who adopt AWS Graviton-based EC2 instances see up to 40% better price performance on their cloud workloads when compared to equivalent non-Graviton EC2 instances. The price performance improvement is the result of both the performance improvement and the lower price in using AWS Graviton-based instances.

Price performance blends the cost of infrastructure with the amount of work you can achieve with infrastructure usage. After talking to many AWS Graviton customers, we’ve learned that the cost savings go beyond the lower AWS Graviton-based instances price. Many AWS Graviton customers told us that the performance increase from AWS Graviton allows them to consume fewer computing hours than comparable non-Graviton instances for equivalent workload throughput. In turn, this leads to further cost reduction.

The following are some of examples from our customers:

  • Pinterest achieved 47% cost savings and 38% savings on compute resources while reducing carbon emissions by 62% for its web API workload.
  • SAP powers its SAP HANA Cloud with AWS Graviton to enhance its price performance by 35% while lowering carbon impact by 45%.
  • Sprinklr improved their machine learning (ML) inference workloads’ throughput by up to 20% while reducing costs by up to 25%.

You can find more customer examples in the AWS Graviton testimonials page.

To help organizations capture similar benefits, we’ve enhanced the AWS Graviton Savings Dashboard (GSD) with new features that account for both pricing and performance improvements. In the following section we explore these new capabilities and how they can help optimize your infrastructure costs.

Understanding performance-driven cost optimization in the GSD

The GSD helps organizations identify ideal workloads for AWS Graviton migration through automated resource matching and data-driven visualizations. You can learn the GSD details and setup in this AWS compute post.

Although the dashboard has traditionally focused on calculating direct cost savings from the AWS Graviton pricing advantages, we’ve observed that customers often experience more benefits when their applications perform more efficiently on AWS Graviton processors, leading to decreased compute resource usage. To better reflect these real-world scenarios, we’ve enhanced the dashboard with new features highlighting Normalized Instance Hours (NIH) analysis capabilities so that you can model potential savings based on both pricing benefits and compute hour reductions. Although this tool helps estimate potential savings, actual performance improvements can only be determined by testing your specific workloads on AWS Graviton instances. Performance is always workload and use case specific, so we encourage you to test your AWS Graviton-based workloads using the Optimization and Performance Runbook to help you determine the actual possible NIH percent reduction.

Key dashboard components

This section outlines the following three key dashboard components: NIH reduction analysis, enhanced cost analysis visualizations, and detailed savings analysis.

NIH reduction analysis

The dashboard now features a new slider that lets you model potential cost savings by inputting the percentage reduction in NIH. Many organizations have found it challenging to calculate their total possible savings since the benefits come from two sources: the lower instance pricing of AWS Graviton and the reduced compute hours.

You can use the slider to model different cost scenarios by adjusting a theoretical NIH reduction between 0% and 40%. You can use this slider to input NIH reductions validated through your workload testing, model the combined impact of both pricing benefits and reduced compute hours, and explore different scenarios to help prioritize which workloads to test first.

Figure 1: NIH slider location

Figure 1: NIH slider location

Assume that your testing shows that your workload runs just as effectively with 15% fewer normalized instance hours on AWS Graviton. You can now plug that exact number into the slider to see your modeled savings combining both pricing differences and compute hour reductions. Although we’ve heard success stories of significant reductions from customers, we recommend starting your initial estimate with a conservative 10% baseline and adjusting based on your own testing results.

Enhanced cost analysis visualizations

The dashboard presents key visualizations that demonstrate the direct relationship between NIH reduction and cost savings. First, you see the Potential Graviton Base Savings from pricing differences alone. In the following diagram, we can observe an example of $61.54K of cost savings from migrating to equivalent AWS Graviton instances. Next, the Estimated Additional Savings Due to Performance in the same diagram shows $42.40K in savings if your performance testing confirms a 15% NIH reduction in your workload. Finally, the dashboard sums these two values into the Total Potential Graviton Savings of $103.94K. The Total Potential Graviton Savings helps visualize how both pricing benefits and any validated compute hour reductions could contribute to your overall savings.

Figure 2: Visualization with relationship between NIH reduction and cost savings

Figure 2: Visualization with relationship between NIH reduction and cost savings

The Amortized Cost Breakdown and Normalized Instance Hrs Breakdown charts in the following figure show 6-month historical trends, helping you spot patterns such as seasonal spikes or high-usage periods. These patterns can help you identify where even small efficiency improvements might yield significant savings, for example, workloads with consistently high usage or predictable peak periods that would be good candidates for testing.

Figure 3: Amortized Cost, NIH, and Total Potential Savings Breakdown charts

Figure 3: Amortized Cost, NIH, and Total Potential Savings Breakdown charts

Detailed savings analysis

Building on our commitment to help customers optimize cloud costs, we’ve enhanced the Potential Graviton Savings Details table with two columns focused on performance-based savings modeling. The Estimated Additional Savings Due to Performance column shows the modeled savings based on your chosen NIH reduction percentage, while Total Potential Graviton Savings combines this with the base pricing benefits.

Figure 4: Potential Graviton Savings Details table

Figure 4: Potential Graviton Savings Details table

As you examine your current instance family, you can observe both baseline AWS Graviton savings and these added saving opportunities clearly laid out in a comprehensive breakdown. The analysis presents your total savings potential in both dollar amounts and percentages. This allows you to build a compelling business case for migration. Although this detailed breakdown provides valuable planning insights, remember that actual savings may vary depending on your specific workload patterns, implementation approaches, and operational considerations.

Conclusion

The Graviton Savings Dashboard (GSD) serves as a powerful analytics tool that streamlines your journey to cost-effective cloud computing. The GSD provides clear visualizations and interactive features to help you understand and maximize potential savings when migrating to AWS Graviton-based instances. To further explore the new features, navigate to the GSD interactive demo, where you can model an example of potential savings using the NIH reduction slider and detailed cost breakdowns.

Ready to explore how AWS Graviton can transform your infrastructure costs? Visit the GSD page to deploy or update your GSD dashboard. Access implementation guides, such as the CFM Technical Implementation Playbook (CFM TIPs), and start optimizing your cloud spend today with the enhanced capabilities of the GSD.

Over 85,000 AWS customers have discovered the benefits of AWS Graviton, with many completing their adoptions in just hours. We have created this resource guide so that you can accelerate your AWS Graviton adoption with minimal effort and enjoy significant price performance benefits.

“What I always tell customers is one week, one application, one engineer, and see what you can do. They always are pleasantly surprised by how much progress they can make. If you’re out there and you haven’t yet moved to AWS Graviton, what are you waiting for? Let’s make it happen!”

Dave Brown, VP, AWS Compute & ML Services

Important note about performance testing
The GSD does not attempt to estimate the potential NIH percent reduction or your workload’s performance when transitioned to AWS Graviton. You can use it to perform what-if analysis of your potential savings for a projected NIH percent reduction. In the absence of this variable, GSD only considers the price delta between instance types and misses an important contributor to the overall savings potential of AWS Graviton from the performance upside. Compute performance is always workload and use case specific, so we encourage you to test your AWS Graviton-based workloads using the Optimization and Performance Runbook to help you determine the actual possible NIH percent reduction.

Enhancing API security with Amazon API Gateway TLS security policies

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/compute/enhancing-api-security-with-amazon-api-gateway-tls-security-policies/

As compliance frameworks evolve and cryptographic standards advance, organizations are looking for additional controls to improve their cloud security posture. One of the neccesary controls is a more granular TLS configuration, for example when regulatory requirements mandate disabling older ciphers like CBC or enforcing TLS 1.3 as a minimum version.

In this post, you will learn how the new Amazon API Gateway’s enhanced TLS security policies help you meet standards such as PCI DSS, Open Banking, and FIPS, while strengthening how your APIs handle TLS negotiation. This new capability increases your security posture without adding operational complexity, and provides you with a single, consistent way to standardize TLS configuration across your API Gateway infrastructure.

Overview

Previously, API Gateway offered limited control over TLS configuration, and only for custom domain names. Default endpoints used fixed security policies, which meant you often had to introduce additional infrastructure, such as custom Amazon CloudFront distributions, to meet your organization’s security or compliance requirements.

With this launch, you can configure TLS behavior directly on all REST API endpoint types, including Regional, edge-optimized, and private, and apply consistent TLS settings across both your APIs and their custom domain names. You can choose from predefined enhanced security policies to enforce the minimum TLS versions and cipher suites that your workloads require. For example, you can enforce TLS 1.3, use hardened TLS 1.2 without CBC ciphers, adopt FIPS-aligned suites for government workloads, or prepare for the future with policies that include post-quantum cryptography (PQC). The new security policies provide finer-grained control without adding operational complexity, helping you align your APIs with evolving security and compliance expectations.

Understanding API Gateway security policies

A security policy in API Gateway is a predefined combination of a minimum TLS version and a curated set of cipher suites. When a client connects to your REST API or custom domain name, API Gateway uses the selected policy to determine which protocol versions and ciphers it will accept during the TLS handshake. This gives you a predictable and enforceable way to control how clients establish encrypted connections to your APIs.

API Gateway supports two categories of security policies. Legacy policies, such as TLS_1_0 or TLS_1_2, remain available for backwards compatibility. Enhanced policies, identified by the SecurityPolicy_* prefix, provide stricter and more modern controls for regulated workloads, advanced governance, or cryptographic hardening. When you use an enhanced policy, you must also specify an endpoint access mode, which adds additional validation for how traffic reaches your API, as described in the following sections.

Enhanced policies follow a consistent naming patterns that helps you quickly understand what each policy enforces. For example, for REGIONAL and PRIVATE endpoint types, the following pattern applies:

SecurityPolicy_[TLS-Versions]_[Variant]_[YYYY-MM]

From this structure, you can identify the minimum TLS versions supported, any specialized cryptographic variants (such as FIPS, PFS, or PQ), and the release date of the policy. For example, SecurityPolicy_TLS13_1_3_2025_09 accepts only TLS 1.3 traffic, while SecurityPolicy_TLS13_1_2_PFS_PQ_2025_09 supports TLS 1.2 as lowest and TLS 1.3 as highest TLS version with forward secrecy and post-quantum enhancements.

Each policy maps to a curated combination of ciphers. For instance, SecurityPolicy_TLS13_1_3_2025_09 accepts only three TLS 1.3 cipher suites (TLS_AES_128_GCM_SHA256, TLS_AES_256_GCM_SHA384, and TLS_CHACHA20_POLY1305_SHA256) and rejects any other protocol versions or ciphers. For a full list of supported policies and ciphers, and naming pattern for the EDGE endpont type, see the API Gateway documentation.

How security policies apply to default endpoints and custom domains

You can use API Gateway to attach different security policies to your default API endpoint and custom domain names. During TLS negotiation, API Gateway selects the policy based on the Server Name Indication (SNI) value in the client’s TLS handshake, not the HTTP Host header. This means the policy depends on the hostname the client uses when initiating TLS.

For example, if a client connects directly to your default endpoint, such as:

https://abcdef1234.execute-api.us-east-1.amazonaws.com

API Gateway uses the policy attached to that default endpoint because the SNI value matches its hostname.

If the client instead connects through a custom domain name, such as:

https://api.example.com

API Gateway uses the policy attached to that custom domain. In this case, the SNI value api.example.com determines which policy is enforced.

This distinction is important even if you disable your default endpoint. TLS negotiation always occurs before API Gateway evaluates endpoint settings, so the default endpoint security policy still applies to clients that connect directly to its hostname. To avoid unexpected client behavior, you should keep the API and its custom domain name aligned with the same security policy whenever possible.

Understanding endpoint access mode

When you use an enhanced security policy (SecurityPolicy_*), you must also specify an endpoint access mode. Endpoint access mode defines how strictly API Gateway validates the network path a request takes before it reaches your API. This gives you an additional layer of governance and helps you prevent unauthorized or misrouted traffic.

You can choose between two modes:

  • BASIC mode provides standard API Gateway behavior. It is the recommended starting point when you migrate an existing API to an enhanced security policy. Clients can continue reaching your API as they do today, without additional validation.
  • STRICT mode adds enforcement checks to ensure that requests originate from the correct endpoint type, and TLS negotiation aligns with your configuration.

When you enable STRICT mode, API Gateway performs additional validations, such as:

  • The SNI and HTTP Host header values match
  • The request originates from the same endpoint type as your API (Regional, edge-optimized, or private)

If any of these validations fail, API Gateway rejects the request. STRICT is a viable choice when you need stronger security guarantees, such as when running regulated or sensitive workloads. See API Gateway documentation for additional details.

When you switch from BASIC to STRICT mode, it takes up to 15 minutes for the change to fully propagate. Your API remains available during this period. If your endpoint access mode is set to STRICT, you cannot change the endpoint type until you revert the mode back to BASIC.

Applying security policies to new and existing APIs

You can apply a security policy when you create a new REST API or custom domain name, or update an existing resource to use one of the enhanced SecurityPolicy_* options. When migrating existing APIs, the recommended approach is to start with BASIC mode, validate client behavior (SNI and HTTP Host header values match, request originates from the same endpoint type as your API), and then move to STRICT mode once you confirm compatibility.

The following code snippets illustrate how to apply security policies to different scenarios:

Create a REST API with a security policy and STRICT endpoint access mode

You can attach a security policy directly during API creation, removing the need for extra infrastructure just to control TLS negotiation.

aws apigateway create-rest-api \
  --name "your-private-api-name" \
  --endpoint-configuration '{"types":["PRIVATE"]}' \
  --security-policy "SecurityPolicy_TLS13_1_3_2025_09" \
  --endpoint-access-mode STRICT \
  --policy file://api-policy.json

Create a custom domain name with a security policy and STRICT endpoint access mode

You can also specify the security policy when creating a custom domain name. API Gateway applies the selected policy during TLS negotiation based on the SNI value the client provides.

aws apigateway create-domain-name \
  --domain-name api.example.com \
  --regional-certificate-arn arn:aws:acm:region:account-id:certificate/certificate-id \
  --endpoint-configuration '{"types":["REGIONAL"]}' \
  --security-policy SecurityPolicy_TLS13_1_3_2025_09 \
  --endpoint-access-mode STRICT

Updating existing REST API

If you are migrating an existing API, start by applying the enhanced security policy with BASIC mode. After confirming that your clients can connect with BASIC mode as expected, proceed to enable the STRICT mode.

1. Apply the new policy with BASIC mode

aws apigateway update-rest-api --rest-api-id abcd123 --patch-operations '[
    {
         "op": "replace",
         "path": "/securityPolicy",
         "value": "SecurityPolicy_TLS13_1_3_2025_09"
    },
    {
         "op": "replace",
         "path": "/endpointAccessMode",
         "value": "BASIC"
     }
]'

Verify your clients can consume the API as expected using access logs and performance metrics in Amazon CloudWatch.

2. Enable the STRICT mode after validation

aws apigateway update-rest-api --rest-api-id abcd123 --patch-operations '[
    {
        "op": "replace",
        "path": "/endpointAccessMode",
        "value": "STRICT"
     }
]'

Updating existing custom domain name

Custom domain names follow the same migration approach as REST APIs.

1. Apply the new policy with BASIC mode and validate clients can successfully connect.

aws apigateway update-domain-name --domain-name api.example.com --patch-operations '[
    {
        "op": "replace",
        "path": "/securityPolicy",
        "value": "SecurityPolicy_TLS13_1_3_2025_09"
    },
    {
        "op": "replace",
        "path": "/endpointAccessMode",
        "value": "BASIC"
     }
]'

2. Enable the STRICT mode after validation

aws apigateway update-domain-name --domain-name api.example.com --patch-operations '[
    {
        "op": "replace",
        "path": "/endpointAccessMode",
        "value": "STRICT"
     }
]'

After you update your REST API or custom domain configuration, redeploy your API so that stages receive the new settings. When you change a security policy, the update takes up to 15 minutes to complete. The API status appears as UPDATING while the change propagates and returns to AVAILABLE when complete. Your API remains fully functional throughout this process.

Rolling back endpoint access mode

If you notice clients failing to connect to your API after applying the STRICT mode, you can revert the endpoint access mode back to BASIC at any time. Below code snippet illustrates doing this for a REST API.

aws apigateway update-rest-api --rest-api-id abcd123 --patch-operations '[
    {
      "op": "replace",
      "path": "/endpointAccessMode",
      "value": "BASIC"
    }
  ]'

You can use the same approach to update a custom domain name.

Monitoring TLS usage and policy migrations

As you adopt enhanced security policies, it is important to understand how clients negotiate encrypted connections with your API. Monitoring helps you verify client readiness, identify legacy consumers that may require updates, and validate that STRICT endpoint access mode behaves as expected during rollout. Use the following API Gateway access logs variables to monitor protocol and cipher usage over time.

  • $context.tlsVersion – the negotiated TLS version
  • $context.cipherSuite – the cipher suite selected during the handshake

You can use these variables to confirm that:

  • Clients are using the expected minimum TLS version
  • BC-based ciphers are no longer used after you move to a hardened policy
  • PQC and FIPS-aligned policies are being exercised by the appropriate clients

Access logs are especially useful during migrations, where validating the actual client behavior is a prerequisite before enabling STRICT mode. For example, if you still observe live clients negotiating TLS 1.0 or TLS 1.2 CBC ciphers after applying a hardened policy in BASIC mode, you can identify the affected clients and plan remediation before switching to STRICT mode.

Future-proof security configurations

Some of the new policies combine TLS 1.3 with post-quantum cryptography (PQC) to help you prepare for a future where quantum-capable threat actors exist. With these policies you can start testing and adopting quantum-resistant algorithms without redesigning your API architecture.

As standards evolve and new cipher suites are introduced, API Gateway’s policy model provides you with a clear path for adding new variants while keeping your configuration simple and predictable.

Conclusion and next steps

Enhanced TLS security policies and endpoint access mode in the Amazon API Gateway gives you direct control over how clients establish secure connections to your APIs. You can choose the policies that match your compliance needs, such as PCI DSS, FIPS, Open Banking, PQC, and use STRICT mode to control how traffic reaches your endpoints and apply additional domain-level validations, further hardening security of your APIs

To get started:

  1. Review the list of available security policies in the API Gateway documentation.
  2. Identify which REST APIs and domains require stronger TLS controls.
  3. Apply an appropriate SecurityPolicy-* policy with BASIC mode.
  4. Validate client behavior using access logs and CloudWatch metrics.
  5. Move to STRICT mode when you are ready to enforce additional connection-level protection.

For more information about building Serverless architectures, see ServerlessLand.com

Practical steps to minimize key exposure using AWS Security Services

Post Syndicated from Jennifer Paz original https://aws.amazon.com/blogs/security/practical-steps-to-minimize-key-exposure-using-aws-security-services/

Exposed long-term credentials continue to be the top entry point used by threat actors in security incidents observed by the AWS Customer Incident Response Team (CIRT). The exposure and subsequent use of long-term credentials or access keys by threat actors poses security risks in cloud environments. Additionally, poor key rotation practices, sharing of access keys among multiple users, or failing to revoke unused credentials can leave systems exposed.

Using long-term credentials is strongly discouraged and presents an opportunity to migrate towards AWS Identity and Access Management (IAM) roles and federated access. While our recommended best practice is for customers to migrate away from long-term credentials, we recognize that this transition might not be immediately feasible for all organizations.

Building a comprehensive defense against unintended access to long-term credentials requires a strategic layered approach. This approach is intended to bridge the gap between ideal security practices and real-world operational constraints, providing actionable steps for teams managing legacy AWS workloads that require the use of long-term credentials.

In this post, you learn how to build your defense, starting with identifying existing risks and potential exposures through services such as Amazon CodeGuru Security and AWS IAM Access Analyzer, providing visibility into credential risks across the environment. This is then complemented by establishing strict boundaries through service control policies (SCPs) and data perimeters to control how and where credentials can be created and used. With these mechanisms in place, you can strengthen your position with network-level controls that help protect the infrastructure where access keys might be used, implementing services such as AWS WAF and Amazon Inspector to help protect against exploitation of vulnerabilities. Finally, you implement operational best practices such as automated secret rotation to maintain ongoing security hygiene and minimize the impact of potential compromise.

Detect current access keys and exposure

Audit current access keys

For comprehensive auditing, organizations should regularly generate credential reports to identify IAM user ownership of long-lived credentials and other relevant information such as the last time the key was rotated, last time it was used, last service used and last region used. These reports provide essential visibility into your credential landscape, enabling you to spot unused or potentially compromised credentials by focusing on access keys with stale activity, keys exceeding rotation policies, and unexpected usage patterns from unfamiliar regions.

Detect exposed access keys

A common source of credential compromise occurs through inadvertent commits to public repositories. When developers accidentally commit credentials to public repositories, these credentials can be harvested by automated scanning tools used by adversaries. Code scanning is a foundational step that helps catch these critical security issues early, before sensitive credentials can be accidentally committed to code repositories or deployed to production environments where they could be exploited.

You can use the secrets detection capability of CodeGuru Security to proactively identify exposed sensitive data in your codebase.

The tool integrates with AWS Secrets Manager, employing detection mechanisms to locate unencrypted secrets in your code, such as AWS secret access keys, embedded passwords, and database connection strings.

When CodeGuru Security discovers unprotected secrets during a scan, it creates a finding with recommended remediation to address the vulnerability.

AWS Trusted Advisor also contains an exposed access key check that checks popular code repositories for access keys that have been exposed to the public and for irregular Amazon Elastic Compute Cloud (Amazon EC2) usage that could be the result of a compromised access key.

Note that while these are valuable security tools, they cannot detect secrets or access keys stored in locations outside their scanning scope, such as local development machines or external systems. They should be used as part of a broader security strategy, not as the sole method for identifying and preventing credential exposure.

When addressing potentially compromised access keys, it is advised to immediately rotate the keys. See instructions on how to rotate access keys for IAM Users.

Detect unused access

Beyond identifying exposed credentials, detecting unused access keys helps minimize the attack surface. IAM Access Analyzer contains an unused access analyzer that looks for access permissions that are either overly generous or that have fallen into disuse, including unused IAM roles, access keys for IAM users, passwords for IAM users, and services and actions for active IAM roles and users. After reviewing the findings generated by an organization-wide or account-specific analyzer, you can remove or modify permissions that aren’t needed. By identifying and revoking unused credentials and access, you can limit the impact if credentials have been obtained by a threat actor.

By implementing these tools, you can gain insights into credential risks across your environment. The combined capabilities help surface embedded secrets, exposed access keys, and credentials requiring removal.

Preventive guardrails: Establish a data perimeter

Now that you’ve learned how to identify exposed or unused credentials, let’s explore how you can use SCPs and resource control policies (RCPs) to create a data perimeter and help make sure that only your trusted identities are accessing trusted resources from expected networks. Implementing preventive guardrails around your AWS environment is crucial for helping protect against unauthorized access and potential access key compromises. For more information on what a data perimeter is and how to establish one, see the Establishing a Data Perimeter on AWS blog post series.

The following SCP denies an IAM user’s credentials from being used outside of unexpected networks (corporate Classless Inter-Domain Routing (CIDR) or specific virtual private cloud (VPC)). This policy includes several actions in the NotAction element that would impact services access if not exempted. Examples of SCPs and RCPs can be found in the data-perimeter-policy-examples, which is the source of truth for newly revised policies. The following example has been updated to address the use case of user credentials being used outside of unexpected networks.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "EnforceNetworkPerimeterOnIAMUsers",
            "Effect": "Deny",
            "NotAction": [
                "es:ES*",
                "dax:GetItem",
                "dax:BatchGetItem",
                "dax:Query",
                "dax:Scan",
                "dax:PutItem",
                "dax:UpdateItem",
                "dax:DeleteItem",
                "dax:BatchWriteItem",
                "dax:ConditionCheckItem",
                "neptune-db:*",
                "kafka-cluster:*",
                "elasticfilesystem:client*",
                "rds-db:connect"
            ],
            "Resource": "*",
            "Condition": {
                "BoolIfExists": {
                    "aws:ViaAWSService": "false"
                },
                "NotIpAddressIfExists": {
                    "aws:SourceIp": [
                        "<my-corporate-cidr>"
                    ]
                },
                "StringNotEqualsIfExists": {
                    "aws:SourceVpc": [
                        "<my-vpc>"
                    ]
                },
                "ArnLike": {
                    "aws:PrincipalArn": [
                        "arn:aws:iam::*:user/*"
                    ]
                }
            }
        }
    ]

By implementing this network perimeter, you can reduce the risk of credential compromise leading to unauthorized access and data exposure. Threat actors attempting to use stolen credentials from a coffee shop or home network will be blocked, helping to limit the impact of unintended access to credentials.

To further increase your defense in depth, you can use RCPs to help protect your data, such as by using them to control which identities can access your resources. For example, you might want to allow identities in your organization to access resources in your organization. You might also want to prevent identities external to your organization from accessing your resources. You can enforce this control using RCPs. You can use RCPs to restrict the maximum available access to your resources and include which principals, both inside and outside your organization, can access your resources. SCPs can only impact the effective permissions for principals within your AWS organization.

By implementing the following RCP, you can help make sure that if long-lived credentials are accidentally exposed, unauthorized users from outside your organization will be blocked from using them to access your critical data and resources. The policy will deny Amazon Simple Storage Service (Amazon S3) actions unless requested from your corporate CIDR range (NotIpAddressIfExists with aws:SourceIp), or from your VPC (StringNotEqualsIfExists with aws:SourceVpc). See the list of AWS services that support RCPs. Examples of SCPs and RCPs can be found in this GitHub repository, which is the source of truth for newly revised policies. The following example has been updated to address the use case discussed in this post.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EnforceNetworkPerimeter",
      "Effect": "Deny",
      "Principal": "*",
      "Action": [
		"s3:*",
		"sqs:*",
		"kms:*",
		"secretsmanager:*",
		"sts:AssumeRole",
		"sts:DecodeAuthorizationMessage",
		"sts:GetAccessKeyInfo",
		"sts:GetFederationToken",
		"sts:GetServiceBearerToken",
		"sts:GetSessionToken",
 		"sts:SetContext",
 		"aoss:*",
 		"ecr:*"
		],
      "Resource": "*",
      "Condition": {
        "NotIpAddressIfExists": {
          "aws:SourceIp": "<0.0.0.0/1>"
        },
        "StringNotEqualsIfExists": {
          "aws:SourceVpc": "<my-vpc>"
        },
        "BoolIfExists": {
          "aws:PrincipalIsAWSService": "false",
          "aws:ViaAWSService": "false"
        }
      }
	 }
    ]
  }

If you’re ready to begin migrating away from long-term credentials, using an SCP to deny access key creation and deny updates to existing keys helps enforce the use of more secure authentication methods like IAM roles and federated access. This policy denies principals from creating or updating an AWS access key.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Deny",
            "Action": [
                "iam:CreateAccessKey",
			 	"iam:UpdateAccessKey"
            ],
            "Resource": "*"
        }
    ]
}

In addition to establishing these data perimeter controls, let’s examine how network controls protect the runtime environments where access keys operate.

Network controls: Protecting the runtime environment for access keys

Beyond building a data perimeter and using SCPs and RCPs, protecting the compute and network infrastructure that uses these access keys is essential. The risk of credential exposure through compromised runtime environments makes infrastructure protection a critical component of access key security, because bad actors often target these environments to gain unauthorized access.

Security groups and network ACLs (NACLs)

Use network-level security protections that act as firewalls for varying levels, such as the instance level or the subnet level to help protect against unauthorized access.

  • Restricting critical ports, such as SSH (port 22) and RDP (port 3306), is essential because they’re prime targets for bad actors seeking unauthorized system access. Open administrative ports in your security groups can increase your attack surface and security risk. Using AWS Systems Manager Session Manager helps provide secure remote access without exposing inbound ports, alleviating the need for bastion hosts or SSH key management.
  • NACLs effectively block access at the subnet level by acting as stateless packet filters at subnet boundaries. Unlike security groups that protect individual instances, NACLs help secure entire subnets with explicit allow/deny rules for both inbound and outbound traffic. They create a critical perimeter defense layer that filters traffic before reaching your instances. When deployed as part of a defense-in-depth approach, NACLs provide subnet-level isolation between application tiers, block malicious traffic patterns at the network edge, and maintain protection even if other security layers are compromised, helping to facilitate comprehensive network security through multiple independent control points.
  • For enhanced network protection beyond NACLs, AWS Network Firewall enables enterprise-grade perimeter defense through comprehensive VPC protection. It combines intrusion prevention systems, domain filtering, deep packet inspection, and geographic IP controls, while automatically safeguarding your cloud environment against emerging threats using global threat intelligence gathered by Amazon. By using Network Firewall and AWS Transit Gateway integration, you can implement consistent security policies across your VPCs and Availability Zones with centralized management.
  • To automate and scale network security across your organization, AWS Firewall Manager provides centralized administration of both Network Firewall rules and security group policies. As your organization grows, Firewall Manager helps maintain security by automating the deployment of common security group policies, cleaning up unused groups, and remediating overly permissive rules across multiple accounts and organizational units.

Amazon Inspector

To help identify unintended network exposure at scale, consider using Amazon Inspector. Amazon Inspector continually scans AWS workloads for software vulnerabilities and unintended network exposure, helping you identify and remediate security vulnerabilities before they can be exploited.

Key capabilities include:

  • Package vulnerability: Package vulnerability findings identify software packages in your AWS environment that are exposed to Common Vulnerabilities and Exposures (CVEs). Bad actors can exploit these unpatched vulnerabilities to compromise the confidentiality, integrity, or availability of data, or to access other systems.
  • Code vulnerability: Code vulnerability findings identify lines in your AWS Lambda code that bad actors could exploit. Code vulnerabilities include injection flaws, data leaks, weak cryptography, or missing encryption in your code. It identifies policy violations and vulnerabilities based on internal detectors developed in collaboration with CodeGuru Security. For a list of possible detections, see the Amazon Q Detector Library.
  • Network reachability: Network reachability findings show whether your ports are reachable from the internet through an internet gateway (including instances behind Application Load Balancers or Classic Load Balancers), a VPC peering connection, or a VPN through a virtual gateway. These findings highlight network configurations that may be overly permissive, such as mismanaged security groups, NACLs or internet gateways, or that might allow for potentially malicious access. It can help identify open SSH ports on your instance security groups.

AWS WAF

Complementing your network security controls and vulnerability management, AWS WAF provides an additional layer of defense by filtering malicious web traffic that could lead to credential exposure.

AWS WAF offers several managed rule groups to protect against unauthorized access and common vulnerabilities:

  • AWS WAF Fraud Control account creation fraud prevention (ACFP) rule group: ACFP uses request tokens to gather information about the client browser and about the level of human interactivity in the creation of the account creation request. The rule group detects and manages bulk account creation attempts by aggregating requests by IP address and client session and aggregating by the provided account information such as the physical address and phone number. Additionally, the rule group detects and blocks the creation of new accounts using credentials that have been compromised, which helps protect the security posture of your application and of your new users.
  • AWS WAF Fraud Control account takeover prevention (ATP) rule group: To help prevent account takeovers that might lead to fraudulent activity, ATP gives you visibility and control over anomalous sign-in attempts and sign-in attempts that use stolen credentials. For Amazon CloudFront distributions, in addition to inspecting incoming sign-in requests, the ATP rule group inspects your application’s responses to sign-in attempts to track success and failure rates. ATP checks email and password combinations against its stolen credential database, which is updated regularly as new leaked credentials are found on the dark web.

Operational best practices

To complement these protective layers and maintain ongoing security posture, implement automated credential management through Secrets Manager to help facilitate proper rotation and lifecycle management of access keys throughout your environment. This automation reduces human error, helps facilitate timely credential updates and limits the exposure window if credentials are compromised.

It’s recommended to rotate keys at least every 90 days. Secrets Manager helps by automating the process of rotating secrets on a schedule, helping to make sure that credentials are regularly updated without manual intervention. It also centralizes the storage of secrets, reducing the likelihood of sharing access keys among multiple users. With Secrets Manager, you can configure automatic key rotation using a Lambda integration.

There is also an existing solution that can be deployed to implement automatic access key rotation at scale. This pattern helps you automatically rotate IAM access keys by using AWS CloudFormation templates, which are provided in the GitHub IAM key rotation repository.

If you’re unable to implement automatic rotation and need a quicker way to identify access keys that need to be rotated, AWS Trusted Advisor has a security check for IAM access key rotation that checks for active IAM access keys that haven’t been rotated in the last 90 days. You can use the security check to drill down on which access keys in your environment need to be rotated if you need to perform manual rotation.

Detect anomalous IAM activity

Finally, while proactive measures to secure your IAM infrastructure are crucial, it’s equally important to have robust detection and alerting mechanisms in place. No matter how diligent your efforts, there is still a possibility of unforeseen threats or unauthorized activities. That’s why a comprehensive defense-in-depth strategy should include the ability to quickly identify and respond to anomalous IAM-related events. Amazon GuardDuty combines machine learning and integrated threat intelligence to help protect AWS accounts, workloads, and data from threats.

GuardDuty Extended Threat Detection automatically correlates multiple events across different data sources to identify potential threats within AWS environments. When Extended Threat Detection detects suspicious sequences of activities, it generates comprehensive attack sequence findings. The system analyzes individual API activities as weak signals, which might not indicate risks independently, but when observed together in specific patterns can reveal potential security issues.

This capability is enabled by default when GuardDuty is activated in an AWS account, helping provide protection without additional configuration.

The specific attack sequence finding related to compromised credentials is AttackSequence:IAM/CompromisedCredentials which is marked as Critical severity. This finding informs you that GuardDuty detected a sequence of suspicious actions made by using AWS credentials that impacts one or more resources in your environment. Multiple suspicious and anomalous threat behaviors were observed by the same credentials, resulting in higher confidence that the credentials are being misused.

Conclusion

The security best practices outlined in this post provide a comprehensive, multi-layered approach to mitigate the risks associated with long-term credentials. By implementing proactive code scanning, automated key rotation, network-level controls, data perimeter restrictions, and threat detection, you can significantly reduce the attack surface and better protect your organization’s AWS resources until a full migration to temporary credentials is feasible.

While the recommendations provided in this post represent an ample set of controls to put organizations in a good security posture, there might be additional security measures that can be taken depending on the specific needs and risk profile of each environment. The key is to adopt a holistic, layered approach to credential management and protection. By doing so, you can bridge the gap until a complete transition to temporary credentials becomes possible.

Implementing these security measures can help reduce risks, but long-term credentials inherently carry security risks. Even with strict best practices and comprehensive security controls, the possibility of credential compromise cannot be removed completely. You should consider evaluating your organization’s security posture and prioritizing temporary credentials through IAM roles and federation whenever possible. If you have questions or need help, AWS is here to support you.

Jennifer Paz
Jennifer Paz

Jennifer is a Security Engineer with over a decade of experience, currently serving on the AWS Customer Incident Response Team (CIRT). Jennifer enjoys helping customers tackle security challenges and implementing complex solutions to help enhance their security posture. When not at work, Jennifer is an avid runner, pickleball enthusiast, traveler, and foodie, always on the hunt for new culinary adventures.
Samantha Tavares
Samantha Tavares

Samantha is an Incident Responder on the AWS Customer Incident Response Team. She’s passionate about helping customers protect their cloud environments. When she’s not diving into security challenges, she’s sweating at CrossFit, or planning her next travel adventure.

Accelerate investigations with AWS Security Incident Response AI-powered capabilities

Post Syndicated from Daniel Begimher original https://aws.amazon.com/blogs/security/accelerate-investigations-with-aws-security-incident-response-ai-powered-capabilities/

If you’ve ever spent hours manually digging through AWS CloudTrail logs, checking AWS Identity and Access Management (IAM) permissions, and piecing together the timeline of a security event, you understand the time investment required for incident investigation. Today, we’re excited to announce the addition of AI-powered investigation capabilities to AWS Security Incident Response that automate this evidence gathering and analysis work.

AWS Security Incident Response helps you prepare for, respond to, and recover from security events faster and more effectively. The service combines automated security finding monitoring and triage, containment, and now AI-powered investigation capabilities with 24/7 direct access to the AWS Customer Incident Response Team (CIRT).

While investigating a suspicious API call or unusual network activity, scoping and validation require querying multiple data sources, correlating timestamps, identifying related events, and building a complete picture of what happened. Security operations center (SOC) analysts devote a significant amount of time to each investigation, with roughly half of that effort spent manually gathering and piecing together evidence from various tools and complex logs. This manual effort can delay your analysis and response.

AWS is introducing an investigative agent to Security Incident Response, changing this paradigm and adding layers of efficiency. The investigative agent helps you reduce the time required to validate and respond to potential security events. When a case for a security concern is created, either by you or proactively by Security Incident Response, the investigative agent asks clarifying questions to make sure it understands the full context of the potential security event. It then automatically gathers evidence from CloudTrail events, IAM configurations, and Amazon Elastic Compute Cloud (Amazon EC2) instance details and even analyzes cost usage patterns. Within minutes, it correlates the evidence, identifies patterns, and presents you with a clear summary.

How it works in practice

Before diving into an example, let’s paint a clear picture of where the investigative agent lives, how it’s accessed, and its purpose and function. The investigative agent is built directly into Security Incident Response and is automatically available when you create a case. Its purpose is to act as your first responder—gathering evidence, correlating data across AWS services, and building a comprehensive timeline of events so you can quickly move from detection to recovery.

For example: you discover that AWS credentials for an IAM user in your account were exposed in a public GitHub repository. You need to understand what actions were taken with those credentials and properly scope the potential security event, including lateral movement and reconnaissance operations. You need to identify persistence mechanisms that might have been created and determine the appropriate containment steps. To get started, you create a case in the Security Incident Response console and describe the situation.

Here’s where the agent’s approach differs from traditional automation: it asks clarifying questions first. When were the credentials first exposed? What’s the IAM user name? Have you already rotated the credentials? Which AWS account is affected?

This interactive step gathers the appropriate details and metadata before it starts gathering evidence. Specifically, you’re not stuck with generic results—the investigation is tailored to your specific concern.

After the agent has what it needs, it investigates. It looks up CloudTrail events to see what API calls were made using the compromised credentials, pulls IAM user and role details to check what permissions were granted, identifies new IAM users or roles that were created, checks EC2 instance information if compute resources were launched, and analyzes cost and usage patterns for unusual resource consumption. Instead of you querying each AWS service, the agent orchestrates this automatically.

Within minutes, you get a summary, as shown in the following figure. The investigation summary includes a high-level summary and critical findings, which include the credential exposure pattern, observed activity and the timeframe, affected resources, and limiting factors.

This response was generated using AWS Generative AI capabilities. You are responsible for evaluating any recommendations in your specific context and implementing appropriate oversight and safeguards. Learn more about AWS Responsible AI requirements.

Note: The preceding example is representative output. Exact formatting will vary depending on findings.

The investigation summary includes various tabs for detailed information, such as technical findings with an events timeline, as shown in the following figure:

Figure 2 – Security event timeline

Figure 2 – Security event timeline

When seconds count, this transparency is paramount to a quick, high-fidelity, and accurate response—especially if you need to escalate to the AWS CIRT, a dedicated group of AWS security experts, or explain your findings to leadership, creating a single lens for stakeholders to view the incident.

When the investigation is complete, you have a high-resolution picture of what happened and can make informed decisions about containment, eradication, and recovery. For the preceding exposed credentials scenario, you might need to:

  • Delete the compromised access keys
  • Remove the newly created IAM role
  • Terminate the unauthorized EC2 instances
  • Review and revert associated IAM policy changes
  • Check for additional access keys created for other users.

When you engage with the CIRT, they can provide additional guidance on containment strategies based on the evidence the agent gathered.

What this means for your security operations

The leaked credentials scenario shows what the agent can do for a single incident. But the bigger impact is on how you operate day-to-day:

  • You spend less time on evidence collection. The investigative agent automates the most time-consuming part of investigations—gathering and correlating evidence from multiple sources. Instead of spending an hour on manual log analysis, you can spend most of that time on making containment decisions and preventing recurrence.
  • You can investigate in plain language. The investigative agent uses natural language processing (NLP), which you can use to describe what you’re investigating in plain language, such as unusual API calls from IP address X or data access from terminated employee’s credentials, and the agent translates that into the technical queries needed. You don’t need to be an expert in AWS log formats or know the exact syntax for querying CloudTrail.
  • You get a foundation for high-fidelity and accurate investigations. The investigative agent handles the initial investigation—gathering evidence, identifying patterns, and providing a comprehensive summary. If your case requires deeper analysis or you need guidance on complex scenarios, you can engage with the AWS CIRT, who can immediately build on the work the agent has already done, speeding up their response time. They see the same evidence and timeline, so they can focus on advanced threat analysis and containment strategies rather than starting from scratch.

Getting started

If you already have Security Incident Response enabled, the AI-powered investigation capabilities are available now—no additional configuration needed. Create your next security case and the agent will start working automatically.

If you’re new to Security Incident Response, here’s how to set it up:

  1. Enable Security Incident Response through your AWS Organizations management account. This takes a few minutes through the AWS Management Console and provides coverage across your accounts.
  2. Create a case. Describe what you’re investigating; you can do this through the Security Incident Response console or an API, or set up automatic case creation from Amazon GuardDuty or AWS Security Hub alerts.
  3. Review the analysis. The agent presents its findings through the Security Incident Response console, or you can access them through your existing ticketing systems such as Jira or ServiceNow.

The investigative agent uses the AWS Support service-linked role to gather information from your AWS resources. This role is automatically created when you set up your AWS account and provides the necessary access for Support tools to query CloudTrail events, IAM configurations, EC2 details, and cost data. Actions taken by the agent are logged in CloudTrail for full auditability.

The investigative agent is included at no additional cost with Security Incident Response, which now offers metered pricing with a free tier covering your first 10,000 findings ingested per month. Beyond that, findings are billed at rates that decrease with volume. With this consumption-based approach, you can scale your security incident response capabilities as your needs grow.

How it fits with existing tools

Security Incident Response cases can be created by customers or proactively by the service. The investigative agent is automatically triggered when a new case is created, and cases can be managed through the console, API, or Amazon EventBridge integrations.

You can use EventBridge to build automated workflows that route security events from GuardDuty, Security Hub, and Security Incident Response itself to create cases and initiate response plans, enabling end-to-end detection-to-investigation pipelines. Before the investigative agent begins its work, the service’s auto-triage system monitors and filters security findings from GuardDuty and third-party security tools through Security Hub. It uses customer-specific information, such as known IP addresses and IAM entities, to filter findings based on expected behavior, reducing alert volume while escalating alerts that require immediate attention. This means the investigative agent focuses on alerts that actually need investigation.

Conclusion

In this post, I showed you how the new investigative agent in AWS Security Incident Response automates evidence gathering and analysis, reducing the time required to investigate security events from hours to minutes. The agent asks clarifying questions to understand your specific concern, automatically queries multiple AWS data sources, correlates evidence, and presents you with a comprehensive timeline and summary while maintaining full transparency and auditability.

With the addition of the investigative agent, Security Incident Response customers now get the speed and efficiency of AI-powered automation, backed by the expertise and oversight of AWS security experts when needed.

The AI-powered investigation capabilities are available today in all commercial AWS Regions where Security Incident Response operates. To learn more about pricing and features, or to get started, visit the AWS Security Incident Response product page.

If you have feedback about this post, submit comments in the Comments section below.

Daniel Begimher

Daniel Begimher

Daniel is a Senior Security Engineer in Global Services Security, specializing in cloud security, application security, and incident response. He co-leads the Application Security focus area within the AWS Security and Compliance Technical Field Community, holds all AWS certifications, and authored Automated Security Helper (ASH), an open source code scanning tool.

Introducing Cluster insights: Unified monitoring dashboard for Amazon OpenSearch Service clusters

Post Syndicated from Siddhant Gupta original https://aws.amazon.com/blogs/big-data/introducing-cluster-insights-unified-monitoring-dashboard-for-amazon-opensearch-service-clusters/

Amazon OpenSearch Service clusters offer a wealth of operational metrics accessible through CloudWatch and the Amazon OpenSearch Service console to support effective performance monitoring and alert creation. Yet, pinpointing resiliency and performance challenges within your cluster can prove daunting. The process of identifying resource-intensive queries or understanding performance degradation trends can be time-consuming.

To address these challenges, we launched Cluster insights, which presents a unified dashboard delivering curated insights along with actionable mitigation steps. The dashboard displays detailed metrics at the node, index, and shard levels, coupled with a concise summary of security and resiliency best practices to uphold peak resiliency and availability.

This blog will guide you through setting up and using Cluster Insights, including key features and metrics. By the conclusion, you’ll understand how to use Cluster insights to recognize and address performance and resiliency issues within your OpenSearch Service clusters.

Getting Started with Cluster insights

Cluster insights is available at no additional cost to OpenSearch Service users running OpenSearch version 2.17 or later. Accessing Cluster insights requires admin-level permissions for your OpenSearch domain. Cluster insights is available only through the OpenSearch UI. OpenSearch UI offers support to multiple data sources, zero downtime upgrades for your dashboard experience, and curated workspaces for effective team collaborations. You first need to associate a data source (your clusters) with an OpenSearch UI application. Detailed steps are described in the user guide. Your OpenSearch UI console experience will look like following screenshots.

To access Cluster insights using the OpenSearch UI application:

  1. In the Amazon OpenSearch Service console, navigate to OpenSearch UI (Dashboards) and choose the Application URL to access your OpenSearch UI application.
  2. OpenSearch UI application, choose the settings icon at the left-bottom corner, then choose Data administration.
  3. On the Data administration overview page, or under Manage data in the left navigation, select Cluster insights.

Cluster insights overview

The Cluster insights – Overview acts as a landing page to show health and insights for all connected OpenSearch domains. It is organized into five sections:

  1. Current cluster status – Displays cluster health status (Green, Yellow, and Red) in a donut chart.
  2. Insights trend – Tracks issue patterns over the past 30 days, helping you identify emerging problems and track resolution progress. This trend analysis becomes particularly valuable when monitoring the impact of operational changes or troubleshooting recurring issues.
  3. Current open insights – Shows the count and severity breakdown of currently active insights across your clusters.
  4. OpenSearch service clusters – Lists all domains with their vital statistics such as health status, insights count, nodes, shards, and active queries.
  5. Top insights by severity – Prioritizes issues that need immediate attention. Each insight comes with a clear description and specific recommendations, transforming complex monitoring data into actionable tasks. This prioritized view helps teams can focus on critical issues first, whether they’re addressing shard size problems, disk space issues, or performance bottlenecks.

Together, these sections provide a comprehensive view of your OpenSearch Service infrastructure so you can assess cluster health, identify trends, and take action on critical issues from a single dashboard.

Cluster health

When you choose a specific cluster from the OpenSearch domains on the Cluster insights – Overview page, you will see cluster-specific details including health status, active insights, and performance metrics. The overview section displays cluster health along with essential metrics including count of shards, nodes, indices, and a total document size. You can also review the configuration best practices followed by domain across resiliency and security areas.

The lower section contains a table of actionable insights that presents a detailed view of current issues. This table mirrors the insights from the landing page but focuses specifically on issues affecting the selected cluster. You can observe high-severity issues such as low disk space and shard count problems, as well as medium-severity concerns that may impact cluster performance.

Each insight entry serves as an interactive element – selecting any issue reveals an in-depth analysis complete with root cause identification and specific remediation steps. The table includes important metadata such as generation timestamps, severity levels, recommendation counts, and current status, so users can prioritize and address issues effectively.

Insight details

Every insight offers detailed analysis and actionable recommendations. Take the Shard Count insight as an example: selecting it reveals a comprehensive breakdown of the issue. You’ll see that your OpenSearch cluster has breached the number of shards allowed on the nodes based on its JVM heap size, along with a detailed list of affected resources.

The detailed view includes a resource map that precisely identifies each impacted node and index, displaying critical information such as node IDs, shard counts, and the indices contributing to the issue.

The recommendations are organized into two levels: cluster-level recommendations address overall architecture improvements, such as scaling your cluster or adjusting global shard allocation settings. Index-level recommendations provide specific actions for individual indices—for example, you might see suggestions to move idle shards to UltraWarm storage. These are shards without any search or indexing operations for the last 10 days and are at least 5 days old, making them ideal candidates for warm storage to reduce the active shard count. All of this guidance is available directly within the Cluster insights interface, eliminating the need to switch between different tools or consoles.

Node, Index, Shard, and Query view

Next to cluster health, you can review Node, Index, Shard, and Query details for a specific cluster. These views present critical metrics such as resource (CPU, memory, disk) utilization, search and index latency.

Node view

The Node view tab provides a comprehensive view of individual node performance across your cluster. This table displays critical metrics for each node including heat score indicating overall node health, resource utilization (CPU, memory, disk), search and indexing latency and rates, along with quick links to view top N shards and queries running on each node.

This view helps you identify nodes experiencing high resource utilization or performance degradation. You can drill deeper into each node by clicking on the node ID to view detailed time-based metrics showing resource usage trends over time. Additionally, you can click the top N shards link to navigate directly to the Shard View, automatically filtered to show only the shards running on the selected node, allowing you to pinpoint which specific shards are contributing to performance issues.

Index view

The Index view tab shows performance metrics aggregated at the index level. For each index, you can monitor document count and storage size, search latency and rate, indexing latency and rate, and access top N queries affecting the index. This perspective is valuable for understanding which indices are driving cluster load and identifying optimization opportunities at the index configuration level.

Shard view

The Shard view tab offers the most granular view of cluster performance by displaying metrics for individual shards. Each row shows shard ID and its assigned node, index association and resource pressure metrics (CPU, memory), along with search and indexing latency per shard. This detailed view enables you to pinpoint specific shards causing performance issues, identify shard placement imbalances, and take targeted remediation actions.

Query view

The Query view on the Cluster insights page solves presents live dashboards that break down execution stats, CPU and memory usage, and completion progress for every query. This helps monitor which queries are driving the biggest resource consumption (the Top-N queries). With intuitive donut charts and scoreboards showing distribution by node, index, and user, this interface helps operators to quickly pinpoint performance bottlenecks and heavy workloads, supporting targeted optimization and confident scaling decisions.

Query insights

In addition to Cluster insights, you can also get Query insights to view the exact queries running and latencies across Expand, Query, and Fetch phases that provides valuable insights for search developers to further fine-tune their queries.

Conclusion

Cluster insights transforms OpenSearch Service cluster management from reactive troubleshooting to proactive optimization. By providing unified dashboards with heat score, and best practices across stability, resiliency, and security pillars, it offers visibility into your search infrastructure at the account level.

The actionable recommendations and step-by-step remediation guidance help users of all experience levels effectively resolve complex issues like shard imbalances and resource bottlenecks.

The integration with Query insights delivers real-time visibility into resource consumption patterns so that teams can identify and optimize performance-critical queries through detailed profiling and latency analysis.

For more information, see the AWS OpenSearch Service User Guide for additional details.


About the authors

Siddhant Gupta

Siddhant Gupta

Siddhant is a Senior Product Manager (Technical) at AWS, leading AI innovation for OpenSearch. He focuses on democratizing advanced AI capabilities, making them accessible and practical for customers regardless of their technical expertise. His work centers on seamlessly integrating cutting-edge AI technologies into scalable, user-friendly solutions.

Varunsrivathsa Venkatesha

Varunsrivathsa Venkatesha

Varunsrivathsa is a Software Development Manager at AWS, leading the Intelligent Domain Management team. He focuses on monitoring and recovery services for Amazon OpenSearch Service and on leveraging these services to provide a seamless domain management experience for customers.

Gagandeep Juneja

Gagandeep Juneja

Gagandeep is a senior software development engineer at AWS working on OpenSearch.

Jinhwan Hyon

Jinhwan Hyon

Jinhwan is a Specialist Solutions Architect at AWS focused on Amazon OpenSearch Service based on Seoul, South Korea. His interests center on data and analytics, with a passion for helping customers integrate AI into their data strategies. He’s particularly fascinated by generative AI and intelligent agents, exploring how these technologies can revolutionize decision-making and solve complex business challenges.

The Agentic AI Security Scoping Matrix: A framework for securing autonomous AI systems

Post Syndicated from Aaron Brown original https://aws.amazon.com/blogs/security/the-agentic-ai-security-scoping-matrix-a-framework-for-securing-autonomous-ai-systems/

As generative AI became mainstream, Amazon Web Services (AWS) launched the Generative AI Security Scoping Matrix to help organizations understand and address the unique security challenges of foundation model (FM)-based applications. This framework has been adopted not only by AWS customers across the globe, but also widely referenced by organizations such as OWASP, CoSAI, and other industry standards bodies, partners, systems integrators (SIs), analysts, auditors, and more. Now, as long-running, function-calling agentic AI systems emerge with capabilities for autonomous decision-making, we’re creating an additional framework to address an entirely new set of security challenges.

Agentic AI systems can autonomously execute multi-step tasks, make decisions, and interact with infrastructure and data. This is a paradigm shift, and organizations must adapt to it. Unlike traditional FMs that operate in stateless request-response patterns, agentic AI systems introduce autonomous capabilities, persistent memory, tool orchestration, identity and agency challenges, and external system integration, expanding the risks that organizations must address.

Working with customers deploying these systems, we’ve observed that traditional AI security frameworks don’t always extend into the agentic space. The autonomous nature of agentic systems requires fundamentally different security approaches. To address this gap, we’ve developed the Agentic AI Security Scoping Matrix, a mental model and framework that categorizes four distinct agentic architectures based on connectivity and autonomy levels, mapping critical security controls across each.

Understanding the agentic paradigm shift

FM-powered applications operate in a now well-understood, predictable pattern even though the responses that an FM produces are non-deterministic and stateless. These applications, in their most basic form receive a prompt or instruction, generate a response, then terminate the session. Security and safety controls focus on basic measures such as input validation, output filtering, and content moderation guardrails, while governance focuses on the overall risk profiles and the resilience of models. This model works because security failures have limited scope: a compromised interaction affects only that specific request and response, without persisting or propagating to other systems or users.

Agentic AI systems fundamentally change this security model through several key capabilities:

Autonomous execution and agency: Agents initiate actions based on goals and environmental triggers that might, or might not, require human prompts or approval. This creates risks of unauthorized actions, runaway processes, and decisions that exceed intended boundaries when agents misinterpret objectives or operate on compromised instructions.

When AI agents are given instructions or permissions to act based on the data, parameters, instructions, and responses given to them, the boundaries of independence or autonomy they are permitted to act within are important to define. In discussing agentic AI systems, it’s important to clarify the distinction between agency and autonomy, because these related but different concepts inform our security approach.

Agency refers to the scope of actions an AI system is permitted and enabled to take within the operating environment, and how much a human bounds an agent’s actions or capabilities. This includes what systems it can interact with, what operations it can perform, and what resources it can modify. Agency is fundamentally about capabilities and permissions—what the system is allowed to do within its operational environment. For example, an AI agent with no agency would be guided by human-defined workflow, process, tools, or orchestration compared to an AI agent with full agency that can self-determine how to accomplish a human-defined goal.

Autonomy, in contrast, refers to the degree of independent decision-making and action the system can take without human intervention. This includes when it operates, how it chooses between available actions, and whether it requires human approval for execution. Autonomy is about independence in decision-making and execution—how freely the system can act within its granted agency. For example, an AI agent might have high agency (able to perform many actions) but low autonomy (requiring human approval for each action), or vice versa.

Understanding this distinction is crucial for implementing appropriate security controls. Agency requires boundaries and permission systems, while autonomy requires oversight mechanisms and behavioral controls. Both dimensions must be carefully managed to create secure agentic AI systems.

It’s important to determine how much agency and autonomy you want to permit and grant your AI agents to act within. After you have determined the appropriate level that any given agent should operate within, you can then evaluate the appropriate security controls to put in place to restrict the agency to a permissible risk tolerance for your agentic-based application and your organization.

Persistent memory: Agents often benefit from maintaining context and learned behaviors across sessions, building knowledge bases that inform future decisions in the form of short- and long-term memory. This data persistence introduces additional data protection requirements and can add new risk vectors such as memory poisoning attacks where adversaries inject false information that corrupts decision-making across multiple interactions and users.

Tool orchestration: Agents directly integrate via functions with connections to databases, APIs, services, and potentially other agents or orchestration components to execute complex tasks autonomously depending on the tool abstraction level. This expanded attack surface creates risks of cascading compromises where a single agent breach can propagate through connected systems, multi-agent workflows, and downstream services and data stores.

External connectivity: Agents operate across network boundaries, accessing internet resources, third-party APIs, and enterprise systems. Like traditional non-agentic systems, expanded connectivity can help unlock new business value, but this access should be designed with security controls that limit risks such as data exfiltration, lateral movement, and external manipulation. Threat modeling your agentic AI applications should be a high priority and can help directly align security controls that assist your implementation of zero-trust principles into your strategy.

Self-directed behavior: Advanced agents can initiate activities based on environmental monitoring, scheduling, or learned patterns without human instantiation or review, depending on configuration. This self-direction introduces risks of uncontrolled operations, explainability, and auditability, and makes it difficult to maintain predictable security boundaries.

These capabilities transform security from a boundary problem to a continuous monitoring and control challenge. A compromised agent doesn’t just leak information—it could autonomously execute unauthorized transactions, modify critical infrastructure, or operate maliciously for extended periods without detection.

The Agentic AI Security Scoping Matrix

Working with customers and the community, we’ve identified four architectural scopes that represent the evolution of agentic AI systems based on two critical dimensions: level of human oversight compared with autonomy and the level of agency the AI system is permitted to act within. Each scope introduces new capabilities—and corresponding security requirements—that organizations must prioritize when addressing agentic AI risk. Figure 1 shows the Agentic AI Security Scoping Matrix.

Figure 1 - The Agentic AI Security Scoping Matrix

Figure 1 – The Agentic AI Security Scoping Matrix

Scope 1: No agency

In this most basic scope, systems operate with human-initiated processes and no autonomous or even human-approved change capabilities through the agent itself. The agents are, essentially, read-only. These systems follow predefined execution paths and operate under strict human-triggered workflows, which are usually predefined and follow discrete steps, but could be augmented with non-deterministic outputs from an FM. Security focuses primarily on process integrity and boundary enforcement, helping operations remain within predetermined limits and agents are highly controlled and prohibited from change execution and unbounded actions.

Key characteristics:

  • Agents can’t directly execute change in the environment
  • Fixed step-by-step execution following predetermined paths
  • Generative AI components process data within individual workflow nodes
  • Conditional branching only where explicitly designed into the workflow
  • No dynamic planning or autonomous goal-seeking behavior
  • State persistence limited to workflow execution context
  • Tool access restricted to specific predefined workflow steps

Security focus: Protecting data integrity within the environment and restricting agents to not exceed their boundaries, especially limits around environment and data modification. Primary concerns include securing state transitions between steps, validating data passed between workflow nodes, and preventing AI components from modifying the orchestration logic or escaping their designated boundaries within the workflow.

Example: We will use a very simplistic example, across all four scopes, of a use case for an agent that is designed to help you create calendar invites. Let’s say you need to book a meeting with another colleague. In Scope 1, you might have an agent that you instantiate through a workflow or prompt to look at your calendar and your colleague’s calendar for available meeting times. In this case, you initiate the request, and the agent executes a contextual search using a Model Context Protocol (MCP) server connected to your enterprise calendaring application. The agent is only allowed to look at available times, analyze the best times to meet, and provide a response back, which a human can then use to manually set up a meeting. In this example, the human defines specific workflows and orchestrations (no agency) and reviews and approves the actions taken (no autonomous change).

Scope 2: Prescribed agency

Moving up in agency and risk, Scope 2 systems also are instantiated by a human, but now have the potential to perform actions—limited agency—that could change the environment. However, all actions taken by an agent require explicit human approval for all actions of consequence—commonly referred to as human in the loop or HITL. These systems can gather information, analyze data, and prepare recommendations, but cannot execute actions that modify external systems or access sensitive resources without human authorization. Agents can also request human input to clarify ambiguities, provide missing context, or optimize their approach before presenting recommendations.

Key characteristics:

  • Agents can execute change in the environment with human review and approval
  • Real-time human oversight with approval workflows
  • Bidirectional human interaction—agents can query humans for context
  • Limited autonomous actions restricted to read-only operations (such as, querying data, running analysis jobs, and so on)
  • Agent-initiated requests for clarification or additional information
  • Audit trails of all human approval decisions and context exchanges

Security focus: Implementing robust approval workflows and preventing agents from bypassing human authorization controls. Key concerns include preventing privilege escalation, enforcing appropriate identity contexts, securing the approval process itself, validating human-provided context to prevent injection attacks, and maintaining visibility into all agent recommendations and their rationale.

Example: In our calendaring example, a Scope 2 agentic system is instantiated by a human. The agent then looks up the stakeholders’ calendar availability, does its analysis, returns a recommendation for a meeting time to the user, and asks the user if they want the agent to send the invitation out on their behalf. The user looks at the response and recommendation of the agent, validates that it meets their requirements, and then acknowledges and approves the agent’s request to modify the calendars and send the invitation. In this example, the human orchestrates a structured workflow, but the agent now can instantiate human reviewed change through bounded actions (limited agency and limited autonomy).

Scope 3: Supervised agency

In Scope 3, we expand the agency to allow for a greater sense of agentic autonomy—high agency—in execution. These are AI systems that execute complex autonomous tasks that are initiated by humans (or at least from an upstream human-managed workflow), with the ability to make decisions and take actions to connected systems without further approval or HITL mechanisms. Humans define the objectives and trigger execution, but agents operate independently to achieve goals through dynamic planning and tool usage. During execution, agents can request human guidance to optimize trajectory or handle edge cases, though they can continue operating without it.

Key characteristics:

  • Agents can execute change in the environment, with no (or optional) human interaction or review
  • Human-triggered execution with autonomous task completion
  • Dynamic planning and decision-making during execution
  • Optional human intervention points for trajectory optimization
  • Human ability to adjust parameters or provide context mid-execution
  • Direct access to external APIs and systems for task completion
  • Persistent memory across extended execution sessions
  • Autonomous tool selection and orchestration within defined boundaries

Security focus: Implementing comprehensive monitoring of agent actions during autonomous execution phases and establishing clear agency boundaries for agent operations—the bounds you’re willing to let the agents operate within, and actions that would be out of bounds and must be prevented. Critical concerns include securing the human intervention channel to prevent unauthorized modifications, preventing scope creep during task execution, implementing trusted identity propagation constructs, monitoring for behavioral anomalies, and validating that agents remain aligned with original human intent throughout extended operations even when trajectory adjustments are made.

Example: In our calendaring example, a Scope 3 agentic system can still be instantiated by a human. The agent then looks up the stakeholders’ calendar availability, does its analysis, and returns a recommendation for a meeting time to the user; however, it’s within the agent’s bounds to act upon its own recommendation on behalf of the user to automatically book the best available slot. The user is not prompted or expected to give the agent permission to do so prior to its actions. The result is that all stakeholders have a calendar entry added to their calendar in the context of the calling human user. In this example, the human defines an outcome but with more freedom for the agent to determine how to achieve that goal, and the agent now can take autonomous action without human review (high agency and high autonomy).

Scope 4: Full agency

Scope 4 includes fully autonomous AI systems that can initiate their own activities based on environmental monitoring, learned patterns, or predefined conditions, and execute complex tasks without human intervention. These systems represent the highest level of AI agency, operating continuously and making independent decisions about when and how to act. It’s key to note that AI systems within Scope 4 could have full agency when executing within their designed bounds; therefore, it’s critical that humans maintain supervisory oversight with the ability to provide strategic guidance, course corrections, or interventions when needed. Continuous compliance, auditing, and full-lifecycle management mechanisms, both human and automated reviews, which could also be aided by AI, are critical to successfully securing and governing Scope 4 agentic AI systems while limiting risk.

Key characteristics:

  • Self-directed activity initiation based on environmental triggers
  • Continuous operation with minimal human oversight or HITL processes during execution
  • Human ability to inject strategic guidance without disrupting operations
  • High to full degrees of autonomy in goal setting, planning, and execution
  • Dynamic interaction with multiple external systems and agents
  • Capability for recursive self-improvement and capability expansion

Security focus: Implementing advanced guardrails for behavioral monitoring, anomaly detection, scope-based tool access controls, and fail-safe mechanisms to prevent runaway operations. Primary concerns include maintaining alignment with organizational objectives, securing human intervention channels against adversarial manipulation, preventing unauthorized capability expansion, preventing human oversight mechanisms from being disabled by the agent, and enabling graceful degradation when agents encounter unexpected situations.

Example: Let’s look at how we might deploy our AI calendaring example in Scope 4. Let’s say you have implemented a generative AI meeting summarizer. This agent is automatically enabled when you host a web conference. At the conclusion of the meeting, the calendaring agent sees a new meeting occurred from the meeting summarizer agent. It looks at the action items that were summarized and determines that six people agreed to a whiteboard session on Friday. The calendaring agent might either have a statically defined API configuration or leverage dynamic discovery on MCP servers to help with calendaring. It then finds availability for the six identified resources and books the best available slot. It then uses the appropriate identity context of the user who is asking for the meeting to book the meeting autonomously. At no point does a user directly instantiate the request for calendaring; it is fully automated and driven off environment changes that the agent is instructed to look for (full agency and full autonomy).

Scope comparison across the scopes

In the context of the security scoping matrix, let’s compare how autonomy and agency characteristics shift depending on the scope:

Table 1 – Scope impacts on agency and autonomy levels

Critical security dimensions

Scope

Agency level

Agency characteristics

Autonomy level

Autonomy characteristics

Scope 1: No agency

None

  • Read-only operations
  • Fixed workflow paths

None

  • Human-initiated only
  • Predefined execution steps

Scope 2: Prescribed agency

Limited

  • Can modify systems
  • Access to multiple tools

Limited

  • Requires human approval for actions
  • HITL for all changes

Scope 3: Supervised agency

High

  • Can modify multiple systems
  • Dynamic tool selection

High

  • Autonomous execution after human initiation
  • Optional human guidance

Scope 4: Full agency

Full

  • Comprehensive system access
  • Multi-system orchestration
  • Self-adaptive

Full

  • Self-initiated actions
  • Continuous autonomous operation
  • Strategic human oversight

Each architectural scope requires specific security controls and considerations across six critical dimensions. Table 2 illustrates how security requirements escalate with increasing agency and autonomy:

Security dimension

Scope 1: No agency

Scope 2: Prescribed agency

Scope 3: Supervised agency

Scope 4: Full agency

Identity context (authN and authZ)

  • User authentication
  • Service authentication
  • Limited system permissions (read-only)
  • Limited system access (only necessary, known systems needed for the workflow)
  • User authentication
  • Service authentication
  • Human identity verification for approvals
  • User authentication
  • Service authentication
  • Agent authentication
  • Identity delegation for autonomous actions
  • Dynamic identity lifecycle
  • Federated authentication
  • Continuous identity verification
  • Agent identity attestation

Data, memory, and state protection

  • Local resource permissions
  • File system access controls
  • Role-based access control
  • Human approval workflows
  • Read-mostly permissions for agents
  • Context-aware authorization
  • Just-in-time privilege elevation
  • Dynamic permission boundaries
  • Behavioral authorization
  • Adaptive access controls
  • Continuous authorization validation

Audit and logging

  • Local activity logs
  • Change tracking
  • Integrity monitoring
  • Policy enforcement
  • Human decision audit trails
  • Agent recommendation logging
  • Approval process tracking
  • Comprehensive action logging
  • Reasoning chain capture
  • Extended session tracking
  • Continuous behavioral logging
  • Pattern analysis
  • Predictive monitoring
  • Automated incident correlation

Agent and FM controls

  • Process isolation
  • Input/output validations
  • Guardrails
  • Approval gateway enforcement
  • Extended session monitoring
  • Container isolation
  • Long-running process management
  • Tool invocation sandboxing
  • Behavioral analysis
  • Anomaly detection
  • Automated containment
  • Self-healing security

Agency perimeters and policies

  • Fixed execution boundaries
  • Predefined action limits
  • Static resource quotas
  • Hard-coded constraints
  • Approval-based boundary modification
  • Human-validated constraint changes
  • Time-bound elevated access
  • Multi-step validation
  • Dynamic boundary adjustment
  • Runtime constraint evaluation
  • Resource scaling limits
  • Automated safety checks
  • Self-adjusting boundaries
  • Context-aware constraints
  • Cross-system resource management
  • Autonomous limit adaptation

Orchestration

  • Simple workflow orchestration
  • Fixed execution paths
  • Single or limited system integration points
  • Multi-step workflow orchestration
  • Approval-gated tool access
  • Human-validated tool chains
  • Dynamic tool orchestration
  • Parallel execution paths
  • Cross-system integration
  • Autonomous multi-agent orchestration
  • Cross-session learning
  • Dynamic service discovery

Table 2 — Critical security dimensions per scope

Security implementation by scope

Now that we’ve outlined each of the scopes and the associated levels of agency and autonomy, let’s discuss some primary security challenges per scope and key considerations that should be taken to address the associated risks.

Scope 1: No agency
Primary security challenges: Protecting workflow integrity, preventing prompt injection from breaking predetermined flows, and maintaining isolation between workflow executions.

Implementation considerations:

  • Comprehensive monitoring with anomaly detection
  • Strict data validation and integrity checking
  • Input validation at each workflow step boundary
  • Immutable workflow definitions with version control
  • State encryption and validation between workflow nodes
  • Monitoring for attempts to escape workflow boundaries
  • Segregation between different workflow executions
  • Fixed timeout and resource limits per workflow step
  • Audit trails showing actual compared to expected execution paths

Scope 2: Prescribed agency
Primary security challenges: Securing approval workflows, preventing human authorization bypass, and maintaining oversight effectiveness.

Implementation considerations:

  • Multi-factor authentication for all human approvers
  • Cryptographically signed approval decisions
  • Securing bidirectional human-agent communication channels
  • Time-bounded approval tokens with automatic expiration
  • Comprehensive logging of all approval interactions
  • Regular training for human approvers on agent capabilities and risks

Scope 3: Supervised agency
Primary security challenges: Maintaining control during autonomous execution, scope management, explainability and auditability, and behavioral monitoring.

Implementation considerations:

  • Clear execution boundaries defined at initiation
  • Real-time monitoring of agent actions during execution
  • Automated kill switches for runaway processes
  • Non-blocking intervention mechanisms
  • Behavioral baselines for normal agent operations
  • Regular validation of agent alignment with original objectives

Scope 4: Full agency
Primary security challenges: Continuous behavioral validation, enforcing agency boundaries, preventing capability drift, and maintaining organizational alignment.

Implementation considerations:

  • Advanced AI safety techniques including reward modeling
  • Continuous monitoring with machine learning-based anomaly detection
  • Automated response systems for behavioral deviations
  • Regular alignment validation through systematic testing
  • Tamper-proof human override mechanisms
  • Failsafe mechanisms that can halt operations when confidence drops

Key architectural patterns

Successful agentic deployments share common patterns that balance autonomy with control.

Progressive autonomy deployment: Start with Scope 1 or 2 implementations and gradually advance through the scopes as organizational confidence and security capabilities mature. This approach minimizes risk while building operational experience. Be cautious and selective when analyzing use cases and bounding controls for Scope 4 implementations and review your ability to address risks at the lower scopes and how risks increase as you move further up the levels.

Layered security architecture: Implement defense-in-depth with security controls at multiple levels—network, application, agent, and data layers—to safeguard that compromise at one level doesn’t lead to complete system failure. Although the combination of these controls is what enables a high security bar, be sure to spend considerable efforts on making sure that identity and authorization concerns are addressed—for both machines and humans. This helps prevent issues such as the confused deputy problem—when a human or service with lesser permissions is able to elevate permissions through agents that might themselves have more entitlements and privileges.

Continuous validation loops: Establish automated systems that continuously verify agent behavior against expected patterns, and that have escalation procedures for when deviations are detected. Auditability and explainability are key requirements to confirm that agents are performing within the bounds intended and to help you determine control effectiveness, adjust parameters, and validate your orchestration workflows.

Human oversight integration: Even in highly autonomous systems, maintain meaningful human oversight through strategic checkpoints, behavioral reporting, and manual override capabilities. It might be reasonable to assume that human oversight reduces when moving from Scope 1 to Scope 4 agency, but the truth is that it simply shifts focus. For example, the human requirement to instantiate, review, and approve certain agentic actions is higher in Scopes 1 and 2 and lower in Scopes 3 and 4; however, the human requirement to audit, assess, validate, and implement more complex security and operational controls is much higher in Scopes 4 and 3 than they are in Scopes 2 and 1.

Graceful degradation: Design systems to automatically reduce autonomy levels when security events are detected, allowing operations to continue safely while human operators investigate. If your agents start to act in ways that go beyond the intended bounds of their design, anomalous behavior is detected, or they begin to perform actions deemed particularly risky or sensitive to your business, then consider having detective controls that will automatically inject tighter restrictions such as requiring more HITL or reducing the actions an agent can take. You might do this as incremental degradation or, you might choose to disable the agent if it’s acting in ways that negatively impact the environment. These agentic safety mechanisms that can implement additional restrictions or even disable an agent should be considered when building or deploying agents.

Conclusion

The Agentic AI Security Scoping Matrix provides a structured mental model and framework for understanding and addressing the security challenges of autonomous agentic AI systems across four distinct scopes. By accurately assessing your current scope and implementing appropriate controls across all six security dimensions, organizations can confidently deploy agentic AI while managing the landscape of associated risks.

The progression from basic and highly constrained agents to fully autonomous and even self-directing agents represents a fundamental shift in how we approach AI security. Each scope requires specific security capabilities, and organizations must build these capabilities systematically to support their agentic ambitions safely.

Next steps

To implement the Agentic AI Security Scoping Matrix in your organization:

  1. Assess your current agentic use cases and maturity against the four scopes to understand your security requirements and associated risks. Integrate it into your procurement and SDLC processes.
  2. Identify capability gaps across the six security dimensions for your target scope.
  3. Develop a progressive deployment strategy that builds security capabilities as you advance through scopes.
  4. Implement continuous monitoring and behavioral analysis appropriate for your scope level.
  5. Establish governance processes for scope progression and security validation.
  6. Train your teams on the unique security challenges of each scope.

You can find additional information on the Agentic AI Security Scoping matrix here, along with additional information on AI security topics. For additional resources on securing AI workloads, see the AI for security and security for AI: Navigating Opportunities and Challenges whitepaper and explore purpose-built platforms designed for the unique challenges of agentic AI.

If you have feedback about this post, submit comments in the Comments section below.

Aaron Brown

Aaron Brown

Aaron Brown is a Senior AI Security Architect at AWS with over 8+ years of experience designing, building and shipping AI solutions in the offensive and defensive security domains.

Matt Saner

Matt Saner

Matt Saner is a global security leader helping customers unblock and accelerate complex security challenges. He plays a key role in the development of security standards for AI, including serving on the project governing board and executive steering committee for the Coalition for Secure AI (CoSAI) and as a distinguished review board member for OWASP’s GenAI and Agentic AI security projects.

Introducing the Landing Zone Accelerator on AWS Universal Configuration and LZA Compliance Workbook

Post Syndicated from Kevin Donohue original https://aws.amazon.com/blogs/security/introducing-the-landing-zone-accelerator-on-aws-universal-configuration-and-lza-compliance-workbook/

We’re pleased to announce the availability of the latest sample security baseline from Landing Zone Accelerator on AWS (LZA)—the Universal Configuration. Developed from years of field experience with highly regulated customers including governments across the world, and in consultation with AWS Partners and industry experts, the Universal Configuration was built to help you implement security and compliance at scale for on your regulated workloads. By setting a high bar with the latest AWS security best practices, the Universal Configuration can help address technical control requirements from compliance frameworks across different geographic regions and industry verticals. The Universal Configuration’s multi-account security architecture provides a foundation to host your diverse workload requirements today along with providing the ability to explore the generative AI and agentic AI solutions that will shape your organization in the future. It can also replace months of complex planning and design by deploying a comprehensive security and compliance-driven environment based on AWS Well-Architected principles in a matter of hours.

As organizations grow, they typically pursue or must adhere to new security compliance certifications. LZA and the Universal Configuration help organizations of all sizes and phases in their security and compliance journey. The speed of deployment, step-by-step documentation, and compliance resources can reduce traditional assessment and authorization timelines by months and result in more predictable and successful audit outcomes. This enables more freedom to invest resources to grow the business instead of choosing between security and compliance tradeoffs.

The Universal Configuration helps organizations:

  • Automate the deployment of a secure multi-account AWS environment
    • Foundational security controls based on AWS Well-Architected best practices
    • Apply consistent and predictable security controls post-deployment
    • Enable and integrate with native AWS security, identity, and compliance services
  • Implement controls across system layers
    • Organization-wide security architecture
    • Perimeter and resource-specific preventative, proactive, and detective controls
    • Support for multi-AWS Region resilience, disaster recovery, and active failover
  • Establish a foundation for security and compliance readiness
    • Built-in AWS security best practices and technical implementation statements
    • Map LZA capabilities across global and industry-specific compliance frameworks
    • Deploy hundreds of controls hours instead of months

The LZA Compliance Workbook

The LZA engine has been a trusted tool for quickly deploying secure multi-account AWS environments for over 4 years. It is also cost effective because you pay only for the AWS services used to operate your environment. The Universal Configuration is the first sample configuration accompanied by the LZA Compliance Workbook available on AWS Artifact. It’s a first-of-its-kind resource with detailed control mappings showing how the Universal Configuration can help you address requirements from frameworks including NIST 800-53 Rev5, CMMC/NIST 800-171, ISO-27001, HIPAA, C5:2020, NATO D-32 (Appendix B), and DoD CCI.

The LZA Compliance Workbook is regularly maintained to reflect the latest Universal Configuration baseline and will include additional compliance mappings in future releases. The workbook contains detailed security configuration descriptions based on the Universal Configuration deployment files, along with control requirement mappings and implementation statements that translate its security capabilities into a compliance-friendly format. By combining AWS security best practices with global compliance expertise, the Universal Configuration delivers predicable security outcomes while also helping you meet regional and industry requirements.

Getting started

To get started with the Landing Zone Accelerator on AWS Universal Configuration, the LZA Implementation Guide walks you through the steps, use cases, and considerations when deploying with LZA. You can download the LZA Compliance Workbook from AWS Artifact today and configure notifications to receive emails when future versions are released. You can view the deployment files and additional technical implementation guidance on the GitHub Universal Configuration sample and documentation page. Additionally, visit the AWS Partner Network (APN) for help with audit and advisory initiatives, cloud migrations, deploying the LZA Universal Configuration, and other services. You can visit the AWS Partner Finder tool and search by solution for Landing Zone Accelerator for the latest LZA Partner offerings.

If you have feedback about this post, submit comments in the Comments section below.

Kevin Donohue

Kevin Donohue

Kevin is a Senior Security Compliance Engineer at AWS, where he builds solutions and resources to help AWS customers achieve their security and compliance goals. Prior to joining the Landing Zone Accelerator team in AWS Professional Services in 2024, Kevin began his tenure with AWS Security in 2019 specializing in FedRAMP compliance and the shared responsibility model.

Christine Screnci

Christine Screnci

Christine is a Principal Technical Product Manager at AWS, where she specializes in developing and scaling enterprise-level solutions. Christine began her tenure with AWS in 2016 working with Worldwide Public Sector customers to improve the migration and modernization journey through globally scaled solutions. She is passionate about hypothesis-driven development and experimentation to improve customer experiences with AWS technologies.

Bhavish Khatri

Bhavish is a Senior Delivery Engineer at AWS, where he builds enterprise-scale solutions to help large organizations achieve their compliance goals. Bhavish started at AWS in 2018, specializing in multi-account AWS deployments and focusing on LZA and the Universal Configuration solution. He helps organizations build secure, scalable cloud environments that align with global compliance frameworks and regulatory requirements across diverse sectors.

Enforce business glossary classification rules in Amazon SageMaker Catalog

Post Syndicated from Ramesh H Singh original https://aws.amazon.com/blogs/big-data/enforce-business-glossary-classification-rules-in-amazon-sagemaker-catalog/

Organizations are scaling their data catalogs faster than ever. Maintaining consistent metadata standards across teams remains a challenge. Business glossaries define the language of the enterprise—terms like Customer Profile, Transaction, or Confidential Data—but assets are often published without these classifications, leading to inconsistent metadata and poor discoverability.

To address this, Amazon SageMaker Catalog now supports metadata enforcement rules for glossary terms classification (tagging) at the asset level. With this capability, administrators can require that assets include specific business terms or classifications. Data producers must apply required glossary terms or classifications before an asset can be published. This enforces metadata consistency across the catalog and makes sure assets carry the business context needed for effective discovery and governance.

This capability builds on existing metadata rule features for enforcing required metadata fields during asset publishing. The new addition extends those rules to cover glossary term validation, strengthening the link between business language and technical data assets.

In this post, we show how to enforce business glossary classification rules in SageMaker Catalog.

Why metadata enforcement matters

A common governance challenge is the lack of standardized tagging and classification for assets entering enterprise catalogs. Without enforcement, data producers might publish assets missing required business terms (such as data sensitivity level or product domain), resulting in inconsistent metadata that confuses business users, unreliable search and filtering results, and manual cleanup and downstream compliance risks.

By automatically validating metadata at publish time, SageMaker Catalog validates metadata when assets are published. This offers the following key benefits:

  • Assets are classified with approved business terms before publication
  • Validation supports compliance with internal glossary and classification standards
  • Consistent tagging enhances search accuracy and reduces noise
  • Incomplete or incorrectly tagged assets don’t reach consumers

How metadata enforcement works

On the Amazon SageMaker Unified Studio console, administrators navigate to Catalog, Governance, Rules and create metadata rules targeting the asset publishing workflow. Rules can specify required glossary terms or classification fields (for example, Business Unit, PII Category, or Data Sensitivity). Rules can apply organization-wide or within specific domains or projects.

When a producer attempts to publish an asset, SageMaker Catalog checks that the asset includes the required glossary terms or classifications. If any required metadata is missing, the publish action fails with a clear error message. After the metadata is added, the asset can be published successfully.

Enforced tagging makes sure published assets can be searched and filtered using consistent business terminology, improving catalog usability for analysts and business users.

Solution overview

For this post, we explore a financial services use case. Our example a financial services company defines a rule requiring all datasets published from the project to have ‘Finance’ glossary associated:

  • A data producer attempting to publish a new dataset without this tag receives a validation error
  • After applying the correct classification, the dataset publishes successfully
  • Analysts can now filter the catalog to find only Finance datasets or join assets consistently tagged with the same glossary term

In the following sections, we walk through the steps to configure this solution. We create a rule that all assets published from a specific project should have a business unit tag called Finance.

Prerequisites

To test this solution, you should have a SageMaker Unified Studio domain set up with a domain owner or domain unit owner privileges. You should also have an existing project to publish assets and catalog assets. For instructions to create these assets, see the Getting started guide.

In this example, we created a project named financial_analysis and a test table. For instructions to create a table, see Get started with Amazon S3 Tables in Amazon SageMaker Unified Studio. To ingest the sample data to SageMaker Catalog and generate business metadata, see Create an Amazon SageMaker Unified Studio data source for Amazon Redshift in the project catalog.

Create glossary and add terms

Complete the following steps to create a new glossary and add terms:

  1. In SageMaker Unified Studio, on the Discover menu, choose Glossaries.
  2. Choose Create glossary.
  3. Provide details for your glossary, including name, owning project, and optional description.
  4. For Glossary restriction, turn on Enabled.
  5. Choose Create.
  6. Create the term Finance in the Business Unit Details glossary.

Create rule to enforce glossary terms

Complete the following steps to create a rule to define glossary terms:

  1. On the Govern menu, choose Domain units.
  2. On the Rules tab, choose Add.
  3. Add a publishing rule for the Finance project to have the Finance tag for all assets published to the catalog.
  4. Choose Add rule.

    The following screenshot shows the configuration details for your new rule.

Publish asset with enforced rules

Complete the following steps to publish your asset with the enforced rules:

  1. On the financial_analysis project page, go to your asset.
  2. In the Glossary terms section, choose Add terms.

    If you choose Publish without adding the needed term, you get an error stating the Finance term should be assigned.
  3. Choose Finance to add the required term.
  4. Choose Publish asset.

The following screenshot shows the published asset and the required terms in the glossary.

Conclusion

With metadata enforcement rules for glossary terms, SageMaker Catalog brings stronger control and consistency to how organizations publish and manage their data assets. By requiring approved business classifications before publication, teams can make sure assets adhere to enterprise metadata standards, improving governance, discoverability, and trust in shared catalogs. This capability helps organizations scale their catalog governance without adding manual overhead—embedding compliance and quality directly into the publishing workflow.

Metadata enforcement rules for glossary terms are available in AWS Regions where SageMaker Catalog operates. Get started with this capability, refer to the user guide.


About the Authors

Ramesh Singh

Ramesh Singh

Ramesh is a Senior Product Manager Technical (External Services) at AWS in Seattle, Washington, currently with the Amazon SageMaker team. He is passionate about building high-performance ML/AI and analytics products that help enterprise customers achieve their critical goals using cutting-edge technology.

Pradeep Misra

Pradeep Misra

Pradeep is a Principal Analytics and Applied AI Solutions Architect at AWS. He is passionate about solving customer challenges using data, analytics, and AI/ML. Outside of work, he likes exploring new places, trying new cuisines, and playing badminton with his family. He also likes doing science experiments, building LEGOs, and watching anime with his daughters.

Pradyut Singh

Pradyut Singh

Pradyut is a Software Development Engineer at AWS, working with the Amazon SageMaker team with a focus on Data and AI services. Outside of work, he has a passion for travel and enjoys going on long road trips, exploring diverse cuisines and discovering new places along the way.

Manny Pelaez

Manny Pelaez

Manny is a UX Designer at AWS working on Amazon SageMaker Unified Studio. He is passionate about creating intuitive user experiences by listening to customers and focusing on their pain points. Outside of work, he enjoys driving, exploring food, art, sketching, and working on side projects. He also teaches a design course, sharing his expertise with aspiring designers.

Improve API discoverability with the new Amazon API Gateway Portal

Post Syndicated from Giedrius Praspaliauskas original https://aws.amazon.com/blogs/compute/improve-api-discoverability-with-the-new-amazon-api-gateway-portal/

Amazon API Gateway now provides a fully managed portal feature, Amazon API Gateway Portal, that eliminates the need for static websites, open source solutions, or third-party offerings, which often led to fragmented API lifecycle management and increased costs. API Gateway Portal integrates with the API Gateway service and offers features like API products, interactive “Try it” functionality, and documentation for your API portfolio.

This fully managed solution addresses the need for a seamless way to showcase APIs and help developers quickly find, try, and integrate with them. By providing a managed solution that handles infrastructure, security, and scalability, API providers can focus on creating valuable APIs and delivering a great developer experience.

In this post, we will show how you can use the new portal feature to create customizable portals with enhanced security features in minutes, with APIs from multiple accounts, without managing any infrastructure.

Overview

A developer portal is a web page where API providers can share their APIs and API documentation by grouping them into portal products. Each portal product is a logical grouping of REST APIs and contains the documentation that you create and publish for your API consumers. Product pages within a portal contain the custom documentation at the portal product level. Product REST endpoint pages contain the documentation for each of the REST APIs with the details of the path and method of a REST API and the stage it’s deployed to. The combination of Product pages and Product REST endpoint pages provide the complete documentation for our API consumers on how to start using your REST APIs.

This abstraction allows you to organize endpoints from multiple APIs and stages into coherent product offerings for your consumers. For example, if you operate multiple APIs supporting a pet adoption service, you can create an “AdoptAnimals” portal product that groups dog-related endpoints from one API with cat-related endpoints from another API, while organizing user management functions into a separate “AdoptProcess” portal product.

With this flexibility you can present your APIs in a way that matches your business logic rather than your technical architecture and organize your APIs in ways that make the most sense for your consumers. For large enterprises managing extensive API portfolios, API Gateway Portal offers centralized catalogs of APIs across business groups, reducing duplicate work and improving standardization.

The portal feature automatically creates developer portals that display APIs with documentation, interactive testing capabilities, and integrated consumer analytics. The platform uses AWS Resource Access Manager (RAM) for multi-account API sharing, Amazon Cognito for access control, and Amazon CloudWatch for centralized monitoring.

Key features of API Gateway Portal

The API Gateway Portal provides comprehensive functionality for both API providers and consumers.

The following is a list of the key features that were introduced by the service at launch:

Customizable portal experience: You control your portal’s branding through custom logos and color schemes. You can configure custom domain names with SSL certificates managed by AWS Certificate Manager, or use the default domain structure provided by AWS.

Flexible access control: Access to developer portals can be controlled using Amazon Cognito, you can configure portals to be either publicly accessible or require authentication. Integration with Cognito user pools provides secure and scalable identity and access management that is enterprise-grade, cost-effective, and customizable. For organizations using existing identity systems, Cognito supports federation with SAML and OpenID Connect identity providers.

Cross-account API organization: The portal supports sharing portal products across AWS accounts using AWS RAM, so that organizations can create a unified API catalog while maintaining flexibility for API providers to develop and maintain APIs in their own accounts. When you share a portal product with another account, that account cannot modify any properties of your portal product or product endpoint pages, so API providers maintain control over their APIs while still enabling discovery across the organization. The cross-account sharing capabilities provide significant governance benefits for enterprise customers, including centralized discovery, standardization, reduced duplication, clear ownership, and controlled access.

Documentation: Beyond API reference documentation synchronized from your API definitions, you can add supplemental documentation including guides, use cases, and integration examples.

Search, discovery, and interactive API exploration: Consumers can search across your entire catalog. The portal provides intuitive customizable navigation and organization to help users find the right endpoints for their needs. Using the “Try It” functionality consumers can try APIs directly from the portal. Users can input request parameters, headers, and see live responses, reducing time-to-value for API integrations. This environment includes built-in limits for security and cost control.

Access control and governance

Amazon API Gateway Portal provides security and governance capabilities essential for production deployments.

Identity and access management: Integration with Cognito user pools provides secure and scalable identity and access management that is enterprise-grade, cost-effective, and customizable, including multi-factor authentication, password policies, and user lifecycle management.

API authorization: The portal respects existing authorization mechanisms configured on your APIs, including AWS IAM, Lambda authorizers, and Cognito user pools. Portal access doesn’t bypass your established security controls.

Cross-account governance: When sharing portal products across accounts using AWS RAM, the original API owners retain full control over their endpoints, including authorization strategies, integration configurations, and stage settings. Portal owners can use shared portal products but cannot modify the underlying API configurations.

Audit and monitoring: All portal management activities integrate with AWS CloudTrail for comprehensive audit logging. You can use Amazon CloudWatch RUM to perform real user monitoring to collect and view analytics about API consumers in near real time.

Resource limits: The service includes built-in quotas to prevent abuse, including limits on API testing rate limits, payload sizes, and integration timeouts. With these limits the “Try It” functionality cannot impact your production API performance.

Getting Started

Setting up a portal involves three main steps: creating portal products, configuring the portal, and publishing for consumer access. We will walk through those steps in more detail.

Create portal product

The following procedure shows you how to create a portal product:

  1. Navigate to the API Gateway console and select Portal products from the main navigation.
  2. Choose Create portal product and specify your portal product details including name, description, and visibility settings.
  3. Next, select the endpoints you want to include in this portal product. You can choose entire API stages or specific resources and methods, and even rename endpoints with user-friendly names for better discoverability.
  4. The system automatically imports your API documentation. You can improve the documentation with additional context, use cases, and examples later.
  5. Organize product endpoints into custom categories that reflect your business logic rather than technical implementation details.

Configure the developer portal

The following procedure shows how to create a portal.

  1. Select Developer portals in the API Gateway console navigation.
  2. Specify your portal name, description, and domain configuration.
  3. Choose between adding your prefix to the default AWS domain or configuring a custom domain name with your own SSL certificate.
  4. Configure access control by selecting authentication requirements. For internal portals, you might require Amazon Cognito authentication, while public portals can allow anonymous access to documentation.
  5. Upload your logo and select color themes to match your brand identity.
  6. Add your portal products. You can include products from your account or products shared with you from other accounts through AWS RAM. The portal provides search and filtering capabilities for consumers.

Preview and publish

Before making your portal publicly available, use the preview functionality to review the consumer experience. The preview shows exactly how your portal will appear to users, including navigation, documentation, and available API testing capabilities.

When you’re satisfied with the configuration, choose Publish portal to make it accessible to consumers. The publishing process typically completes within a few minutes, and API Gateway provides the final portal URL for distribution to your consumers.

Conclusion and next steps

The new API Gateway Portal eliminates the complexity of building and maintaining custom API documentation sites. Your developers get a professional, feature-rich experience where they can discover and try your APIs immediately. Plus, since everything stays within AWS, you get built-in security, simplified operations, and comprehensive observability through integration with services like CloudWatch and CloudTrail.

Ready to streamline your API discovery experience? Here’s how to get started:

Building an AI gateway to Amazon Bedrock with Amazon API Gateway

Post Syndicated from Thomas Natschläger original https://aws.amazon.com/blogs/architecture/building-an-ai-gateway-to-amazon-bedrock-with-amazon-api-gateway/

When building generative AI applications, enterprises need to govern foundation model usage through authorization, quota management, tenant isolation, and cost control. To meet these needs, Dynatrace developed a robust AI gateway architecture that has evolved into a reusable reference pattern for organizations looking to control access to Amazon Bedrock services at scale.

This pattern uses Amazon API Gateway as the access layer in front of Amazon Bedrock. It supports key capabilities such as request authorization with seamless integration into existing identity systems (for example, JWT validation), usage quotas and request throttling, lifecycle management, canary releases, and AWS WAF integration. The gateway also uses Amazon API Gateway response streaming, launched today, for real-time delivery of API model outputs that stream to users as they are generated. The complete solution code is available in our GitHub repository.

In this blog post, you’ll explore the underlying architecture, learn how to deploy and configure the solution, and discover further enhancement ideas.

Architecture of the AI gateway

The reference architecture gives you granular control over LLM access using fully managed AWS services. It is transparent to client applications and seamlessly integrates into existing enterprise environments.


Figure 1. Reference architecture of the AI gateway.

The solution consists of four core components:

  1. Amazon Route 53 (optional) manages custom domain routing, allowing clients to access the gateway through a company-specific endpoint instead of the default Amazon API Gateway domain.
  2. Amazon API Gateway serves as the entry point for the requests and provides capabilities like authorization, request throttling, and lifecycle management.
  3. AWS Lambda authorizer handles request authorization, which in the Dynatrace implementation involves validating JWT tokens with existing authentication systems. For your specific implementation, you can implement your own authorization logic in a Lambda authorizer, integrate with Amazon Cognito user pools, or use other API Gateway authorization mechanisms.
  4. Lambda integration is a dynamic request forwarder that signs incoming requests with AWS credentials and routes them to the appropriate Amazon Bedrock endpoints. The function preserves the original request details, including the API action and parameters, to support current or future Amazon Bedrock APIs without code changes. The complete implementation is available in the integration Lambda function.
  5. Amazon Bedrock provides access to foundation models and AI capabilities.

The benefit of this architecture is the transparency to client applications and future-proof design. Clients can use AWS SDKs (like Boto3) to access Amazon Bedrock functionalities (such as LLMs and Knowledge Bases) exactly as they would when calling the Amazon Bedrock API directly. Meanwhile, the AI gateway handles authorization, quota management, and other capabilities behind the scenes.

When a client makes an Amazon Bedrock API call to the AI gateway endpoint, the Lambda integration function:

  1. Captures the original request with its details (headers, body, and parameters).
  2. Applies AWS Signature Version 4 authentication.
  3. Forwards the request to the correct Amazon Bedrock service endpoint.

With this approach the AI gateway can support current and new Amazon Bedrock features without requiring specific API knowledge or code updates, reducing gateway maintenance as the available features grow.

Deploying with AWS CloudFormation

This walkthrough will deploy a private AI gateway with authorization disabled for initial testing. You’ll create the core infrastructure (API Gateway, Lambda functions, and VPC endpoints) and then test basic functionality before optionally adding security features.

The quickest way to deploy this solution is with AWS CloudFormation:

  1. Sign in as an administrator to the AWS Management Console and use the navigation bar to select your desired AWS Region for deployment.
  2. Choose the following Launch Stack button:

launch stack button

  1. In the Quick create stack page, configure the key parameters as follows. For complete parameter descriptions, see the documentation.
Parameter Description Choose value Why
EndpointType API Gateway endpoint accessibility (PRIVATE or REGIONAL) PRIVATE Secure internal access only
EnableAuthorizer Enable Lambda Authorizer for API Gateway false Start without auth for simpler testing
CustomDomain Custom domain name for API Gateway (leave empty) Use default domain initially
HostedZoneId Route 53 Hosted Zone ID for custom domain SSL validation (leave empty) Not needed with default domain
  1. Select the capability I acknowledge that AWS CloudFormation might create IAM resources.
  2. Leave all other configurations at their default values and choose Create Stack.
  3. In the stack page, wait until the Status of the stack transitions to CREATE_COMPLETE.
  4. Choose Outputs and copy the values for GatewayUrl, VpcId, and ApiId – you’ll use these to test your gateway later.

Testing the deployment

Your gateway is now running privately inside its VPC, but that means you can’t reach it from the outside. You’ll now create an AWS CloudShell environment inside the VPC to test the gateway:

  1. Open the CloudShell console page, choose the + icon and then choose Create VPC environment.
  2. On the Create a VPC environment page, configure:
    1. Name: for example, AIGatewayTest
    2. Virtual Private Cloud (VPC): the VpcId you copied earlier
    3. Subnet: any available subnet
    4. Security group: the default VPC security group
  3. Choose Create to create your VPC environment.

Once your CloudShell environment is ready, you’ll create a client that properly routes requests through your private API endpoint.First, execute this command in CloudShell to create a reusable client factory that routes requests through your gateway while maintaining the standard boto3 interface:

cat > boto3_client_factory.py << 'EOF'
import boto3
from botocore import UNSIGNED
from botocore.client import BaseClient
from botocore.config import Config

class Boto3ClientFactory:
    """Utility class for creating boto3 clients."""
    
    @classmethod
    def create(cls, service_name: str, endpoint_url: str, jwt_token: str = None) -> BaseClient:
        """Create a boto3 client instance usable with the sign-and-forward Lambda.
        
        Parameters
        ----------
        service_name : str
            The service name to be used when initiating boto3.client.
        endpoint_url: str
            URL pointing to API gateway invoke endpoint.
        jwt_token: str, optional
            JWT token to include in Authorization header for all requests.
            If provided, adds 'Authorization: Bearer {jwt_token}' header.
        """
        generic_client = boto3.client(
            service_name = service_name,
            endpoint_url = endpoint_url,
            # do not sign the request at the client side; authentication is done
            # in API gateway
            config = Config(signature_version = UNSIGNED),
            # non-None Region_name needs to be passed due to validation logic
            # it is NOT used if we pass endpoint_url above
            region_name="",
        )
        
        def add_client_headers(model, params, **kwargs):
            """Hook to add custom headers each request."""
            headers = params["headers"]
            
            # the header "aws-endpoint-prefix" is used by the Lambda integration
            headers["aws-endpoint-prefix"] = model.service_model.endpoint_prefix
            
            # Add Authorization header if jwt_token is provided
            if jwt_token:
                headers["Authorization"] = f"Bearer {jwt_token}"
        
        # register the hook to add custom headers before each call
        generic_client.meta.events.register("before-call.*.*", add_client_headers)
        
        return generic_client
EOF

With the factory in place, you can now create clients for different Amazon Bedrock services. Set your configuration variables:

# Replace with your actual Gateway URL from CloudFormation Outputs (used in all examples)
export GATEWAY_URL="https://your-api-id.execute-api.region.amazonaws.com/v1"
# Replace with one of your pre-existing Amazon Bedrock Knowledge Bases ID (optional, only needed for knowledge base example)
export KB_ID="your-kb-id"

Test model inference with Amazon Bedrock ConverseStream API:

cat > test_converse_stream.py << 'EOF'
import os
from boto3_client_factory import Boto3ClientFactory
import json

# Get configuration from environment variables
api_gateway_url = os.environ['GATEWAY_URL']

# Create client for Bedrock Runtime (model inference)
bedrock_runtime_client = Boto3ClientFactory.create(
    service_name = "bedrock-runtime",
    endpoint_url = api_gateway_url
)

response = bedrock_runtime_client.converse_stream(
    modelId = 'global.anthropic.claude-haiku-4-5-20251001-v1:0',
    messages = [{"role": "user", "content": [{"text": "Who invented the airplane?"}]}]
)

print("Model Response:")
# Stream the response as it arrives
for event in response['stream']:
    if 'contentBlockDelta' in event:
        delta = event['contentBlockDelta']['delta']
        if 'text' in delta:
            print(delta['text'], end='', flush=True)  # Print each chunk immediately
    elif 'messageStop' in event:
        print("\n")  # End of stream
        break
EOF

python test_converse_stream.py

Test retrieval from Amazon Bedrock Knowledge Bases:

cat > test_knowledge_base.py << 'EOF'
import os
from boto3_client_factory import Boto3ClientFactory

# Get configuration from environment variables
api_gateway_url = os.environ['GATEWAY_URL']
knowledge_base_id = os.environ['KB_ID']

# Create client for Bedrock Agent Runtime (knowledge bases)
bedrock_kb_client = Boto3ClientFactory.create(
    service_name = "bedrock-agent-runtime",
    endpoint_url = api_gateway_url
)

response = bedrock_kb_client.retrieve(
    knowledgeBaseId = knowledge_base_id,
    retrievalQuery = {'text': 'Who invented the airplane?'},
    retrievalConfiguration = {
        'vectorSearchConfiguration': {
            'numberOfResults': 5
        }
    }
)

print("Knowledge base retrieval results:")
print(response)
EOF

python test_knowledge_base.py

Configuring authorization

After testing the basic functionality, you can now enable authorization by updating your deployed stack with custom authorization logic. For example, to implement JWT validation in a Lambda authorizer:

  1. Open the CloudFormation template file with your favorite text editor and replace the following placeholder code in the Lambda authorizer with your own authorization logic (see examples):
def lambda_handler(event, context):
    try:
        # Placeholder for JWT validation
        # Implement your authorization logic
        token = event['authorizationToken']
        
        # By default, deny all requests
        return generate_policy('user', 'Deny', event['methodArn'])
    except Exception as e:
        return generate_policy('user', 'Deny', event['methodArn'])
  1. Update the CloudFormation stack to enable your new authorization logic:
    1. Go back to the CloudFormation console
    2. Select your bedrock-llm-gateway stack, choose Update stack, and choose Make a direct update
    3. Choose Replace existing template, upload your modified template file, and choose Next
    4. In the parameters section, change EnableAuthorizer from false to true and choose Next
    5. Select the capability I acknowledge that AWS CloudFormation might create IAM resources.
    6. Choose Next and choose Submit
  2. Once the CloudFormation stack update is complete, deploy your changes to the API Gateway. API Gateway requires an explicit deployment step to activate configuration changes. While CloudFormation updates the API Gateway resources, the changes won’t be active until you create a new deployment to push them to the stage.
    1. Navigate to the API Gateway console
    2. Choose your API (find it using the ApiId from your CloudFormation outputs)
    3. Choose Deploy API
    4. For Stage select v1, and choose Deploy
  3. Test your authorization by going back to CloudShell and running this example. This example passes a JWT token to test the authorization – replace it with your actual token or other authorization parameters you configured in the Lambda authorizer.
cat > test_with_auth.py << 'EOF'
import os
from boto3_client_factory import Boto3ClientFactory

# Get configuration from environment variables
api_gateway_url = os.environ['GATEWAY_URL']

# Replace "your-jwt-token" with your actual JWT token
jwt_token = "your-jwt-token"

bedrock_runtime_client_with_jwt = Boto3ClientFactory.create(
    service_name = "bedrock-runtime",
    endpoint_url = api_gateway_url,
    jwt_token = jwt_token
)

response = bedrock_runtime_client_with_jwt.converse_stream(
    modelId = 'global.anthropic.claude-haiku-4-5-20251001-v1:0',
    messages = [{"role": "user", "content": [{"text": "Who invented the airplane?"}]}]
)

print("Model Response:")
# Stream the response as it arrives
for event in response['stream']:
    if 'contentBlockDelta' in event:
        delta = event['contentBlockDelta']['delta']
        if 'text' in delta:
            print(delta['text'], end='', flush=True)  # Print each chunk immediately
    elif 'messageStop' in event:
        print("\n")  # End of stream
        break
EOF

python test_with_auth.py

Enhancement options for the AI gateway

The solution can be enhanced using additional API Gateway capabilities. Here are some examples:

  • Rate limiting and throttling: Control request rates using usage plans and API keys. This is especially important in multi-tenant SaaS applications to avoid noisy neighbor problems. For examples of throttling scenarios, see the throttling documentation.
  • Private or edge-optimized endpoints: Configure endpoint types to optimize for internal access or global performance.
  • Lifecycle management and canary releases: Manage multiple API versions and implement gradual rollouts with stage variables and canary deployments.
  • WAF integration: Add AWS WAF rules to help protect from common exploits.
  • Prompt and response caching: Implement caching strategies to reduce costs and improve response times for frequently requested prompts using API Gateway caching.
  • Content filtering: In addition to the safeguards offered by Amazon Bedrock Guardrails, add custom filtering in the Lambda integration layer to screen for sensitive content such as personally identifiable information (PII).

For more information about these capabilities, visit the API Gateway features page.

Conclusion

The AI gateway pattern demonstrated in this post provides a scalable way to manage access to foundation models and agent tools through Amazon Bedrock. Initially developed and implemented by Dynatrace to serve their global user base, this pattern has proven its effectiveness at enterprise-scale. By using the Amazon API Gateway enterprise features organizations can implement necessary controls while maintaining the benefits of serverless architecture.

To start using this solution today, follow the walkthrough in this blog post or check out our GitHub repository. To learn more about the services used in this solution, explore the Amazon API Gateway features page, or visit the documentation for Amazon API Gateway and Amazon Bedrock. To learn how this solution streams foundation model responses, see the documentation for the new Amazon API Gateway response streaming capability.


About the authors

How to register for a US toll-free number with AWS End User Messaging

Post Syndicated from Tyler Holmes original https://aws.amazon.com/blogs/messaging-and-targeting/how-to-register-for-a-us-toll-free-number-with-aws-end-user-messaging/

As businesses increasingly use SMS messaging to engage with customers at scale, having the right origination identity is crucial. Toll-free numbers (TFNs) are the quickest way to begin sending to the United States and offer a trusted, high-visibility option that can drive greater response and brand recognition. This post is for every company that wants to send to the US or internationally.

Obtaining and properly registering a US toll-free number requires a registration process and adhering to requirements set forth by mobile carriers. This comprehensive guide walks you through the step-by-step procedure for registering a US toll-free number through AWS End User Messaging, which provides robust SMS capabilities to AWS customers.

The benefits of using a US toll-free number

TFNs offer several key advantages over other SMS origination types in the US market:

Toll-free facts

  • The opt-out flow for US TFNs is managed at a network level and enforced by US Carriers. If a user sends the word stopor any of the other supported keywords—to the TFN, the carrier sends the following outbound message to the user: NETWORK MSG: You replied with the word "stop" which blocks all texts sent from this number.
    Text back unstop or start to receive messages again. This behavior cannot be changed.
  • Toll-free numbers have a throughput of three Message Parts per Second (MPS).
  • International toll-free numbers are two-way capable in the US and Canada but are one-way only in all other supported countries. Depending on the country being sent to, if not the US or Canada, your end-user can receive your message from an originator other than your TFN. This feature can be turned on before or after registration.

The TFN registration process

To get started, you need to create a US toll-free number registration in the AWS Management Console for AWS End User Messaging or use the API.

  1. Company information: Provide details about your business, including the company name, website, and headquarters address.
  2. Contact information: Enter the name, email, and phone number of the individual who will serve as the main point of contact for your TFN program. This email address should match the domain of the company being registered and cannot be a distribution list, contact group, or mailing list. This information will be used for verification or in the event of something needing to be communicated to you about your TFN. It will not be public knowledge.
  3. Messaging use case: Describe how you intend to use the TFN, including your estimated monthly SMS volume, and select the Use Case Category (such as two-factor authentication, notifications, or marketing).
  4. Use case details: It’s critical that the Use Case Details field and all message templates are consistent with the Use Case Category you selected in the previous step.

For example, if you select two-factor authentication or one-time passwords, your Use Case Details should explain how you plan to use your TFN for that use case, who you will interact with, and why. Answers must be written in English, and it is very important to be clear and concise in this section. Humans are reviewing these, so make sure that everything you write can be understood without prior knowledge of your company or your use case.

  1. Opt-in Workflow Description: This has several boiler-plate components that must be present at the point of opt-in and are discussed in depth in this blog post. If you have a verbal opt-in, you can include the script in this field. If you have a publicly available form, you can supply the URL in the description. Regardless of the format, you must include the following elements at the point of opt in:
    1. Program (brand) name.
    2. Explicitly state the purpose of the SMS program that your end-users are opting into.
    3. Have no prefilled checkboxes, radio buttons, or other fields.
    4. Message frequency disclosure. For example: Message frequency varies or One message per login.
    5. Customer care contact information. For example, Text HELP or call 1-800-111-2222 for support.
    6. Opt-out information. For example: Text STOP to opt-out of future messages.
    7. Include Message and data rates may apply disclosure.
    8. Link to a publicly accessible terms and conditions page.
        • Note: See this post on opt-in processes for terms that must be included.
        • If you are unable to include a public link to your terms, you can include them in the Opt-in workflow image field or alternatively attach them to the registration form or another method like an Amazon S3 presigned URL. Make sure to keep it separate from the actual opt-in screenshots.
    9. Link to a publicly accessible privacy policy page.
        • Note: Carriers are primarily concerned with data sharing of opt-in information to third parties. It’s recommended to have a specific SMS section that addresses that no data gathered during opt-in is shared. See this post on opt-in processes for more details on creating a compliant privacy policy.
        • If you’re unable to include a public link, you can include the full terms in the Opt-in workflow image field or alternatively attach them to the registration form or another method like an Amazon S3 presigned URL. Make sure to keep it separate from the actual opt-in screenshots.
  2. Opt-in workflow image: Upload an image showing how users consent to receiving messages.
    • The maximum file size is 500 KB, and valid file extensions are PDF, JPEG, and PNG.
    • This could be a screenshot of a non-public form, a written consent form, or other evidence of a compliant explicit opt-in that includes all the elements detailed previously.
    • Make sure that the screenshot is clear and readable; degraded image quality will likely be rejected regardless of compliance.
  3. Message samples: Each sample message should reflect actual messages to be sent, should match the Use Case Category you indicated previously, and should follow these best practices:

    • Indicate any variable fields with brackets and make sure to be clear what information can be replaced.
    • Example: Hi, [FirstName] this is AnyCompany letting you know that your delivery is ready.
    • Each sample message must be at least 20 characters. If you plan to use multiple message templates, include them too.
    • Ensure that all messages include your brand name and that it’s consistent with the previously entered information.
    • Make sure your messaging doesn’t involve prohibited content such as cannabis, hate speech, and so on; and that your use case is compliant with AWS Messaging Policy.
  4. Review and submit: Verify that all information is accurate before submitting your registration for approval. There are no exceptions to an explicit opt-in—this includes one-time password use cases, so make sure that your registration includes all the required elements.

The TFN provisioning process

After your TFN registration is submitted it will be reviewed by the same third-party as all other SMS vendors across the globe, not by AWS. You can find current registration time estimates in the number registration process. While waiting, you can monitor your registration status for rejection or acceptance. This AWS blog post has an example of using AWS Lambda to monitor status changes.

If your registration is rejected, the status will change to REQUIRES_UPDATES and should have at least one rejection reason that needs to be reviewed and updated before resubmitting. Follow these instructions to update a rejected registration.

Sending SMS messages and monitoring delivery receipts

After your TFN is activated, you can begin sending SMS messages through AWS End User Messaging. It’s important to monitor your program closely and maintain compliance, because carriers might filter or block your messages if there are issues with your program. This blog post reviews best practices for how to monitor deliverability of SMS messages.

Conclusion

Make sure to follow each step carefully and answer each question completely. There are humans reviewing these so it’s important that your answers are succinct and clear.

As an AWS customer, you have access to powerful messaging capabilities through AWS End User Messaging. By following the steps outlined in this guide, you can quickly register for a US toll-free number to start your SMS outreach. Maintaining compliance is key, and with a TFN in place, you’ll be well on your way to delivering highly effective, compliant SMS messaging that drives real business impact. If you have other questions about AWS End User Messaging, see the comprehensive API specs, the User Guide, or reach out to AWS Support.


About the authors

Python 3.14 runtime now available in AWS Lambda

Post Syndicated from Leandro Cavalcante Damascena original https://aws.amazon.com/blogs/compute/python-3-14-runtime-now-available-in-aws-lambda/

AWS Lambda now supports Python 3.14 as both a managed runtime and container base image. Python is a popular language for building serverless applications. Developers can now take advantage of new features and enhancements when creating serverless applications on Lambda.

You can develop Lambda functions in Python 3.14 using the AWS Management ConsoleAWS Command Line Interface (AWS CLI)AWS SDK for Python (Boto3)AWS Serverless Application Model (AWS SAM)AWS Cloud Development Kit (AWS CDK), and other infrastructure as code tools.

The Python 3.14 runtime supports Powertools for AWS Lambda (Python), a developer toolkit that helps you to implement serverless best practices. Powertools includes observability, batch processing, AWS Systems Manager Parameter Store integration, idempotency, feature flags, Amazon CloudWatch metrics, structured logging, and more.

Lambda@Edge allows you to use Python 3.14 to customize low-latency content delivered through Amazon CloudFront.

This blog post highlights notable Python language updates, Python Lambda runtime features and support, and how you can use the new Python 3.14 runtime in your serverless applications.

New Python features

Python 3.14 contains the following notable updates.

Template strings literal

Template strings introduce a new mechanism for custom string processing using the t prefix instead of f for f-strings. Unlike f-strings that return a simple string, t-strings return an object representing both static and interpolated parts.

Evaluation of type annotations

With the implementation of PEP 649, Python 3.14 defers type annotation evaluation until required. This reduces import time overhead and resolves forward reference issues.

Improved Error Messages

The interpreter now provides helpful suggestions when it detects typos in Python keywords. These include incorrect control flow structures, misused conditional expressions, string syntax errors, incompatible type usage in dicts/sets, and context manager protocol mismatches.

whille :

Traceback (most recent call last):
  File "<stdin>", line 1
    whille :
    ^^^^^^
SyntaxError: invalid syntax. Did you mean 'while'?

Standard library

The standard library includes a new compression.zstd module that provides native support for zstandard compression, offering better compression ratios and faster decompression compared to existing algorithms.

Python 3.14 also includes improved error messages and enhanced asyncio introspection capabilities.

Lambda runtime changes

The Lambda Python runtime contains the following changes.

Python 3.14 features that are not available

Python 3.14 includes some features that are not enabled for the Lambda managed runtime or base images. These features must be enabled when the Python runtime is compiled and cannot be enabled via an execution-time flag. The just-in-time (JIT) compiler is not available in the Lambda runtime because it’s still in an experimental phase. Free-threaded mode, running Python without the global interpreter lock, is supported in Python 3.14, but it is not enabled in the Lambda runtime due to potential performance impact. To use these features in Lambda, you can deploy your own Python runtime build with these features enabled, using a container image or custom runtime.

Amazon Linux 2023

As with the Python 3.12 and Python 3.13 runtimes, the Python 3.14 runtime is based on the provided.al2023 runtime, which is based on the Amazon Linux 2023 minimal container image. The Amazon Linux 2023 minimal image uses microdnf as a package manager, symlinked as dnf. This replaces the yum package manager used in Python 3.11 and earlier AL2-based images. If you deploy your Lambda functions as container images, you must update your Dockerfiles to use dnf instead of yum when upgrading to the Python 3.14 base image from Python 3.11 or earlier base images.

Learn more about the provided.al2023 runtime in the blog post Introducing the Amazon Linux 2023 runtime for AWS Lambda and the Amazon Linux 2023 launch blog post.

Using Python 3.14 in Lambda

You can use Python 3.14 for your Lambda functions in the AWS Management Console, an AWS Lambda container image, or the AWS Cloud Development Kit (AWS CDK).

AWS Management Console

To use the Python 3.14 runtime to develop your Lambda functions, specify a runtime parameter value of Python 3.14 when creating or updating a function. On the Create Function page of the AWS Lambda console, Python 3.14 is available in the Runtime dropdown menu.

Create function page of the AWS Lambda console

To update an existing Lambda function to Python 3.14, navigate to the function in the Lambda console and choose Edit in the Runtime settings panel. The new version of Python is available in the Runtime dropdown menu.

The runtime dropdown menu

Upgrading a function to Python 3.14

To upgrade a function to Python 3.14, check your code and dependencies for compatibility with Python 3.14, run tests, and update as necessary. Consider using generative AI coding assistants like Amazon Q Developer, Amazon Q Developer for CLI, or Kiro to help with upgrades.

AWS Lambda container image

Change the Python base image version by modifying the FROM statement in your Dockerfile:

FROM public.ecr.aws/lambda/python:3.14
# Copy function code
COPY lambda_handler.py ${LAMBDA_TASK_ROOT}

AWS Serverless Application Model (AWS SAM)

In AWS SAM set the Runtime attribute to python3.14 to use this version.

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Simple Lambda Function
  MyFunction:
    Type: AWS::Serverless::Function
    Properties:
      Description: My Python Lambda Function
      CodeUri: my_function/
      Handler: lambda_function.lambda_handler
      Runtime: python3.14

AWS SAM supports generating this template with Python 3.14 for new serverless applications using the sam init command. Refer to the AWS SAM documentation.

AWS Cloud Development Kit

In the AWS CDK, set the runtime attribute to lambda.Runtime.PYTHON_3_14 to use this version.

In Python CDK:

from constructs import Construct
from aws_cdk import ( App, Stack, aws_lambda as _lambda )
class SampleLambdaStack(Stack):
    def __init__(self, scope: Construct, id: str, **kwargs) -> None:
        super().__init__(scope, id, **kwargs)

        base_lambda = _lambda.Function(self, 'python314LambdaFunction',
                                       handler='lambda_handler.handler',
                                    runtime=_lambda.Runtime.PYTHON_3_14,
                                 code=_lambda.Code.from_asset('lambda'))

In TypeScript CDK:

import * as cdk from 'aws-cdk-lib';
import * as lambda from 'aws-cdk-lib/aws-lambda'
import * as path from 'path';
import { Construct } from 'constructs';
export class SampleLambdaStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);
    // The code that defines your stack goes here
    // The python3.14 enabled Lambda Function
    const lambdaFunction = new lambda.Function(this, 'python314LambdaFunction', {
      runtime: lambda.Runtime.PYTHON_3_14,
      memorySize: 512,
      code: lambda.Code.fromAsset(path.join(__dirname, '/../lambda')),
      handler: 'lambda_handler.handler'
    })
  }
}

Serverless Land Patterns AWS Top Picks for Python, now use Python 3.14.

Performance considerations

At launch, new Lambda runtimes receive less usage than existing established runtimes. This can result in longer cold start times due to reduced cache residency within internal Lambda sub-systems. Cold start times typically improve in the weeks following launch as usage increases. As a result, AWS recommends not drawing conclusions from side-by-side performance comparisons with other Lambda runtimes until the performance has stabilized. Since performance is highly dependent on workload, customers with performance-sensitive workloads should conduct their own testing instead of relying on generic test benchmarks.

Conclusion

Lambda now supports Python 3.14 as a managed language runtime to help developers build more efficient, powerful, and scalable serverless applications. Python 3.14 language additions include data model improvements, typing changes, and updates to the standard library. The Lambda managed runtime does not include the option to disable the global interpreter lock (GIL) or use the experimental JIT compiler.

You can build and deploy functions using Python 3.14 using the AWS Management Console, AWS CLI, AWS SDK, AWS SAM, AWS CDK, or your choice of infrastructure as code tool. You can also use the Python 3.14 container base image if you prefer to build and deploy your functions using container images.

Try the Python 3.14 runtime in Lambda today and experience the benefits of this updated language version.

To find more Python examples, use the Serverless Patterns Collection. For more serverless learning resources, visit Serverless Land.

How to automate Session Manager preferences across your organization

Post Syndicated from Nima Fotouhi original https://aws.amazon.com/blogs/security/how-to-automate-session-manager-preferences-across-your-organization/

AWS Systems Manager Session Manager is a fully managed service that provides secure, interactive, one-click access to your Amazon Elastic Compute Cloud (Amazon EC2) instances, edge devices, and virtual machines (VMs) through a browser-based shell or AWS Command Line Interface (AWS CLI), without requiring open inbound ports, bastion hosts, or SSH keys. Session Manager helps you maintain security compliance and controlled access while providing users with access to managed nodes. When starting a session, you must specify a preferences document (known as the Session Manager preferences document) to set the session parameters.

While providing users with access to managed nodes, managing these preferences consistently across multiple AWS Regions and accounts in a large organization can be challenging. Organizations often need to maintain standardized security settings, logging configurations, and session controls across their entire AWS footprint. Manual configuration of these preferences in each Region and account is not only time-consuming but also prone to human error and can lead to security gaps or compliance violations. Additionally, tracking and maintaining these configurations becomes increasingly complex as the organization scales.

You can use Session Manager to control various session options including data encryption for session data in transit and session logs at rest, session duration, and logging. For example, you can specify whether to store session log data in an Amazon Simple Storage Service (Amazon S3) bucket or Amazon CloudWatch Logs log group. In this post, I demonstrate how to manage Session Manager preferences across your organization using AWS CloudFormation StackSets. You can use CloudFormation StackSets to manage resources and configurations, such as Session Manager preferences, across different AWS accounts and Regions using standardized templates to maintain consistent security and compliance standards across your entire AWS infrastructure.

Prerequisites

You need to meet the following prerequisites to deploy the solution in this post:

  • Basic understanding of CloudFormation
  • Trusted access enabled between CloudFormation StackSets and AWS Organizations
  • Access to an AWS management account or StackSet delegated admin account
  • Appropriate AWS Identity and Access Management (IAM) permissions to create and manage StackSets

The Session Manager environment has some additional prerequisites:

  • For EC2 instances with internet access, allow HTTPS (port 443) outbound traffic to:
    • ec2messages.<region>.amazonaws.com
    • ssm.<region>.amazonaws.com
    • ssmmessages.<region>.amazonaws.com

    Note: <region> represents the actual Region where you are deploying your instances.

  • Additional endpoints required for specific features:
    • For CloudWatch Logs integration: logs.<region>.amazonaws.com
    • For Amazon S3 log storage: s3.<region>.amazonaws.com
    • For session data encryption: kms.<region>.amazonaws.com

    Note: For EC2 instances without internet access, you must configure virtual private cloud (VPC) endpoints to maintain connectivity with Systems Manager and related services.

  • SSM Agent requirements:
    • Minimum version 2.3.68.0 for basic session connectivity
    • Version 3.0.222.0 or later for port forwarding and SSH sessions

    Note: Many AWS-provided and trusted third-party Amazon Machine Images (AMIs) come with the SSM Agent pre-installed. For more information, see Find AMIs with the SSM Agent preinstalled.

For a complete list of requirements, see Setting up Session Manager.

Solution overview

This solution, shown in Figure 1, automatically configures the SSM-SessionManagerRunShell document with customizable preferences that govern how Session Manager behaves across your AWS accounts. It creates resources for logging, encryption, and session controls, and updates the SSM-SessionManagerRunShell document with these preferences. The document is updated by an AWS Lambda function that helps make sure that the preferences are correctly applied. It transforms the default Session Manager preferences document to meet your enterprise compliance requirements. Changes are deployed using CloudFormation template provided in the GitHub repository. The solution supports multiple logging destinations, encryption options, and session controls to meet various security and compliance requirements.

Figure 1: Solution overview

Figure 1: Solution overview

Walkthrough

To deploy the solution, complete the following steps.

Step 1: Download or clone the repository

The first step is to download or clone the GitHub repository.

To download the repository:

  1. Go to the main page of the repository on GitHub.
  2. Choose Code and then choose Download ZIP.

To clone the repository:

  1. Make sure that you have Git installed.
  2. Run the following command in your terminal:
    git clone https://github.com/aws-samples/<repo-link>

Step 2: Create the CloudFormation StackSet

In this step, you deploy the solution’s resources by creating a CloudFormation StackSet using the provided CloudFormation template. Sign in to your management account or StackSet delegated admin account. To create the stack, follow the steps in Get started with StackSets using a sample template. Create the StackSet in each of the accounts and Regions where you plan to implement the solution. Note that you need to provide values for the parameters defined in the template to deploy the stack. The following table lists the parameters that you need to provide.

Parameter

Description

S3Logging

Enables storing session logs to an S3 bucket.

S3BucketName

Name of the S3 bucket for session logs. The bucket must exist or the deployment will fail.

S3KeyPrefix

Key prefix for session logs, will be appended by account ID and Region

S3EncryptionEnabled

If set to true, the S3 bucket you specified in the s3BucketName input must be encrypted.

CreateCWLogGroup

Creates the CloudWatch log group. If set to true, a CloudWatch log group will be created; if not, the log group name passed is used.

CWLogGroupName

The name of the CloudWatch log group you want to send session logs to.

CWEncryptionEnabled

If set to true, the CloudWatch log group you specified in the cwLogGroupName input must be encrypted.

CWStreamingEnabled

If set to true, a continual stream of session data logs is sent to the log group.

SessionDataEncryption

If set to true, session data is encrypted with a key created by the stack.

RunAsEnabled

If set to true, sessions are run using another user than ssm-user. The Run As feature is only supported for connecting to Linux and macOS managed nodes.

RunAsDefaultUser

The name of the user account to start sessions with on Linux and macOS managed nodes when the runAsEnabled input is set to true.

IdleSessionTimeout

The amount of time of inactivity you want to allow before a session ends. This input is measured in minutes.

MaxSessionDuration

The maximum amount of time you want to allow before a session ends. This input is measured in minutes.

WinShellProfile

The shell preferences, environment variables, working directories, and commands you specify for sessions on Windows Server managed nodes.

LinuxShellProfile

The shell preferences, environment variables, working directories, and commands you specify for sessions on Linux and macOS managed nodes.

Step 3: Update your EC2 instance profiles with proper permissions

Depending on the parameter values you pass when deploying the template, you need to update your EC2 instance profiles with proper permissions. For example, if you have enabled session data and session log encryption, you need to add the following policy to your instance profiles.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "logs:DescribeLogGroups"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "logs:DescribeLogStreams"
            ],
            "Resource": "<arn:aws:logs:*:123456789012:log-group:ssm-sessionmanager-logs>",
            "Effect": "Allow"
        },
        {
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:123456789012:log-group:ssm-sessionmanager-logs:log-stream:*",
            "Effect": "Allow"
        },
        {
            "Condition": {
                "Null": {
                    "kms:ResourceAliases": "false"
                },
                "ForAllValues:StringLike": {
                    "kms:ResourceAliases": [
                        "alias/session-manager/data"
                    ]
                }
            },
            "Action": [
                "kms:Decrypt"
            ],
            "Resource": "arn:aws:kms:us-east-1:123456789012:key/*",
            "Effect": "Allow"
        }
    ]
}

Note: If you enable S3 logging, you need to add the required permissions for that as well. See Configure a central S3 bucket for Session Manager logging article on AWS re:Post for more information about how to properly configure your S3 bucket and EC2 instance profile for centralized logging. Same-account logging follows a similar pattern.

Step 4: Verify the solution implementation

You can verify that the Session Manager preferences are correctly configured across your environment. Here’s a systematic approach to validation:

Verify preference configuration

Through the AWS Management Console, navigate to AWS Systems Manager Session Manager, choose Preferences and review the configured Session Manager preferences. Alternatively, verify the configuration through AWS CLI using:

aws ssm get-document --name "SSM-SessionManagerRunShell" --document-version \$LATEST

Validate session functionality

Start a new session following the AWS Systems Manager User Guide and perform the following validations:

  1. Verify the encryption configuration by starting a new session. If session data encryption is enabled, you should see the message This session is encrypted using AWS KMS when the session begins.
  2. For CloudWatch logging verification, navigate to the CloudWatch console and access the Log groups section. Confirm that your specified log group exists and KMS encryption is enabled if you configured it during deployment. Execute some commands in your session and observe the real-time log streaming to your configured log group.
  3. To verify S3 logging, establish a session and execute several commands. Terminate the session and check your configured S3 bucket for the session logs. Remember that S3 logs are only generated after the session is terminated.
  4. If you enabled the RunAsEnabled option, verify the configuration by executing the whoami command in your session. The output should match your configured RunAs user.

Resources

The following is a list of resources created by this solution:

AWS::Lambda::Function (UpdateSessionManagerFunction)
This resource creates a Lambda function that:

  • Updates the SSM-SessionManagerRunShell document with the specified preferences
  • Handles CloudFormation create, update, and delete events
  • Performs deep comparison of document contents to avoid unnecessary updates
  • Includes error handling and retry logic

AWS::IAM::Role (LambdaExecutionRole)
This resource creates an IAM role that allows the Lambda function to:

  • Execute with basic Lambda permissions
  • Access and modify the SSM-SessionManagerRunShell document
  • Access SSM parameters storing session data encryption key ID

AWS::KMS::Key (SessionDataKMSKey)
This conditional resource creates a KMS key for encrypting session data when SessionDataEncryption parameter is set to enabled. The key has a policy allowing key management with IAM.

AWS::KMS::Alias (SessionDataKeyAlias)
This conditional resource creates a friendly alias (alias/session-manager/data) for the session data encryption key. This value cannot be changed.

AWS::SSM::Parameter (SessionKeyID)
This conditional resource creates an Systems Manager parameter to store the KMS key ID for session data encryption, making it accessible to other components.

Note: The session data KMS key ID is stored in a Systems Manager parameter to decouple components and help prevent circular dependency and failures to due race conditions.

AWS::KMS::Key (SessionLogsKMSKey)
This conditional resource creates a KMS key for encrypting CloudWatch logs when CWEncryptionEnabled parameter is set to enabled. The key has a policy allowing CloudWatch Logs service to use it

Note: SessionLogsKMSKey is used to encrypt logs at-rest and is not used by the SSM Agent, so your instance profile does not need to have permission to this key. Logs are encrypted in-transit and will be encrypted by CloudWatch service after they are received.

AWS::KMS::Alias (SessionLogsKeyAlias)
This conditional resource creates a friendly alias (alias/session-manager/logs) for the CloudWatch Logs encryption key.

AWS::Logs::LogGroup (SessionManagerLogGroup)
This conditional resource creates a CloudWatch Logs group for session logs when the CreateCWLogGroup paremeter is set to enabled. The log group:

  • Uses the specified name (controlled by the CWLogGroupName parameter, and defaults to ssm-sessionmanager-logs)
  • Sets a 90-day retention period
  • Uses KMS encryption if enabled

Custom::UpdateSessionManager (UpdateSessionManagerCustomResource)
This custom resource invokes the Lambda function to update the SSM-SessionManagerRunShell document with the specified preferences.

Parameter groups

The following template parameters are available for customizing Session Manager behavior:

Parameter group

Parameters

Description

S3 logging

S3Logging, S3BucketName, S3KeyPrefix, S3EncryptionEnabled

Controls logging to Amazon S3

CloudWatch logging

CreateCWLogGroup, CWLogGroupName, CWEncryptionEnabled, CWStreamingEnabled

Controls logging to CloudWatch Logs

Encryption

SessionDataEncryption

Controls encryption of session data

Session controls

RunAsEnabled, RunAsDefaultUser, IdleSessionTimeout, MaxSessionDuration

Controls session behavior

Shell profiles

WinShellProfile, LinuxShellProfile

Controls shell environment

Conclusion

In this post, we explored how to implement and manage Session Manager preferences across your organization using CloudFormation StackSets. This solution enables centralized management of Session Manager configurations across multiple accounts and Regions from a single account, significantly simplifying the administration of remote access to your compute resources. Through automated deployment of security controls including session encryption, logging, and access restrictions, the solution helps facilitate consistent compliance with organizational security requirements while reducing manual configuration efforts and the risk of human error. As your organization grows, this solution scales seamlessly to accommodate new accounts and Regions while maintaining uniform security standards across your infrastructure.

Remember to regularly review and update your Session Manager preferences to align with evolving security requirements and organizational needs. For more information about AWS Systems Manager Session Manager, visit the official AWS documentation.

If you have feedback about this post, submit comments in the Comments section below.

Nima Fotouhi

Nima Fotouhi

Nima is a Security Consultant at AWS. He’s a builder with a passion for infrastructure as code (IaC) and policy as code (PaC) and helps customers build secure infrastructure on AWS. In his spare time, he loves to hit the slopes and go snowboarding.

Introducing Amazon MWAA Serverless

Post Syndicated from John Jackson original https://aws.amazon.com/blogs/big-data/introducing-amazon-mwaa-serverless/

Today, AWS announced Amazon Managed Workflows for Apache Airflow (MWAA) Serverless. This is a new deployment option for MWAA that eliminates the operational overhead of managing Apache Airflow environments while optimizing costs through serverless scaling. This new offering addresses key challenges that data engineers and DevOps teams face when orchestrating workflows: operational scalability, cost optimization, and access management.

With MWAA Serverless you can focus on your workflow logic rather than monitoring for provisioned capacity. You can now submit your Airflow workflows for execution on a schedule or on demand, paying only for the actual compute time used during each task’s execution. The service automatically handles all infrastructure scaling so that your workflows run efficiently regardless of load.

Beyond simplified operations, MWAA Serverless introduces an updated security model for granular control through AWS Identity and Access Management (IAM). Each workflow can now have its own IAM permissions, running on a VPC of your choosing so you can implement precise security controls without creating separate Airflow environments. This approach significantly reduces security management overhead while strengthening your security posture.

In this post, we demonstrate how to use MWAA Serverless to build and deploy scalable workflow automation solutions. We walk through practical examples of creating and deploying workflows, setting up observability through Amazon CloudWatch, and converting existing Apache Airflow DAGs (Directed Acyclic Graphs) to the serverless format. We also explore best practices for managing serverless workflows and show you how to implement monitoring and logging.

How does MWAA Serverless work?

MWAA Serverless processes your workflow definitions and executes them efficiently in service-managed Airflow environments, automatically scaling resources based on workflow demands. MWAA Serverless uses the Amazon Elastic Container Service (Amazon ECS) executor to run each individual task on its own ECS Fargate container, on either your VPC or a service-managed VPC. Those containers then communicate back to their assigned Airflow cluster using the Airflow 3 Task API.


Figure 1: Amazon MWAA Architecture

MWAA Serverless uses declarative YAML configuration files based on the popular open source DAG Factory format to enhance security through task isolation. You have two options for creating these workflow definitions:

This declarative approach provides two key benefits. First, since MWAA Serverless reads workflow definitions from YAML it can determine task scheduling without running any workflow code. Second, this allows MWAA Serverless to grant execution permissions only when tasks run, rather than requiring broad permissions at the workflow level. The result is a more secure environment where task permissions are precisely scoped and time limited.

Service considerations for MWAA Serverless

MWAA Serverless has the following limitations that you should consider when deciding between serverless and provisioned MWAA deployments:

  • Operator support
    • MWAA Serverless only supports operators from the Amazon Provider Package.
    • To execute custom code or scripts, you’ll need to use AWS services, such as:
  • User interface
    • MWAA Serverless operates without using the Airflow web interface.
    • For workflow monitoring and management, we provide integration with Amazon CloudWatch and AWS CloudTrail.

Working with MWAA Serverless

Complete the following prerequisites and steps to use MWAA Serverless.

Prerequisites

Before you begin, verify you have the following requirements in place:

  • Access and permissions
    • An AWS account
    • AWS Command Line Interface (AWS CLI) version 2.31.38 or later installed and configured
    • The appropriate permissions to create and modify IAM roles and policies, including the following required IAM permissions:
      • airflow-serverless:CreateWorkflow
      • airflow-serverless:DeleteWorkflow
      • airflow-serverless:GetTaskInstance
      • airflow-serverless:GetWorkflowRun
      • airflow-serverless:ListTaskInstances
      • airflow-serverless:ListWorkflowRuns
      • airflow-serverless:ListWorkflows
      • airflow-serverless:StartWorkflowRun
      • airflow-serverless:UpdateWorkflow
      • iam:CreateRole
      • iam:DeleteRole
      • iam:DeleteRolePolicy
      • iam:GetRole
      • iam:PutRolePolicy
      • iam:UpdateAssumeRolePolicy
      • logs:CreateLogGroup
      • logs:CreateLogStream
      • logs:PutLogEvents
      • airflow:GetEnvironment
      • airflow:ListEnvironments
      • s3:DeleteObject
      • s3:GetObject
      • s3:ListBucket
      • s3:PutObject
      • s3:Sync
    • Access to an Amazon Virtual Private Cloud (VPC) with internet connectivity
  • Required AWS services – In addition to MWAA Serverless you will need access to the following AWS services:
    • Amazon MWAA to access your existing Airflow environment(s)
    • Amazon CloudWatch to view logs
    • Amazon S3 for DAG and YAML file management
    • AWS IAM to control permissions
  • Development environment
  • Additional requirements
    • Basic familiarity with Apache Airflow concepts
    • Understanding of YAML syntax
    • Knowledge of AWS CLI commands

Note: Throughout this post, we use example values that you’ll need to replace with your own:

  • Replace amzn-s3-demo-bucket with your S3 bucket name
  • Replace 111122223333 with your AWS account number
  • Replace us-east-2 with your AWS Region. MWAA Serverless is available in multiple AWS Regions. Check the List of AWS Services Available by Region for current availability.

Creating your first serverless workflow

Let’s start by defining a simple workflow that gets a list of S3 objects and writes that list to a file in the same bucket. Create a new file called simple_s3_test.yaml with the following content:

simples3test:
  dag_id: simples3test
  schedule: 0 0 * * *
  tasks:
    list_objects:
      operator: airflow.providers.amazon.aws.operators.s3.S3ListOperator
      bucket: 'amzn-s3-demo-bucket'
      prefix: ''
      retries: 0
    create_object_list:
      operator: airflow.providers.amazon.aws.operators.s3.S3CreateObjectOperator
      data: '{{ ti.xcom_pull(task_ids="list_objects", key="return_value") }}'
      s3_bucket: 'amzn-s3-demo-bucket'
      s3_key: 'filelist.txt'
      dependencies: [list_objects]

For this workflow to run, you must create an Execution role that has permissions to list and write to the above bucket. The role also needs to be assumable from MWAA Serverless. The following CLI commands create this role and its associated policy:

aws iam create-role \
--role-name mwaa-serverless-access-role \
--assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
          "Service": [
            "airflow-serverless.amazonaws.com"
          ]
        },
        "Action": "sts:AssumeRole"
      },
      {
        "Sid": "AllowAirflowServerlessAssumeRole",
        "Effect": "Allow",
        "Principal": {
          "Service": "airflow-serverless.amazonaws.com"
        },
        "Action": "sts:AssumeRole",
        "Condition": {
          "StringEquals": {
            "aws:SourceAccount": "${aws:PrincipalAccount}"
          },
          "ArnLike": {
            "aws:SourceArn": "arn:aws:*:*:${aws:PrincipalAccount}:workflow/*"
          }
        }
      }
    ]
  }'

aws iam put-role-policy \
  --role-name mwaa-serverless-access-role \
  --policy-name mwaa-serverless-policy   \
  --policy-document '{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "CloudWatchLogsAccess",
			"Effect": "Allow",
			"Action": [
				"logs:CreateLogGroup",
				"logs:CreateLogStream",
				"logs:PutLogEvents"
			],
			"Resource": "*"
		},
		{
			"Sid": "S3DataAccess",
			"Effect": "Allow",
			"Action": [
				"s3:ListBucket",
				"s3:GetObject",
				"s3:PutObject"
			],
			"Resource": [
				"arn:aws:s3:::amzn-s3-demo-bucket",
				"arn:aws:s3:::amzn-s3-demo-bucket/*"
			]
		}
	]
}'

You then copy your YAML DAG to the same S3 bucket, and create your workflow based upon the Arn response from the above function.

aws s3 cp "simple_s3_test.yaml" \
s3://amzn-s3-demo-bucket/yaml/simple_s3_test.yaml

aws mwaa-serverless create-workflow \
--name simple_s3_test \
--definition-s3-location '{ "Bucket": "amzn-s3-demo-bucket", "ObjectKey": "yaml/simple_s3_test.yaml" }' \
--role-arn arn:aws:iam::111122223333:role/mwaa-serverless-access-role \
--region us-east-2

The output of the last command returns a WorkflowARN value, which you then use to run the workflow:

aws mwaa-serverless start-workflow-run \
--workflow-arn arn:aws:airflow-serverless:us-east-2:111122223333:workflow/simple_s3_test-abc1234def \
--region us-east-2

The output returns a RunId value, which you then use to check the status of the workflow run that you just executed.

aws mwaa-serverless get-workflow-run \
--workflow-arn arn:aws:airflow-serverless:us-east-2:111122223333:workflow/simple_s3_test-abc1234def \
--run-id ABC123456789def \
--region us-east-2

If you need to make a change to your YAML, you can copy back to S3 and run the update-workflow command.

aws s3 cp "simple_s3_test.yaml" \
s3://amzn-s3-demo-bucket/yaml/simple_s3_test.yaml

aws mwaa-serverless update-workflow \
--workflow-arn arn:aws:airflow-serverless:us-east-2:111122223333:workflow/simple_s3_test-abc1234def \
--definition-s3-location '{ "Bucket": "amzn-s3-demo-bucket", "ObjectKey": "yaml/simple_s3_test.yaml" }' \
--role-arn arn:aws:iam::111122223333:role/mwaa-serverless-access-role \
--region us-east-2

Converting Python DAGs to YAML format

AWS has published a conversion tool that uses the open-source Airflow DAG processor to serialize Python DAGs into YAML DAG factory format. To install, you run the following:

pip3 install python-to-yaml-dag-converter-mwaa-serverless
dag-converter convert source_dag.py --output output_yaml_folder

For example, create the following DAG and name it create_s3_objects.py:

from datetime import datetime
from airflow import DAG
from airflow.models.param import Param
from airflow.providers.amazon.aws.operators.s3 import S3CreateObjectOperator

default_args = {
    'start_date': datetime(2024, 1, 1),
    'retries': 0,
}

dag = DAG(
    'create_s3_objects',
    default_args=default_args,
    description='Create multiple S3 objects in a loop',
    schedule=None
)

# Set number of files to create
LOOP_COUNT = 3
s3_bucket = 'md-workflows-mwaa-bucket'
s3_prefix = 'test-files'

# Create multiple S3 objects using loop
last_task=None
for i in range(1, LOOP_COUNT + 1):  
    create_object = S3CreateObjectOperator(
        task_id=f'create_object_{i}',
        s3_bucket=s3_bucket,
        s3_key=f'{s3_prefix}/{i}.txt',
        data='{{ ds_nodash }}-{{ ts_nodash | lower }}',
        replace=True,
        dag=dag
    )
    if last_task:
        last_task >> create_object
    last_task = create_object

Once you have installed python-to-yaml-dag-converter-mwaa-serverless, you run:

dag-converter convert "/path_to/create_s3_objects.py" --output "/path_to/yaml/"

Where the output will end with:

YAML validation successful, no errors found

YAML written to /path_to/yaml/create_s3_objects.yaml

And resulting YAML will look like:

create_s3_objects:
  dag_id: create_s3_objects
  params: {}
  default_args:
    start_date: '2024-01-01'
    retries: 0
  schedule: None
  tasks:
    create_object_1:
      operator: airflow.providers.amazon.aws.operators.s3.S3CreateObjectOperator
      aws_conn_id: aws_default
      data: '{{ ds_nodash }}-{{ ts_nodash | lower }}'
      encrypt: false
      outlets: []
      params: {}
      priority_weight: 1
      replace: true
      retries: 0
      retry_delay: 300.0
      retry_exponential_backoff: false
      s3_bucket: md-workflows-mwaa-bucket
      s3_key: test-files/1.txt
      task_id: create_object_1
      trigger_rule: all_success
      wait_for_downstream: false
      dependencies: []
    create_object_2:
      operator: airflow.providers.amazon.aws.operators.s3.S3CreateObjectOperator
      aws_conn_id: aws_default
      data: '{{ ds_nodash }}-{{ ts_nodash | lower }}'
      encrypt: false
      outlets: []
      params: {}
      priority_weight: 1
      replace: true
      retries: 0
      retry_delay: 300.0
      retry_exponential_backoff: false
      s3_bucket: md-workflows-mwaa-bucket
      s3_key: test-files/2.txt
      task_id: create_object_2
      trigger_rule: all_success
      wait_for_downstream: false
      dependencies: [create_object_1]
    create_object_3:
      operator: airflow.providers.amazon.aws.operators.s3.S3CreateObjectOperator
      aws_conn_id: aws_default
      data: '{{ ds_nodash }}-{{ ts_nodash | lower }}'
      encrypt: false
      outlets: []
      params: {}
      priority_weight: 1
      replace: true
      retries: 0
      retry_delay: 300.0
      retry_exponential_backoff: false
      s3_bucket: md-workflows-mwaa-bucket
      s3_key: test-files/3.txt
      task_id: create_object_3
      trigger_rule: all_success
      wait_for_downstream: false
      dependencies: [create_object_2]
  catchup: false
  description: Create multiple S3 objects in a loop
  max_active_runs: 16
  max_active_tasks: 16
  max_consecutive_failed_dag_runs: 0

Note that, because the YAML conversion is done after the DAG parsing, the loop that creates the tasks is run first and the resulting static list of tasks is written to the YAML document with their dependencies.

Migrating an MWAA environment’s DAGs to MWAA Serverless

You can take advantage of a provisioned MWAA environment to develop and test your workflows and then move them to serverless to run efficiently at scale. Further, if your MWAA environment is using compatible MWAA Serverless operators, then you can convert all of the environment’s DAGs at once. The first step is to allow MWAA Serverless to assume the MWAA Execution role via a trust relationship. This is a one-time operation for each MWAA Execution role, and can be performed manually in the IAM console or using an AWS CLI command as follows:

MWAA_ENVIRONMENT_NAME="MyAirflowEnvironment"
MWAA_REGION=us-east-2

MWAA_EXECUTION_ROLE_ARN=$(aws mwaa get-environment --region $MWAA_REGION --name $MWAA_ENVIRONMENT_NAME --query 'Environment.ExecutionRoleArn' --output text )
MWAA_EXECUTION_ROLE_NAME=$(echo $MWAA_EXECUTION_ROLE_ARN | xargs basename) 
MWAA_EXECUTION_ROLE_POLICY=$(aws iam get-role --role-name $MWAA_EXECUTION_ROLE_NAME --query 'Role.AssumeRolePolicyDocument' --output json | jq '.Statement[0].Principal.Service += ["airflow-serverless.amazonaws.com"] | .Statement[0].Principal.Service |= unique | .Statement += [{"Sid": "AllowAirflowServerlessAssumeRole", "Effect": "Allow", "Principal": {"Service": "airflow-serverless.amazonaws.com"}, "Action": "sts:AssumeRole", "Condition": {"StringEquals": {"aws:SourceAccount": "${aws:PrincipalAccount}"}, "ArnLike": {"aws:SourceArn": "arn:aws:*:*:${aws:PrincipalAccount}:workflow/*"}}}]')

aws iam update-assume-role-policy --role-name $MWAA_EXECUTION_ROLE_NAME --policy-document "$MWAA_EXECUTION_ROLE_POLICY"

Now we can loop through each successfully converted DAG and create serverless workflows for each.

S3_BUCKET=$(aws mwaa get-environment --name $MWAA_ENVIRONMENT_NAME --query 'Environment.SourceBucketArn' --output text --region us-east-2 | cut -d':' -f6)

for file in /tmp/yaml/*.yaml; do MWAA_WORKFLOW_NAME=$(basename "$file" .yaml); \
      aws s3 cp "$file" s3://$S3_BUCKET/yaml/$MWAA_WORKFLOW_NAME.yaml --region us-east-2; \
      aws mwaa-serverless create-workflow --name $MWAA_WORKFLOW_NAME \
      --definition-s3-location "{\"Bucket\": \"$S3_BUCKET\", \"ObjectKey\": \"yaml/$MWAA_WORKFLOW_NAME.yaml\"}" --role-arn $MWAA_EXECUTION_ROLE_ARN  \
      --region us-east-2  
      done

To see a list of your created workflows, run:

aws mwaa-serverless list-workflows --region us-east-2

Monitoring and observability

MWAA Serverless workflow execution status is returned via the GetWorkflowRun function. The results from that will return details for that particular run. If there are errors in the workflow definition, they are returned under RunDetail in the ErrorMessage field as in the following example:

{
  "WorkflowVersion": "7bcd36ce4d42f5cf23bfee67a0f816c6",
  "RunId": "d58cxqdClpTVjeN",
  "RunType": "SCHEDULE",
  "RunDetail": {
    "ModifiedAt": "2025-11-03T08:02:47.625851+00:00",
    "ErrorMessage": "expected token ',', got 'create_test_table'",
    "TaskInstances": [],
    "RunState": "FAILED"
  }
}

Workflows that are properly defined, but whose tasks fail, will return "ErrorMessage": "Workflow execution failed":

{
  "WorkflowVersion": "0ad517eb5e33deca45a2514c0569079d",
  "RunId": "ABC123456789def",
  "RunType": "SCHEDULE",
  "RunDetail": {
    "StartedOn": "2025-11-03T13:12:09.904466+00:00",
    "CompletedOn": "2025-11-03T13:13:57.620605+00:00",
    "ModifiedAt": "2025-11-03T13:16:08.888182+00:00",
    "Duration": 107,
    "ErrorMessage": "Workflow execution failed",
    "TaskInstances": [
      "ex_5496697b-900d-4008-8d6f-5e43767d6e36_create_bucket_1"
    ],
    "RunState": "FAILED"
  },
}

MWAA Serverless task logs are stored in the CloudWatch log group /aws/mwaa-serverless/<workflow id>/ (where /<workflow id> is the same string as the unique workflow id in the ARN of the workflow). For specific task log streams, you will need to list the tasks for the workflow run and then get each task’s information. You can combine these operations into a single CLI command.

aws mwaa-serverless list-task-instances \
  --workflow-arn arn:aws:airflow-serverless:us-east-2:111122223333:workflow/simple_s3_test-abc1234def \
  --run-id ABC123456789def \
  --region us-east-2 \
  --query 'TaskInstances[].TaskInstanceId' \
  --output text | xargs -n 1 -I {} aws mwaa-serverless get-task-instance \
  --workflow-arn arn:aws:airflow-serverless:us-east-2:111122223333:workflow/simple_s3_test-abc1234def \
  --run-id ABC123456789def \
  --task-instance-id {} \
  --region us-east-2 \
  --query '{Status: Status, StartedAt: StartedAt, LogStream: LogStream}'

Which would result in the following:

{
    "Status": "SUCCESS",
    "StartedAt": "2025-10-28T21:21:31.753447+00:00",
    "LogStream": "//aws/mwaa-serverless/simple_s3_test_3-abc1234def//workflow_id=simple_s3_test-abc1234def/run_id=ABC123456789def/task_id=list_objects/attempt=1.log"
}
{
    "Status": "FAILED",
    "StartedAt": "2025-10-28T21:23:13.446256+00:00",
    "LogStream": "//aws/mwaa-serverless/simple_s3_test_3-abc1234def//workflow_id=simple_s3_test-abc1234def/run_id=ABC123456789def/task_id=create_object_list/attempt=1.log"
}

At which point, you would use the CloudWatch LogStream output to debug your workflow.

You may view and manage your workflows in the Amazon MWAA Serverless console:

For an example that creates detailed metrics and monitoring dashboard using AWS Lambda, Amazon CloudWatch, Amazon DynamoDB, and Amazon EventBridge, review the example in this GitHub repository.

Clean up resources

To avoid incurring ongoing charges, follow these steps to clean up all resources created during this tutorial:

  1. Delete MWAA Serverless workflows – Run this AWS CLI command to delete all workflows:
    aws mwaa-serverless list-workflows --query 'Workflows[*].WorkflowArn' --output text | while read -r workflow; do aws mwaa-serverless delete-workflow --workflow-arn $workflow done

  2. Remove the IAM roles and policies created for this tutorial:
    aws iam delete-role-policy --role-name mwaa-serverless-access-role --policy-name mwaa-serverless-policy

  3. Remove the YAML workflow definitions from your S3 bucket:
    aws s3 rm s3://amzn-s3-demo-bucket/yaml/ --recursive

After completing these steps, verify in the AWS Management Console that all resources have been properly removed. Remember that CloudWatch Logs are retained by default and may need to be deleted separately if you want to remove all traces of your workflow executions.

If you encounter any errors during cleanup, verify you have the necessary permissions and that resources exist before attempting to delete them. Some resources may have dependencies that require them to be deleted in a specific order.

Conclusion

In this post, we explored Amazon MWAA Serverless, a new deployment option that simplifies Apache Airflow workflow management. We demonstrated how to create workflows using YAML definitions, convert existing Python DAGs to the serverless format, and monitor your workflows.

MWAA Serverless offers several key advantages:

  • No provisioning overhead
  • Pay-per-use pricing model
  • Automatic scaling based on workflow demands
  • Enhanced security through granular IAM permissions
  • Simplified workflow definitions using YAML

To learn more MWAA Serverless, review the documentation.


About the authors

John Jackson

John Jackson

John has over 25 years of software experience as a developer, systems architect, and product manager in both startups and large corporations and is the AWS Principal Product Manager responsible for Amazon MWAA.

Building serverless applications with Rust on AWS Lambda

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-serverless-applications-with-rust-on-aws-lambda/

Today, AWS Lambda is promoting Rust support from Experimental to Generally Available. This means you can now use Rust to build business-critical serverless applications, backed by AWS Support and the Lambda availability SLA.

Rust is a popular programming language due to its combination of high performance, memory safety, and developer experience. It offers speed and memory utilization efficiency comparable with C++, together with the reliability normally associated with higher-level languages.

This post shows you how to build and deploy Rust-based Lambda functions using Cargo Lambda, a third-party open source tool for working with Lambda functions in Rust. We’ll also cover how to deploy your functions using the Cargo Lambda AWS Cloud Development Kit (AWS CDK) construct.

Prerequisites

Before you begin, make sure you have:

  • An AWS account with appropriate permissions.
  • The AWS Command Line Interface (AWS CLI) configured with your credentials
  • Rust installed on your development machine (version 1.70 or later)
  • Node.js 20 or later (for AWS CDK deployment)
  • AWS CDK installed: npm install -g aws-cdk

Solution overview

This post takes you through the following steps:

  1. Install and configure Cargo Lambda.
  2. Create and deploy a basic HTTP Lambda function using Cargo Lambda.
  3. Build a complete serverless API using AWS CDK with Rust Lambda functions.

Install and configure Cargo Lambda

Cargo is the package manager and build system for Rust. Cargo Lambda is a third-party open source extension to the cargo command-line tool that simplifies building and deploying Rust Lambda functions.

To install Cargo Lambda on Linux systems, run:

curl -fsSL https://cargo-lambda.info/install.sh | sh

For additional installation options, see the Cargo Lambda installation documentation.

Creating your first Rust Lambda function

Create an HTTP-based Lambda function:

cargo lambda new hi_api

When prompted for Is this function an HTTP function?, enter y.

cd hi_api

This creates a project with the following structure:

├── Cargo.toml
├── README.md
└── src
    ├── http_handler.rs
    └── main.rs

The project includes:

  • main.rs – The function entry point where you configure dependencies and shared state
  • http_handler.rs – The primary function logic

The main.rs file contains the following code:

use lambda_http::{run, service_fn, tracing, Error};
mod http_handler;
use http_handler::function_handler;
#[tokio::main]
async fn main() -> Result<(), Error> {
tracing::init_default_subscriber();
run(service_fn(function_handler)).await
}

The key part of the main.rs file is run(service_fn(function_handler)).await. The run function is part of the http_lambda crate and starts the Lambda Rust runtime interface client (RIC), which actively polls for events from the Lambda Runtime API. The function_handler is the function that is defined in the http_handler.rs file. When the Runtime API returns the invoke event, the RIC calls the function_handler from http_handler.rs:

use lambda_http::{Body, Error, Request, RequestExt, Response};
pub(crate) async fn function_handler(event: Request) -> Result<Response, Error> {
// Extract some useful information from the request
let who = event
.query_string_parameters_ref()
.and_then(|params| params.first("name"))
.unwrap_or("world");
let message = format!("Hello {who}, this is an AWS Lambda HTTP request");
// Return something that implements IntoResponse.
// It will be serialized to the right response event automatically by the runtime

let resp = Response::builder()
    .status(200)
    .header("content-type", "text/html")
    .body(message.into())
    .map_err(Box::new)?;
Ok(resp)

}

The function_handler function signature includes a variable event of type Request. The event contents depend on the service triggering the function. For example, it may contain HTTP request information such as path parameters if the request is coming via HTTP, or even an array of Amazon Kinesis stream records.

For non-HTTP functions, events can be strongly typed. Additionally, you can accept any structure as input as long as it implements serde::Serialize and serde::Deserialize.

The example parses query parameters and looks for the first parameter that has the name name.

The lambda_http crate provides an idiomatic way to return a response, using a builder pattern. The function returns a response as a Result with an Ok() which is what the run function in main.rs expects.

Logging

The main.rs file includes the following line by default:

tracing::init_default_subscriber();

The Rust Lambda runtime integrates natively with Tracing libraries for logging and tracing, and supports JSON structured logging. When setting this line and the RUST_LOG environment variable, Lambda sends logs to Amazon CloudWatch. By default, the INFO log level is enabled.

To write logs, use the tracing crate and send events using the following syntax:

tracing::info("This is a log entry");

Building

To build the Lambda function, use cargo lambda build. When compiling the Lambda function, the AWS Lambda Runtime is built into your binary. The compiled binary file is called bootstrap. It is packaged in the function artifact .zip file and visible as a file in the AWS Lambda console.

When Lambda executes this binary, it starts an infinite loop (the Run function). This polls the Lambda Runtime API to receive the invoke request and then calls your handler, the function_handler function.

The Lambda runtime execution environment

Your function code runs and then sends the function response back to the Lambda Runtime API, which forwards it onto the caller.

Testing

Before deploying the function, you can debug/test the function locally using cargo lambda.

cargo lambda watch sets up an environment that emulates the Lambda execution environment. This allows you to send requests to the Lambda function and see the results.

To send invocation requests, you can use either cargo lambda or send a curl request to the Lambda emulator.

To use cargo lambda, run the following, replace <lambda-function-name> with hi_api for this example

cargo lambda invoke <lambda-function-name> --data-example apigw-request

You can use any of the built-in example payloads with the --data-example parameter. Use --data-ascii <payload> to provide your own payload.

To invoke the function using curl, pass the JSON format payload to the local emulator’s address:

curl -v -X POST \
  'http://127.0.0.1:9000/lambda-url/<lambda-function-name>/' \
  -H 'content-type: application/json' \
  -d '{ "command": "hi" }'

Deploying with Cargo Lambda

Once you have built the function using cargo lambda build, you can deploy it to your AWS account.

To deploy your function:

cargo lambda deploy

Once the Lambda function is deployed, you can test it remotely. cargo lambda invoke tests the remote Lambda function using a payload stored in a .json file:

cargo lambda invoke --remote hi_api --data-file <event file>

Infrastructure-as-Code with AWS CDK

You can create a serverless API in front of this Rust Lambda function using Amazon API Gateway. This example uses the AWS CDK. This example does not have authentication configured for the API Gateway endpoint as it is a sample. The AWS best practice is to implement relevant security controls where necessary.

  1. First, create a new CDK project:
    mkdir rusty_cdk
    cd rusty_cdk
    cdk init --language=typescript

    The easiest way to deploy a Rust Lambda function using the AWS CDK is to use the cargo lambda CDK Construct. This comes with everything required to run Rust Lambda functions on AWS. It is part of the cargo lambda project.

  2. Install the Cargo Lambda CDK construct:
    npm i cargo-lambda-cdk

  3. Create a new HTTP Lambda function in your project:
    mkdir lambda
    cd lambda
    cargo lambda new helloRust

    When prompted for Is this function an HTTP function?, enter y.

  4. Update your CDK stack lib/rusty_cdk-stack.ts to include both the Lambda function and API Gateway.
    import * as cdk from 'aws-cdk-lib';
    import { HttpApi } from 'aws-cdk-lib/aws-apigatewayv2';
    import { HttpLambdaIntegration } from 'aws-cdk-lib/aws-apigatewayv2-integrations';
    import { HttpMethod } from 'aws-cdk-lib/aws-events';
    import { RustFunction } from 'cargo-lambda-cdk';
    import { Construct } from 'constructs';
    export class RustyCdkStack extends cdk.Stack {
      constructor(scope: Construct, id: string, props?: cdk.StackProps) {
        super(scope, id, props);
        const helloRust = new RustFunction(this, 'helloRust',{
          manifestPath: './lambda/helloRust',
          runtime: 'provided.al2023',
          timeout: cdk.Duration.seconds(30),
        });
    
        const api = new HttpApi(this, 'rustyApi');
        const helloInteg = new HttpLambdaIntegration('helloInteg', helloRust);
    
        api.addRoutes({
          path: '/hello',
          methods: [HttpMethod.GET],
          integration: helloInteg,
        })
        new cdk.CfnOutput(this, 'apiUrl',{
          description: 'The URL of the API Gateway',
          value: `https://${api.apiId}.execute-api.${this.region}.amazonaws.com`,
        })
      }
    }

  5. Bootstrap your AWS account and AWS Region for the AWS CDK:
    cdk bootstrap

  6. Deploy your stack:
    cdk deploy

Testing the API

To test your deployed API using the URL provided in the AWS CDK output:

curl https://<YOUR_API_URL>/hello

Clean up

To avoid ongoing charges, remove the deployed resources:

cdk destroy

Conclusion

AWS Lambda support for Rust is now Generally Available to build high-performance, memory-efficient serverless applications. Cargo Lambda is a third-party extension to the Rust cargo CLI which simplifies the experience of developing, testing, and deploying Rust applications to Lambda.

To learn more about building serverless applications with Rust:

To find more Rust code examples, use the Serverless Patterns Collection. For more serverless learning resources, visit Serverless Land.