Tag Archives: Intermediate (200)

Remote access to AWS: A guide for hybrid workforces

2025-07-01 Itay Meller

Post Syndicated from Itay Meller original https://aws.amazon.com/blogs/security/remote-access-to-aws-a-guide-for-hybrid-workforces/

Amazon Web Services (AWS) customers can enable secure remote access to their cloud resources, supporting business operations with both speed and agility. As organizations embrace flexible work environments, employees can safely connect to AWS resources from various locations using different devices. AWS provides comprehensive security solutions that help organizations maintain strong protection of corporate resources, manage appropriate access controls, and meet compliance requirements while enabling productive remote work environments.

Because there are different types of workloads—from Amazon Elastic Compute Cloud (Amazon EC2) instances to web applications—running in the AWS Cloud, there are correspondingly multiple remote access use cases for using or operating these workloads. For example, access to an EC2 instance and its operating system to perform operations such as troubleshooting, log analysis, and data retrieval. Other use cases require access to web applications such as Jenkins, Salesforce, or the Kubernetes UI deployed on AWS.

To support these use cases, AWS provides multiple services and features that help you address access patterns using different approaches. One of the key challenges that you might face when implementing remote access solutions is understanding the tradeoffs of the different approaches and solutions. This post is designed to help you decide which remote access approach is best for your use-case.

Use cases

In this post, we address the following use cases:

User access to internal web applications deployed in a virtual private cloud (VPC).
Internal operator access to EC2 and Amazon Relational Database Service (Amazon RDS) instances deployed in a VPC.
Analyst access to sensitive data residing on Amazon Simple Storage Service (Amazon S3).
User access to SAML 2.0 and OAuth 2.0 applications.

Challenges associated with remote access

Cost: The cost of a remote access solution is a key factor for businesses.
Increased exposure surface: Securing a VPC with several EC2 instances, S3 buckets, and a database is a different task than securing the identities, devices, and communications channels used for remote access to the infrastructure.
Increased risk: Susceptibility to social engineering threats. Humans accessing workloads are the weakest link in any security program, introducing risks to data and infrastructure that otherwise wouldn’t have existed.
User experience (UX): The UX is a key factor in remote access. Lacking a well-designed UX can introduce risks by making it difficult to conduct day-to-day operations or respond quickly to incidents that affect users at scale.

A solution to mitigate the risks associated with remote access is to not provide it at certain levels, and you might sometimes choose this approach. In these cases, access to workloads that must be secure is only possible from trusted locations (such as company offices) and managed devices (such as company-issued laptops). For the remainder of this post, we talk about approaches and solutions available for you when you need to provide remote access from various locations and devices.

The different approaches

Before diving deeper into the services and features, let’s explore the different approaches for providing remote access to your users (shown in Figure 1). The main differentiator among them is where the trust boundary lies.

Figure 1: The different approaches along with the corresponding solutions

Network-based approach: Users are given access to your network through VPCs and are granted broad access to the actual target resource, web application, or EC2 instances. The trust boundary in this case is the VPC.
Host-based approach: Users have access to the host running the application. This is commonly used for operator access. The trust boundary is the host.
Application-based approach: Users access the application using their corporate credentials. This is commonly the case for software as a service (SaaS) applications. The trust boundary is the application.
End-user computing approach: End-user computing (EUC) is a combination of technologies, policies, and processes that gives users secure, remote access to applications, desktops, and data that they need to get their work done. Desktops are operated centrally in the cloud and interacted with using streamed pixels to users’ devices. This approach shifts the trust boundary from the user device to desktops and data residing in the cloud.

These approaches aren’t mutually exclusive and occasionally overlap or can be combined in a zero trust model. Zero trust is centered on the idea that access to resources shouldn’t be based solely on the network location but on authentication and authorization of each request using multiple factors; including the user identity, device, and location, among others.

The trust boundary primarily depends on the criticality of the target resource, the risk tolerance of the organization, and the complexity of the implementation. Wider trust boundaries (such as in a network-based approach) increase the exposed surface area—because the whole network is exposed to trusted users and access to the network grants access to all the resources inside it—but are the simplest to implement. Tighter trust boundaries (such as a zero trust model) considerably reduce the exposed surface area but require implementing multiple factors that feed into the authorization context.

For example, organizations might provide network-based access from trusted devices to operators for a VPC with web-servers and databases, but only allow end-user computing based access for contractors or third-party users using non-corporate devices, also known as bring your own device (BYOD).

When selecting your remote access solution, you need to consider the desired trust boundary, authentication, authorization, user experience, access visibility and cost, which we explore in the following sections.

Network-based approach

The network-based approach is popular when users need access to multiple resources residing in specific networks in a straightforward manner, while keeping the networks disconnected from the public internet. When providing access at the network level, managing security configurations such as authorization, authentication, and auditing happens at the resource (application or machine) and client device, introducing challenges at scale.

AWS Client VPN is a fully managed service that you can use to securely connect users to VPCs from virtually any location using OpenVPN-based clients. Users can authenticate using your organization’s identity provider (IdP) in combination with certificate-based authentication. The service supports authorization rules that act as firewall rules to grant users access to specific CIDR blocks based on membership in an Active Directory group or a group defined in a SAML-based IdP. Additionally, you can use client connect handlers to run custom authorization logic based on device, user, and connection attributes.

After the required infrastructure is set up, users can connect to the target VPC and access EC2 instances or web applications at the network level inside their authorization scope. The UX is as straightforward as connecting using client software installed on the user’s device and authenticating using corporate authentication policies. A client VPN provides visibility into users’ connections to the VPN through connection logs, which are streamed to an Amazon CloudWatch log group. Connection logging provides visibility into each user’s initial VPN connection; getting visibility into what happened during the connection requires gathering the data from the target resource, network, or the user’s device.

After the user is authenticated and authorized, their device gains network access to the relevant VPC—and potentially other VPCs that are peered—or is connected through AWS Transit Gateway to that VPC. This can potentially provide network access to resources and networks outside the scope of the user.

A client VPN-based solution should be implemented when the network is the intended trust boundary around resources (for example, at the subnet or security group level) and group-level access control is sufficient. See Get started with AWS Client VPN.

Host-based approach

Providing access to hosts isn’t always necessary. One way to mitigate risks related to unauthorized host access is to not allow it and rely on fully automated operations instead. In practice, operators and developers still require access to hosts for visibility, tuning the operating system settings, applying patches, or manually restarting a service.

For access to EC2 instances, you can use features such as AWS Systems Manager Session Manager or EC2 Instance Connect Endpoint. Both of these features provide access to a host without exposing it to the internet, and because they use AWS Identity and Access Management (IAM), they authenticate, authorize, and log every session request made to AWS CloudTrail, and provide capabilities such as IAM Conditions (for example, aws:SourceIp) to apply conditional access, often minimizing the need for a network-based approach.

These two features mainly differ in the way they operate. Session Manager requires an agent, which is installed by default on several Amazon Machine Images (AMIs). The agent establishes an outbound connection—through the internet or a VPC endpoint—to the service endpoints, so you don’t have to modify the host’s inbound security group rules. It allows SSH connections tunneled over a proxy connection and provides in-session logging providing visibility into users’ commands within a session.

EC2 Instance Connect doesn’t require that you install an agent; it allows a secure native SSH connection, using short-lived SSH keys. As such, it requires that inbound connections from the EC2 Instance Connect service on port 22 be allowed on the host’s security group.

Most customers use Session Manager unless they don’t want to have an agent installed on the virtual machine or require a native SSH experience.

End-user computing approach

End-user computing services like the Amazon WorkSpaces Family or Amazon AppStream 2.0 stream desktops and applications as encrypted pixels to remote users while keeping data safely within your Amazon Virtual Private Cloud (Amazon VPC) and connected private networks. Unauthorized access to the client device is exposed only to encrypted pixels, which essentially moves the trust boundary from the device accessing the resources to the virtual desktop running in the cloud.

You can dive deeper into the differences between these different services in Unified access to AWS End User Computing services.

These services are particularly popular among customers who want to minimize the user’s device as the trust boundary. This can improve operational efficiency, especially when dealing with untrusted devices or highly sensitive data, because you significantly narrow the scope of what needs to be protected. This can also reduce the use (and costs) of expensive hardware.

The idea is that a user first authenticates using credentials provided by the corporate Active Directory or a SAML federation to the corporate identity provider. After the user is authenticated and authorized, an encrypted streaming session begins and the client is remotely operating a desktop or application that’s deployed in an Amazon VPC, with an elastic network interface (ENI) deployed to the customer’s managed VPC.

When adopting end-user computing for remote access, you can choose the UX and the cost structure that best fit your use case (for example, persistent access to a desktop or on-demand access to specific applications). You can also select different compute and storage options depending on the desired performance.

AWS End User Computing provides different machine types to accommodate different UX requirements and with different pricing models depending on the consumption model being used.

For more information, see Getting Started with Amazon Workspaces and Getting Started with AppStream 2.0.

Application-based approach

IAM Identity Center is primarily known for simplifying user access to AWS accounts within an organization at scale. It also provides single sign-on (SSO) access to supported web applications, giving users seamless access to these applications after they sign in using their directory credentials. Identity Center supports two application types:

AWS managed applications, such as Amazon Q Developer and Amazon QuickSight.
Customer managed applications, such as Salesforce, Microsoft365, and other commonly used applications.

If you use customer managed applications that support SAML 2.0 and OAuth 2.0, you can federate your IdP to IAM Identity Center through SAML 2.0 and use Identity Center to manage user access to those applications.

For organizations operating AWS environments at scale with multiple accounts, using IAM Identity Center is the recommended service to provide access to web applications and is provided at no additional cost.

Combining multiple approaches in a zero trust model

The zero trust model combines multiple factors including the user identity, device, location, and others, to evaluate and grant access requests. One way that you can implement this model to provide remote access to workloads deployed in a VPC is to use AWS Verified Access. With Verified Access, you can provide secure access to corporate applications without a client VPN and support TCP-based connections to a VPC, be it to web applications or EC2 or RDS instances. Authentication can be done using an existing IdP or AWS IAM Identity Center and a device management service that can provide additional information to improve authorization decisions based on context from the device. Those authorization decisions are expressed as Cedar policies that you author based on your access requirements. The service provides extensive logging for each web request, so you can investigate anomalies and view information about the access that was granted. For more information, see Get Started with Amazon Verified Access.

Understanding the tradeoffs

To select the right solution for your workforce, work backwards from the use case. Start by identifying and classifying the asset inventory and mapping the users accessing it and their access patterns.

Things to consider based on the classification:

Visibility: Determine the level of visibility into remote access activities and the type of information that you’ll need to detect and recover from a security event or to comply with regulatory and compliance requirements.
Authentication and authorization: Determine if your existing IAM mechanisms are sufficient. You might need to identify a temporary access management system or include information coming from user devices to address risks of compromised employees.
Network access: Know your users and what type of network access, if any, they need. When considering network access, include the potential risks of overly permissive access.
Cost: To determine costs, you need to know how many users and resources will be supported by remote access. Also, how many connections you expect and for how much time. Use that information to help determine the total cost of ownership of your solution.
Endpoint security: For each resource, understand risks associated with providing access to it from a user’s device. Know what mechanisms you have (or can implement) to detect threats and unauthorized access or provide additional context for the authorization decision granting access to a resource.
User experience: Compare the cost of a streamed user experience to one that’s locally installed to see if any additional cost is balanced by the improved security of the streamed UX.

The following provides an overview of the different solutions and the factors that can help you make an informed decision.

Solution	Use cases	Trust Boundary	Provides access to	Protocol	User experience	Authentication	Authorization	Visibility	Cost
Client VPN	User access to internal applications Operator access to IP resources in VPC	Network	VPCs, subnets, security groups	IP	Client based,native	Single sign-on (SAML-based) Active Directory Mutual (certificate based)	Per CIDR (Authorization rules) Lambda Authorizer for custom code	Connection logging (CloudWatch)	Connection time and endpoint association
AWS Session Manager	Operator access to EC2 or on-premises instances	Host	EC2 Instances: Linux, Windows, or MacOS (EC2 only)	SSH or RDP	Native	IAM	IAM	CloudTrail, or in-session logging using CloudWatch and Amazon S3)	No additional cost for accessing EC2 instances
EC2 Instance Connect Endpoint	Operator access to EC2 instances	Host	EC2 Instances: Linux or Windows	SSH or RDP	Native	IAM	IAM	CloudTrail	No additional cost
IAM Identity Center	User access to SAML 2.0 and OAuth 2.0 applications	Application	Web applications	HTTP(S)	Native	IAM Identity Center	IAM Identity Center	CloudTrail	No additional cost
Amazon Verified Access	User accesses web or TCP-based applications deployed in a VPC	Amazon Verified Access	Web applications TCP resources	HTTP(S) or TCP	Native	IAM Identity Center or OIDC	Custom using Cedar policies Allows device signals	Per request logging	Per application or bandwidth
Amazon Workspaces	User accesses a virtual persistent desktop in a VPC	Cloud desktop	Persistent virtual desktop	WSP or PCoIP	Client based, or non-native	Identity provider	Group membership	CloudTrail	Per instance
Amazon AppStream 2.0	User accesses a virtual desktop in a VPC	Cloud desktop	Non persistent virtual desktops and applications	NICE DCV	Client based or non-native	Identity provider	Group membership	CloudTrail	Per instance

Conclusion

In this post, you learned about different approaches and solutions for providing remote access for your organization’s workforce. This included tactical recommendations on how to find the remote access solution that suits your needs best based on factors such as costs, user experience, and risk. By understanding those tradeoffs, you can now map out the different use-cases based on your infrastructure and threat model and build a remote access strategy to meet your needs. As you experiment and adopt the different tools, careful planning is required when designing and deploying the services. For example, which account to deploy the service to or how to provision access to the services. Use resources such as the AWS Security Reference Architecture (AWS SRA) and the individual service documentation pages to help guide your journey.

If you have feedback about this post, submit comments in the Comments section below.

Enhance data ingestion performance in Amazon Redshift with concurrent inserts

2025-06-26 Raghu Kuppala

Post Syndicated from Raghu Kuppala original https://aws.amazon.com/blogs/big-data/enhance-data-ingestion-performance-in-amazon-redshift-with-concurrent-inserts/

Amazon Redshift is a fully managed petabyte data warehousing service in the cloud. Its massively parallel processing (MPP) architecture processes data by distributing queries across compute nodes. Each node executes identical query code on its data portion, enabling parallel processing.

Amazon Redshift employs columnar storage for database tables, reducing overall disk I/O requirements. This storage method significantly improves analytic query performance by minimizing data read during queries. Data has become many organizations’ most valuable asset, driving demand for real-time or near real-time analytics in data warehouses. This demand necessitates systems that support simultaneous data loading while maintaining query performance. This post showcases the key improvements in Amazon Redshift concurrent data ingestion operations.

Challenges and pain points for write workloads

In a data warehouse environment, managing concurrent access to data is crucial yet challenging. Customers using Amazon Redshift ingest data using various approaches. For example, you might commonly use INSERT and COPY statements to load data to a table, which are also called pure write operations. You might have requirements for low-latency ingestions to maximize data freshness. To achieve this, you can submit queries concurrently to the same table. To enable this, Amazon Redshift implements snapshot isolation by default. Snapshot isolation provides data consistency when multiple transactions are running simultaneously. Snapshot isolation guarantees that each transaction sees a consistent snapshot of the database as it existed at the start of the transaction, preventing read and write conflicts that could compromise data integrity. With snapshot isolation, read queries are able to execute in parallel, so you can take advantage of the full performance that the data warehouse has to offer.

However, pure write operations execute sequentially. Specifically, pure write operations need to acquire an exclusive lock during the entire transaction. They only release the lock when the transaction has committed the data. In these cases, the performance of the pure write operations is constrained by the speed of serial execution of the writes across sessions.

To understand this better, let’s look at how a pure write operation works. Every pure write operation includes pre-ingestion tasks such as scanning, sorting, and aggregation on the same table. After the pre-ingestion tasks are complete, the data is written to the table while maintaining data consistency. Because the pure write operations run serially, even the pre-ingestion steps ran serially due to lack of concurrency. This means that when multiple pure write operations are submitted concurrently, they are processed one after another, with no parallelization even for the pre-ingestion steps. To improve the concurrency of ingestion to the same table and meet low latency requirements for ingestion, customers often use workarounds through the use of staging tables. Specifically, you can submit INSERT ... VALUES(..) statements into staging tables. Then, you perform joins with other tables, such FACT and DIMENSION tables, prior to appending data using ALTER TABLE APPEND into your target tables. This approach isn’t desirable because it requires you to maintain staging tables and potentially have a larger storage footprint due to data block fragmentation from the use of ALTER TABLE APPEND statements.

In summary, the sequential execution of concurrent INSERT and COPY statements, due to their exclusive locking behavior, creates challenges if you want to maximize the performance and efficiency of your data ingestion workflows in Amazon Redshift. To overcome these limitations, you must adopt workaround solutions, introducing additional complexity and overhead. The following section outlines how Amazon Redshift has addressed these pain points with improvements to concurrent inserts.

Concurrent inserts and its benefits

With Amazon Redshift patch 187, Amazon Redshift has introduced significant improvement in concurrency for data ingestion with support for concurrent inserts. This improves concurrent execution of pure write operations such as COPY and INSERT statements, accelerating the time for you to load data into Amazon Redshift. Specifically, multiple pure write operations are able to progress simultaneously and complete pre-ingestion tasks such as scanning, sorting, and aggregation in parallel.

To visualize this improvement, let’s consider an example of two queries, executed concurrently from different transactions.

The following is query 1 in transaction 1:

INSERT INTO table_a SELECT * FROM table_b WHERE table_b.column_x = 'value_a';

The following is query 2 in transaction 2:

INSERT INTO table_a SELECT * FROM table_c WHERE table_c.column_y = 'value_b'

The following figure illustrates a simplified visualization of pure write operations without concurrent inserts.

Without concurrent inserts, the key components are as follows:

First, both pure write operations (INSERT) need to read data from table b and table c, respectively.
The segment in pink is the scan step (reading data) and the segment in green is write step (actually inserting the data).
In the “Before concurrent inserts” state, both queries would run sequentially. Specifically, the scan step in query 2 waits for the insert step in query 1 to complete before it begins.

For example, consider two identically sized queries across different transactions. Both queries need to scan the same amount of data and insert the same amount of data into the target table. Let’s say both are issued at 10:00 AM. First, query 1 would spend from 10:00 AM to 10:50 AM scanning the data and 10:50 AM to 11:00 AM inserting the data. Next, because query 2 is identical in scan and insertion volumes, query 2 would spend from 11:00 AM to 11:50 AM scanning the data and 11:50 AM to 12:00 PM inserting the data. Both transactions started at 10:00 AM. The end-to-end runtime is 2 hours (transaction 2 ends at 12:00 PM).The following figure illustrates a simplified visualization of pure write operations with concurrent inserts, compared with the previous example.

With concurrent inserts enabled, the scan step of query 1 and query 2 can progress simultaneously. When either of the queries need to insert data, they now do so serially. Let’s consider the same example, with two identically sized queries across different transactions. Both queries need to scan the same amount of data and insert the same amount of data into the target table. Again, let’s say both are issued at 10:00 AM. At 10:00 AM, query 1 and query 2 begin executing concurrently. From 10:00 AM to 10:50 AM, query 1 and query 2 are able to scan the data in parallel. From 10:50 AM to 11:00 AM, query 1 inserts the data into the target table. Next, from 11:00 AM to 11:10 AM, query 2 inserts the data into the target table. The total end-to-end runtime for both transactions is now reduced to 1 hour and 10 minutes, with query 2 completing at 11:10 AM. In this scenario, the pre-ingestion steps (scanning the data) for both queries are able to run concurrently, taking the same amount of time as in the previous example (50 minutes). However, the actual insertion of data into the target table is now executed serially, with query 1 completing the insertion first, followed by query 2. This demonstrates the performance benefits of the concurrent inserts feature in Amazon Redshift. By allowing the pre-ingestion steps to run concurrently, the overall runtime is improved by 50 minutes compared to the sequential execution before the feature was introduced.

With concurrent inserts, pre-ingestion steps are able to progress simultaneously. Pre-ingestion tasks could be one or a combination of tasks, such as scanning, sorting, and aggregation. There are significant performance benefits achieved in the end-to-end runtime of the queries.

Benefits

You can now benefit from these performance improvements without any additional configuration because the concurrent processing is handled automatically by the service. There are multiple benefits from the improvements in concurrent inserts. You can experience the improvement of end-to-end performance of ingestion workloads when you’re writing to the same table. Internal benchmarking shows that concurrent inserts can improve end-to-end runtime by up to 40% for concurrent insert transactions to the same tables. This feature is particularly beneficial for scan-heavy queries (queries that spend more time reading data than they spend time writing data). The higher the ratio of scan:insert in any query, higher the performance improvement expected.

This feature also improves the throughput and performance for multi-warehouse writes through data sharing. Multi-warehouse writes through data sharing helps you scale your write workloads across dedicated Redshift clusters or serverless workgroups, optimizing resource utilization and achieving more predictable performance for your extract, transform, and load (ETL) pipelines. Specifically, in multi-warehouse writes through data sharing, queries from different warehouses can write data on the same table. Concurrent inserts improve the end-to-end performance of these queries by reducing resource contention and enabling them to make progress simultaneously.

The following figure shows the performance improvements from internal tests from concurrent inserts, with the orange bar indicating the performance improvement for multi-warehouse writes through data sharing and the blue bar denoting the performance improvement for concurrent inserts on the same warehouse. As the graph indicates, queries with higher scan components relative to insert components benefit up to 40% with this new feature.

You can also experience additional benefits as a result of using concurrent inserts to manage your ingestion pipelines. When you directly write data to the same tables by using the benefit of concurrent inserts instead of using workarounds with ALTER TABLE APPEND statements, you can reduce your storage footprint. This comes in two forms: first from the elimination of temporary tables, and second from the reduction in table fragmentation from frequent ALTER TABLE APPEND statements. Additionally, you can avoid operational overhead of managing complex workarounds and rely on frequent background and customer-issued VACUUM DELETE operations to manage the fragmentation caused by appending temporary tables to your target tables.

Considerations

Although the concurrent insert enhancements in Amazon Redshift provide significant benefits, it’s important to be aware of potential deadlock scenarios that can arise in a snapshot isolation environment. Specifically, in a snapshot isolation environment, deadlocks can occur in certain conditions when running concurrent write transactions on the same table. The snapshot isolation deadlock happens when concurrent INSERT and COPY statements are sharing a lock and making progress, and another statement needs to perform an operation (UPDATE, DELETE, MERGE, or DDL operation) that requires an exclusive lock on the same table.

Consider the following scenario:

Transaction 1:
```
INSERT/COPY INTO table_A;
```

Transaction 2:

INSERT/COPY INTO table_A;
<UPDATE/DELETE/MERGE/DDL statement> table_A

A deadlock can occur when multiple transactions with INSERT and COPY operations are running concurrently on the same table with a shared lock, and one of those transactions follows its pure write operation with an operation that requires an exclusive lock, such as an UPDATE, MERGE, DELETE, or DDL statement. To avoid the deadlock in these situations, you can separate statements requiring an exclusive lock (UPDATE, MERGE, DELETE, DDL statements) to a different transaction so that INSERT and COPY statements can progress simultaneously, and the statements requiring exclusive locks can execute after them. Alternatively, for transactions with INSERT and COPY statements and MERGE, UPDATE, and DELETE statements on same table, you can include retry logic in your applications to work around potential deadlocks. Refer to Potential deadlock situation for concurrent write transactions involving a single table for more information about deadlocks, and see Concurrent write examples for examples of concurrent transactions.

Conclusion

In this post, we demonstrated how Amazon Redshift has addressed a key challenge: improving concurrent data ingestion performance into a single table. This enhancement can help you meet your requirements for low latency and stricter SLAs when accessing the latest data. The update exemplifies our commitment to implementing critical features in Amazon Redshift based on customer feedback.

About the authors

Raghu Kuppala is an Analytics Specialist Solutions Architect experienced working in the databases, data warehousing, and analytics space. Outside of work, he enjoys trying different cuisines and spending time with his family and friends.

Sumant Nemmani is a Senior Technical Product Manager at AWS. He is focused on helping customers of Amazon Redshift benefit from features that use machine learning and intelligent mechanisms to enable the service to self-tune and optimize itself, ensuring Redshift remains price-performant as they scale their usage.

Gagan Goel is a Software Development Manager at AWS. He ensures that Amazon Redshift features meet customer needs by prioritising and guiding the team in delivering customer-centric solutions, monitor and enhance query performance for customer workloads.

Kshitij Batra is a Software Development Engineer at Amazon, specializing in building resilient, scalable, and high-performing software solutions.

Sanuj Basu is a Principal Engineer at AWS, driving the evolution of Amazon Redshift into a next-generation, exabyte-scale cloud data warehouse. He leads engineering for Redshift’s core data platform — including managed storage, transactions, and data sharing — enabling customers to power seamless multi-cluster analytics and modern data mesh architectures. Sanuj’s work helps Redshift customers break through th

Amazon OpenSearch Service 101: Create your first search application with OpenSearch

2025-06-25 Sriharsha Subramanya Begolli

Post Syndicated from Sriharsha Subramanya Begolli original https://aws.amazon.com/blogs/big-data/amazon-opensearch-service-101-create-your-first-search-application-with-opensearch/

Organizations today face the challenge of managing and deriving insights from an ever-expanding universe of data in real time. Industrial Internet of Things (IoT) sensors stream millions of temperature, pressure, and performance metrics from field equipment every second. Ecommerce platforms need to surface relevant products from vast catalogs instantly. Security teams must analyze system logs in real time to detect threats. As data volumes grow, organizations increasingly struggle with fragmented monitoring tools that create critical visibility gaps and slow incident response times. The cost of commercial observability solutions becomes prohibitive, forcing teams to manage multiple separate tools and increasing both operational overhead and troubleshooting complexity. Across these diverse scenarios, the ability to efficiently search, analyze, and visualize data in real time has become crucial for business success.

Amazon OpenSearch Service addresses these challenges by providing a fully managed search and analytics service. This managed service configures, manages, and scales OpenSearch clusters so you can focus on your search workloads and end customers. Amazon OpenSearch Serverless further makes it straightforward to run search and log analytics workloads by automatically scaling compute and storage resources up and down to match your application’s demands—with no infrastructure to manage. Whether you’re processing continuous streams of IoT telemetry, enabling product discovery, or performing security analytics, OpenSearch Service scales to meet your needs.

In this post, we walk you through a search application building process using Amazon OpenSearch Service. Whether you’re a developer new to search or looking to understand OpenSearch fundamentals, this hands-on post shows you how to build a search application from scratch—starting with the initial setup; diving into core components such as indexing, querying, result presentation; and culminating in the execution of your first search query.

Components of OpenSearch Service

Before building your first search application, it’s important to understand some key architectural components in OpenSearch. The fundamental unit of information in OpenSearch is a document stored in JSON format. These documents are organized into indices—collections of related documents that function similar to database tables. When you search for information, OpenSearch queries these indices to find matching documents.

OpenSearch operates on a distributed architecture where multiple servers, called nodes, work together in a cluster or domain. Each cluster can utilize dedicated master nodes that focus solely on cluster management tasks, such as maintaining cluster state, managing indices, and orchestrating shard allocation. These specialized nodes enhance cluster stability by offloading cluster management duties from data nodes. Data nodes, on the other hand, handle the storage, indexing, and querying of data—essentially performing the heavy lifting of data operations. Together, they provide scalability, availability, and efficient data processing in the cluster. Configure dedicated coordinator nodes that specialize in routing and distributing search and indexing requests across the cluster. These nodes reduce the load on data nodes, which allows them to focus on data storage, indexing, and search operations.

Coordinator nodes in OpenSearch are most beneficial in the following scenarios:

Large cluster deployments – When managing substantial data volumes across many nodes.
Query-intensive workloads – For environments handling frequent search queries or aggregations, especially those with complex date histograms or multiple aggregations, benefit from faster query processing.
Heavy dashboard utilization – OpenSearch Dashboards can be resource-intensive. Offloading this responsibility to dedicated coordinator nodes reduces the strain on data nodes.

To manage large datasets efficiently, OpenSearch splits indices into smaller pieces called shards. Each shard is distributed across the cluster, with a recommended size of 10–50 GB for optimal performance. For reliability and high availability, OpenSearch maintains replica copies of these shards on different nodes, which means that your data remains accessible even if some nodes fail.

Search operations in OpenSearch are powered by inverted indices, a data structure that maps terms to the documents containing them. The BM25 ranking algorithm helps make sure that search results are relevant to users’ queries. Although searches happen in near real time, with configurable refresh intervals, individual document retrievals are immediate.

This architecture provides the foundation for handling high-volume IoT data streams, complex full-text search operations, and real-time analytics, all while maintaining fault tolerance. Understanding these components will help you make informed decisions as you build your search application.OpenSearch Dashboards is a visualization and analytics tool for exploring, analyzing, and visualizing data in real time. It provides an intuitive interface for querying, monitoring, and reporting on OpenSearch data using visualizations such as charts, graphs, and maps. Key features include interactive dashboards, alerting, anomaly detection, security monitoring, and trace analytics.

Sample Amazon OpenSearch Service tutorial application overview

The following architecture diagram demonstrates how to build and deploy a scalable, fully managed search application on Amazon Web Services (AWS). The architecture uses Amazon OpenSearch Service for indexing and searching data. The UI application is deployed on AWS App Runner and interacts with Amazon OpenSearch Service through secure serverless Amazon API Gateway and AWS Lambda.

Here is the end-to-end workflow for our application detailing how user requests are handled from initial access through to data retrieval or indexing:

Users access the application through AWS App Runner, which hosts the frontend interface.
Amazon Cognito handles user authentication and authorization for secure access to the application.
When users interact with the application, their requests are sent to API Gateway. API Gateway communicates with Amazon Cognito to verify user authentication status. It serves as the primary entry point for all API operations and routes the requests appropriately. It forwards requests to Lambda functions within the virtual private cloud (VPC).
Lambda functions process the requests, performing either:
Data indexing operations into OpenSearch Service
Search queries against the OpenSearch Service cluster
The OpenSearch Service cluster resides within a private subnet in a VPC for enhanced security.

Prerequisites

Before you deploy the solution, review the prerequisites.

Install the sample app

The entire infrastructure is deployed using AWS Cloud Development Kit (AWS CDK), with cluster configurations customizable through the cdk.json file on GitHub. This deployment approach provides consistent and repeatable infrastructure creation while maintaining security best practices. The steps to deploy this infrastructure are available in this README file. After deployment, you’ll access a comprehensive search application built with Cloudscape React components that includes:

Interactive search functionality – Test various OpenSearch query methods including prefix match keyword searches, phrase matching, fuzzy searches, and field-specific queries against the sample product dataset
Document management tools – Bulk index the product catalog with a single click or delete and recreate the index as needed for testing purposes
Educational resources – Access embedded guides explaining OpenSearch concepts, query syntax, and best practices

Index the documents

After you’ve deployed this search application, the first step is to index some documents into OpenSearch Service. Sign in to the search application UI and follow these steps:

To trigger a bulk index process, under Index Documents in the navigation pane, choose Bulk Index Product Catalog.
Choose Index Product catalog, as shown in the following screenshot.

The Lambda function indexes a comprehensive ecommerce product catalog into your newly created OpenSearch Service cluster. This sample dataset includes detailed fashion and lifestyle products spanning multiple categories. Each product record contains rich metadata, including title, detailed description, category, color, and price.

Keyword searches

OpenSearch Service offers multiple search features. For an exhaustive list, refer to Search features. We focus on a few keyword search types to help you get started with OpenSearch.

With the product catalog in OpenSearch, you can perform prefix searches through the search application’s intuitive interface. To better understand the search functionality, expand the Guide section at the top of the interface. This interactive guide explains how various kinds of searches work, complete with a practical example in context of the product catalog dataset. The guide includes best practices and a link to the detailed documentation to help you make the most of OpenSearch’s powerful query capabilities.

You can do a prefix search on any of the three key search fields: Title, Description, or Color.

A typical prefix match query looks like this:

{
  "query": {
    "match_phrase_prefix": {
      "attribute_name": {
        "query": "attribute_value",
        "max_expansions": 10,
        "slop": 1
      }
    }
  }
}

You can use this query pattern to find documents where specific fields begin with your search term, offering an intuitive “starts with” search experience.

The following image illustrates a practical example of the Prefix Match search. Entering “Ru” in the title field matches products with titles such as “Running”, “Runners” and “Ruby.” Prefix Match search is particularly useful when users only remember the beginning of a product name or are searching across multiple variations or simply exploring product categories.

Multi Match search enables searching across multiple fields simultaneously. For example, you can search for “Coral” across product title, description, and color fields simultaneously. The search query can be customized using field boosting in which matches in certain fields carry more weight than others.

A typical multi match query looks like this:

{
  "query": {
    "multi_match": {
      "query": "Coral",
      "fields": [
        "title^3",
        "description",
        "color"
      ],
      "type": "best_fields"
    }
  }
}

You can explore Wildcard Match, Range Filter, and other search features through the search application. For developers and administrators managing this search infrastructure, OpenSearch Dashboards is a native, developer-friendly interface for indexing, searching, and managing your data. It serves as a comprehensive control center where you can interact directly with your indices, test queries, and monitor performance in real time. The following screenshot shows OpenSearch Dashboards which provides an interactive UI to explore, analyze and visualize search and log data.

While our example demonstrates lexical search functionality on a sample product catalog, OpenSearch Service is equally powerful for observability usecases. When handling time-series data from logs, metrics, or traces, OpenSearch excels at real-time analytics and visualization. For instance, DevOps teams can index application logs and system telemetry data, then use date histograms and statistical aggregations to identify performance bottlenecks or security anomalies as they occur. This real-time search allows IT teams to detect and respond to incidents with minimal delay. Using OpenSearch Dashboards, teams can create live operational dashboards that update automatically as new data streams in. For IoT applications monitoring thousands of sensors, this means temperature anomalies or equipment failures can trigger immediate alerts through OpenSearch’s alerting capabilities. These observability workloads benefit from the same distributed architecture that powers our product search example, with the added advantage of time-series optimized indices and retention policies for managing high-volume streaming data efficiently.

Beyond search management, you can configure alerts for specific conditions, set up notification channels for operational events, and enable data discovery features. If you want to experiment with the same search queries we implemented in our application, you can launch OpenSearch Dashboards and use relevant index and search APIs from the Dev Tools section, which is an ideal environment for developing and testing before implementing in your production application. Because our OpenSearch Service cluster resides within a private subnet, you need to create a Secure Shell (SSH) tunnel to access the dashboard. For more information and steps to do this, refer to How do I use an SSH tunnel to access OpenSearch Dashboards with Amazon Cognito authentication from outside a VPC? in the Knowledge Center. So far, we’ve explored OpenSearch’s query domain-specific language (DSL). However, for those coming in from a traditional database background, OpenSearch also offers SQL and Piped Processing Language (PPL) functionality, making the transition smoother. You can explore more on this at SQL and PPL in the OpenSearch documentation.

In this post, we introduced you to different types of keyword searches. You can also store documents as vector embeddings in OpenSearch and use it for semantic search, hybrid search, multimodal search, or to implement Retrieval Augmented Generation (RAG) pattern.

Conclusion

You can now build sample search applications by following the steps outlined in this post and the implementation details available at sample-for-amazon-opensearch-service-tutorials-101 on GitHub. By using the distributed architecture of Amazon OpenSearch Service, an AWS managed service, you get fast, scalable search capabilities that grow with your business, built-in security and compliance controls, and automated cluster management—all with pay-only-for-what-you-use pricing flexibility.

Ready to learn more? Check out the Amazon OpenSearch Service Developer Guide. For more insights, best practices and architectures, and industry trends, refer to Amazon OpenSearch Service blog posts and hands-on workshops at AWS Workshops. Please also visit the OpenSearch Service Migration Hub if you are ready to migrate legacy or self-managed workloads to OpenSearch Service.

We hope this detailed guide and accompanying code will help you get started. Try it out, let us know your thoughts in the comments section, and feel free to reach out to us for questions!

About the authors

Sriharsha Subramanya Begolli works as a Senior Solutions Architect with Amazon Web Services (AWS), based in Bengaluru, India. His primary focus is assisting large enterprise customers in modernizing their applications and developing cloud-based systems to meet their business objectives. His expertise lies in the domains of data and analytics.

Fraser Sequeira is a Startups Solutions Architect with Amazon Web Services (AWS) based in Melbourne, Australia. In his role at AWS, Fraser works closely with startups to design and build cloud-native solutions on AWS, with a focus on analytics and streaming workloads. With over 10 years of experience in cloud computing, Fraser has deep expertise in big data, real-time analytics, and building event-driven architecture on AWS. He enjoys staying on top of the latest technology innovations from AWS and sharing his learnings with customers. He spends his free time tinkering with new open source technologies.

Implement secure hybrid and multicloud log ingestion with Amazon OpenSearch Ingestion

2025-06-25 Xiaoxue Xu

Post Syndicated from Xiaoxue Xu original https://aws.amazon.com/blogs/big-data/implement-secure-hybrid-and-multicloud-log-ingestion-with-amazon-opensearch-ingestion/

Running applications across hybrid or multicloud environments creates a common challenge: fragmented logs scattered across different platforms. This fragmentation complicates monitoring, slows troubleshooting, and reduces operational visibility. To address this, many organizations seek to implement secure log ingestion from all environments into a centralized platform.

Amazon OpenSearch Service provides a unified solution for real-time search, analytics, and log management across your entire infrastructure. Amazon OpenSearch Ingestion, a fully managed data collector, simplifies data processing with built-in capabilities to filter, transform, and enrich your logs before analysis.

However, securely sending logs from non-AWS environments presents a challenge. Every request to OpenSearch Ingestion requires AWS Signature Version 4 (AWS SigV4) authentication, traditionally requiring long-term credentials that introduce security risks. AWS Identity and Access Management Roles Anywhere solves this problem by providing temporary credentials for workloads running outside AWS.

In this post, we demonstrate how to configure Fluent Bit, a fast and flexible log processor and router supported by various operating systems, to securely send logs from any environment to OpenSearch Ingestion using IAM Roles Anywhere. This approach alleviates the need for long-term credentials while providing a comprehensive view of your application logs across all environments—improving security, simplifying operations, and enhancing your ability to quickly resolve issues.

Solutions overview

The solution in this post uses Fluent Bit to collect logs, retrieve temporary credentials from IAM Roles Anywhere, and sign HTTP log ingestion requests with AWS SigV4 before sending them to the OpenSearch Ingestion pipeline. The following diagram shows the architecture.

Architecture for log ingestion with AWS IAM Roles Anywhere

This solution provisions the following key components:

Certificate authority – For this post, we use AWS Private Certificate Authority (AWS Private CA) as the certificate authority (CA) source. Alternatively, you can integrate with an external CA; for more details, see IAM Roles Anywhere with an external certificate authority. Certificates issued from public CAs can’t be used as trust anchors for IAM Roles Anywhere.
X.509 Certificate – We use a sample private certificate stored in AWS Certificate Manager (ACM) and issued by AWS Private CA.
IAM Roles Anywhere configuration – This includes the following:
- Trust anchor – Establishes trust between IAM Roles Anywhere and the specified CA.
- IAM role – Grants permissions for log ingestion and trusts the IAM Roles Anywhere service principal. At minimum, this role must be granted permission for the osis:Ingest action.
- Profile – Defines which roles IAM Roles Anywhere can assume and the maximum permissions granted with the temporary credentials.
OpenSearch Service domain – For this post, we use an OpenSearch Service domain, which is an AWS provisioned equivalent of an open source OpenSearch cluster. We create the domain within a virtual private cloud (VPC); see VPC versus public domains for more information. Alternatively, you can use an Amazon OpenSearch Serverless collection, which is an OpenSearch cluster that scales compute capacity based on your application’s needs.
OpenSearch Ingestion – This is configured to receive logs over HTTP as the pipeline source and forward them to the OpenSearch Service domain as the pipeline sink.

Connectivity between AWS and your hybrid or multicloud environments

You can access your OpenSearch Ingestion pipelines using an interface VPC endpoint with push-based HTTP source, which provides private IP address connectivity. For production environments, we recommend using these private connections through interface endpoints for enhanced security.

Setting up this connectivity requires additional configuration, such as creating an AWS Site-to-Site VPN connection with your hybrid and multicloud network. Although this post focuses on the log ingestion solution, you can find detailed guidance on network connectivity in the following resources:

Hybrid connectivity – Learn about different methods to connect your on-premises networks to AWS
Configuring VPC access for Amazon OpenSearch Ingestion pipelines – Set up secure private access to your ingestion pipelines
Access Amazon OpenSearch Service using an OpenSearch Service-managed VPC endpoint (AWS PrivateLink) – Configure private endpoints for your OpenSearch Service domain

How Fluent Bit retrieves temporary credentials using IAM Roles Anywhere

Using the HTTP output plugin, Fluent Bit can send logs to the OpenSearch Ingestion pipeline. The following diagram is a simplified view of how Fluent Bit retrieves AWS credentials.

How Fluent Bit retrieve AWS credentials

On Linux systems, Fluent Bit can use an AWS Command Line Interface (AWS CLI) profile that uses the credential_process parameter to trigger an external process. This external process is invoked to generate or retrieve credentials not directly supported by the AWS CLI.

The following are two common mechanisms for the external process:

IAM Roles Anywhere – Uses X.509 certificates to authenticate and returns temporary IAM credentials through IAM Roles Anywhere
OpenID Connect (OIDC) federation – Exchanges an OIDC authentication token for temporary AWS credentials

Although both options are viable, this post focuses on IAM Roles Anywhere. In this setup, the AWS IAM Roles Anywhere Credential Helper is executed to handle the signing process for the CreateSession API. This returns credentials in a JSON format that Fluent Bit can consume.

As of this writing, the Fluent Bit aws_profile configuration is supported only on Linux. It is untested on other Unix-based systems (such as macOS) and is not implemented for Windows.

Prerequisites

Before you begin this walkthrough, make sure you have the following:

AWS account requirements – This includes:
- An AWS account with permissions to deploy AWS CloudFormation templates.
- Access to AWS CloudShell for exporting a sample private certificate we will create using AWS CloudFormation in a later step.
Remote (hybrid or multicloud) environment – You must have a remote machine with Linux-based operating system. This solution was tested on Ubuntu 24.04 with the following additional tooling installed:

Deploy AWS resources with AWS CloudFormation

Follow these steps to deploy AWS resources required for this solution:

Choose Launch Stack:
Enter a unique name for Stack name. The default value is osis-with-iamra.
Configure the stack parameters. Default values are provided in the following table.

Parameter	Default value	Description
`CACommonName`	`example.com`	Common Name for the CA
`CACountry`	`US`	Organization for the CA
`CAOrganization`	`Example Org`	Country for the CA
`CAValidityInDays`	`1826`	Validity period in days for the CA certificate
`VPCCIDR`	`10.0.0.0/16`	IPv4 CIDR range for the VPC used for OpenSearch Service domain
`PublicSubnetCIDR`	`10.0.0.0/24`	IPv4 CIDR range for public subnet
`PrivateSubnet1CIDR`	`10.0.1.0/24`	IPv4 CIDR range for private subnet
`PrivateSubnet2CIDR`	`10.0.2.0/24`	IPv4 CIDR range for private subnet
`DomainName`	`test-domain`	Name of the OpenSearch Service domain
`PipelineName`	`test-pipeline`	Name of the OpenSearch Ingestion pipeline
`PipelineIngestionPath`	`/test-ingestion-path`	Ingestion path for the OpenSearch Ingestion pipeline

Select the acknowledgement check box and choose Create Stack.
Stack deployment takes about 30 minutes to complete.
When stack creation is complete, navigate to the Outputs tab on the AWS CloudFormation console and note down the values for the resources created.
The following table summarizes the output values.

Output	Description	Example value
`ACMCertificateArn`	Amazon Resource Name (ARN) of the ACM certificate. You will use this for exporting certificate and private key files using the AWS CLI in a later step.	`arn:aws:acm:aa-example-1:111122223333:certificate/a1b2c3d4-5678-90ab-cdef-EXAMPLE11111`
`CertificateAuthorityArn`	ARN of the Private CA.	`arn:aws:acm-pca:aa-example-1:111122223333:certificate-authority/a1b2c3d4-5678-90ab-cdef-EXAMPLE22222`
`TrustAnchorArn`	ARN of the IAM Roles Anywhere profile. You will use this value for configuring `credential_process` for IAM Roles Anywhere in a later step.	`arn:aws:rolesanywhere:aa-example-1:111122223333:trust-anchor/a1b2c3d4-5678-90ab-cdef-EXAMPLE33333`
`IngestionRoleArn`	ARN of the OpenSearch Ingestion role. You will use this value for configuring `credential_process` for IAM Roles Anywhere in a later step.	`arn:aws:iam::111122223333:role/role-name-with-path`
`ProfileArn`	ARN of the IAM Roles Anywhere profile. You will use this value for configuring `credential_process` for IAM Roles Anywhere in a later step.	`arn:aws:rolesanywhere:aa-example-1:111122223333:profile/a1b2c3d4-5678-90ab-cdef-EXAMPLE44444`
`OpenSearchDomainEndpoint`	Endpoint of the VPC OpenSearch domain. You will use this public endpoint for querying your index after ingestion.	`vpc-my-domain-123456789012.aa-example-1.es.amazonaws.com`
`PipelineEndpoint`	Endpoint of the OpenSearch Ingestion pipeline. You will use this public endpoint in the Fluent Bit configuration.	`my-pipeline-123456789012.aa-example-1.osis.amazonaws.com`
`PipelineIngestionPath`	Ingestion path for the OpenSearch Ingestion pipeline.	`/test-ingestion-path`

Export a sample private certificate using CloudShell

Follow these steps to export the sample private certificate created by the CloudFormation stack:

Open CloudShell. For more details, see Navigating the AWS CloudShell interface.
Export the certificate ARN from the CloudFormation outputs. If you changed the stack name in the previous step, use that value for <stack-name>, otherwise use the default value osis-with-iamra.

export CERT_ARN=$(aws cloudformation describe-stacks \
    --stack-name <stack-name> \
    --query 'Stacks[0].Outputs[?OutputKey==`ACMCertificateArn`].OutputValue' \
    --output text)

Extract the certificate and private key files:

# Generate and save the passphrase
export PASSPHRASE=$(openssl rand -base64 32)

# Export certificate using environment variables
aws acm export-certificate \
    --certificate-arn $CERT_ARN \
    --passphrase $(echo -n "$PASSPHRASE" | base64) \
    > cert_export.json

# Extract components to separate files
jq -r '.Certificate' cert_export.json > certificate.pem
jq -r '.PrivateKey' cert_export.json > encrypted_private_key.pem

# Decrypt the private key
openssl rsa -in encrypted_private_key.pem -out private_key.pem -passin pass:"$PASSPHRASE"

# Clear environment variables
unset PASSPHRASE CERT_ARN

Download the extracted certificate and private key files from CloudShell:
1. /home/cloudshell-user/certificate.pem
2. /home/cloudshell-user/private_key.pem

Configure an AWS CLI profile

Follow these steps to configure an AWS CLI profile for your log ingestion environment:

Store the downloaded certificate and private key to your environment. For an automated approach to generate and rotate certificates, see Set up AWS Private Certificate Authority to issue certificates for use with IAM Roles Anywhere.
Create a new profile named osis-pipeline-credentials that invokes the credential process. Replace the placeholders with your specific values. Find the values for trusted-anchor-arn, profile-arn, and ingestion-role-arn in your CloudFormation stack outputs.

aws configure set profile.osis-pipeline-credentials.credential_process "</path/to/aws_signing_helper> credential-process \
      --certificate </path/to/certificate.pem> \
      --private-key </path/to/private_key.pem> \
      --trust-anchor-arn <trusted-anchor-arn> \
      --profile-arn <profile-arn> \
      --role-arn <ingestion-role-arn>"

Verify your configuration. Open the ~/.aws/config file and confirm it contains a profile named osis-pipeline-credentials similar to the following:

[profile osis-pipeline-credentials]
credential_process = </path/to/aws_signing_helper> credential-process       --certificate </path/to/certificate.pem>       --private-key </path/to/private_key.pem>       --trust-anchor-arn <trusted-anchor-arn>       --profile-arn <profile-arn>       --role-arn <ingestion-role-arn>

Configure Fluent Bit

Run the following command to create a Fluent Bit configuration. Replace the placeholders with your specific values. Find the osis-pipeline-endpoint and pipeline-ingestion-path values in your CloudFormation stack outputs.

cat << 'EOF' > ~/fluent-bit.conf
[INPUT]
  name                  tail
  path                  /var/log/syslog
  read_from_head        true
  refresh_interval      5
[OUTPUT]
  name 			http
  match	   		*
  aws_service 		osis
  host 			<osis-pipeline-endpoint>
  port 			443
  uri 			<pipeline-ingestion-path> 
  format 		json
  aws_auth	 	true
  aws_region 		<aa-example-1>   
  aws_profile 		osis-pipeline-credentials
  tls 			On
EOF

This example configuration includes the following:

Uses the tail input plugin to monitor the /var/log/syslog file
Uses the http output plugin to flush log records to the OpenSearch Ingestion pipeline endpoint
Uses the osis-pipeline-credentials profile to obtain temporary AWS credentials for SigV4 authentication (aws_auth set to true)

Test the solution

Follow these steps to test the setup:

Start the Fluent Bit client with the configuration file fluent-bit.conf that you created in the previous step. Replace the placeholder with the value applicable to your environment. For Ubuntu 24.04, the default path of the Fluent Bit client is /opt/fluent-bit/bin/fluent-bit. Adjust the path if using other distributions.

sudo AWS_CONFIG_FILE=~/.aws/config <path-to-fluent-bit> -c ~/fluent-bit.conf

Because the solution in this post launched the OpenSearch Service domain within a VPC, you will need an environment that has connectivity to the VPC. For this post, we create a CloudShell VPC environment to run the commands in the next step. Find the VPC, subnet, and security group to use from your CloudFormation stack outputs.
The solution that you deployed through AWS CloudFormation dynamically creates indexes based on ingestion timestamps, format logs-%{yyyy.MM.dd}. You can specify your preferred naming using OpenSearch Ingestion index management. You can query your OpenSearch index using your preferred tool to see the ingested logs from Fluent Bit. We use awscurl in a CloudShell environment as shown in the following example. Replace the placeholders with your specific values. Find the opensearch-domain-endpoint value in your CloudFormation stack outputs.

pip install awscurl

export OPENSEARCH_DOMAIN_ENDPOINT=https://<opensearch-domain-endpoint>

# List indices matching logs-%{yyyy.MM.dd} format and get most recent one to query
export INDEX=$(awscurl --service es "$OPENSEARCH_DOMAIN_ENDPOINT/_cat/indices?v" | grep -E "logs-[0-9]{4}\.[0-9]{2}\.[0-9]{2}" | sort -r | head -1 | awk '{print $3}')

awscurl --service es $OPENSEARCH_DOMAIN_ENDPOINT/$INDEX/_search \
        -X GET -H "Content-Type: application/json" \
        -d '{
            "size": 10,
            "sort": [
              {"@timestamp": {"order": "desc"}}
            ],
            "query": { "match_all": {} }
          }' | jq '.hits.hits[]._source'

The following is an example of the expected output:

{
  "date": 1732039662.399506,
  "log": "2024-11-19T18:07:42.399375+00:00 test-server fluent-bit[9986]: 200 OK",
  "@timestamp": "2024-11-19T18:07:42.812Z"
}
{
  "date": 1732039662.399501,
  "log": "2024-11-19T18:07:42.399224+00:00 test-server fluent-bit[9986]: [2024/11/19 18:07:42] [ info] [output:http:http.0] test-pipeline-123456789012.us-east-2.osis.amazonaws.com:443, HTTP status=200",
  "@timestamp": "2024-11-19T18:07:42.812Z"
}
...

Clean up

To avoid future charges, remove the deployed resources:

Delete the CloudFormation stack.
Remove generated files from CloudShell:

rm cert_export.json encrypted_private_key.pem certificate.pem private_key.pem

Conclusion

In this post, we demonstrated how to obtain temporary credentials from IAM Roles Anywhere and securely ingest logs from hybrid or multicloud environments into OpenSearch Service using OpenSearch Ingestion. This approach minimizes the risk of credential exposure while enabling centralized log collection from distributed workloads. This solution is particularly valuable for organizations managing complex infrastructures across multiple environments and looking to consolidate observability data in OpenSearch Service. For additional details, refer to the following resources:

If you have questions or feedback about this post, please leave them in the comments section.

About the Authors

Xiaoxue Xu is a Solutions Architect for AWS based in Toronto. She primarily works with financial services customers to help secure their workload and design scalable solutions on the AWS Cloud.

Simran Singh is a Senior Solutions Architect at AWS. In this role, he assists our large enterprise customers in meeting their key business objectives using AWS. His areas of expertise include artificial intelligence and machine learning, security, and improving the experience of developers building on AWS.

Amazon Bedrock baseline architecture in an AWS landing zone

2025-06-23 Abdel-Rahman Awad

Post Syndicated from Abdel-Rahman Awad original https://aws.amazon.com/blogs/architecture/amazon-bedrock-baseline-architecture-in-an-aws-landing-zone/

As organizations increasingly adopt Amazon Bedrock to build and deploy large-scale AI applications, it’s important that they understand and adopt critical network access controls to protect their data and workloads. These generative AI-enabled applications might have access to sensitive or confidential information within their knowledge bases, Retrieval Augmented Generation (RAG) data sources, or models themselves, which could pose a risk if exposed to unauthorized parties. Additionally, organizations might want to limit access to certain AI models to specific teams or services, making sure only authorized users can use the most powerful capabilities. Another important consideration is cost optimization, because organizations need to be able to monitor and control access to manage various aspects of their cloud spending.

In this post, we explore the Amazon Bedrock baseline architecture and how you can secure and control network access to your various Amazon Bedrock capabilities within AWS network services and tools. We discuss key design considerations, such as using Amazon VPC Lattice auth policies, Amazon Virtual Private Cloud (Amazon VPC) endpoints, and AWS Identity and Access Management (IAM) to restrict and monitor access to your Amazon Bedrock capabilities.

By the end of this post, you will have a better understanding of how to configure your AWS landing zone to establish secure and controlled network connectivity to Amazon Bedrock across your organization using VPC Lattice.

Solution overview

Addressing the aforementioned challenges requires a well-designed network architecture and security controls. For this, we use the standard AWS Landing Zone Accelerator networking configuration. It provides a good starting point for managing network communication across multiple accounts. On top of the AWS Landing Zone Accelerator network design, we add two shared accounts.

In this solution design, we create a centralized architecture for managing organization AI capabilities across different accounts. The architecture consists of three main parts that work together to provide secure and controlled access to AI services:

Service network account – This account serves as the central networking hub for the organization, managing network connectivity and access policies. Through this account, network administrators can centrally manage and control access to AI services across the organization. The account follows AWS Landing Zone Accelerator networking practices that scale with enterprise organizational needs.
Generative AI account – This account hosts the organization’s Amazon Bedrock capabilities and serves as the central point for AI/ML management. The organization’s AI/ML scientists and prompt engineers will centrally build and manage Amazon Bedrock capabilities. The account provides access to various large language models (LLMs) through Amazon Bedrock by using VPC interface endpoints, while also enabling centralized monitoring of cost consumption and access patterns.
Workload accounts (dev, test, prod) – These accounts represent different environments where teams develop and deploy applications that consume AI services. Through secure network connections established through the service network account, these workload accounts can access the AI capabilities hosted in the generative AI account. This separation enforces proper isolation between development, testing, and production workloads while maintaining secure access to AI services.

Amazon Bedrock baseline architecture in an AWS landing zone

The following diagram illustrates the solution architecture.

The service network account has its own VPC Lattice service network—a centralized networking construct that enables service-to-service communication across your organization, which is shared with workload accounts using AWS Resource Access Manager (AWS RAM) to enable VPC Lattice Service network sharing.

Workload accounts (dev, test, prod) establish VPC associations with the shared VPC Lattice service network by creating a service network association in their VPC. When an application in these accounts makes a request, it first queries the VPC resolver for DNS resolution. The resolver routes the traffic to the VPC Lattice service network.

Access control is implemented through an VPC Lattice auth policy. The service network policies determine which accounts can access the VPC Lattice service network, and service-level policies control access to specific AI services and define what actions each account can perform.

In the central AI services account, we find the proxy layer, we create a VPC Lattice service that points to a proxy layer, which acts as a single entry point, providing workload accounts access to Amazon Bedrock. This proxy layer then connects to Amazon Bedrock through VPC endpoints. Through this setup, the AI team can configure which foundation models (FMs) are available and manage access permissions for different workload accounts. After the necessary policies and connections are in place, workload accounts can access Amazon Bedrock capabilities through the established secure pathway. This setup enables secure, cross-account access to AI services while maintaining centralized control and monitoring.

Network components

We use VPC Lattice, which is a fully managed application networking service that helps you simplify network connectivity, security, and monitoring for service-to-service communication needs. With VPC Lattice, organizations can achieve a centralized connectivity pattern to control and monitor access to the services required for building generative AI applications.

For details about VPC Lattice, refer to the Amazon VPC Lattice User Guide. The following is an overview of the constructs you can use in setting up the centralized pattern in this solution:

VPC Lattice service network – You can use the VPC Lattice service network to provide central connectivity and security to the central AI services account. The service network is a logical grouping mechanism that simplifies how you can enable connectivity across VPCs or accounts, and apply common security policies for application communication patterns. You can create a service network in an account and share it with other accounts within or outside AWS Organizations using AWS RAM.
VPC Lattice service – In a service network, you can associate a VPC Lattice service, which consists of a listener (protocol and port number), routing rules that allow for control of the application flow (for example, path, method, header-based, or weighted routing), and target group, which defines the application infrastructure. A service can have multiple listeners to meet various client capabilities. Supported protocols include HTTP, HTTPS, gRPC, and TLS. The path-based routing allows control to various high-performing FMs and other capabilities you would need to build a generative AI application.
Proxy layer – You use a proxy layer for the VPC Lattice service target group. The proxy layer can be built based on your organization’s preference of AWS services, such as AWS Lambda, AWS Fargate, or Amazon Elastic Kubernetes Service (Amazon EKS). The purpose of the proxy layer is to provide a single entry point to access LLMs, knowledge bases, and other capabilities that are tested and approved according to your organization’s compliance requirements.
VPC Lattice auth policies – For security, you use VPC Lattice auth policies. VPC Lattice auth policies are specified using the same syntax as IAM policies. You can apply an auth policy to VPC Lattice service network as well as to the VPC Lattice service.
Fully Qualified Domain Names –To facilitate service discovery, VPC Lattice supports custom domain names for your services and resources, and maintains a Fully Qualified Domain Name (FQDN) for each VPC Lattice service and resource you define. You can use these FQDNs in your Amazon Route 53 private hosted zone configurations, and empower business units or teams to discover and access services and resources.
Service network VPC – Business units or teams can access generative AI services in a service network using service network VPC associations or a service network VPC endpoint.
Monitoring – You can choose to enable monitoring at the VPC Lattice service network level and VPC Lattice service level. VPC Lattice generates metrics and logs for requests and responses, making it more efficient to monitor and troubleshoot applications

The preceding guidance takes a “secure by default” approach—you must be explicit about which features, models, and so on should be accessed by which business unit. The setup also enables you to implement a defense-in-depth strategy at multiple layers of the network:

The first level of defense is that business team needs to connect to the service network in order to get access to the generative AI service through the central AI service account.
The second level includes network-level security protections in the business team’s VPC for the service network, such as security groups and network access control lists (ACLs). By using these, you can allow access to specific workloads or teams in a VPC.
The third level is through the VPC Lattice auth policy, which you can apply at two layers: at the service network level to allow authenticated requests within the organization, and at the service level to allow access to specific models and features.

VPC Lattice auth policy

This solution makes it possible to centrally manage access to Amazon Bedrock resources across your organization. This approach uses an VPC Lattice auth policy to centrally control Amazon Bedrock resources and manage it from one location across all the organization accounts.

Typically, the auth policy on the service network is operated by the network or cloud administrator. For example, allowing only authenticated requests from specific workloads or teams in your AWS organization. In the following example, access is granted to invoke the generated AI service for authenticated requests and to principals that are part of the o-123456example organization:

{
   "Version": "2012-10-17",
   "Statement": [
      {
         "Effect": "Allow",
         "Principal": "*",
         "Action": "vpc-lattice-svcs:Invoke",
         "Resource": "*",
         "Condition": {
            "StringEquals": {
               "aws:PrincipalOrgID": [ 
                  "o-123456example"
               ]
            }
         }
      }
   ]
}

The auth policy at the service level is managed by the central AI service team to set fine-grained controls, which can be more restrictive than the coarse-grained authorization applied at the service network level. For example, the following policy restricts access to claude-3-haiku for only business-team1:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect": "Allow",
         "Principal": {
            "AWS": [
               "arn:aws:iam::<account-number>:role/businss-team1"
            ]
         },
         "Action": "vpc-lattice-svcs:Invoke",
         "Resource": [
            "arn:aws:vpc-lattice:<aws-region>:<account-number>:service/svc-0123456789abcdef0/*"
         ],
         "Condition": {
            "StringEquals": {
               "vpc-lattice-svcs:RequestQueryString/modelid": "claude-3-haiku"            }
         }
      }
   ]
}

Monitoring and tracking

This design employs three monitoring approaches, using Amazon CloudWatch, AWS CloudTrail, and VPC Lattice access logs. This strategy provides a view of service usage, security, and performance.

CloudWatch metrics offer real-time monitoring of VPC Lattice service performance and usage. CloudWatch tracks metrics such as request counts and response times for Amazon Bedrock related endpoints, allowing for the setup of alarms for proactive management of service health and capacity. This enables monitoring of overall usage patterns of Amazon Bedrock models across different business units, facilitating capacity planning and resource allocation. CloudTrail provides detailed API-level auditing of Amazon Bedrock related actions. It logs cross-account access attempts and interactions with Amazon Bedrock services, providing a compliance and security audit trail. This tracking of who is accessing which Amazon Bedrock models, when, and from which accounts helps organizations adhere to their organizational policies.VPC Lattice access logs provide detailed insights into HTTP/HTTPS requests to Amazon Bedrock services, capturing specific usage patterns of AI models by different business teams. These logs contain client-specific information, which for example can be used to enable organizations to implement capabilities such as charge-back models. This allows for accurate attribution of AI service usage to specific teams or departments, facilitating fair cost allocation and responsible resource utilization across the organization. These services work together to enhance security, optimize performance, and provide valuable insights for managing cross-account Amazon Bedrock access.

Conclusion

In this post, we explored the importance of securing and controlling network access to Amazon Bedrock capabilities within an organization’s AWS landing zone. We discussed the key business challenges, such as the need to protect sensitive information in Amazon Bedrock knowledge bases, limit access to AI models, and optimize cloud costs by monitoring and controlling Amazon Bedrock capabilities. To address these challenges, we outlined a multi-layered network solution that uses AWS networking services, including a VPC Lattice auth policy to restrict and monitor access to Amazon Bedrock capabilities. Try out this solution for your own use case, and share your feedback in the comments.

About the authors

Secure access to a cross-account Amazon MSK cluster from Amazon MSK Connect using IAM authentication

2025-06-19 Venkata Sai Mahesh Swargam

Post Syndicated from Venkata Sai Mahesh Swargam original https://aws.amazon.com/blogs/big-data/secure-access-to-a-cross-account-amazon-msk-cluster-from-amazon-msk-connect-using-iam-authentication/

Amazon Managed Streaming for Apache Kafka (MSK) Connect is a fully managed, scalable, and highly available service that enables the streaming of data between Apache Kafka and other data systems. Amazon MSK Connect is built on top of Kafka Connect, an open-source framework that provides a standard way to connect Kafka with external data systems. Kafka Connect supports a variety of connectors, which are used to stream data in and out of Kafka. MSK Connect extends the capabilities of Kafka Connect by providing a managed service with added security features, straightforward configuration, and automatic scaling capabilities, enabling businesses to focus on their data streaming needs without the overhead of managing the underlying infrastructure.

In some use cases, you might need to use an MSK cluster in one AWS account, but MSK Connect is located in a separate account. In this post, we demonstrate how to create a connector to achieve this use case. At the time of writing, MSK Connect connectors can be created only for MSK clusters that have AWS Identity and Access Management (IAM) role-based authentication or no authentication. We demonstrate how to implement IAM authentication after establishing network connectivity. IAM provides enhanced security measures, making sure your systems are protected against unauthorized access.

Solution overview

The connector can be configured for a variety of purposes, such as sinking data to an Amazon Simple Storage Service (Amazon S3) bucket, tracking the source database changes, or serving as a migration tool such as MirrorMaker2 on MSK Connect to transfer data from a source cluster to a target cluster this is located in a different account.

The following diagram illustrates a use case using Debezium and Amazon S3 source connectors.

The following diagram illustrates using S3 Sink and migration to a cross-account failover cluster using a MirrorMaker connector deployed on MSK Connect.

Currently MSK Connect connectors can be created only for MSK clusters which have IAM role-based authentication or no authentication. In this blog, I’ll guide you through the essential steps for implementing the industry-recommended IAM (Identity and Access Management) authentication after establishing network connectivity. IAM provides enhanced security measures, ensuring your systems are protected against unauthorized access.

The launch of multi-VPC private connectivity (powered by AWS PrivateLink) and cluster policy support for MSK clusters simplifies the connectivity of Kafka clients to brokers. By enabling this feature on the MSK cluster, you can use the cluster-based policy to manage all access control centrally in one place. In this post, we cover the process of enabling this feature on the source MSK cluster.

We don’t fully utilize the multi-VPC connectivity provided by this new feature because that requires you to use different bootstrap URLs with port numbers (14001:3) that are not supported by MSK Connect as of writing of this post. We explore a secure network connectivity solution that uses private connectivity patterns, as detailed in How Goldman Sachs builds cross-account connectivity to their Amazon MSK clusters with AWS PrivateLink.

Connecting to a cross-account MSK cluster from MSK Connect involves the following steps.

Steps to configure the MSK cluster in Account A:

Enable the multi-VPC private connectivity(Private Link) feature for IAM authentication scheme that is enabled for your MSK cluster.
Configure the cluster policy to allow a cross-account connector.
Implement one of the preceding network connectivity patterns according to your use case to establish the connectivity with the Account B VPC and make network changes accordingly.

Steps to configure the MSK connector in Account B:

Create an MSK connector in private subnets using the AWS Command Line Interface (AWS CLI).
Verify the network connectivity from Account A and make network changes accordingly.
Check the destination service to verify the incoming data.

Prerequisites

To follow along with this post, you should have an MSK cluster in one AWS account and MSK Connect in a separate account.

Set up the MSK cluster setup in Account A:

In this post, we only show the important steps that are required to enable the multi-VPC feature on an MSK cluster:

Create a provisioned MSK cluster in Account A’s VPC with the following considerations, which are required for the multi-VPC feature:
- Cluster version must be 2.7.1 or higher.
- Instance type must be m5.large or higher.
- Authentication should be IAM (you must not enable unauthenticated access for this cluster).
After you create the cluster, go to the Networking settings section of your cluster and choose Edit. Then choose Turn on multi-VPC connectivity.

Select IAM role-based authentication and choose Turn on selection.

It might take around 30 minutes to enable. This step is required to enable the cluster policy feature that allows the cross-account connector to access the MSK cluster.

After it has been enabled, scroll down to Security settings and choose Edit cluster policy.
Define your cluster policy and choose Save changes.

The new cluster policy allows for defining a Basic or Advanced cluster policy. With the Basic option, it only allows CreateVPCConnection, GetBootstrapBrokers, DescribeCluster, and DescribeClusterV2 actions that are required for creating the cross-VPC connectivity to your cluster. However, we have to use Advanced to allow more actions that are required by the MSK Connector. The policy should be as follows:

{

    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Principal": {
            "AWS": "Connector-AccountId"
        },
        "Action": [
            "kafka:CreateVpcConnection",
            "kafka:GetBootstrapBrokers",
            "kafka:DescribeCluster",
            "kafka:DescribeClusterV2",
            "kafka-cluster:Connect",
            "kafka-cluster:DescribeCluster",
            "kafka-cluster:ReadData",
            "kafka-cluster:DescribeTopic",
            "kafka-cluster:WriteData",
            "kafka-cluster:CreateTopic",
            "kafka-cluster:AlterGroup",
            "kafka-cluster:DescribeGroup"
        ],
"Resource": [
                "arn:aws:kafka:<region>:<Cluster-AccountId>:cluster/<cluster-name>/<uuid>",
                "arn:aws:kafka:<region>:<Cluster-AccountId>:topic/<cluster-name>/<uuid>/<specific-topic-name>",
                "arn:aws:kafka:<region>:<Cluster-AccountId>:group/<cluster-name>/<uuid>/<specific-group-name>"
            ]
    }]
}

You might need to modify the preceding permissions to limit access to your resources (topics, groups). Also, you can restrict access to a specific connector by giving the connector IAM role, or you can mention the account number to allow the connectors in that account.

Now the cluster is ready. However, you need to make sure of the network connectivity between the cross-account connector VPC and the MSK cluster VPC.

If you’re using VPC peering or Transit Gateway while connecting to MSK Connect either from cross-account or the same account, do not configure your connector to reach the peered VPC resources with IPs in the following CIDR ranges (for more details, see Connecting from connectors):

10.99.0.0/16
192.168.0.0/16
172.21.0.0/16

In the MSK cluster security group, make sure you allowed port 9098 from Account B network resources and make changes in the subnets according to your network connectivity pattern.

Set up the MSK connector in Account B:

In this section, we demonstrate how to use the S3 Sink connector. However, you can use a different connector according to your use case and make the changes accordingly.

Create an S3 bucket (or use an existing bucket).
Make sure that the VPC that you’re using in this account has a security group and private subnets. If your connector for MSK Connect needs access to the internet, refer to Enable internet access for Amazon MSK Connect.
Verify the network connectivity between Account A and Account B by using the telnet command to the broker endpoints with port 9098.
Create an S3 VPC endpoint.
Create a connector plugin according to your connector plugin provider (confluent or lenses). Make a note of the custom plugin Amazon Resource Name (ARN) to use in a later step.

Create an IAM role for your connector to allow access to your S3 bucket and the MSK cluster.

The IAM role’s trust relationship should be as follows:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "kafkaconnect.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Add the following S3 access policy to your IAM role:

{
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Action": [
            "s3:ListAllMyBuckets",
            "s3:ListBucket",
            "s3:GetBucketLocation",
            "s3:DeleteObject",
            "s3:PutObject",
            "s3:GetObject",
            "s3:AbortMultipartUpload",
            "s3:ListMultipartUploadParts",
            "s3:ListBucketMultipartUploads"
        ],
        "Resource": [
                        "arn:aws:s3:::<destination-bucket>",
                     "arn:aws:s3:::<destination-bucket>/*"
        ],
           "Condition": {
        "StringEquals": {
               "aws:SourceVpc": "vpc-xxxx"
               }
               }
    }]
}

The following policy contains the required actions by the connector:

{
"Version": "2012-10-17",
"Statement": [
   {
        "Effect": "Allow",
        "Action": [
            "kafka-cluster:Connect",
            "kafka-cluster:DescribeCluster",
            "kafka-cluster:ReadData",
            "kafka-cluster:DescribeTopic",
            "kafka-cluster:WriteData",
            "kafka-cluster:CreateTopic",
            "kafka-cluster:AlterGroup",
            "kafka-cluster:DescribeGroup"
        ],
        "Resource": [
            "arn:aws:kafka:<region>:<Cluster-AccountId>:cluster/<cluster-name>/<uuid>",
            "arn:aws:kafka:<region>:<Cluster-AccountId>:topic/<cluster-name>/<uuid>/<specific-topic-name>",
            "arn:aws:kafka:<region>:<Cluster-AccountId>:group/<cluster-name>/<uuid>/<specific-group-name>"
        ]
    }
]
}

You might need to modify the preceding permissions to limit access to your resources (topics, groups)

Finally, it’s time to create the MSK connector. Because the Amazon MSK console doesn’t allow viewing MSK clusters in other accounts, we show you how to use the AWS CLI instead. We also use basic Amazon S3 configuration for testing purposes. You might need to modify the configuration according to your connector’s use case.

Create a connector using the AWS CLI with the following command with the required parameters of the connector, along with Account A’s MSK cluster broker endpoints:

aws kafkaconnect create-connector \
--capacity "autoScaling={maxWorkerCount=2,mcuCount=1,minWorkerCount=1,scaleInPolicy={cpuUtilizationPercentage=10},scaleOutPolicy={cpuUtilizationPercentage=80}}" \
--connector-configuration \
"connector.class=io.confluent.connect.s3.S3SinkConnector, \
s3.region=<region>, \
schema.compatibility=NONE, \
flush.size=2, \
tasks.max=1, \
topics=<MSK-Cluster-topic>, \
security.protocol=SASL_SSL, \
s3.compression.type=gzip, \
format.class=io.confluent.connect.s3.format.json.JsonFormat, \
sasl.mechanism=AWS_MSK_IAM, \
sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required, \
sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler, \
value.converter=org.apache.kafka.connect.storage.StringConverter, \
storage.class=io.confluent.connect.s3.storage.S3Storage, \
s3.bucket.name=<s3-bucket-name>, \
timestamp.extractor=Record, \
key.converter=org.apache.kafka.connect.storage.StringConverter" \
--connector-name "Connector-name" \
--kafka-cluster '{"apacheKafkaCluster": {"bootstrapServers": "<broker-strings>:9098","vpc": {"securityGroups": ["sg-0b36a015789f859a3"],"subnets": ["subnet-07950da1ebb8be6d8","subnet-026a729668f3f9728"]}}}' \
--kafka-cluster-client-authentication "authenticationType=IAM" \
--kafka-cluster-encryption-in-transit "encryptionType=TLS" \
--kafka-connect-version "2.7.1" \
--log-delivery workerLogDelivery='{cloudWatchLogs={enabled=true,logGroup="<MSKConnect-log-group-name>"}}' \
--plugins "customPlugin={customPluginArn=<Custom-Plugin-ARN>,revision=1}" \
--service-execution-role-arn "<IAM-role-ARN>"

After you create the connector, connect the producer to your topic and insert data into it. In the following code, we use a Kafka client to insert data for testing purposes:
```
bin/kafka-console-producer.sh --broker-list <broker-string> --producer.config client.properties --topic <topic-name>
```

If everything is set up correctly, you should see the data in your destination S3 bucket. If not, check the troubleshooting tips in the following section.

Troubleshooting tips

After deploying the connector, if it’s in the CREATING state on the connector details page, access the Amazon CloudWatch log group specified in your connector creation request. Review the logs for any errors. If no errors are found, wait for the connector to complete its creation process.

Additionally, make sure the IAM roles have their required permissions, and check the security groups and NACLs for proper connectivity between VPCs.

Clean up

When you’re done testing this solution, clean up any unwanted resources to avoid ongoing charges

Conclusion

In this post, we demonstrated how to create an MSK connector when you need to use an MSK cluster in one AWS account, but MSK Connect is located in a separate account. This architecture includes an S3 Sink connector for demonstration purposes, but it can accommodate other types of sink and source connectors. Additionally, this architecture focuses solely on IAM authenticated connectors. If an unauthenticated connector is desired, the multi-VPC connectivity (PrivateLink) and cluster policy components can be ignored. The remaining process, which involves creating a network connection between the account VPCs, remains the same.

Try out the solution for yourself, and let us know your questions and feedback in the comments section.

Check out more AWS Partners or contact an AWS Representative to learn how we can help accelerate your business.

About the Author

Venkata Sai Mahesh Swargam is a Cloud Engineer at AWS in Hyderabad. He specializes in Amazon MSK and Amazon Kinesis services. Mahesh is dedicated to helping customers by providing technical guidance and solving issues related to their Amazon MSK architectures. In his free time, he enjoys being with family and traveling around the world.

Build a multi-Region analytics solution with Amazon Redshift, Amazon S3, and Amazon QuickSight

2025-06-19 Donatas Kuchalskis

Post Syndicated from Donatas Kuchalskis original https://aws.amazon.com/blogs/big-data/build-a-multi-region-analytics-solution-with-amazon-redshift-amazon-s3-and-amazon-quicksight/

Organizations increasingly face complex requirements balancing regional data sovereignty with global analytics needs. Regulatory frameworks like GDPR, HIPAA, and local data protection laws often mandate storing data in specific geographic regions, and business operations require global teams to access and analyze this data efficiently.

This post explores how to effectively architect a solution that addresses this specific challenge: enabling comprehensive analytics capabilities for global teams while making sure that your data remains in the AWS Regions required by your compliance framework. We use a variety of AWS services, including Amazon Redshift, Amazon Simple Storage Service (Amazon S3), and Amazon QuickSight.

It’s important to note that this solution focuses primarily on data residency (where data is stored) and not on preventing data from being in transit between Regions. Organizations with strict data transit restrictions might need additional controls beyond what’s covered here. We show how you can configure AWS across Regions to help meet business needs and regulatory requirements simultaneously.

Cross-Region architecture requirements

Before implementing a cross-Region solution, it’s important to understand when this approach is actually necessary. Although single-Region deployments offer simplicity and cost advantages, several specific business and regulatory scenarios warrant a cross-Region approach:

Data sovereignty and residency requirements – When regulations like GDPR, HIPAA, or local data sovereignty laws require data to remain in specific geographic boundaries while still enabling global analytics capabilities
Global operations with local compliance – When your organization operates globally, but needs to adhere to regional compliance frameworks while maintaining unified analytics
Performance optimization for global users – When your organization needs to optimize analytics performance for users in different geographic areas while centralizing data governance
Enhanced business continuity – When your analytics capabilities need higher availability and Regional redundancy to support mission-critical business processes

Use case: Financial services analytics with Regional data residency

Consider a financial services company with the following business and regulatory requirements:

Data residency requirement – All customer financial data must remain in the Bahrain Region (me-south-1) to comply with local financial regulations.
Global analytics capability – The organization’s data science team operates from European offices and needs to access and analyze the financial data without moving it out of its mandated storage Region.
Advanced analytics requirements – Business leaders need interactive data exploration and natural language query capabilities to derive insights from financial data.
Performance requirement – Specific dashboard queries require subsecond response times for both local executives and the global management team.

This specific combination of requirements can’t be met with a single-Region deployment. Let’s explore how to architect a solution.

Solution overview

The following architecture is designed to address the specific challenge of using QuickSight in one Region while maintaining data in another Region.

As shown in the architecture diagram, data engineers based in Bahrain (me-south-1) work with local data, whereas data engineers in Stockholm (eu-north-1) and analysts in Ireland (eu-west-1) can securely access the same data through Redshift datashares and virtual private cloud (VPC) peering connections. This approach maintains data residency in me-south-1 while enabling global access.

The solution consists of the following key components:

Primary data Region (me-south-1):
- Redshift cluster (primary data repository)
- S3 buckets for data lake storage
- Private and public subnets with appropriate security controls
- Data must remain in this Region for compliance reasons
Analytics services Region (eu-west-1):
- QuickSight deployment
- Cross-Region VPC peering connection to the primary Region
- Data access using Redshift datashares (no data replication)
Data engineering Region (eu-north-1):
- Redshift consumer cluster for data engineering workloads
- Data access using Redshift datashares from me-south-1
- Makes it possible for data engineering teams in eu-north-1 to access and work with data while maintaining compliance

Before implementing this architecture, evaluate whether:

Your requirements actually necessitate a cross-Region approach
The performance impact is acceptable for your use case
The additional cost is justified by your business requirements

For most analytics workloads, a single-Region architecture remains the recommended approach for simplicity, performance, and cost-effectiveness. Consider cross-Region architectures only when specific business and compliance requirements make them necessary.

Establish cross-Region network connectivity: Amazon Redshift to QuickSight

The foundation of a cross-Region solution is secure, reliable network connectivity. VPC peering provides a straightforward approach for connecting VPCs across Regions. To implement VPC peering in Amazon Virtual Private Cloud (Amazon VPC), complete the following steps:

Create a new VPC in the secondary Region (eu-west-1):
1. Open the Amazon VPC console in the eu-west-1 Region.
2. Choose Create VPC.
3. Set IPv4 CIDR block to 172.32.0.0/16 (verify there is no overlap with the primary Region VPC).
4. Select Auto-generate to create subnets automatically within this new VPC.
5. Leave other settings as default and choose Create VPC.

Set up VPC peering:
1. On the Amazon VPC console, choose Peering connections in the navigation pane and choose Create peering connection.
2. Select the new eu-west-1 VPC as the requester.
3. For Select another VPC to peer with, select My account and Another Region.
4. Choose the primary Region (me-south-1) and enter the VPC ID.
5. Choose Create peering connection.

Accept the VPC peering connection:
1. Switch to the primary Region on the Amazon VPC console.
2. Choose Peering connections in the navigation pane and select the pending connection.
3. On the Actions dropdown menu, choose Accept request.

Update the route tables:
1. On the secondary Region Amazon VPC console, choose Route tables in the navigation pane.
2. Choose the route table for the new VPC.
3. Choose Edit routes and add a new route:
  - Destination: Primary Region VPC CIDR (e.g., 172.31.0.0/16).
  - Target: Choose the peering connection.
4. On the primary Region Amazon VPC console, repeat the process, adding a route to the secondary Region VPC CIDR (172.32.0.0/16) using the peering connection.

Configure security groups:
1. On the secondary Region Amazon VPC console, choose Security groups in the navigation pane and create a new security group.
2. Add an outbound rule:
  - Type: Custom TCP
  - Port range: 5439
  - Destination: Primary Region VPC CIDR
3. On the primary Region Amazon VPC console, locate the Redshift cluster’s security group.
4. Add an inbound rule:
  - Type: Custom TCP
  - Port range: 5439
  - Source: Secondary Region VPC CIDR

Configure DNS settings:
1. On the Amazon VPC console for both Regions, choose Your VPCs in the navigation pane.
2. Select each VPC, and on the Actions dropdown menu, choose Edit DNS hostnames.
3. Select Enable DNS resolution and Enable DNS hostnames.

Implement cross-Region data sharing

Rather than replicating data, which could create compliance issues, you can use Redshift datashares to provide secure, read-only access to data across Regions. Complete the following steps to set up your datashares:

Create producer datashares in the primary Region:

On the Amazon Redshift console, choose Query editor v2 in the navigation pane to connect to your primary Region Redshift cluster (me-south-1).

Run the following commands:

-- In Primary Region Redshift

CREATE DATASHARE datashare_1;
ALTER DATASHARE datashare_1 ADD SCHEMA analytics;
ALTER DATASHARE datashare_1 ADD TABLE analytics.customers;
ALTER DATASHARE datashare_1 ADD TABLE analytics.transactions;

-- Grant usage permissions
	
GRANT USAGE ON DATASHARE datashare_1 TO ACCOUNT '123456789012';

Create a consumer database in the secondary Region:

Connect to your secondary Region Redshift cluster (eu-west-1) using the query editor and run the following commands:

-- In Secondary Region Redshift

CREATE DATABASE consumer_db FROM DATASHARE datashare_1 OF ACCOUNT '123456789012'REGION 'me-south-1';

Verify the datashare configuration with the following code:

-- In Secondary Region Redshift

SELECT * FROM SVV_DATASHARE_CONSUMERS;
SELECT * FROM SVV_DATASHARE_OBJECTS;

This approach maintains data residency in the primary Region while enabling analytics access from another Region, addressing the core challenge of Regional service limitations. For our financial services company example, this makes sure that customer financial data remains in Bahrain (me-south-1) while making it securely accessible to the data science team in Europe (eu-west-1).

Configure QuickSight in the analytics Region

With network connectivity and data sharing established, complete the following steps to configure QuickSight to securely access the Redshift data:

Set up a QuickSight VPC connection:
1. Open the QuickSight console in the secondary Region.
2. Choose Manage QuickSight, VPC connections, and Add VPC connection.
3. Configure the connection:
  - Name: Enter a name (for example, Cross-Region-Connection).
  - VPC: Choose the secondary Region VPC.
  - Subnet: Choose the automatically created subnets.
  - Security group: Choose the security group created for cross-Region access.

Add a QuickSight IP range to the data source security group:
1. Open the Amazon Elastic Compute Cloud (Amazon EC2) console in the primary Region.
2. Choose Security groups in the navigation pane and find the security group for your data source.
3. Edit the inbound rules.
4. Add a new rule:
  - Type: HTTPS (443)
  - Protocol: TCP
  - Port range: 443
  - Source: QuickSight IP range for the secondary Region (for example, 52.210.255.224/27 for eu-west-1).

QuickSight IP ranges can change over time. Refer to AWS Regions, websites, IP address ranges, and endpoints for current IP ranges.

Create a QuickSight data source:
1. On the QuickSight console, choose Datasets in the navigation pane.
2. Choose New dataset, then choose Redshift.
3. Configure the connection:
  - Data source name: Enter a descriptive name.
  - Connection type: Choose the VPC connection.
  - Database server: Enter the Redshift cluster endpoint from the primary Region.
  - Port: 5439
  - Database name: Enter the consumer database name.
  - Username and Password: Enter credentials (consider using AWS Secrets Manager).
4. Choose Validate connection to test.
5. Choose Create data source.

Verify the connection and create datasets:
1. Choose the schema and tables from the consumer database.
2. Configure appropriate refresh schedules.
3. Create calculations and visualizations as needed.

Performance considerations for cross-Region analytics

When implementing a cross-Region analytics architecture, be aware of the following performance implications:

Query performance impact – Cross-Region queries can experience higher latency than single-Region queries. To mitigate this, consider the following:
- Use SPICE for QuickSight – Import frequently-used datasets into SPICE (Super-fast, Parallel, In-memory Calculation Engine) to help avoid repeated cross-Region queries. SPICE is the QuickSight in-memory engine that enables fast, interactive visualizations by precomputing and storing datasets locally in the QuickSight Region.
- Implement efficient query patterns – Minimize the amount of data transferred between Regions.
- Use appropriate caching – Enable result caching where possible.
- Monitoring cross-Region performance – Implement monitoring to identify and address performance issues:
  - Set up Amazon CloudWatch metrics to track cross-Region query performance
  - Create dashboards to visualize latency trends
  - Establish performance baselines and alerts for degradation

Security considerations

Maintaining security in a cross-Region architecture requires additional attention:

Network security:
- Limit VPC peering connections to only necessary VPCs
- Implement restrictive security groups that allow only required traffic
- Consider using VPC endpoints for service access when possible
Data access controls:
- Use AWS Identity and Access Management (IAM) policies consistently across Regions
- Implement fine-grained access controls in Redshift datashares
- Enable audit logging in relevant Regions
Compliance monitoring:
- Implement AWS CloudTrail in all Regions
- Create centralized logging for cross-Region activities
- Regularly review cross-Region access patterns

Cost implications

Before implementing a cross-Region architecture, consider these cost factors:

Data transfer costs – Data transfer between Regions incurs charges
Additional infrastructure – You might need Redshift clusters in multiple Regions
VPC peering costs – Data transfer costs are associated with VPC peering
Operational overhead – Managing multi-Region deployments requires additional resources
Workload-based sizing – You should size each Regional Redshift cluster according to the specific workloads it will handle

Conclusion

The cross-Region architecture described in this post addresses specific challenges related to Regional compliance requirements and global analytics needs, particularly in the following scenarios:

Your data must remain in a specific Region for compliance reasons
You have teams in different Regions who need to access and analyze this data
Different user groups have distinct workload requirements

The datasharing capabilities of Amazon Redshift and Regional storage options in Amazon S3 are key enablers of this solution, allowing data to remain in the required Region while still being accessible for analytics across Regions. However, it’s worth emphasizing that this architecture supports data storage in specific Regions but doesn’t prevent data from traveling between Regions during processing. Organizations concerned about data transit restrictions should evaluate additional controls to address those specific requirements. Combined with secure VPC peering connections and QuickSight visualizations, this architecture creates a complete solution that satisfies both compliance requirements and business needs.

For our financial services example, this architecture successfully enables the company to keep its customer financial data in Bahrain while providing seamless analytics capabilities to the European data science team and delivering interactive dashboards to global business leaders.

For more information, refer to Building a Cloud Security Posture Dashboard with Amazon QuickSight. For hands-on experience, explore the Amazon QuickSight Workshops. Visit the Amazon Redshift console or Amazon QuickSight console to start building your first dashboard, and explore our AWS Big Data Blog for more customer success stories and implementation patterns

Try out this solution for your own use case, and share your thoughts in the comments.

About the Authors

Donatas Kuchalskis is a Cloud Operations Architect at AWS, based in London, focusing on Financial Services customers in the UK. He helps customers optimize their AWS environments for cost, security, and resiliency while providing strategic cloud guidance. Prior to this role, he served as a Prototyping Architect specializing in Big Data and as a Specialist Solutions Architect for Retail. Before joining AWS, Donatas spent 6 years as a technical consultant in the retail sector.

Jumana Nagaria is a Prototyping Architect at AWS. She builds innovative prototypes with customers to solve their business challenges. She is passionate about cloud computing and data analytics. Outside of work, Jumana enjoys travelling, reading, painting, and spending quality time with friends and family.

Introducing the new console experience for AWS WAF

2025-06-17 Harith Gaddamanugu

Post Syndicated from Harith Gaddamanugu original https://aws.amazon.com/blogs/security/introducing-the-new-console-experience-for-aws-waf/

Protecting publicly facing web applications can be challenging due to the constantly evolving threat landscape. You must defend against sophisticated threats, including zero-day vulnerabilities, automated events, and changing compliance requirements. Navigating through consoles and selecting the protections best suited to your use case can be complicated, requiring not only security expertise but also a deep understanding of the applications you want to protect.

Today, we’re excited to share that AWS WAF has launched a new console experience designed to simplify security configuration. This enhanced interface provides an integrated security solution that provides comprehensive protection across your application landscape, featuring a streamlined single-page workflow that can reduce deployment time from hours to minutes. AWS WAF now provides pre-configured rule packs for specific workloads, automated security recommendations, and a unified dashboard for clear visibility into security status. The new console also offers flexible protection levels and integrated partner solutions to help you scale security operations effectively.

In this post, we walk you through the features of the enhanced console experience for AWS WAF. You can now select your workload type to automatically apply expert-curated protection packs, helping to maintain comprehensive protection across your workloads.

Reduce web application security implementation steps by 80 percent

The simplified interface consolidates the configuration steps into a single page, reducing the need to navigate through multiple pages. This consolidation creates an intuitive workflow that significantly reduces setup complexity. You can complete configurations that previously took hours within minutes, representing an 80 percent reduction in implementation time.

The new experience offers pre-configured security defaults and documentation, helping users of all technical abilities to effectively secure applications. When starting the setup process, you select the application category and feature focus to protect an API or web application, and AWS WAF automatically customizes the protection parameters accordingly.

AWS WAF now also integrates with AWS Marketplace through a dedicated page for direct deployment of partner solutions.

This streamlined approach, shown in Figure 1, transforms what was previously a complex, time-consuming process into a straightforward, efficient workflow that maintains robust security standards while dramatically reducing implementation time and potential configuration errors.

Figure 1: Workload specific protection

Simplified AWS WAF rule deployment through protection packs

AWS WAF simplifies security deployment through three configuration options, each based on AWS security expertise and ready for immediate implementation. These protection packs provide comprehensive configuration optimized for various application types.

Recommended: Enables recommended protections for the selected application categories and traffic sources
Essentials: Enables essential protections for the selected application categories and traffic sources
You build it: Select and customize protections from the available options to fit your needs

Implementation requires a single step: selecting the appropriate protection pack based on your security needs, as shown in Figure 2. This approach minimizes complex configurations while maintaining security standards.

Figure 2: Choosing a rule package during AWS WAF onboarding

Real-time monitoring, automated recommendations, and protection activity visualization

The new console features an outcome-driven dashboard, shown in Figure 3, that simplifies security monitoring by consolidating threat detection metrics, rule performance analytics, and actionable insights into a single view. Security teams can now respond to threats faster without having to navigate through multiple screens.

Figure 3: Resources and protections dashboard

AWS Threat Intelligence enhances monitoring capabilities by analyzing two weeks of allowed traffic patterns to proactively identify potential vulnerabilities. The service examines critical traffic dimensions including IP reputation, distributed denial of service (DDoS) attacks, bot activities, anonymous IP sources, and vulnerable application traffic. When AWS WAF detects vulnerabilities, it recommends specific AWS Managed Rules to strengthen your security posture, as shown in Figure 4.

Figure 4: Automated recommendations dashboard

The new real-time monitoring dashboard, as shown in Figure 5, offers comprehensive security insights at a glance. It displays request counts for traffic within your specified time range and summarizes protection rules with their corresponding termination actions. A standout feature is the Sankey visualization, which maps protection activity to AWS WAF actions, helping security teams track traffic flow, identify rule interaction patterns, and optimize configurations.

Figure 5: Summary and protection activity visualization dashboard

Outcome driven dashboards for actionable insights

The new dashboard, shown in Figure 6, displays security metrics based on business impact. The left panel contains four main categories: traffic, bot, rule, and CAPTCHA characteristics. The right panel shows metrics within each category. Under Traffic characteristics, you can view metrics by countries, attack types, and device types.

Figure 6: New, outcome-driven dashboard

Bot characteristics include detection ratio, categories, token information, signals, session thresholds, IP token reuse thresholds, and coordinated activity. Rule characteristics highlight the top 10 managed rules, while CAPTCHA characteristics show solved, abandoned, and unsolved puzzle metrics. This organized structure helps you analyze data effectively, starting with category selection and progressing to rule evaluation. You can combine categories and metrics to visualize patterns, investigate incidents, and make informed decisions. Interactive features, such as IP blocking on hover, enable immediate action. The new console experience allows you to quickly identify threats, optimize WAF rules, and maintain effective security controls based on specific needs.

Query logs from the console

The new AWS WAF console features an integrated log explorer. You can access two analysis options from the bottom of the new console page: sampled requests for immediate review or log explorer for detailed analysis. The log explorer, shown in Figure 7, requires Amazon CloudWatch logging and provides pre-built filters for rule actions and time frames, enabling quick pattern identification and trend analysis. If CloudWatch logging isn’t enabled, you can still analyze security events through sampled requests.

Figure 7: New log explorer view for traffic analysis

Availability and pricing

Starting today, these dashboards are available by default in the AWS WAF console at no additional cost to you and require no additional setup to be completed.

Customer success

Early adopters are already seeing significant benefits. Sarah Chen, Security Engineer at National Retail Corporation, reports: “With the new console, we configured protection for our PHP applications in under 10 minutes. The rule recommendations helped us identify and block sophisticated attacks targeting our applications, reducing our security incidents by 60 percent in the first month.”

Conclusion

By combining intelligent monitoring, guided configuration, and straightforward access to specialized solutions, our enhanced console helps organizations to maintain robust security postures without the complexity traditionally associated with advanced security management.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

How AWS improves active defense to empower customers

2025-06-17 Stephen Goodman

Post Syndicated from Stephen Goodman original https://aws.amazon.com/blogs/security/how-aws-improves-active-defense-to-empower-customers/

At AWS, security is the top priority, and today we’re excited to share work we’ve been doing towards our goal to make AWS the safest place to run any workload. In earlier posts on this blog, we shared details of our internal active defense systems, like MadPot (global honeypots), Mithra (domain graph neural network), and Sonaris (network mitigations). We’re still inventing new ways to improve the effectiveness of threat intelligence and automated response to detect and help prevent attacks. Today we’ll share advancements in active defense related to malware, software vulnerabilities, and AWS resource misconfigurations. Like the other posts we linked to, these are constantly improving capabilities that our customers get just for being on the AWS network. We’ll discuss these topics in more depth at re:inforce 2025 during Innovation Talk SEC302.

Stopping malware from spreading

Financially motivated threat actors try to gain access to a wide array of networked assets. The more resources they control, the more places they can hide, and the longer they can profit from their abusive operations. As such, threat actor malware often contains modules to scan for new targets, replicate binaries over the network, and then repeat. If left unchecked, such rapidly spreading behavior can lead to network congestion, service availability loss, and data destruction. We want to help prevent this behavior to the greatest degree possible.

One effective strategy we employ is identifying the threat actor’s key infrastructure where malware is centrally controlled. We use a variety of techniques to identify, verify, track, and disrupt threat infrastructure. Using network traffic logs, honeypot interactions, and malware samples from an array of sensor positions, we mitigate botnets, abusive proxies, and peer-to-peer malware. Over the past 12 months, AWS helped prevent over 4 million malware infection attempts across 315 thousand distinct Amazon Elastic Compute Cloud (Amazon EC2) instances. By protecting workloads from these malware infections, we not only protect our network and our customers, but also the broader internet from further malware expansion.

Advancements in threat hunting and mitigating software vulnerabilities

At Amazon, we’re proud to support software vulnerability research with programs for bug bounty, vulnerability disclosure, and open source contribution. We’ve also become a more active participant in the CVE process by becoming a CVE Numbering Authority (CNA) for the software and services provided by Amazon. Thanks to the public CVE database, we see vulnerability research accelerating as reported CVEs have grown by 21 percent year-over-year since 2013, with over 40 thousand CVEs published in 2024. This virtuous cycle of finding and resolving vulnerabilities improves cyber security over time, but AWS sees threat actors searching for unresolved vulnerabilities to gain unauthorized access to resources.

We’ve expanded MadPot and Sonaris to identify and stop a broader range of malicious vulnerability scanning and exploitation activity, protecting every AWS customer from vulnerability exposure. We’ve added hundreds of new detections and MadPot service emulations to identify real attacks. As we’ve expanded our visibility, we’ve continued blocking hundreds of millions of CVE exploit attempts daily across the AWS network.

As we’ve made these active defense systems better at stopping CVE exploit attacks, the total number of attacks has gone down by over 55 percent in the last 12 months, as shown in Figure 1. There are many factors outside our control in this observation, but we’re happy to see fewer CVE exploit attacks. This trend coincides with the detection, regionalization, latency, and guardrail improvements we’ve made in 2025. No system can block everything, so fewer exploit attempts mean less risk across a wide range of workloads.

Figure 1: Chart showing the decrease in global malicious vulnerability exploit attempts

This work to identify known exploits in the wild directly benefits users of vulnerability intelligence in Amazon Inspector, which provides an Amazon Inspector score for customers to prioritize where to spend security hardening resources. This includes the most recent date of observed exploitation attempts, the MITRE ATT&CK techniques associated with the exploit activity, and the industries targeted.

Protecting architectures built on AWS

AWS actively defends compute and network resources for our customers; we also defend the distinct AWS-native resources that customers rely on. AWS access key credentials are a critical resource that allow access to customer accounts. The AWS Identity and Access Management best practices share proven techniques for customers to keep their credentials from being abused. Through active defense, we do even more to help customers who haven’t yet adopted these best practices.

Each day, AWS helps prevent an average of 167 million malicious scanning connections seeking unintentionally exposed AWS access key pairs. In case access keys are discovered through other means, we’ve expanded our protection of customer-managed IAM credentials. When our threat intelligence analytics show that a customer-managed credential is known by a threat actor, we put mitigations in place to restrict access to highly privileged operations. We also send customized notifications to help customers identify how the credential was exposed. These efforts are paying off for our customers every day; the following response is a good example of what we hear regularly:

This is a key that we already rotated a few weeks ago based on another alarm from you. It turned out that the new rotated key happened to be in your second alarm to us. So it meant that the app that the key was linked to was still leaking it.

So on Monday we sat down with the dev team, found where the app was leaking some secrets from, we patched it, I rotated all the exposed secrets (it was more than the IAM key) and we plugged in the extra security in the app.

So thanks again for those alerts, they are very precious.
– AWS Customer

In a specific case of threat activity in November and December of 2024, customers reported ransomware activity against their objects in Amazon Simple Storage Service (Amazon S3) storage. We saw that these ransom threats were highly correlated with exposed customer-managed IAM keys. We applied quarantines to the exposed keys, taking care to make sure that normal customer operations could continue safely. We re-sent our proactive notifications to customers about keys that were likely exposed, because the risk of an attack was elevated. During this period, we worked together with customers to deactivate over 30 thousand exposed credentials. Since this threat activity began, AWS has helped prevent over 943 million malicious attempts to encrypt customer Amazon S3 objects.

These credential exposure detections flow into Amazon GuardDuty Extended Threat Detection, simplifying threat detection and response operations for modern cloud environments.

Better together

The approach AWS takes to active defense shows how security can be improved by layering protections across the infrastructure stack and using threat intelligence to drive risk reduction. By building active defense into our services at no extra cost, AWS helps our customers stay protected from a wide range of threats.

While we continue to constantly improve our protections for our customers, some of our work is by nature probabilistic, because we never see inside customer workloads. We don’t apply active defense in situations where the detection is ambiguous, because that might impact our customers’ production systems. To stay secure, customers should never let down their own defenses. AWS security services like AWS Identity and Access Management (IAM), AWS Shield Advanced, AWS WAF, AWS Network Firewall, Amazon GuardDuty, and Amazon Inspector provide prevention, detection, and response that customers can configure for their unique needs. The good news is that by working together, we’re making the internet safer for everyone.

If you have feedback about this post, submit comments in the Comments section below.

Reduce time to access your transactional data for analytical processing using the power of Amazon SageMaker Lakehouse and zero-ETL

2025-06-16 Avijit Goswami

Post Syndicated from Avijit Goswami original https://aws.amazon.com/blogs/big-data/reduce-time-to-access-your-transactional-data-for-analytical-processing-using-the-power-of-amazon-sagemaker-lakehouse-and-zero-etl/

As the lines between analytics and AI continue to blur, organizations find themselves dealing with converging workloads and data needs. Historical analytics data is now being used to train machine learning models and power generative AI applications. This shift requires shorter time to value and tighter collaboration among data analysts, data scientists, machine learning (ML) engineers, and application developers. However, the reality of scattered data across various systems—from data lakes to data warehouses and applications—makes it difficult to access and use data efficiently. Moreover, organizations attempting to consolidate disparate data sources into a data lakehouse have historically relied on extract, transform, and load (ETL) processes, which have become a significant bottleneck in their data analytics and machine learning initiatives. Traditional ETL processes are often complex, requiring significant time and resources to build and maintain. As data volumes grow, so do the costs associated with ETL, leading to delayed insights and increased operational overhead. Many organizations find themselves struggling to efficiently onboard transactional data into their data lakes and warehouses, hindering their ability to derive timely insights and make data-driven decisions. In this post, we address these challenges with a two-pronged approach:

Unified data management: Using Amazon SageMaker Lakehouse to get unified access to all your data across multiple sources for analytics and AI initiatives with a single copy of data, regardless of how and where the data is stored. SageMaker Lakehouse is powered by AWS Glue Data Catalog and AWS Lake Formation and brings together your existing data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses with integrated access controls. In addition, you can ingest data from operational databases and enterprise applications to the lakehouse in near real-time using zero-ETL which is a set of fully-managed integrations by AWS that eliminates or minimizes the need to build ETL data pipelines.
Unified development experience: Using Amazon SageMaker Unified Studio to discover your data and put it to work using familiar AWS tools for complete development workflows, including model development, generative AI application development, data processing, and SQL analytics, in a single governed environment.

In this post, we demonstrate how you can bring transactional data from AWS OLTP data stores like Amazon Relational Database Service (Amazon RDS) and Amazon Aurora flowing into Redshift using zero-ETL integrations to SageMaker Lakehouse Federated Catalog (Bring your own Amazon Redshift into SageMaker Lakehouse). With this integration, you can now seamlessly onboard the changed data from OLTP systems to a unified lakehouse and expose the same to analytical applications for consumptions using Apache Iceberg APIs from new SageMaker Unified Studio. Through this integrated environment, data analysts, data scientists, and ML engineers can use SageMaker Unified Studio to perform advanced SQL analytics on the transactional data.

Architecture patterns for a unified data management and unified development experience

In this architecture pattern, we show you how to use zero-ETL integrations to seamlessly replicate transactional data from Amazon Aurora MySQL-Compatible Edition, an operational database, into the Redshift Managed Storage layer. This zero-ETL approach eliminates the need for complex data extraction, transformation, and loading processes, enabling near real-time access to operational data for analytics. The transferred data is then cataloged using a federated catalog in the SageMaker Lakehouse Catalog and exposed through the Iceberg Rest Catalog API, facilitating comprehensive data analysis by consumer applications.

You then use SageMaker Unified Studio, to perform advanced analytics on the transactional data bridging the gap between operational databases and advanced analytics capabilities.

Prerequisites

Make sure that you have the following prerequisites:

An AWS account with permission to create AWS Identity and Access Management (IAM) roles and policies
Administrator access to one or more of the following sources: Aurora MySQL-Compatible and Amazon Redshift
An AWS Identity and Access Management (IAM) user with an access key and secret key to configure the AWS Command Line Interface (AWS CLI). We recommend that you use temporary credentials using AWS IAM Identity Center as a security best practice for accessing your AWS accounts because these credentials are generated every time you sign in. See Getting IAM Identity Center user credentials for the AWS CLI or AWS SDKs steps to configure IAM Identity Center authentication.

Deployment steps

In this section, we share steps for deploying resources needed for Zero-ETL integration using AWS CloudFormation.

Setup resources with CloudFormation

This post provides a CloudFormation template as a general guide. You can review and customize it to suit your needs. Some of the resources that this stack deploys incur costs when in use. The CloudFormation template provisions the following components:

An Aurora MySQL provisioned cluster (source).
An Amazon Redshift Serverless data warehouse (target).
Zero-ETL integration between the source (Aurora MySQL) and target (Amazon Redshift Serverless). See Aurora zero-ETL integrations with Amazon Redshift for more information.

Create your resources

To create resources using AWS Cloudformation, follow these steps:

Sign in to the AWS Management Console.
Select the us-east-1 AWS Region in which to create the stack.
Open the AWS CloudFormation
Choose Launch Stack
Choose Next.
This automatically launches CloudFormation in your AWS account with a template. It prompts you to sign in as needed. You can view the CloudFormation template from within the console.
For Stack name, enter a stack name, for example UnifiedLHBlogpost.
Keep the default values for the rest of the Parameters and choose Next.
On the next screen, choose Next.
Review the details on the final screen and select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Submit.

Stack creation can take up to 30 minutes.

After the stack creation is complete, go to the Outputs tab of the stack and record the values of the keys for the following components, which you will use in a later step:
- NamespaceName
- PortNumber
- RDSPassword
- RDSUsername
- RedshiftClusterSecurityGroupName
- RedshiftPassword
- RedshiftUsername
- VPC
- Workgroupname
- ZeroETLServicesRoleNameArn

Implementation steps

To implement this solution, follow these steps:

Setting up zero-ETL integration

A zero-ETL integration is already created as a part of CloudFormation template provided. Use the following steps from the Zero-ETL integration post to complete setting up the integration.:

Bring Amazon Redshift metadata to the SageMaker Lakehouse catalog

Now that transactional data from Aurora MySQL is replicating into Redshift tables through zero-ETL integration, you next bring the data into SageMaker Lakehouse, so that operational data can co-exist and be accessed and governed together with other data sources in the data lake. You do this by registering an existing Amazon Redshift Serverless namespace that has Zero-ETL tables as a federated catalog in SageMaker Lakehouse.

Before starting the next steps, you need to configure data lake administrators in AWS Lake Formation.

Go to the Lake Formation console and in the navigation pane, choose Administration roles and then choose Tasks under Administration. Under Data lake administrators, choose Add.
In the Add administrators page, under Access type, select Data Lake administrator.
Under IAM users and roles, select Admin. Choose Confirm.

Add AWS Lake Formation Administrators

On the Add administrators page, for Access type, select Read-only administrators. Under IAM users and roles, select AWSServiceRoleForRedshift and choose Confirm. This step enables Amazon Redshift to discover and access catalog objects in AWS Glue Data Catalog.

Add AWS Lake Formation Administrators 2

With the data lake administrators configured, you’re ready to bring your existing Amazon Redshift metadata to SageMaker Lakehouse catalog:

From the Amazon Redshift Serverless console, choose Namespace configuration in the navigation pane.
Under Actions, choose Register with AWS Glue Data Catalog. You can find more details on registering a federated Amazon Redshift catalog in Registering namespaces to the AWS Glue Data Catalog.

Choose Register. This will register the namespace to AWS Glue Data Catalog

After registration is complete, the Namespace register status will change to Registered to AWS Glue Data Catalog.
Navigate to the Lake Formation console and choose Catalogs New under Data Catalog in the navigation pane. Here you can see a pending catalog invitation is available for the Amazon Redshift namespace registered in Data Catalog.

Select the pending invitation and choose Approve and create catalog. For more information, see Creating Amazon Redshift federated catalogs.

Enter the Name, Description, and IAM role (created by the CloudFormation template). Choose Next.

Grant permissions using a principal that is eligible to provide all permissions (an admin user).
- Select IAM users and rules and choose Admin.
- Under Catalog permissions, select Super user to grant super user permissions.

Assigning super user permissions grants the user unrestricted permissions to the resources (databases, tables, views) within this catalog. Follow the principal of least privilege to grant users only the permissions required to perform a task wherever applicable as a security best practice.

As final step, review all settings and choose Create Catalog

After the catalog is created, you will see two objects under Catalogs. dev refers to the local dev database inside Amazon Redshift, and aurora_zeroetl_integration is the database created for Aurora to Amazon Redshift ZeroETL tables

Fine-grained access control

To set up fine-grained access control, follow these steps:

To grant permission to individual objects, choose Action and then select Grant.

On the Principals page, grant access to individual objects or more than one object to different principals under the federated catalog.

Access lakehouse data using SageMaker Unified Studio

SageMaker Unified Studio provides an integrated experience outside the console to use all your data for analytics and AI applications. In this post, we show you how to use the new experience through the Amazon SageMaker management console to create a SageMaker platform domain using the quick setup method. To do this, you set up IAM Identity Center, a SageMaker Unified Studio domain, and then access data through SageMaker Unified Studio.

Set up IAM Identity Center

Before creating the domain, makes sure that your data admins and data workers are ready to use the Unified Studio experience by enabling IAM Identity Center for single sign-on following the steps in Setting up Amazon SageMaker Unified Studio. You can use Identity Center to set up single sign-on for individual accounts and for accounts managed through AWS Organizations. Add users or groups to the IAM instance as appropriate. The following screenshot shows an example email sent to a user through which they can activate their account in IAM Identity Center.

Set up SageMaker Unified domain

Follow steps in Create a Amazon SageMaker Unified Studio domain – quick setup to set up a SageMaker Unified Studio domain. You need to choose the VPC that was created by the CloudFormation stack earlier.

The quick setup method also has a Create VPC option that sets up a new VPC, subnets, NAT Gateway, VPC endpoints, and so on, and is meant for testing purposes. There are charges associated with this, so delete the domain after testing.

If you see the No models accessible, you can use the Grant model access button to grant access to Amazon Bedrock serverless models for use in SageMaker Unified Studio, for AI/ML use-cases

Fill in the sections for Domain Name. For example, MyOLTPDomain. In the VPC section, select the VPC that was provisioned by the CloudFormation stack, for example UnifiedLHBlogpost-VPC. Select subnets and choose Continue.

In the IAM Identity Center User section, look up the newly created user from (for example, Data User1) and add them to the domain. Choose Create Domain. You should see the new domain along with a link to open Unified Studio.

Access data using SageMaker Unified Studio

To access and analyze your data in SageMaker Unified Studio, follow these steps:

1. Select the URL for SageMaker Unified Studio. Choose Sign in with SSO and sign in using the IAM user, for example datauser1, and you will be prompted to select a multi-factor authentication (MFA) method.
2. Select Authenticator App and proceed with next steps. For more information about SSO setup, see Managing users in Amazon SageMaker Unified Studio.After you have signed in to the Unified Studio domain, you need to set up a new project. For this illustration, we created a new sample project called MyOLTPDataProject using the project profile for SQL Analytics as shown here.A project profile is a template for a project that defines what blueprints are applied to the project, including underlying AWS compute and data resources. Wait for the new project to be set up, and when status is Active, open the project in Unified Studio.By default, the project will have access to the default Data Catalog (AWSDataCatalog). For the federated redshift catalog redshift-consumer-catalog to be visible, you need to grant permissions to the project IAM role using Lake Formation. For this example, using the Lake Formation console, we have granted below access to the demodb database that is part of the Zero-ETL catalog to the Unified Studio project IAM role. Follow steps in Adding existing databases and catalogs using AWS Lake Formation permissions.In your SageMaker Unified Studio Project’s Data section, connect to the Lakehouse Federated catalog that you created and registered earlier (for example redshift-zetl-auroramysql-catalog/aurora_zeroetl_integration). Select the objects that you want to query and execute them using the Redshift Query Editor integrated with SageMaker Unified Studio.If you select Redshift, you will be transferred to the Query editor where you can execute the SQL and see the results as shown in the following figure.

With this integration of Amazon Redshift metadata with SageMaker Lakehouse federated catalog, you have access to your existing Redshift data warehouse objects in your organizations centralized catalog managed by SageMaker Lakehouse catalog and join the existing Redshift data seamlessly with the data stored in your Amazon S3 data lake. This solution helps you avoid unnecessary ETL processes to copy data between the data lake and the data warehouse and minimize data redundancy.

You can further integrate more data sources serving transactional workloads such as Amazon DynamoDB and enterprise applications such as Salesforce and ServiceNow. The architecture shared in this post for accelerated analytical processing using Zero-ETL and SageMaker Lakehouse can be further expanded by adding Zero-ETL integrations for DynamoDB using DynamoDB zero-ETL integration with Amazon SageMaker Lakehouse and for enterprise applications by following the instructions in Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Clean up

When you’re finished, delete the CloudFormation stack to avoid incurring costs for some of the AWS resources used in this walkthrough incur a cost. Complete the following steps:

On the CloudFormation console, choose Stacks.
Choose the stack you launched in this walkthrough. The stack must be currently running.
In the stack details pane, choose Delete.
Choose Delete stack.
On the Sagemaker console, choose Domains and delete the domain created for testing.

Summary

In this post, you’ve learned how to bring data from operational databases and applications into your lake house in near real-time through Zero-ETL integrations. You’ve also learned about a unified development experience to create a project and bring in the operational data to the lakehouse, which is accessible through SageMaker Unified Studio, and query the data using integration with Amazon Redshift Query Editor. You can use the following resources in addition to this post to quickly start your journey to make your transactional data available for analytical processing.

About the authors

Avijit Goswami is a Principal Data Solutions Architect at AWS specialized in data and analytics. He supports AWS strategic customers in building high-performing, secure, and scalable data lake solutions on AWS using AWS managed services and open-source solutions. Outside of his work, Avijit likes to travel, hike in the San Francisco Bay Area trails, watch sports, and listen to music.

Saman Irfan is a Senior Specialist Solutions Architect focusing on Data Analytics at Amazon Web Services. She focuses on helping customers across various industries build scalable and high-performant analytics solutions. Outside of work, she enjoys spending time with her family, watching TV series, and learning new technologies.

Sudarshan Narasimhan is a Principal Solutions Architect at AWS specialized in data, analytics and databases. With over 19 years of experience in Data roles, he is currently helping AWS Partners & customers build modern data architectures. As a specialist & trusted advisor he helps partners build & GTM with scalable, secure and high performing data solutions on AWS. In his spare time, he enjoys spending time with his family, travelling, avidly consuming podcasts and being heartbroken about Man United’s current state.

AWS CIRT announces the launch of the Threat Technique Catalog for AWS

2025-06-13 Steve de Vera

Post Syndicated from Steve de Vera original https://aws.amazon.com/blogs/security/aws-cirt-announces-the-launch-of-the-threat-technique-catalog-for-aws/

Greetings from the AWS Customer Incident Response Team (AWS CIRT). AWS CIRT is a 24/7, specialized global Amazon Web Services (AWS) team that provides support to customers during active security events on the customer side of the AWS Shared Responsibility Model. We’re excited to announce the launch of the Threat Technique Catalog for AWS.

When the AWS CIRT assists customers with incident response during security investigations, we gather AWS service metadata on the types of tactics and techniques that threat actors have used against AWS customers. We use this information to build an internal dataset of indicators of compromise (IOCs) and threat patterns that provides insight into how threat actors are taking advantage of misconfigured AWS resources, overly permissive access, or the methods they use in attempting to achieve their objectives.

We capture this metadata and use it internally to continually improve AWS services to help make them more secure for our customers by making it more difficult for threat actors to perform unauthorized actions. For example, some of the metadata that the AWS CIRT has captured as a result of investigating security events where a threat actor has used the Amazon Bedrock service to consume tokens by invoking large language models (LLMs) has been used to supplement the Amazon GuardDuty IAM Anomalous Behavior finding. Earlier this year, the AWS CIRT identified an increase in data encryption events in Amazon S3 that used an encryption method known as server-side encryption using client-provided keys (SSE-C). AWS CIRT used the Threat Technique Catalog for AWS to classify the new techniques identified in these security events to communicate internally and with other Amazon security teams.

We’ve received feedback from AWS customers that information about the adversarial tactics, techniques, and procedures (TTPs) observed by the AWS CIRT would be valuable and helpful if made available to them, so they could use the information to configure their AWS resources more securely. Over the previous year, we’ve been working with MITRE to make these techniques and sub-techniques available to the global security community. As a result of this collaboration, MITRE has updated and added some of these techniques to MITRE ATT&CK® as part of their October 2024 update cycle (for example, Data Destruction: Lifecycle-Triggered Deletion).

“We greatly appreciated the insight AWS shared with us, and it inspired improvements to a number of techniques in the October release of MITRE ATT&CK. For ATT&CK to keep up with the latest threats, community contributions that benefit the ecosystem are needed, and we value AWS being a part of the ATT&CK community.”
–Adam Pennington, project lead, MITRE ATT&CK, MITRE

Companies, entities, and organizations use ATT&CK to help them understand, prioritize and protect against the threats to their on-premises environments, and we believe that taking advantage of an already existing framework to present these adversarial techniques will provide AWS customers and the global security community with the ability to identify and categorize threats on their AWS infrastructure the same way that the AWS CIRT does.

The Threat Technique Catalog for AWS—based on MITRE ATT&CK Cloud—extends these contributions and includes categories of adversarial techniques that are specific to AWS and have been observed by the AWS CIRT; in addition to information on ways to mitigate those techniques and how to detect them. For example, you can go to the Threat Technique Catalog for AWS, filter by the AWS services in your account, and review the content that will help make your environment more secure. The Getting Started section includes additional ways that you can use the Threat Technique Catalog for AWS. We will continue to update and provide additional changes to the Threat Technique Catalog for AWS to help guide you into making your AWS environment more secure and will continue collaborating with MITRE to advise them of new and trending threat actor techniques.

To get started, visit the Threat Technique Catalog for AWS.

If you have feedback about this post, submit comments in the Comments section below.

Introducing the AWS Security Champion Knowledge Path and digital badge

2025-06-12 Sarah Currey

Post Syndicated from Sarah Currey original https://aws.amazon.com/blogs/security/introducing-the-aws-security-champion-knowledge-path-and-digital-badge/

Today, Amazon Web Service (AWS) introduces the Security Champion Knowledge Path on AWS Skill Builder, featuring training and a digital badge. The Security Champion Knowledge path is a comprehensive educational framework designed to empower developers and software engineers with essential AWS cloud security knowledge and best practices. The structured learning path enables development teams to accelerate their delivery while maintaining robust security standards in cloud environments, addressing customers’ need for a structured curriculum to develop and validate security expertise across their organizations.

AWS Skill Builder logo

A new era of security education

The AWS Security Champion Knowledge Path complements the existing AWS security training offerings, providing a structured, self-paced journey to security expertise. Hart Rossman, Vice President of Global Services Security at AWS, emphasizes the program’s significance: “Security in the cloud isn’t a destination; it’s an ongoing journey. The AWS Security Champion Knowledge Path equips our customers with the tools and knowledge to navigate this journey confidently, fostering a culture where security is woven into every aspect of cloud operations.”

Designed for a diverse audience including software developers, solutions architects, technical leaders, and cloud practitioners, the training plan covers a wide range of topics that are critical for a strong security posture in the cloud. This AWS security learning journey begins with essential fundamentals and progressively builds toward advanced concepts across a well-structured curriculum. Starting with AWS Security Fundamentals and the AWS Shared Responsibility Model, learners establish core principles before diving into AWS Identity and Access Management (IAM), including detailed troubleshooting scenarios. The curriculum advances to critical security elements such as encryption and comprehensive threat modeling through the AWS Builders Workshop. Security governance and auditing form the next tier, followed by practical implementations of monitoring, alerting, and network infrastructure best practices. The learning path then covers specialized areas including web-facing workload protection, network control, and incident response procedures. The knowledge path culminates with deep dives into container security and core security concepts through AWS SimuLearn, providing hands-on experience with real-world scenarios. This carefully orchestrated progression helps facilitate a thorough understanding of AWS security principles while maintaining a practical, implementation-focused approach.

AWS SimuLearn logo

What is a Security Champion?

A Security Champion is a bridge between security teams and development teams, promoting security best practices and making sure that security is embedded into every stage of the development lifecycle. However, Security Champion isn’t just a role—it’s a mindset. In today’s distributed and agile cloud environments, having Security Champions across different teams provides a competitive advantage for releasing products quickly and securely.

This distributed ownership of security brings numerous benefits: faster development cycles because teams can address security requirements proactively, reduced security incidents through early detection, and improved collaboration between security and development teams. Most importantly, it creates a culture of security where every team member understands their role in protecting the organization’s assets and data.

By becoming a Security Champion, you’ll gain valuable expertise, earn recognized credentials, and develop leadership skills that can accelerate your career growth. Most importantly, you’ll be empowered to make meaningful contributions to your organization’s security posture by promoting best practices, identifying potential vulnerabilities early in the development cycle, and fostering collaboration between teams—ultimately helping your organization deliver products that are both innovative and secure.

How can I become an AWS Security Champion?

Security enthusiasts can enroll into the AWS Security Champion Knowledge Badge Readiness Path on AWS Skill Builder and complete the assessment successfully to earn the AWS Security Champion digital badge available on Credly.

AWS Security Champion training is a self-paced, hands-on, and interactive approach to upskilling on security concepts. As a participant, you’re introduced to security best practices, performing basic audits, planning for governance at scale, incident response concepts and more. You can engage with real-world scenarios through hands-on labs, interactive game-based learning, gain access to AWS sandbox environments, and conduct practical security assessments. This applied learning helps make sure that knowledge isn’t just acquired, but truly internalized and ready for immediate application.

“The AWS Security Champion Knowledge Path represents a significant milestone in democratizing security expertise. We’ve designed this program to transform how organizations approach security training, making it accessible, practical, and immediately applicable. This isn’t just about learning security concepts—it’s about creating a culture where security becomes second nature to every team member.”
– Jenni Troutman, Training and Certification Director at AWS

Recognition and community

Upon successfully completing the assessment in this training path, participants earn the prestigious AWS Security Champion knowledge badge in Credly to showcase their accomplishment, such as on LinkedIn, and join a growing community of security professionals. This recognition not only validates individual expertise but also signals an organization’s commitment to security excellence, and helps organizations identify qualified security champions within their team.

Getting started

To begin your journey to becoming an AWS Security Champion, log in or create an account with AWS Skill Builder and enroll in the Security Champion Knowledge Badge Readiness Path. The training plan is available through flexible pricing options, including individual subscriptions at $29 per month and team subscriptions at $449 per month with enterprise volume pricing available.

Rossman concludes, “The AWS Security Champion Knowledge Path represents a paradigm shift in how organizations approach security education. It’s about creating a shared language of security across teams, enabling faster, more secure development cycles, and ultimately, delivering better outcomes for our customers.”

Ready to elevate your organization’s security capabilities? Visit AWS Skill Builder to enroll and start your journey towards becoming an AWS Security Champion. For enterprise inquiries, reach out to your AWS account team.

Stay tuned to the AWS Security Blog for more updates on AWS Security services, features, and best practices. Together, we’re building a more secure cloud for all.

If you have feedback about this post, submit comments in the Comments section below.

How to use on-demand rotation for AWS KMS imported keys

2025-06-06 Jeremy Stieglitz

Post Syndicated from Jeremy Stieglitz original https://aws.amazon.com/blogs/security/how-to-use-on-demand-rotation-for-aws-kms-imported-keys/

Today, we’re announcing support for on-demand rotation of symmetric encryption AWS Key Management Service (AWS KMS) keys with imported key material (EXTERNAL origin). This new capability enables you to rotate the cryptographic key material of these keys without changing the key identifier (key ID or Amazon Resource Name (ARN)). Rotating keys helps you meet compliance requirements and security best practices that mandate periodic key rotation.

AWS KMS has long supported automatic key rotation for AWS KMS keys whose key material is generated by AWS KMS (AWS_KMS origin). Until now, AWS KMS customers who imported their own key material could not rotate that material without creating a new KMS key. This process called manual rotation required updating references to the older key identifiers. With today’s launch, the key ID of the imported key remains unchanged after rotation, so existing workloads are not disrupted. In this post, we tell you how the new capability works, look at key material expiry and deletion features unique to imported keys, and review pricing for this new feature.

How it works

When you create an AWS KMS key with EXTERNAL origin, AWS KMS assigns a fixed identifier to the key called the key ID. However, AWS KMS doesn’t generate key material for the cryptographic operations. You must import your own key material using the ImportKeyMaterial operation.

When you import key material, AWS KMS computes a unique key material identifier based on the key ID and the key material. Even if you import the same key material in different keys, AWS KMS will assign distinct key material identifiers. This computation uses a cryptographic hash so the key material identifier doesn’t reveal information about the key material itself. AWS KMS embeds this key material identifier in the ciphertext blob produced by symmetric encryption.

Until now, after you imported key material into an AWS KMS key, you could not import additional key material into that key to rotate the key. With this new feature launch, you can associate multiple imported key materials with a single, symmetric-encryption key. You can use the RotateKeyOnDemand operation to make the most recently imported key material the current key material. AWS KMS uses the current key material to generate new ciphertext. Unless deleted or expired, the other key materials remain available for decryption. When you present ciphertext for decryption, AWS KMS automatically selects the correct key material using the key material identifier embedded in the ciphertext.

To help improve the auditability that keys have rotated, we’ve added new identifiers in KMS API responses for the specific key material used. The KeyMaterialId is a new field that AWS KMS will return in addition to the KeyId. Similarly, the DescribeKey response for these keys now displays the identifier of the current key material as CurrentKeyMaterialId. The inclusion of the KeyMaterialId and CurrentKeyMaterialId in API responses makes key rotation more transparent.

Before we dive into the details, the following is an outline of the overall process to rotate an imported key:

Create a symmetric encryption KMS key with EXTERNAL origin
Import key material into the key using GetParametersForImport and ImportKeyMaterial APIs. The first key material becomes usable immediately. This part is unchanged and maintains backwards compatibility with the current behavior of AWS KMS.
Use the key to create ciphertext and decrypt it. You’ll notice the key material ID matches the CurrentKeyMaterialId displayed in the DescribeKey response.
When you want to rotate this key, import a second key material into the key. The ImportKeyMaterial API now has a new ImportType input parameter which lets you inform AWS KMS whether you are associating new key material with a key (--import-type NEW_KEY_MATERIAL) or re-importing previously associated key material (--import-type EXISTING_KEY_MATERIAL).
Use ListKeyRotations with --include-key-material ALL_KEY_MATERIAL to view both key materials. The key material state of the second key material will be PENDING_ROTATION.
Use the RotateKeyOnDemand operation to initiate on-demand key rotation.
Optionally, you can use the GetKeyRotationStatus operation to monitor the in-progress rotation. The response will contain OnDemandRotationStartDate only while the rotation is in progress.
Use ListKeyRotations with --include-key-material ALL_KEY_MATERIAL after rotation completes to view key materials associated with this key. The KeyMaterialState of the new key material you imported will change from PENDING_ROTATION to CURRENT. The key material state of the first key material will change from CURRENT to NON_CURRENT.
Use the key to create ciphertext and decrypt it. You’ll notice the CurrentKeyMaterialId is used for creating ciphertext, but the key material used for decryption is automatically determined by AWS KMS.

Using the AWS CLI for rotating an imported key

The following is a sample sequence of AWS KMS commands to exercise the import key rotation functionality using the AWS Command Line Interface (AWS CLI). The specific commands that follow work in Linux or MacOS environments and might need to be modified for use in a Windows environment. This functionality can also be exercised through the AWS SDKs. These operations, except for wrapping a key material for import, generate-data-key, and decrypt can be initiated in the AWS Management Console.

Step 1: Create a key and import key material

This section should be familiar to anyone who has used the existing import key functionality in AWS KMS.

Create a symmetric encryption key with EXTERNAL origin and save the key ARN. The initial state of this key is PendingImport.

export EXTERNAL_KEY1_ARN=$(aws kms create-key --origin EXTERNAL | tee /dev/stderr | jq -r .KeyMetadata.Arn)
{
    "KeyMetadata": {
        "AWSAccountId": "111122223333",
        "KeyId": "97c8e55d-5b04-45a1-8a01-1febde0dd041",
        "Arn": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
        "CreationDate": "2025-05-08T13:55:04.605000-07:00",
        "Enabled": false,
        "Description": "",
        "KeyUsage": "ENCRYPT_DECRYPT",
        "KeyState": "PendingImport",
        "Origin": "EXTERNAL",
        "KeyManager": "CUSTOMER",
        "CustomerMasterKeySpec": "SYMMETRIC_DEFAULT",
        "KeySpec": "SYMMETRIC_DEFAULT",
        "EncryptionAlgorithms": [
            "SYMMETRIC_DEFAULT"
        ],
        "MultiRegion": false
    }
}

Generate a 256-bit (32-byte) key material to be imported. In the following command, we use OpenSSL to generate the imported key material.
```
openssl rand 32 > "KeyMaterial1.bin"
```

Use the get-parameters-for-import command to create the wrapping key and import token and save them to files. AWS KMS supports multiple wrapping algorithms; we use RSAES_OAEP_SHA_256 with a 4096-bit RSA key in the following example. The value of the ImportToken and PublicKey fields in the response has been trimmed for brevity.

export WRAPPING_ALGORITHM="RSAES_OAEP_SHA_256"
export WRAPPING_KEY_SPEC="RSA_4096"
export KMS_PARAMETERS_FOR_IMPORTED_KEY_MATERIAL1=$(aws kms get-parameters-for-import \
    --key-id "${EXTERNAL_KEY1_ARN}" \
    --wrapping-algorithm "${WRAPPING_ALGORITHM}" \
    --wrapping-key-spec "${WRAPPING_KEY_SPEC}" | tee /dev/stderr)
{
    "KeyId": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
    "ImportToken": "AQECAHie/gkutEg+qLc1mscyCnHfNPS1aKVSxf6xd3PX05ny2wAADWIwgg1eBgkqhkiG9w0BBwaggg1PMIINSwIBADCCDUQGCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQMn8AovXPs/ 
…
O22maj4+9nDAgjVkxZQWhjA2VHEyORnw8Qb29X8gvjwmE8xhlNWbnM1zZlDcClDzfJriLVuoXAO92HK1Vihs5hiE8/9tu6DegtXfp28WVIVTttnGjkjdVmChC7cg7yZhs4xu1LzN39LLyR9Q1O/9EQYHbwYXgp6tpMt2JyGhH/lRt2kJl+BPUEfKNsWkoj0Cq3Y=",
    "PublicKey": "MIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEAvXskfFFVoZVaZrEuN8UK3vckGww8tJHvsIxLmqKEbFt7D3dp/Jh8AoaVm2n/kdcGD5Hm8NwOQC40jsRjJKTqiTSr8/y8XEmpB0qi68EtKM8RyEmz7u1R5Vn6uoIZxKb/WMvVME/7tntyU4/uhVTGUlrfZItAV+
…
nuGzOptwcaprtT2iNthNZ38UQ2orETbfwdG6yZPt1qS8Jk/O0H+KtkcsNLHrDUE9XjQX4WDfkNbf/fyqaCLUpSTdONTLXqE16pVieLgrXr1845mECAwEAAQ==",
    "ParametersValidTo": "2025-05-09T13:56:45.632000-07:00"
}

export PUBLIC_KEY_FOR_IMPORTED_KEY_MATERIAL1=$(echo "${KMS_PARAMETERS_FOR_IMPORTED_KEY_MATERIAL1}" | jq -r .PublicKey) 
export IMPORT_TOKEN_FOR_IMPORTED_KEY_MATERIAL1=$(echo "${KMS_PARAMETERS_FOR_IMPORTED_KEY_MATERIAL1}" | jq -r .ImportToken) 

echo "${PUBLIC_KEY_FOR_IMPORTED_KEY_MATERIAL1}" | base64 -d > "WrappingKeyForKeyMaterial1.bin"
echo "${IMPORT_TOKEN_FOR_IMPORTED_KEY_MATERIAL1}" | base64 -d > "ImportTokenForKeyMaterial1.bin"

Wrap the key material using the wrapping key. We use OpenSSL, a popular open source cryptographic library to illustrate this step. For a detailed explanation of this step, see the AWS KMS Developer Guide.

openssl pkeyutl -encrypt -in "KeyMaterial1.bin" -out "WrappedKeyMaterial1.bin" -inkey "WrappingKeyForKeyMaterial1.bin" -keyform DER -pubin -encrypt -pkeyopt rsa_padding_mode:oaep -pkeyopt rsa_oaep_md:sha256 -pkeyopt rsa_mgf1_md:sha256

Import the key material. Optionally, you can assign a key material description. The description can be used to keep track of where the key material is durably maintained outside AWS KMS. This key material description is displayed alongside other information for this key material in the console and the ListKeyRotations API response. We also capture the key material ID from the response. In this example, the key material doesn’t expire. Optionally, you can set an expiration time.

export KEY_MATERIAL1_ID=$(aws kms import-key-material --key-id "${EXTERNAL_KEY1_ARN}" --encrypted-key-material fileb://WrappedKeyMaterial1.bin --import-token fileb://ImportTokenForKeyMaterial1.bin --expiration-model KEY_MATERIAL_DOES_NOT_EXPIRE --key-material-description "Q1 2025 key from HSM1" | tee /dev/stderr | jq .KeyMaterialId | tr -d \")
{
    "KeyId": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
    "KeyMaterialId": "04f5cf50ee8f057f974b9820ab8fa8837d28363a85a8f8ef06a46c93adc5b241"
}

The key is now be enabled for use in cryptographic operations and the CurrentKeyMaterialId in the DescribeKey response should match ${KEY_MATERIAL1_ID}

echo "${KEY_MATERIAL1_ID}"
04f5cf50ee8f057f974b9820ab8fa8837d28363a85a8f8ef06a46c93adc5b241

aws kms describe-key --key-id "${EXTERNAL_KEY1_ARN}"
{
    "KeyMetadata": {
        "AWSAccountId": "111122223333",
        "KeyId": "97c8e55d-5b04-45a1-8a01-1febde0dd041",
        "Arn": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
        "CreationDate": "2025-05-08T13:55:04.605000-07:00",
        "Enabled": true,
        "Description": "",
        "KeyUsage": "ENCRYPT_DECRYPT",
        "KeyState": "Enabled",
        "Origin": "EXTERNAL",
        "ExpirationModel": "KEY_MATERIAL_DOES_NOT_EXPIRE",
        "KeyManager": "CUSTOMER",
        "CustomerMasterKeySpec": "SYMMETRIC_DEFAULT",
        "KeySpec": "SYMMETRIC_DEFAULT",
        "EncryptionAlgorithms": [
            "SYMMETRIC_DEFAULT"
        ],
        "MultiRegion": false,
        "CurrentKeyMaterialId": "04f5cf50ee8f057f974b9820ab8fa8837d28363a85a8f8ef06a46c93adc5b241"
    }
}

Use ListKeyRotations to view key materials associated with the key. There should only be one key material with the same ID as in ${KEY_MATERIAL1_ID} and with a key material state of CURRENT.

aws kms list-key-rotations --key-id "${EXTERNAL_KEY1_ARN}" --include-key-material ALL_KEY_MATERIAL
{
    "Rotations": [
        {
            "KeyId": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
            "KeyMaterialId": "04f5cf50ee8f057f974b9820ab8fa8837d28363a85a8f8ef06a46c93adc5b241",
            "KeyMaterialDescription": "First key material",
            "ImportState": "IMPORTED",
            "KeyMaterialState": "CURRENT",
            "ExpirationModel": "KEY_MATERIAL_DOES_NOT_EXPIRE"
        }
    ],
    "Truncated": false
}

Step 2: Use the first key material to create and decrypt ciphertext

This step demonstrates how to verify the key material ID of your imported key in cryptographic operations.

Use the GenerateDataKey operation and capture the ciphertext. This operation returns a data key in both plaintext and ciphertext form. The KeyMaterialId in the response matches the identifier for the first key material.

export KEY1_CIPHERTEXT_BLOB1=$(aws kms generate-data-key --key-id "${EXTERNAL_KEY1_ARN}" --number-of-bytes 32 | tee /dev/stderr | jq -r .CiphertextBlob)
{
    "CiphertextBlob": "AQIBAHgE9c9Q7o8Ff5dLmCCrj6iDfSg2OoWo+O8GpGyTrcWyQQFGzFweTmSjHkrflCzUMvY+AAAAfjB8BgkqhkiG9w0BBwagbzBtAgEAMGgGCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQMdNV4KMSitkzp9j0pAgEQgDuXnjXR/KvDWjf3KfTAksjAyQJETPoC7zBq6ND2n2c7IUT/EUn+cbmBhXx5P/EI3l9drqsP/6NS5rRYSw==",
    "Plaintext": "3OQAOySTM3/E7QHH3a+GGzEz3HKS30rUcvUep+uueas=",
    "KeyId": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
    "KeyMaterialId": "04f5cf50ee8f057f974b9820ab8fa8837d28363a85a8f8ef06a46c93adc5b241",
    "KeyOrigin": "EXTERNAL"
}

Decrypt the ciphertext and compare it to the plaintext key. The KeyMaterialId in the response matches the identifier for the first key material. The plaintext in decrypt response matches the plaintext data key in the GenerateDataKey response.

aws kms decrypt --ciphertext-blob "${KEY1_CIPHERTEXT_BLOB1}"
{
    "KeyId": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
    "Plaintext": "3OQAOySTM3/E7QHH3a+GGzEz3HKS30rUcvUep+uueas=",
    "EncryptionAlgorithm": "SYMMETRIC_DEFAULT",
    "KeyMaterialId": "04f5cf50ee8f057f974b9820ab8fa8837d28363a85a8f8ef06a46c93adc5b241",
    "KeyOrigin": "EXTERNAL"
}

Step 3: Import a second key material into the key for on-demand rotation

Import key rotations start with importing new key material into this key.

Generate a second key material (also 256 bits in length).
```
openssl rand 32 > "KeyMaterial2.bin"
```

Use the get-parameters-for-import command to create the wrapping key and import token for the second key material to be imported. The value of the ImportToken and PublicKey fields in the response has been trimmed for brevity.

export WRAPPING_ALGORITHM="RSAES_OAEP_SHA_256"
export WRAPPING_KEY_SPEC="RSA_4096"
export KMS_PARAMETERS_FOR_IMPORTED_KEY_MATERIAL2=$(aws kms get-parameters-for-import --key-id "${EXTERNAL_KEY1_ARN}" --wrapping-algorithm "${WRAPPING_ALGORITHM}" --wrapping-key-spec "${WRAPPING_KEY_SPEC}" | tee /dev/stderr)
{
    "KeyId": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
    "ImportToken": "AQECAHie/gkutEg+qLc1mscyCnHfNPS1aKVSxf6xd3PX05ny2wAADWAwgg1cBgkqhkiG9w0BBwagIINE0aPHP5vUwaR6nukZvp8aGKeKcN3EDCsWtTNEF2RJFLazEeHItB84viBqlNIPpf8gNcCad+ODRrCyZeAisqeyIOPRoNn+vS3KMHIpMjhBUsLF6yys7FJts7P+ncF9n1bmTC4qYln5BaZ9FoI1P4NikEGiDakG8rtVJM7jWKJqVipifZvhlY8EKM8hE8e7zqR3C13TQhJXec0rVaqsFxSyjX/hbKkY55mDw36xMJaK6G9F3Fi9Pg5fKc/oOG4gUGXKSSZBoL3Jt3ssr7ACPuzFfhMOaqQ/
… /mfaCPAJN5dH0cmppKiOYqA7RSMsx2/sPMWekVLh6j38RMqZHHIq67jos0p0h8EseWpEmI2mnbJ/yoWLY/DkOEami+6scIPgvyvVzRM9d2hR7BCd0/9kwZxIN6WX0",
    "PublicKey": "MIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEAiGzUXf7XWN5nN5GOo/5McrsIYfiLQrDjVdOen0v7QqQcQ04p+ozI5NDGLADYd0bLNn4cm2xrlxqUjTqrtW0U8k9A/5DurYprmNmF6M+Q+zj+EkR5XoLmcuLALifHXDMI0Af20L46NK6uVKYqDl7gP4xXOP+ka57NfHmWGWDGTOKUQNZSS0emBe0UGS
…
7AC0FuwndrShwNxiKEIWPi3NhJ70jkYCrle1mQqKHpsFe6bGKLvp1gdMWByy0fY9lOH9HfoQB0FonEDZjldWBn3YXtP/eUwrgyyCk/Po7i5pjA6cjf/mBTmwmVD/jBAxW45YHFaLaJGPillaQy0Xeu89Mdwq8uEh8AixwTKbQ9Jwk3RG5A33QFj6Qb67sCAwEAAQ==",
    "ParametersValidTo": "2025-05-09T14:10:08.028000-07:00"
}
export PUBLIC_KEY_FOR_IMPORTED_KEY_MATERIAL2=$(echo "${KMS_PARAMETERS_FOR_IMPORTED_KEY_MATERIAL2}" | jq -r .PublicKey)
export IMPORT_TOKEN_FOR_IMPORTED_KEY_MATERIAL2=$(echo "${KMS_PARAMETERS_FOR_IMPORTED_KEY_MATERIAL2}" | jq -r .ImportToken)

echo "${PUBLIC_KEY_FOR_IMPORTED_KEY_MATERIAL2}" | base64 -d > "WrappingKeyForKeyMaterial2.bin"
echo "${IMPORT_TOKEN_FOR_IMPORTED_KEY_MATERIAL2}" | base64 -d > "ImportTokenForKeyMaterial2.bin"

Wrap the second key material using the wrapping key.

openssl pkeyutl -encrypt -in "KeyMaterial2.bin" -out "WrappedKeyMaterial2.bin" -inkey "WrappingKeyForKeyMaterial2.bin" -keyform DER -pubin -encrypt -pkeyopt rsa_padding_mode:oaep -pkeyopt rsa_oaep_md:sha256 -pkeyopt rsa_mgf1_md:sha256

Import the second key material. Optionally, you can assign a key material description. We also capture the key material ID from the response. In this example, the key material doesn’t expire. Optionally, you can set an expiration time.

Note: This call will fail if you omit the import-type parameter or set it to EXISTING_KEY_MATERIAL. Specifying import-type as NEW_KEY_MATERIAL allows the API caller to associate additional key material with the imported key.

export KEY_MATERIAL2_ID=$(aws kms import-key-material --key-id "${EXTERNAL_KEY1_ARN}" --encrypted-key-material fileb://WrappedKeyMaterial2.bin --import-token fileb://ImportTokenForKeyMaterial2.bin --expiration-model KEY_MATERIAL_DOES_NOT_EXPIRE --key-material-description "Q2 2025 key from HSM1" --import-type NEW_KEY_MATERIAL | tee /dev/stderr | jq .KeyMaterialId | tr -d \")
{
    "KeyId": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
    "KeyMaterialId": "b4749c3ad817379ca9b2264967fe791fcb3fe92a6a14ca06657c3b1e6eafeb4e"
}

Use ListKeyRotations to view key materials associated with the key. There should now be two key materials. The key material state of the second key material should be PENDING_ROTATION.

aws kms list-key-rotations --key-id "${EXTERNAL_KEY1_ARN}" --include-key-material ALL_KEY_MATERIAL
{
    "Rotations": [
        {
            "KeyId": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
            "KeyMaterialId": "04f5cf50ee8f057f974b9820ab8fa8837d28363a85a8f8ef06a46c93adc5b241",
            "KeyMaterialDescription": "First key material",
            "ImportState": "IMPORTED",
            "KeyMaterialState": "CURRENT",
            "ExpirationModel": "KEY_MATERIAL_DOES_NOT_EXPIRE"
        },
        {
            "KeyId": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
            "KeyMaterialId": "b4749c3ad817379ca9b2264967fe791fcb3fe92a6a14ca06657c3b1e6eafeb4e",
            "KeyMaterialDescription": "Second key material",
            "ImportState": "IMPORTED",
            "KeyMaterialState": "PENDING_ROTATION",
            "ExpirationModel": "KEY_MATERIAL_DOES_NOT_EXPIRE"
        }
    ],
    "Truncated": false
}

The CurrentKeyMaterialId in DescribeKey response should still be ${KEY_MATERIAL1_ID}.

echo "${KEY_MATERIAL1_ID}"
04f5cf50ee8f057f974b9820ab8fa8837d28363a85a8f8ef06a46c93adc5b241

aws kms describe-key --key-id "${EXTERNAL_KEY1_ARN}"
{
    "KeyMetadata": {
        "AWSAccountId": "111122223333",
        "KeyId": "97c8e55d-5b04-45a1-8a01-1febde0dd041",
        "Arn": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
        "CreationDate": "2025-05-08T13:55:04.605000-07:00",
        "Enabled": true,
        "Description": "",
        "KeyUsage": "ENCRYPT_DECRYPT",
        "KeyState": "Enabled",
        "Origin": "EXTERNAL",
        "ExpirationModel": "KEY_MATERIAL_DOES_NOT_EXPIRE",
        "KeyManager": "CUSTOMER",
        "CustomerMasterKeySpec": "SYMMETRIC_DEFAULT",
        "KeySpec": "SYMMETRIC_DEFAULT",
        "EncryptionAlgorithms": [
            "SYMMETRIC_DEFAULT"
        ],
        "MultiRegion": false,
        "CurrentKeyMaterialId": "04f5cf50ee8f057f974b9820ab8fa8837d28363a85a8f8ef06a46c93adc5b241"
    }
}

Step 4: Use on-demand rotation to update the current key material

This step moves the current key material for this key to the newly imported key material.

Use the RotateKeyOnDemand operation to initiate an on-demand key rotation. If the key material in PENDING_ROTATION state is deleted or expires before initiating on-demand rotation, this operation will fail.
```
aws kms rotate-key-on-demand --key-id "${EXTERNAL_KEY1_ARN}"
{
    "KeyId": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041"
}
```
AWS KMS uses a background worker to perform the rotation, so there’s a delay between initiating the on-demand key rotation and its completion. Use the GetKeyRotationStatus command to monitor the rotation status. Until the rotation is completed, the GetKeyRotationStatus response will include the OnDemandRotationStartDate field. When this field disappears, the on-demand key rotation is complete.
```
aws kms get-key-rotation-status --key-id "${EXTERNAL_KEY1_ARN}"
{
    "KeyRotationEnabled": false,
    "KeyId": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
    "OnDemandRotationStartDate": "2025-05-08T14:12:23.096000-07:00"
}
```

Use ListKeyRotations when rotation completes. The second key material changes state from PENDING_ROTATION to CURRENT.

aws kms list-key-rotations --key-id "${EXTERNAL_KEY1_ARN}" --include-key-material ALL_KEY_MATERIAL
{
    "Rotations": [
        {
            "KeyId": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
            "KeyMaterialId": "04f5cf50ee8f057f974b9820ab8fa8837d28363a85a8f8ef06a46c93adc5b241",
            "KeyMaterialDescription": "First key material",
            "ImportState": "IMPORTED",
            "KeyMaterialState": "NON_CURRENT",
            "ExpirationModel": "KEY_MATERIAL_DOES_NOT_EXPIRE"
        },
        {
            "KeyId": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
            "KeyMaterialId": "b4749c3ad817379ca9b2264967fe791fcb3fe92a6a14ca06657c3b1e6eafeb4e",
            "KeyMaterialDescription": "Second key material",
            "ImportState": "IMPORTED",
            "KeyMaterialState": "CURRENT",
            "ExpirationModel": "KEY_MATERIAL_DOES_NOT_EXPIRE",
            "RotationDate": "2025-05-08T14:12:30.890000-07:00",
            "RotationType": "ON_DEMAND"
        }
    ],
    "Truncated": false
}

Rotation changes the current key material, which is reflected in the CurrentKeymaterialId field in the DescribeKey response. It should now match ${KEY_MATERIAL2_ID}.

echo "${KEY_MATERIAL1_ID}"
04f5cf50ee8f057f974b9820ab8fa8837d28363a85a8f8ef06a46c93adc5b241

echo "${KEY_MATERIAL2_ID}"
b4749c3ad817379ca9b2264967fe791fcb3fe92a6a14ca06657c3b1e6eafeb4e

aws kms describe-key --key-id "${EXTERNAL_KEY1_ARN}"
{
    "KeyMetadata": {
        "AWSAccountId": "111122223333",
        "KeyId": "97c8e55d-5b04-45a1-8a01-1febde0dd041",
        "Arn": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
        "CreationDate": "2025-05-08T13:55:04.605000-07:00",
        "Enabled": true,
        "Description": "",
        "KeyUsage": "ENCRYPT_DECRYPT",
        "KeyState": "Enabled",
        "Origin": "EXTERNAL",
        "ExpirationModel": "KEY_MATERIAL_DOES_NOT_EXPIRE",
        "KeyManager": "CUSTOMER",
        "CustomerMasterKeySpec": "SYMMETRIC_DEFAULT",
        "KeySpec": "SYMMETRIC_DEFAULT",
        "EncryptionAlgorithms": [
            "SYMMETRIC_DEFAULT"
        ],
        "MultiRegion": false,
        "CurrentKeyMaterialId": "b4749c3ad817379ca9b2264967fe791fcb3fe92a6a14ca06657c3b1e6eafeb4e"
    }
}

Step 5: Use the second key material to create and decrypt ciphertext

Similar to Step 2, this step demonstrates how to verify that the key material ID of your imported key in cryptographic operations has been rotated.

Use the GenerateDataKey operation and capture the ciphertext. This operation returns a data key in both plaintext and ciphertext forms. Note that the KeyMaterialId returned in the response matches the identifier of the second key material.

echo "${KEY_MATERIAL2_ID}"
b4749c3ad817379ca9b2264967fe791fcb3fe92a6a14ca06657c3b1e6eafeb4e

export KEY1_CIPHERTEXT_BLOB2=$(aws kms generate-data-key --key-id "${EXTERNAL_KEY1_ARN}" --number-of-bytes 32 | tee /dev/stderr | jq -r .CiphertextBlob)
{
    "CiphertextBlob": "AQIBAHi0dJw62Bc3nKmyJkln/nkfyz/pKmoUygZlfDsebq/rTgGjuxQDA9WWHvqF6ZipVLZZAAAAfjB8BgkqhkiG9w0BBwagbzBtAgEAMGgGCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQMdXVYuY1Sa/Jb/SvrAgEQgDtQUp2gClGPBOuxFO89oAZmmkDghEKM6rXgCeS5A/NCLyX7UPdcpsJG/cJAFdjPE9sCfWjOUXCv9JTWmQ==",
    "Plaintext": "acSB+a6W2TQ3dsb2c78yDaZLi9HcuHSubeQkFaAdl1Y=",
    "KeyId": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
    "KeyMaterialId": "b4749c3ad817379ca9b2264967fe791fcb3fe92a6a14ca06657c3b1e6eafeb4e",
    "KeyOrigin": "EXTERNAL"
}

Decrypt the ciphertext and compare it to the plaintext key. The KeyMaterialId returned in the response matches the identifier of the second key material. The plaintext in decrypt response matches the plaintext data key in the previous GenerateDataKey response.

echo "${KEY_MATERIAL2_ID}"
b4749c3ad817379ca9b2264967fe791fcb3fe92a6a14ca06657c3b1e6eafeb4e

aws kms decrypt --ciphertext-blob "${KEY1_CIPHERTEXT_BLOB2}"
{
    "KeyId": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
    "Plaintext": "acSB+a6W2TQ3dsb2c78yDaZLi9HcuHSubeQkFaAdl1Y=",
    "EncryptionAlgorithm": "SYMMETRIC_DEFAULT",
    "KeyMaterialId": "b4749c3ad817379ca9b2264967fe791fcb3fe92a6a14ca06657c3b1e6eafeb4e",
    "KeyOrigin": "EXTERNAL"
}

AWS KMS automatically uses the correct key material based on the ciphertext. When you decrypt the ciphertext produced in Step 2, AWS KMS uses the first key material.

echo "${KEY_MATERIAL1_ID}"
04f5cf50ee8f057f974b9820ab8fa8837d28363a85a8f8ef06a46c93adc5b241

aws kms decrypt --ciphertext-blob "${KEY1_CIPHERTEXT_BLOB1}"
{
    "KeyId": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
    "Plaintext": "3OQAOySTM3/E7QHH3a+GGzEz3HKS30rUcvUep+uueas=",
    "EncryptionAlgorithm": "SYMMETRIC_DEFAULT",
    "KeyMaterialId": "04f5cf50ee8f057f974b9820ab8fa8837d28363a85a8f8ef06a46c93adc5b241",
    "KeyOrigin": "EXTERNAL"
}

Step 6: Delete key material, make the key unusable

With the launch of this feature, the DeleteImportedKeyMaterial operation takes an optional KeyMaterialId parameter. If no KeyMaterialId is specified, AWS KMS deletes the current key material. This maintains backwards compatibility with existing behavior.

Delete the first imported key material by specifying its identifier.

aws kms delete-imported-key-material --key-id "${EXTERNAL_KEY1_ARN}" --key-material-id "${KEY_MATERIAL1_ID}"
{
    "KeyId": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
    "KeyMaterialId": "04f5cf50ee8f057f974b9820ab8fa8837d28363a85a8f8ef06a46c93adc5b241"
}

Deleting the key material causes the key state to change to PendingImport.

aws kms describe-key --key-id "${EXTERNAL_KEY1_ARN}"
{
    "KeyMetadata": {
        "AWSAccountId": "111122223333",
        "KeyId": "97c8e55d-5b04-45a1-8a01-1febde0dd041",
        "Arn": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
        "CreationDate": "2025-05-08T13:55:04.605000-07:00",
        "Enabled": false,
        "Description": "",
        "KeyUsage": "ENCRYPT_DECRYPT",
        "KeyState": "PendingImport",
        "Origin": "EXTERNAL",
        "KeyManager": "CUSTOMER",
        "CustomerMasterKeySpec": "SYMMETRIC_DEFAULT",
        "KeySpec": "SYMMETRIC_DEFAULT",
        "EncryptionAlgorithms": [
            "SYMMETRIC_DEFAULT"
        ],
        "MultiRegion": false,
        "CurrentKeyMaterialId": "b4749c3ad817379ca9b2264967fe791fcb3fe92a6a14ca06657c3b1e6eafeb4e"
    }
}

ListKeyRotations also reflects the first key material as PendingImport.

aws kms list-key-rotations --key-id "${EXTERNAL_KEY1_ARN}" --include-key-material ALL_KEY_MATERIAL
{
    "Rotations": [
        {
            "KeyId": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
            "KeyMaterialId": "04f5cf50ee8f057f974b9820ab8fa8837d28363a85a8f8ef06a46c93adc5b241",
            "KeyMaterialDescription": "First key material",
            "ImportState": "PENDING_IMPORT",
            "KeyMaterialState": "NON_CURRENT"
        },
        {
            "KeyId": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
            "KeyMaterialId": "b4749c3ad817379ca9b2264967fe791fcb3fe92a6a14ca06657c3b1e6eafeb4e",
            "KeyMaterialDescription": "Second key material",
            "ImportState": "IMPORTED",
            "KeyMaterialState": "CURRENT",
            "ExpirationModel": "KEY_MATERIAL_DOES_NOT_EXPIRE",
            "RotationDate": "2025-05-08T14:12:30.890000-07:00",
            "RotationType": "ON_DEMAND"
        }
    ],
    "Truncated": false
}

Cryptographic operations fail with a KMSInvalidStateException when a key is in PendingImport state even though the key material required to decrypt the specific ciphertext blob is imported into AWS KMS.

aws kms decrypt --ciphertext-blob "${KEY1_CIPHERTEXT_BLOB2}"

An error occurred (KMSInvalidStateException) when calling the Decrypt operation: arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041 is pending import.

Step 7: Reimport key material to enable the key

You need to re-import all expired or deleted key materials associated with a key for the key to become usable again.

Re-import the missing key material. For this example, we reused the wrapped key and import token we already have. This is an optimization. Optionally, you can get new parameters for wrapping and importing the key material.

aws kms import-key-material --key-id "${EXTERNAL_KEY1_ARN}" --encrypted-key-material fileb://WrappedKeyMaterial1.bin --import-token fileb://ImportTokenForKeyMaterial1.bin --expiration-model KEY_MATERIAL_DOES_NOT_EXPIRE
{
    "KeyId": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
    "KeyMaterialId": "04f5cf50ee8f057f974b9820ab8fa8837d28363a85a8f8ef06a46c93adc5b241"
}


aws kms list-key-rotations --key-id "${EXTERNAL_KEY1_ARN}" --include-key-material ALL_KEY_MATERIAL
{
    "Rotations": [
        {
            "KeyId": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
            "KeyMaterialId": "04f5cf50ee8f057f974b9820ab8fa8837d28363a85a8f8ef06a46c93adc5b241",
            "KeyMaterialDescription": "First key material",
            "ImportState": "IMPORTED",
            "KeyMaterialState": "NON_CURRENT",
            "ExpirationModel": "KEY_MATERIAL_DOES_NOT_EXPIRE"
        },
        {
            "KeyId": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
            "KeyMaterialId": "b4749c3ad817379ca9b2264967fe791fcb3fe92a6a14ca06657c3b1e6eafeb4e",
            "KeyMaterialDescription": "Second key material",
            "ImportState": "IMPORTED",
            "KeyMaterialState": "CURRENT",
            "ExpirationModel": "KEY_MATERIAL_DOES_NOT_EXPIRE",
            "RotationDate": "2025-05-08T14:12:30.890000-07:00",
            "RotationType": "ON_DEMAND"
        }
    ],
    "Truncated": false
}

Use DescribeKey to validate the key is now enabled.

aws kms describe-key --key-id "${EXTERNAL_KEY1_ARN}"
{
    "KeyMetadata": {
        "AWSAccountId": "111122223333",
        "KeyId": "97c8e55d-5b04-45a1-8a01-1febde0dd041",
        "Arn": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
        "CreationDate": "2025-05-08T13:55:04.605000-07:00",
        "Enabled": true,
        "Description": "",
        "KeyUsage": "ENCRYPT_DECRYPT",
        "KeyState": "Enabled",
        "Origin": "EXTERNAL",
        "ExpirationModel": "KEY_MATERIAL_DOES_NOT_EXPIRE",
        "KeyManager": "CUSTOMER",
        "CustomerMasterKeySpec": "SYMMETRIC_DEFAULT",
        "KeySpec": "SYMMETRIC_DEFAULT",
        "EncryptionAlgorithms": [
            "SYMMETRIC_DEFAULT"
        ],
        "MultiRegion": false,
        "CurrentKeyMaterialId": "b4749c3ad817379ca9b2264967fe791fcb3fe92a6a14ca06657c3b1e6eafeb4e"
    }
}

You can use the following commands to validate that the key works:

aws kms decrypt --ciphertext-blob "${KEY1_CIPHERTEXT_BLOB1}"
{
    "KeyId": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
    "Plaintext": "3OQAOySTM3/E7QHH3a+GGzEz3HKS30rUcvUep+uueas=",
    "EncryptionAlgorithm": "SYMMETRIC_DEFAULT",
    "KeyMaterialId": "04f5cf50ee8f057f974b9820ab8fa8837d28363a85a8f8ef06a46c93adc5b241",
    "KeyOrigin": "EXTERNAL"
}

aws kms decrypt --ciphertext-blob "${KEY1_CIPHERTEXT_BLOB2}"
{
    "KeyId": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041",
    "Plaintext": "acSB+a6W2TQ3dsb2c78yDaZLi9HcuHSubeQkFaAdl1Y=",
    "EncryptionAlgorithm": "SYMMETRIC_DEFAULT",
    "KeyMaterialId": "b4749c3ad817379ca9b2264967fe791fcb3fe92a6a14ca06657c3b1e6eafeb4e",
    "KeyOrigin": "EXTERNAL"
}

CloudTrail logging

AWS CloudTrail now includes the key material ID in the additionalEventData field for operations using symmetric-encryption keys (both AWS_KMS and EXTERNAL origin). The following is a sample CloudTrail event for the AWS KMS decrypt operation:

{
    "eventVersion": "1.11",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "AROAEXAMPLE:alice",
        "arn": "arn:aws:sts::111122223333:assumed-role/key-user/alice",
        "accountId": "111122223333",
        "accessKeyId": "ACCESSKEY",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "AROAEXAMPLE",
                "arn": "arn:aws:iam::111122223333:role/key-user",
                "accountId": "111122223333",
                "userName": "key-user"
            },
            "attributes": {
                "creationDate": "2025-05-09T18:12:33Z",
                "mfaAuthenticated": "false"
            }
        }
    },
    "eventTime": "2025-05-09T18:15:02Z",
    "eventSource": "kms.amazonaws.com",
    "eventName": "Decrypt",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "1.2.3.4",
    "userAgent": "aws-cli/2.9.19 Python/3.9.11 Darwin/24.4.0 exe/x86_64 prompt/off command/kms.decrypt",
    "requestParameters": {
        "encryptionAlgorithm": "SYMMETRIC_DEFAULT"
    },
    "responseElements": null,
    "additionalEventData": { "keyMaterialId": "b4749c3ad817379ca9b2264967fe791fcb3fe92a6a14ca06657c3b1e6eafeb4e" },
    "requestID": "68c6013e-24af-4a66-aa88-64cdb27d693d",
    "eventID": "1feca1a6-cbf0-491d-897f-31c52c54c783",
    "readOnly": true,
    "resources": [{
        "accountId": "111122223333",
        "type": "AWS::KMS::Key",
        "ARN": "arn:aws:kms:us-east-1:111122223333:key/97c8e55d-5b04-45a1-8a01-1febde0dd041"
    }],
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "111122223333",
    "eventCategory": "Management",
    "tlsDetails": {
        "tlsVersion": "TLSv1.3",
        "cipherSuite": "TLS_AES_256_GCM_SHA384",
        "clientProvidedHostHeader": "kms.us-east-1.amazonaws.com"
    }
}

Key expiry and key deletion capabilities unique to imported keys

Unlike standard KMS keys that you create within AWS KMS, imported keys offer two unique capabilities for enhanced controls over key material within AWS.

When importing key material into a KMS key, you can optionally set an expiration date and time, up to 365 days from the import date. If you don’t specify an expiration, the key material doesn’t expire. When the expiration time is reached, AWS KMS immediately deletes the key material and the KMS key becomes unusable. This is in contrast to the 7–30 day waiting period required for KMS keys whose key material is generated by AWS KMS. To re-enable the key, you must reimport the key material. With key rotation, you can continue to set expiration periods for new key material that you import.
Unique to KMS imported keys, you can also delete specific key material without deleting the entire KMS key. Deleting the key material of a KMS imported key is temporary and reversible. To restore the key, reimport its key material.

Key expiry and import key material deletion can be useful if you need to demonstrate immediate key suspension in the cloud or when you want to temporarily seed AWS KMS with key material that can be inserted and repeatedly removed from cloud access (hydration and re-hydration of keys for improved digital sovereignty).

Special considerations

AWS KMS is designed to keep imported key material highly available. But AWS KMS doesn’t maintain the durability of imported key material at the same level as key material that AWS KMS generates. You must retain a copy of the imported key material outside of AWS KMS in a system that you control. We recommend that you store an exportable copy of the imported key material in a key management system, such as an HSM. As a best practice, you should store a reference to the KMS key ARN and key material description alongside the exportable copy of the key material.

The deletion or expiration of any key material associated with a KMS key makes that key unusable. You must re-import all the key materials associated with a key before it can be used for cryptographic operations.

The following types of KMS keys with imported key material do not support on-demand key rotation, but you can continue to use manual rotation with these keys.

Asymmetric keys
HMAC keys
Multi-AWS Region keys

Features and benefits

This new capability includes several important features:

Maintain key identifiers: Rotate key material while keeping the same AWS KMS key ID and ARN.
Flexible rotation: Rotate keys on-demand to meet your security requirements or use an external key manager to set a rotation schedule that can drive new key rotations into AWS KMS.
Audit trail: Track key material usage through CloudTrail logs.
Metadata management: Add unique descriptions to each key material version.
Retains support for key expiry and import key deletion (features unique to KMS imported keys)

Pricing

For AWS KMS keys that rotate automatically or on-demand, each key incurs a base cost, and the first two rotations add $1 per month (prorated hourly) in additional charges. The pricing is capped after the second rotation, meaning subsequent rotations beyond the second one aren’t billed. This simplified pricing provides you with more predictable costs while maintaining the flexibility to rotate keys frequently to meet your compliance and security audit requirements.

Getting started

You can start using this feature today in all AWS Regions where AWS KMS is available. To learn more, visit the AWS KMS Developer Guide.

We’re excited to see how you use this new capability to enhance your key management practices. Leave a comment below or on the AWS re:Post community forum to let us know what you think.

Use AI agents and the Model Context Protocol with Amazon SES

2025-06-06 Zip Zieper

Post Syndicated from Zip Zieper original https://aws.amazon.com/blogs/messaging-and-targeting/use-ai-agents-and-the-model-context-protocol-with-amazon-ses/

Amazon Simple Email Service (Amazon SES) delivers a cloud-based email solution that empowers businesses to send emails more efficiently and at a larger scale. Its powerful, scalable platform enables organizations from startups to global brands to send personalized, high-volume email communications while maintaining exceptional deliverability and performance.

Amazon SES caters to a wide range of users, from developers and technical marketing professionals to business communicators. In addition to offering robust programmatic access through APIs and SMTP protocols, Amazon SES provides a comprehensive web console and intuitive dashboards that make email configuration and performance monitoring accessible to users with varying technical backgrounds. Historically, navigating email workflows and configuring advanced email capabilities in Amazon SES has required specialized knowledge, resulting in a learning curve for new users. As seen in many other areas, today’s AI tools can offer more intuitive ways to manage Amazon SES to get the most out of your email communications. We have found, however, that these AI tools occasionally produce inconsistent results, often as a result of the underlying large language model’s (LLM’s) training data.

Recognizing the need for a specialized, service-aware, AI-friendly Amazon SES solution, we are introducing the SESv2 MCP Server, a sample Model Context Protocol (MCP) for Amazon SES. We’ve integrated the SESv2 MCP Server sample with the Amazon SES v2 APIs to provide more precise and reliable AI-assisted use, management, and configuration for Amazon SES.

MCP is an open protocol that enables seamless integration between your AI-powered integrated development environment (IDE) or AI assistant, enriching the capabilities of the AI and enabling you to use Amazon SES using natural language. For more info, see the GitHub repo.

We’ve released the SESv2 MCP Server sample on GitHub and invite current and prospective customers to experiment with it in non-production environments. You can use it with your AI tools to explore ways in which AI can be used with Amazon SES to send emails, check configurations, and review deliverability. We’re interested in learning how you use your AI tools and the SESv2 MCP Server to test out email sending in different services or applications. We’re also curious if new customers find it helpful when configuring and learning about their Amazon SES service. No matter how you use it, we are eager for your feedback, comments, and contributions through the GitHub project’s issues.

Solution overview

You can use the SESv2 MCP Server sample with AI assistant applications like Anthropic’s Claude Desktop. You can also integrate it into MCP-compatible agentic AI coding assistants such as Amazon Q Developer, Amazon Q for command line, Cline, Cursor, and Windsurf. When used as an AI coding assistant, the SESv2 MCP Server sample helps developers add Amazon SES email capabilities to their applications and services using plain, natural language prompting. For recommendations from AWS on how to improve your vibe coding experience, refer to Vibe coding tips and tricks.

After you’ve configured the sample and authenticated with your AWS credentials, you can use natural language in your chosen AI tool. For example, an email marketing manager might want to ask Anthropic’s Claude Desktop “provide me with the status of the verified identities in my SES account, along with any recommendations to improve deliverability.” Someone new to Amazon SES can ask the Amazon Q CLI “create a new Amazon SES configuration set for the octank.com identity, enable it for event publishing for bounces and complaints.” Similarly, the developer of an AI-enabled restaurant booking application might ask the Amazon Q CLI “my application needs to send email confirmation of a customers online booking. Can you walk me thru adding this capability to my app using my SES account?”

As you can see from these examples, although it’s helpful to know a bit about email, and Amazon SES in general, with the help of your AI tool and the SESv2 MCP Server sample, you don’t need to be an email or Amazon SES expert. The combination of your creativity, AI tool, and the SESv2 MCP Server sample empowers even non-developers to create, test, and monitor Amazon SES workflows using natural language.

The SESv2 MCP Server sample release uses the open source Smithy Java project, which is still in development. As such, the SESv2 MCP Server is considered a sample, and we do not recommend employing it for production use. When a stable version is available, we might update this post and the GitHub repository accordingly.

Prerequisites

To follow along with the example use cases, make sure you have the following prerequisites set up:

AWS credentials with appropriate permissions.
An MCP-compatible LLM client (such as Anthropic’s Claude Desktop, Cline, Amazon Q CLI, or Cursor). For this post, we use the Amazon Q Developer CLI. For installation instructions, refer to Installing Amazon Q for command line.
Java 21 (or later) runtime (as required by Smithy Java).
Access to GitHub.
Git installed locally. For instructions, see Getting Started – Installing Git.

Best practices for using MCPs

To maximize the benefits of MCP-assisted development while maintaining security and code quality, we suggest you follow these essential guidelines:

Always review generated code for security implications before deployment
Use MCP servers as accelerators, not replacements for developer judgment and expertise
Keep MCP servers updated with the latest AWS security best practices
Follow the principle of least privilege when configuring AWS credentials
Run security scanning tools on generated infrastructure code

Configure the AWS CLI

Use the following command to configure the AWS Command Line Interface (AWS CLI) with the AWS credentials for your Amazon SES account and AWS Region:

aws configure

Clone and build the GitHub repository locally

To use macOS or Linux, use the following command to clone and build the GitHub repo:

git clone https://github.com/aws-samples/sample-for-amazon-ses-mcp.git
cd sample-for-amazon-ses-mcp
./build.sh

For Windows, use the following command:

git clone https://github.com/aws-samples/sample-for-amazon-ses-mcp.git
cd sample-for-amazon-ses-mcp
.\build.bat

Copy the absolute path to the .jar file (JAR_PATH_FROM_BUILD_OUTPUT). This will be printed at the end of the build script:

/<your path>/sample-for-amazon-ses-mcp/artifacts/sample-for-amazon-ses-mcp-all.jar

Configure your AI tool to use `SESv2 MCP Server`

When the build is complete, add SESv2 MCP Server to your AI tool’s MCP configuration:

{
  "mcpServers": {
    "sesv2-mcp-server": {
      "command": "java",
      "args": [
        "-jar",
        "JAR_PATH_FROM_BUILD_OUTPUT"
      ]
    }
  }
}

See MCP configuration for configuration steps. See the Claude Desktop MCP configuration guide for setup instructions.

After you build the SESv2 MCP Server and configure your AWS credentials, you’re ready to interact with Amazon SES. Keep in mind that effective, thoughtful prompting is crucial for successful AI-assisted development. For more information about vibe coding, see Vibe coding tips and tricks.

Example use cases

In this section, we provide some guided examples using the Amazon Q Developer CLI to interact with Amazon SES. Feel free to experiment on your own use cases, and share your comments and ideas through the GitHub project’s issues. Do not disclose any personal, commercially sensitive, or confidential information.

Get information, recommendations, and configurations your Amazon SES account

Open your AI tool; for these examples, we use a macOS terminal and initiate a chat session with the Amazon Q CLI:

q chat

We’ve found it useful to provide your AI tool with some guidance:

You're connected to the SESv2 MCP Server and have access to the AWS SESv2 APIs.

Ask the Amazon Q CLI about your AWS account’s SES email identities:

Tell me about the identities in my account, and also if the account is in the SES sandbox?

The Amazon Q CLI will request permission to use the SESv2 MCP Server (which provides the Amazon Q CLI with the SESv2 APIs ListEmailIdentities and GetAccount) to query your AWS SES account and reply with a detailed summary.

Ask the Amazon Q CLI if it has any recommendations related to improving deliverability for your Amazon SES account:

Do you have any recommendations to improve email deliverability for my SES account?

The Amazon Q CLI will use the SESv2 MCP Server (which provides the CLI with the SESv2 API ListRecommendations) to query your Amazon SES account and reply with a detailed summary.

Ask the Amazon Q CLI to set up Amazon SES click tracking for one of your domains. We have found it helpful to remind the CLI that it has access to additional knowledge of the AWS service APIs. It’s also a good idea to make sure the AI tool doesn’t invent nonexistent APIs.

You also have access to other AWS service APIs via the AWS CLI and your general knowledge, but you may only use known, documented APIs - do not invent or create any APIs or commands.
Set up Amazon SES click tracking with CloudWatch integration for the domain <my verified identity> to monitor email metrics. Use Amazon's default tracking domain (no SSL or https) for the click tracking to ensure immediate functionality without requiring custom domain setup. Include all necessary configuration steps and verify the setup works correctly. Create a test HTML email to <my email address> from <no-reply@verified domain> with subject "Testing SES click tracking". Create an HTML (with fallback to text) body with links and short descriptions taken from the public AWS webpages for Amazon SES, AWS End User Messaging and Amazon Connect.

Send emails with your Amazon SES account

Using its knowledge of Amazon SES from the SESv2 MCP Server and permissions to use your Amazon SES account (aws configure), you can use your AI tool to create and send emails using Amazon SES.

If your Amazon SES account is in the Amazon SES sandbox, you are limited to sending and receiving email from verified email addresses. You are also limited to 200 messages in 24 hours. For more information about the Amazon SES sandbox, see Request production access (Moving out of the Amazon SES sandbox). If you’re in the sandbox, you can simply ask your AI tool “verify my email address <[email protected]>.”

Ask the Amazon Q CLI to send a test email with a sample HTML body:

Send a test email to <my verified email address> from <verified SES email identity>. Set the from email display name to "MCP testing". Make the email subject "Test sending an email via SES MCP". Use the information found on the Amazon SES website to create an HTML message body with a few sentences and bullet points about SES. Provide a text version of the message body in case of fallback.

Check your email, where you will receive a response.

You can get creative and ask the Amazon Q CLI to create a formatted email template with personalization using a simple table with email recipients, the product they bought, and their postal code:

Use the table below to send each person in the table an html formatted (with fallback) email message. 
-- table --
email,name,product,zipcode
<my verified email address>,Alice,an umbrella,98101
<my verified email address>,Bob,lots of sunscreen,10001
-- end table --
Use the template below. Create a 5-day weather forecast graphic similar to popular weather app graphics based on estimated weather for their ZIP code.
-- template --
"Hi {{name}}, thanks for buying {{product}}; it looks like you'll need it soon based on the 5-day weather forecast for your local area: <5-day weather forecast graphic>.

As we’ve demonstrated, you don’t need to be a seasoned developer to create and test Amazon SES workflows when you have an AI tool and the SESv2 MCP Server sample.

Conclusion

The SESv2 MCP Server sample democratizes the ability to configure, manage, and create sophisticated email automation workflows with Amazon SES.

The examples and guidance in this post demonstrate how even newcomers can use AI tools like the Amazon Q CLI to test out configuring, monitoring, and sending emails with Amazon SES using natural language. More technical users, including developers, can use the SESv2 MCP Server sample to build and test intelligent email applications that use Amazon SES, or to test out building Amazon SES sending into their own application.

We hope you will experiment with the SESv2 MCP Server sample and provide us with your thoughts and feedback, and perhaps contribute to the project through the GitHub project’s issues.

Additional resources

Embracing event driven architecture to enhance resilience of data solutions built on Amazon SageMaker

2025-06-05 Dhrubajyoti Mukherjee

Post Syndicated from Dhrubajyoti Mukherjee original https://aws.amazon.com/blogs/big-data/embracing-event-driven-architecture-to-enhance-resilience-of-data-solutions-built-on-amazon-sagemaker/

Amazon Web Services (AWS) customers value business continuity while building modern data governance solutions. A resilient data solution helps maximize business continuity by minimizing solution downtime and making sure that critical information remains accessible to users. This post provides guidance on how you can use event driven architecture to enhance the resiliency of data solutions built on the next generation of Amazon SageMaker, a unified platform for data, analytics, and AI. SageMaker is a managed service with high availability and durability. If customers want to build a backup and recovery system on their end, we show you how to do this in this blog. It provides three design principles to improve the data solution resiliency of your organization. In addition, it contains guidance to formulate a robust disaster recovery strategy based on event driven architecture. It contains code samples to back up the system metadata of your data solution built on SageMaker, enabling disaster recovery.

The AWS Well-Architected Framework defines resilience as the ability of a system to recover from infrastructure or service disruptions. You can enhance the resiliency of your data solution by adopting three design principles that are highlighted in this post and by establishing a robust disaster recovery strategy. Recovery point objective (RPO) and recovery time objective (RTO) are industry standard metrics to measure the resilience of a system. RPO indicates how much data loss your organization can accept in case of solution failure. RTO refers to the time for the solution to recover after failure. You can measure these metrics in seconds, minutes, hours, or days. The next section discusses how you can align your data solution resiliency strategy to meet the needs of your organization.

Formulating a strategy to enhance data solution resilience

To develop a robust resiliency strategy for your data solution built on SageMaker, start with how users interact with the data solution. The user interaction influences the data solution architecture, the degree of automation, and determines your resiliency strategy. Here are a few aspects you might consider while designing the resiliency of your data solution.

Data solution architecture – The data solution of your organization might follow a centralized, decentralized, or hybrid architecture. This architecture pattern reflects the distribution of responsibilities of the data solution based on the data strategy of your organization. This shift in responsibilities is reflected in the structure of the teams that perform activities in the Amazon DataZone data portal, SageMaker Unified Studio portal, AWS Management Console, and underlying infrastructure. Examples of such activities include configuring and running the data sources, publishing data assets in the data catalog, subscribing to data assets, and assigning members to projects.
User persona – The user persona, their data, and cloud maturity influence their preferences for interacting with the data solution. The users of a data governance solution fall into two categories: business users and technical users. Business users of your organization might include data owners, data stewards, and data analysts. They might find the Amazon DataZone data portal and SageMaker Unified Studio portal more convenient for tasks such as approving or rejecting subscription requests and performing one-time queries. Technical users such as data solution administrators, data engineers, and data scientists might opt for automation when making system changes. Examples of such activities include publishing data assets, managing glossary and metadata forms in the Amazon DataZone data portal or in SageMaker Unified Studio portal. A robust resiliency strategy accounts for tasks performed by both user groups.
Empowerment of self-service – The data strategy of your organization determines autonomy granted to the users. Increased user autonomy demands a high level of abstraction of the cloud infrastructure powering the data solution. SageMaker empowers self-service by enabling users to perform regular data management activities in the Amazon DataZone data portal and in the SageMaker Unified Studio portal. The level of self-service maturity of the data solution depends on the data strategy and user maturity of your organization. At an early stage, you might limit the self-service features to the use cases for onboarding the data solution. As the data solution scales, consider increasing the self-service capabilities. See Data Mesh Strategy Framework to learn about the different phases of a data mesh-based data solution.

Adopt the following design principles to enhance the resiliency of your data solution:

Choose serverless services – Use serverless AWS services to build your data solution. Serverless services scale automatically with increasing system load, provide fault isolation, and have built-in high-availability. Serverless services minimize the need for infrastructure management, reducing the need to design resiliency into the infrastructure. SageMaker seamlessly integrates with several serverless services such Amazon Simple Storage Service (Amazon S3), AWS Glue, AWS Lake Formation, and Amazon Athena.
Document system metadata – Document the system metadata of your data solution using infrastructure-as-code (IaC) and automation. Consider how users interact with the data solution. If the users prefer to perform certain activities through the Amazon DataZone data portal and SageMaker Unified Studio portal, implement automation to capture and store the metadata that’s relevant for disaster recovery. Use Amazon Relational Database Service (Amazon RDS) and Amazon DynamoDB to store the system metadata of your data solution.
Monitor system health – Implement a monitoring and alerting solution for your data solution so that you can respond to service interruptions and initiate the recovery process. Make sure that system activities are logged so that you can troubleshoot the system interruption. Amazon CloudWatch helps you monitor AWS resources and the applications you run on AWS in real time.

The next section presents disaster recovery strategies to recover your data solution built on SageMaker.

Disaster recovery strategies

Disaster recovery focuses on one-time recovery objectives in response to natural disasters, large-scale technical failures, or human threats such as attack or error. Disaster recovery is a crucial part of your business continuity plan. As shown in the following figure, AWS offers the following options for disaster recovery: Backup and restore, pilot light, warm standby, and multi-site active/active.

The business continuity requirements and cost of recovery should guide your organization’s disaster recovery strategy. As a general guideline, the recovery cost of your data solution increases with reduced RPO and RTO requirements. The next section provides architecture patterns to implement a robust backup and recovery solution for a data solution built on SageMaker.

Solution overview

This section provides event-driven architecture patterns following the backup and restore approach to enhance resiliency of your data solution. This active/passive strategy-based solution stores the system metadata in a DynamoDB table. You can use the system metadata to restore your data solution. The following architecture patterns provide regional resilience. You can simplify the architecture of this solution to restore data in a single AWS Region.

Pattern 1: Point-in-time backup

The point-in-time backup captures and stores system metadata of a data solution built on SageMaker when a user or an automation performs an action. In this pattern, a user activity or an automation initiates an event that captures the system metadata. This pattern is suited for low RPO requirements, ranging from seconds to minutes. The following architecture diagram shows the solution for the point-in-time backup process.

Architecture point-in-time-backup

The steps comprise the following.

User or automation performs an activity on an Amazon DataZone domain or Amazon Unified Studio domain.
This activity creates a new event in AWS CloudTrail.
The CloudTrail event is sent to Amazon EventBridge. Alternatively, you can use Amazon DataZone as the event source for the EventBridge rule.
AWS Lambda transforms and stores this event in a DynamoDB global table where the Amazon DataZone domain is hosted.
The information is replicated into the replica DynamoDB table in a secondary Region. The replica DynamoDB table can be used to restore the data solution based on SageMaker in the secondary Region.

Pattern 2: Scheduled backup

The scheduled backup captures and stores system metadata of a data solution built on SageMaker at regular intervals. In this pattern, an event is initiated based on a defined time schedule. This pattern is suited for RPO requirements in the order of hours. The following architecture diagram displays the solution for point-in-time backup process.

The steps comprise the following.

EventBridge triggers an event at regular interval and sends this event to AWS Step Functions.
The Step Functions state machine contains multiple Lambda functions. These Lambda functions get the system metadata from either a SageMaker Unified Studio domain or an Amazon DataZone domain.
The system metadata is stored in an DynamoDB global table in the primary Region where the Amazon DataZone domain is hosted.
The information is replicated into the replica DynamoDB table in a secondary Region. The data solution can be restored in the secondary Region using the replica DynamoDB table.

The next section provides step by step instructions to deploy a code sample that implements the scheduled backup pattern. This code sample stores asset information of a data solution built on a SageMaker Unified Studio domain and an Amazon DataZone domain in an DynamoDB global table. The data in the DynamoDB table is encrypted at rest using a customer managed key stored in AWS Key Management Service (AWS KMS). A multi-Region replica key encrypts the data in the secondary Region. The asset uses the data lake blueprint that contains the definition for launching and configuring a set of services (AWS Glue, Lake Formation, and Athena) to publish and use data lake assets in the business data catalog. The code sample uses the AWS Cloud Development Kit (AWS CDK) to deploy the cloud infrastructure.

Prerequisites

An active AWS account.
AWS administrator credentials for the central governance account in your development environment
AWS Command Line Interface (AWS CLI) installed to manage your AWS services from the command line (recommended)
Node.js and Node Package Manager (npm) installed to manage AWS CDK applications
AWS CDK Toolkit installed globally in your development environment by using npm, to synthesize and deploy AWS CDK applications

npm install -g aws-cdk

TypeScript installed in your development environment or installed globally by using npm compiler:

npm install -g typescript

Docker installed in your development environment (recommended)
An integrated development environment (IDE) or text editor with support for Python and TypeScript (recommended)

Walkthrough for data solutions built on a SageMaker Unified Studio domain

This section provides step by step instructions to deploy a code sample that implements the scheduled backup pattern for data solutions built on a SageMaker Unfied Studio domain.

Set up SageMaker Unified Studio

Sign into the IAM console. Create an IAM role that trusts Lambda with the following policy.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "datazone:Search",
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "dynamodb:PutItem"
            ],
            "Resource": "arn:aws:dynamodb:<AWS_REGION>:<AWS_ACCOUNT>:table/*"
        },
        {
            "Sid": "VisualEditor2",
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt",
                "kms:Encrypt",
                "kms:GenerateDataKey",
                "kms:ReEncrypt*",
                "kms:DescribeKey"
            ],
            "Resource": "arn:aws:kms:<AWS_REGION>:<AWS_ACCOUNT>:key/<KMS_KEY_ID>"
        },
        {
            "Sid": "VisualEditor3",
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:<AWS_REGION>:<AWS_ACCOUNT>:log-group:*:log-stream:*",
                "arn:aws:logs:<AWS_REGION>:<AWS_ACCOUNT>:log-group:*"
            ]
        }
    ]
}

Note down the Amazon Resource Name (ARN) of the Lambda role. Navigate to SageMaker and choose Create a Unified Studio domain.
Select Quick setup and expand the Quick setup settings section. Enter a domain name, for example, CORP-DEV-SMUS. Select the Virtual private cloud (VPC) and Subnets. Choose Continue.
Enter the email address of the SageMaker Unified Studio user in the Create IAM Identity Center user section. Choose Create domain.
After the domain is created, choose Open unified studio in the top right corner.
Sign in to SageMaker Unified Studio using the single sign-on (SSO) credentials of your user. Choose Create project at the top right corner. Enter a project name and description, choose Continue twice, and choose Create project. Wait unti project creation is complete.
After the project is created, go into the project by selecting the project name. Select Query Editor from the Build drop-down menu on the top left. Paste the following create table as select (CTAS) query script in the query editor window and run it to create a new table named mkt_sls_table as described in Produce data for publishing. The script creates a table with sample marketing and sales data.

CREATE TABLE mkt_sls_table AS
SELECT 146776932 AS ord_num, 23 AS sales_qty_sld, 23.4 AS wholesale_cost, 45.0 as lst_pr, 43.0 as sell_pr, 2.0 as disnt, 12 as ship_mode,13 as warehouse_id, 23 as item_id, 34 as ctlg_page, 232 as ship_cust_id, 4556 as bill_cust_id
UNION ALL SELECT 46776931, 24, 24.4, 46, 44, 1, 14, 15, 24, 35, 222, 4551
UNION ALL SELECT 46777394, 42, 43.4, 60, 50, 10, 30, 20, 27, 43, 241, 4565
UNION ALL SELECT 46777831, 33, 40.4, 51, 46, 15, 16, 26, 33, 40, 234, 4563
UNION ALL SELECT 46779160, 29, 26.4, 50, 61, 8, 31, 15, 36, 40, 242, 4562
UNION ALL SELECT 46778595, 43, 28.4, 49, 47, 7, 28, 22, 27, 43, 224, 4555
UNION ALL SELECT 46779482, 34, 33.4, 64, 44, 10, 17, 27, 43, 52, 222, 4556
UNION ALL SELECT 46779650, 39, 37.4, 51, 62, 13, 31, 25, 31, 52, 224, 4551
UNION ALL SELECT 46780524, 33, 40.4, 60, 53, 18, 32, 31, 31, 39, 232, 4563
UNION ALL SELECT 46780634, 39, 35.4, 46, 44, 16, 33, 19, 31, 52, 242, 4557
UNION ALL SELECT 46781887, 24, 30.4, 54, 62, 13, 18, 29, 24, 52, 223, 4561

Navigate to Data sources from the Project. Choose Run in the Actions section next to the project.default_lakehouse connection. Wait until the run is complete.
Navigate to Assets in the left side bar. Select the mkt_sls_table in the Inventory section and review the metadata that was generated. Choose Accept All if you’re satisfied with the metadata.
Choose Publish Asset to publish the mkt_sls_table table to the business data catalog, making it discoverable and understandable across your organization.
Choose Members in the navigation pane. Choose Add members and select the IAM role you created in Step 1. Add the role as a Contributor in the project.

Deployment steps

After setting up SageMaker Unified Studio, use the AWS CDK stack provided on GitHub to deploy the solution to back up the asset metadata that is created in the previous section.

Clone the repository from GitHub to your preferred integrated development environment (IDE) using the following commands.

git clone https://github.com/aws-samples/sample-event-driven-resilience-data-solutions-sagemaker.git
cd sample-event-driven-resilience-data-solutions-sagemaker

Export AWS credentials and the primary Region to your development environment for the IAM role with administrative permissions, use the following format

export AWS_REGION=
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=
export AWS_SESSION_TOKEN=

In a production environment, use AWS Secrets Manager or AWS Systems Manager Parameter Store to manage credentials. Automate the deployment process using a continuous integration and delivery (CI/CD) pipeline.

Bootstrap the AWS account in the primary and secondary Regions by using AWS CDK and running the following command.

cdk bootstrap aws://<AWS_ACCOUNT_ID>/<AWS_REGION>
cdk bootstrap aws://<AWS_ACCOUNT_ID>/<AWS_SECONDARY_REGION>
cd unified-studio

Modify the following parameters in the config/Config.ts file.

SMUS_APPLICATION_NAME – Name of the application.
SMUS_SECONDARY_REGION – Secondary AWS region for backup.
SMUS_BACKUP_INTERVAL_MINUTES – Minutes before each backup interval. 
SMUS_STAGE_NAME – Name of the stage. 
SMUS_DOMAIN_ID – Domain identifier of the Amazon SageMaker Unified Studio. 
SMUS_PROJECT_ID – Project identifier of the Amazon SageMaker Unified Studio. 
SMUS_ASSETS_REGISTRAR_ROLE_ARN – ARN of the AWS Lambda role created in step 1 of the preceding section.

Install the dependencies by running the following command:

npm install

Synthesize the CloudFormation template by running the following command.

cdk synth

Deploy the solution by running the following command.

cdk deploy –all

After the deployment is complete, sign in to your AWS account and navigate to the CloudFormation console to verify that the infrastructure deployed.

When deployment is complete, wait for the duration of DZ_BACKUP_INTERVAL_MINUTES. Navigate to the <DZ_APPLICATION_NAME >AssetsInfo DynamoDB table. Retrieve the data from the DynamoDB table. The following screenshot shows the data in the Items returned section. Verify the same data in the secondary Region. Screenshot smus-dynamodb

Clean up

Use the following steps to clean up the resources deployed.

Empty the S3 buckets that were created as part of this deployment.
In your local development environment (Linux or macOS):
Navigate to the unified-studio directory of your repository.
Export the AWS credentials for the IAM role that you used to create the AWS CDK stack.
To destroy the cloud resources, run the following command:

cdk destroy --all

Go to the SageMaker Unified Studio and delete the published data assets that were created in the project.
Use the console to delete the SageMaker Unified Studio domain.

Walkthrough for data solutions built on an Amazon DataZone domain

This section provides step by step instructions to deploy a code sample that implements the scheduled backup pattern for data solutions built on an Amazon DataZone domain.

Deployment steps

After completing the prerequisites, use the AWS CDK stack provided on GitHub to deploy the solution to backup system metadata of the data solution built on Amazon DataZone domain

Clone the repository from GitHub to your preferred IDE using the following commands.

git clone https://github.com/aws-samples/sample-event-driven-resilience-data-solutions-sagemaker.git
cd event-driven-resilience-sagemaker

Export AWS credentials and the primary Region information to your development environment for the AWS Identity and Access Management (IAM) role with administrative permissions, use the following format:

export AWS_REGION=
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=
export AWS_SESSION_TOKEN=

In a production environment, use Secrets Manager or Systems Manager Parameter Store to manage credentials. Automate the deployment process using a CI/CD pipeline.

Bootstrap the AWS account in the primary and secondary Regions by using AWS CDK and running the following command:

cdk bootstrap aws://<AWS_ACCOUNT_ID>/<AWS_REGION>
cdk bootstrap aws://<AWS_ACCOUNT_ID>/<AWS_SECONDARY_REGION>
cd datazone

From the console for IAM, note the Amazon Resource Name (ARN) of the CDK execution role. Update the trust relationship of the IAM role so that Lambda can assume the role.

Modify the following parameters in the config/Config.ts file.

DZ_APPLICATION_NAME – Name of the application.
DZ_SECONDARY_REGION – Secondary Region for backup.
DZ_BACKUP_INTERVAL_MINUTES – Minutes before each backup interval.
DZ_STAGE_NAME – Name of the stage (dev, qa, or prod).
DZ_DOMAIN_NAME – Name of the Amazon DataZone domain
DZ_DOMAIN_DESCRIPTION – Description of the Amazon DataZone domain
DZ_DOMAIN_TAG – Tag of the Amazon DataZone domain
DZ_PROJECT_NAME – Name of the Amazon DataZone project
DZ_PROJECT_DESCRIPTION – Description of the Amazon DataZone project
CDK_EXEC_ROLE_ARN – ARN of the CDK execution role
DZ_ADMIN_ROLE_ARN – ARN of the administrator role

Install the dependencies by running the following command:

npm install

Synthesize the AWS CloudFormation template by running the following command:

cdk synth

Deploy the solution by running the following command:

cdk deploy --all

After the deployment is complete, sign in to your AWS account and navigate to the CloudFormation console to verify that the infrastructure deployed.

Document system metadata

This section provides instructions to create an asset and demonstrates how you can retrive the metadata of the asset. Perform the following steps to retrieve the systems metadata.

Sign in to the Amazon DataZone data portal from the console. Select the project and choose Query data at the upper right.

Screenshot datazone-open-query

Choose Open Athena and make sure that <DZ_PROJECT_NAME>_DataLakeEnvironment is selected in the Amazon DataZone environment dropdown at the upper right and that on the left, and that <DZ_PROJECT_NAME>_datalakeenvironment_pub_db is selected as the Database.
Create a new AWS Glue table for publishing to Amazon DataZone. Paste the following create table as select (CTAS) query script in the Query window and run it to create a new table named mkt_sls_table as described in Produce data for publishing. The script creates a table with sample marketing and sales data.

CREATE TABLE mkt_sls_table AS
SELECT 146776932 AS ord_num, 23 AS sales_qty_sld, 23.4 AS wholesale_cost, 45.0 as lst_pr, 43.0 as sell_pr, 2.0 as disnt, 12 as ship_mode,13 as warehouse_id, 23 as item_id, 34 as ctlg_page, 232 as ship_cust_id, 4556 as bill_cust_id
UNION ALL SELECT 46776931, 24, 24.4, 46, 44, 1, 14, 15, 24, 35, 222, 4551
UNION ALL SELECT 46777394, 42, 43.4, 60, 50, 10, 30, 20, 27, 43, 241, 4565
UNION ALL SELECT 46777831, 33, 40.4, 51, 46, 15, 16, 26, 33, 40, 234, 4563
UNION ALL SELECT 46779160, 29, 26.4, 50, 61, 8, 31, 15, 36, 40, 242, 4562
UNION ALL SELECT 46778595, 43, 28.4, 49, 47, 7, 28, 22, 27, 43, 224, 4555
UNION ALL SELECT 46779482, 34, 33.4, 64, 44, 10, 17, 27, 43, 52, 222, 4556
UNION ALL SELECT 46779650, 39, 37.4, 51, 62, 13, 31, 25, 31, 52, 224, 4551
UNION ALL SELECT 46780524, 33, 40.4, 60, 53, 18, 32, 31, 31, 39, 232, 4563
UNION ALL SELECT 46780634, 39, 35.4, 46, 44, 16, 33, 19, 31, 52, 242, 4557
UNION ALL SELECT 46781887, 24, 30.4, 54, 62, 13, 18, 29, 24, 52, 223, 4561

Go to the Tables and Views section and verify that the mkt_sls_table table was successfully created.
In the Amazon DataZone Data Portal, go to Data sources, select the <DZ_PROJECT_NAME>-DataLakeEnvironment-default-datasource, and choose Run. The mkt_sls_table will be listed in the inventory and available to publish.
Select the mkt_sls_table table and review the metadata that was generated. Choose Accept All if you’re satisfied with the metadata.
Choose Publish Asset and the mkt_sls_table table will be published to the business data catalog, making it discoverable and understandable across your organization.
After the table is published, wait for the duration of DZ_BACKUP_INTERVAL_MINUTES. Navigate to the <DZ_APPLICATION_NAME >AssetsInfo DynamoDB table and retrieve the data from the table. The following screenshot shows the data in the Items returned section. Verify the same data in the secondary Region.

Clean up

Use the following steps to clean up the resources deployed.

Empty the Amazon Simple Storage Service (Amazon S3) buckets that were created as part of this deployment.
Go to the Amazon DataZone domain portal and delete the published data assets that were created in the Amazon DataZone project.
In your local development environment (Linux or macOS):

Navigate to the datazone directory of your repository.
Export the AWS credentials for the IAM role that you used to create the AWS CDK stack.
To destroy the cloud resources, run the following command:

cdk destroy --all

Conclusion

This post explores how to build a resilient data governance solution on Amazon SageMaker. Resilient design principles and a robust disaster recovery strategy are central to the business continuity of AWS customers. The code samples included in this post implement a backup process of the data solution at regular time interval. They store the Amazon SageMaker asset information in Amazon DynamoDB Global tables. You can extend the backup solution by identifying the system metadata that is relevant for the data solution of your organization and by using Amazon SageMaker APIs to capture and store the metadata. The DynamoDB Global table replicates the changes in the DynamoDB table in the primary region to the secondary region in an asynchronous manner. Consider Implementing an additional layer of resiliency by using AWS Backup to back up the DynamoDB table at regular interval. In the next post, we show how you can use the system metadata to restore your data solution in the secondary region.

Adopt the resiliency features offered by Amazon DataZone and Amazon SageMaker Unified Studio. Use AWS Resilience Hub to assess the resilience of your data solution. AWS Resilience Hub helps you to define your resilience goals, assess your resilience posture against those goals, and implement recommendations for improvement based on the AWS Well-Architected Framework.

To build a data mesh based data solution using Amazon DataZone domain, see our GitHub repository. This open source project provides a step-by-step blueprint for constructing a data mesh architecture using the powerful capabilities of Amazon SageMaker, AWS Cloud Development Kit (AWS CDK), and AWS CloudFormation.

About the authors

BDB-4558-Dhruba Dhrubajyoti Mukherjee is a Cloud Infrastructure Architect with a strong focus on data strategy, data governance, and artificial intelligence at Amazon Web Services (AWS). He uses his deep expertise to provide guidance to global enterprise customers across industries, helping them build scalable and secure cloud solutions that drive meaningful business outcomes. Dhrubajyoti is passionate about creating innovative, customer-centric solutions that enable digital transformation, business agility, and performance improvement. Outside of work, Dhrubajyoti enjoys spending quality time with his family and exploring nature through his love of hiking mountains.

Dynamically routing requests with Amazon API Gateway routing rules

2025-06-03 Anton Aleksandrov

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/compute/dynamically-routing-requests-with-amazon-api-gateway-routing-rules/

Effective API management and routing capabilities are crucial for organizations managing complex application architectures. Whether you’re a technology company rolling out new API versions to millions of users, or a financial services organization conducting A/B tests to optimize customer experiences, the ability to route API traffic dynamically and efficiently is essential.

Today, Amazon API Gateway announces support for dynamic routing rules for custom domain names in all supported AWS Regions. This new capability enables you to route API requests based on HTTP header values, either independently or in combination with URL paths. In this post, you will learn how to use this new capability to implement routing strategies such as API versioning and gradual rollouts without modifying your API endpoints.

Dynamic Routing Rules Overview

Many organizations require dynamic API routing capabilities to support their evolving business needs. As a line-of-business persona, you want to be able to test new user experiences with specific customer segments, while maintaining their existing flows intact. As an engineer, you want to be able to maintain multiple API versions across different client applications while ensuring regulatory compliance. Prior to this launch, developers using API Gateway implemented dynamic routing by using different URL paths, such as “/v1/products” and “/v2/products”.

With this new launch, you can implement dynamic routing logic with a simple declarative configuration within the custom domain name settings. The new routing rule mechanism allows you to make routing decisions based on HTTP headers, base paths, or a combination of both. Developers are no longer required to create new or alter existing paths to smoothly transition between API versions, they can simply specify the desired value in the request HTTP header. Among other possibilities, you can implement cell-based architecture routing, A/B testing, or dynamic backend selection based on hostname, tenant ID, accepted response media type, or cookie value. By implementing routing logic directly within the API Gateway, you can eliminate proxy layers and complex URL structures while maintaining fine-grained control over your API traffic. This new feature seamlessly integrates with existing API Gateway capabilities and supports both public and private REST APIs. The following diagram shows how you can use routing rules for header and base-path based routing. This example uses a single level resource /products to show path matching, however depending on your use-case you could also use multi-level paths like /products/items.

Figure 1. Using routing rules for header and base-path based routing

In the following section you’ll learn how to implement header-based routing, use the new routing rules construct for common scenarios like API versioning and A/B testing, and configure rules with different routing conditions and priorities to achieve the desired behavior.

What is a routing rule

A routing rule is a new resource type uniquely associated with a single custom domain. It represents a collection of conditions that, when matched, cause the incoming request to be forwarded to a specific API and stage. Routing rules have three configuration properties:

The Conditions property defines the criteria that must be met for actions to be taken. A rule can include up to two header conditions and one base path condition, and all specified conditions must be met to trigger the action. If no conditions are defined for a rule, it serves as a catch-all rule matching all requests.
The Actions property defines what actions will be taken when rule conditions are met. At the time of this launch the supported action is invoking any stage of any REST API within the same account and region boundaries.
The Priority property defines the order that rules are evaluated in, with 1 being highest priority and 1,000,000 the lowest. You cannot reuse same priority value for more than one rule. AWS recommends you leave ample space between sequential rules to make it easy to add new rules in future, for example use 100, 200, 300 instead of 1, 2, 3.

Header conditions, specified via a MatchHeaders property, are used to match HTTP request header values, such as x-version=v1. Conforming to RFC 7230, header names are not case sensitive, while header values are. You can also use wildcards in header values for prefix, suffix, and contains match. See the following examples using AWS CloudFormation templates:

Exact match:

- MatchHeaders: 
	AnyOf: 
		- Header: "x-version" 
		ValueGlob: "alpha-v2-latest"

Will only match x-version=alpha-v2-latest

Prefix match:

- MatchHeaders: 
	AnyOf: 
	- Header: "x-version" 
	ValueGlob: "*latest"

Matches x-version=alpha-v2-latest, but not x-version=alpha-v2

Suffix match:

- MatchHeaders: 
	AnyOf: 
		- Header: "x-version" 
		ValueGlob: "alpha*"

Will match x-version=alpha-v2-latest and x-version=alpha-v1, but not x-version=beta-v1

Prefix and suffix match.

- MatchHeaders: 
	AnyOf: 
		- Header: "x-version" 
		ValueGlob: "*v2*"

Matches x-version=alpha-v2-latest and x-version=beta-v2-test, but not x-version=alpha-v1

Base path condition, specified via MatchBasePaths property, is used to match the incoming request path. The matching is case sensitive.

- MatchBasePaths: 
	AnyOf: 
		- "products"

You can have up to two MatchHeaders and one MatchBasePaths conditions per routing rule. Conditions are evaluated using the AND operator, meaning all conditions must be met for the action to be taken. Both condition types support a single comparison value under AnyOf property. The following snippet illustrates a sample routing rule with two MatchHeaders conditions and a single MatchBasePaths condition.

ProductsV1RoutingRule: 
	Type: 'AWS::ApiGatewayV2::RoutingRule' 
	Properties: 
		DomainNameArn: !Sub "arn:aws:apigateway:${AWS::Region}::/domainnames/${ApiCustomDomain}" 
		Priority: 100 
		Conditions: 
			- MatchHeaders: 
				AnyOf: 
					- Header: "x-version" 
					ValueGlob: "v2" 
			- MatchHeaders: 
				AnyOf: 
					- Header: "x-user-cohort" 
					ValueGlob: "beta-testers" 
			- MatchBasePaths: 
				AnyOf: 
					- "products" 
		Actions: 
				- InvokeApi:
					ApiId: !Ref ProductsV2Api 
					Stage: !Ref ProductsV2Stage

This rule matches requests to https://example.com/products when both header conditions are met – x-version=v2 and x-user-cohort=beta-testers. This rule does not match requests to any other base path, such as https://example.com/orders, or requests that do not match at least one header condition.

For scenarios like API versioning, you can create rules that evaluate headers such as “accept” or “version” to route traffic to different API implementations. For example, to route requests containing “x-version: api-beta” to your beta API, you would create a rule specifying this header condition and set the action to route to your beta API deployment.

Header-based routing also simplifies A/B testing by allowing you to define client cohorts based on custom headers, allowing controlled experiments with different configurations. You can create rules that check for a custom header like “x-test-group” to route specific users to different API implementations. The priority system ensures predictable routing behavior – when multiple rules match a request, the rule with the lowest priority number (highest precedence) determines the routing. Combining header and path conditions within a single rule enables complex routing scenarios such as version-specific routing for specific API resources instead of the entire API, as illustrated in the following diagram.

Figure 2. A routing configuration with two header and one path conditions in API Gateway Console.

Review the API Gateway documentation for detailed guide on creating routing rules.

Configuring Routing Mode

Before you begin creating routing rules, you must first create at least one API, stage, and a custom domain name. You can configure your custom domain name with the new routing mode setting.

API mappings only. This is the default mode. When using this mode, you can continue to use base path mappings to route requests to different APIs, and not use Routing Rules at all. This mode maintains the current behavior, where requests are routed based on base path mappings only.
Routing rules then API mappings. With this mode you can use Routing Rules while continuing to keep base path mappings as a fallback. When you use this mode, the Routing Rules always take precedence, and unmatched requests are evaluated against base path mappings. This mode is useful for gradually transitioning your APIs to Routing Rules.
Routing rules only. This mode gives you the flexibility to use routing rules only, and not rely on the base paths that you may have previously created on the domain using API mappings. This is the recommended routing mode; it is helpful when you are starting off with a new custom domain or finished transitioning from API mappings to Routing Rules for an existing custom domain.

When switching from one routing mode to another, always test your new configuration in non-production environments first. For example, when switching mode from API mappings only to routing rules only, your traffic will only be routed with routing rules; existing API mappings will no longer take effect.

Onboarding to Header-Based Routing

You can adopt the new Header-Based Routing for your existing API Gateway custom domains with zero-downtime, risk-minimized approach. The first step is to configure your custom domain to use the Routing rules then API mappings mode using the API Gateway console, AWS CLI, or your infrastructure-as-code (IaC) tool. This configuration ensures that while you gradually create Routing Rules, your existing base path mappings continue to function as fallback routes. Since Routing Rules are evaluated before base path mappings, and in the absence of any matching rules, requests automatically fall back to your existing base path mappings, your current API traffic remains unaffected during this transition.

Once you’ve configured the routing mode, you can progressively introduce Routing Rules alongside your existing setup. For example, you might start by creating a rule with a specific test header that routes to a new API version, allowing you to validate the routing behavior with controlled test traffic while production traffic continues flowing through your existing base path mappings. As you gain confidence in the new routing configuration, you can gradually expand your rules, adjust priorities, and optionally migrate away from base path mappings entirely. This incremental approach, combined with API Gateway’s observability capabilities described in the next section, enables you to validate each change and ensure your API consumers experience no disruption during the transition.

Observability

API Gateway provides comprehensive visibility into how your routing rules are processing requests through access logging. Each request now includes additional context variables that help you understand the routing decision process. The $context.customDomain.routingRuleIdMatched variable identifies which rule was matched and applied to the request, while existing variables like $context.domainName, $context.apiId, and $context.stage provide the complete routing context. By analyzing these access logs, you can verify routing behavior, troubleshoot unexpected routes, and gather insights about traffic patterns across different API versions or test variants.

End-to-end example

Consider a real-world scenario where a team needs to gradually migrate users to a new API version, such as an e-commerce platform updating its checkout API from v1 to v2. First, the team creates two different REST APIs – one for each version. Then, they set up a Routing Rule with priority 100 that checks for the header x-version=v2 and routes matching requests to the v2 API. They also create another rule with priority 200 that routes all requests with paths starting with /checkout to v1 API as a fallback.

Figure 3. Gradually transitioning clients from v1 to v2 API.

In the application code they add the x-version header for a small percentage of users. They monitor the performance and error rates using API Gateway’s telemetry capabilities by tracking the access and execution logs, along with emitted metrics. As their confidence grows, they gradually increase the percentage of users sending the v2 header. This approach ensures a controlled migration with minimal risk and ability to quickly rollback by simply removing the header from requests or changing a routing rule.

Sample

Follow the instructions in this GitHub repository to provision the sample in your AWS account. The project illustrates using dynamic routing with API Gateway.

Conclusion

Header-based routing brings significant advantages to API Gateway users. The feature’s backward compatibility ensures a smooth transition path – you can maintain existing base path mappings while gradually adopting Routing Rules, or use both mechanisms simultaneously with the fallback option. This flexibility allows you to migrate at your own pace without disrupting existing applications. The solution is cost-effective, with no additional charges for using Routing Rules on REST APIs. It reduces requirements to leverage extra service and infrastructure for dynamic routing. The priority-based evaluation system provides deterministic routing behavior, making it easier to understand and troubleshoot routing decisions.

To learn more about API Gateway header-based routing see the service documentation.

To learn more about Serverless architectures see Serverless Land.

Powering global payout intelligence: How MassPay uses Amazon Redshift Serverless and zero-ETL to drive deeper analytics.

2025-06-02 Yossi Shlomo

Post Syndicated from Yossi Shlomo original https://aws.amazon.com/blogs/big-data/powering-global-payout-intelligence-how-masspay-uses-amazon-redshift-serverless-and-zero-etl-to-drive-deeper-analytics/

Since the company was founded in 2019, MassPay’s singular objective has been to deliver frictionless global payments that power innovation and lift people, businesses, and quality of life worldwide. Today, the MassPay payment orchestration offering empowers companies to move money across borders effortlessly; enabling local payment experiences in over 175 countries and 70 currencies—including digital wallets, locally preferred alternative payment methods, and cryptocurrencies. From hyper-localized checkout experiences to instant global payouts, we orchestrate seamless financial experiences that reflect how people and businesses transact around the world.

As we have expanded globally, so has the complexity of our data. In this blog post we shall cover how understanding real-time payout performance, identifying customer behavior patterns across regions, and optimizing internal operations required more than traditional business intelligence and analytics tools. And how since implementing Amazon Redshift and Zero-ETL, we’ve seen 90% reduction in data availability latency, payments data available for analytics 1.5x faster, leading to 45% reduction in time-to-insight and 37% fewer support tickets related to transaction visibility and payment inquiries.

Unlocking deeper payout intelligence and global insights

To continue our innovation—and to continue to exceed our partners’ and customers’ expectations—we knew we needed to go beyond basic reporting. We know success is dependent upon developing a truly data-driven organization. This means tracking granular KPIs across payout success rates, payment method adoption, transaction velocity, customer onboarding funnel drop-off, and support ticket correlation. We also wanted to better forecast customer payment expectations, monitor foreign exchange cost trends, and understand market-specific nuances such as how payout timing impacts seller satisfaction in social commerce ecosystems.

“We didn’t just want more data. We wanted faster, smarter insights that would shape decisions in real time. Being a data-driven organization means our teams don’t guess. They know. And that gives us, our partners, and our customers real operational and competitive advantages.”

– Yossi Schlomo, Director of Payment Systems Architecture

MySQL databases, CSV exports, and third-party reporting tools wouldn’t support the scale or speed we needed to deliver.

Choosing AWS: A scalable and integrated analytics foundation

We chose Amazon Web Services (AWS) for our data infrastructure and to accelerate our analytics capabilities.

At the core of our stack is Amazon Redshift Serverless with AI-driven scaling and optimizations enabled, which gives us scalable, fast, and cost-efficient analytics without the burden of managing infrastructure. Coupled with Amazon Aurora MySQL-Compatible Edition as our transactional data store and Amazon Redshift zero-ETL integration, we eliminated manual data pipelines altogether. Transactional data flows into Amazon Redshift in near real-time, instantly powering dashboards, alerts, and machine learning (ML) models.

This data feeds interactive dashboards—both internally and embedded within our platform for customers. Now, executives, operations leads, and customer success teams can drill into payout performance by region, merchant, or payment method, while customers get real-time visibility into their own payout analytics as part of our platform experience. The architecture is shown in the following figure.

MassPay Zero-ETL architecture with Amazon Redshift Serverless

Why it’s different and what it unlocked

Without Amazon Redshift Serverless and zero-ETL, we would have had to invest in costly custom data pipelines, maintain separate exchange, transform, and load (ETL) infrastructure, and manually manage data freshness. The integration with Aurora MySQL-Compatible is seamless and reduces our analytics latency from minutes to seconds.

“Our differentiator is simple: We operationalize not just transactions but analytics for global payments. Most platforms can tell you if a transaction went through. For payments and payouts, MassPay can tell you how fast it went, what it cost, what method was most effective, and what that means for your business in real time.”

– Yossi Schlomo, Director of Payment Systems Architecture

Embedded intelligence, built for scale

Every MassPay customer gets access to comprehensive payment analytics. These are accessed using our API or through a white-label dashboard (shown in the following figure). This detail is core to our product and central to our value proposition. As part of our go-to-market strategy, we showcase these capabilities in every demo, and they’ve proven to be key drivers in conversion and upsell conversations, especially with platforms targeting high-growth ecosystems.We use tiered pricing models based on transaction volume, and our embedded intelligence helps our partners and customers optimize usage and scale efficiently.

MassPay Dashboard

What we’ve gained

Since implementing Amazon Redshift and Zero-ETL, we’ve seen measurable results including:

90% reduction in data availability latency and data available for analytics 1.5x faster
45% reduction in time-to-insight across payment and payout intelligence reports
37% fewer support tickets related to transaction visibility and payment inquiries
Real-time Net Promoter Score (NPS) tracking correlates with payout success metrics, driving faster resolution paths

What’s next

We’re now extending our analytics model to include more advanced ML-based payout failure prediction and ML-based payment authorization prediction, FX optimization alerts, partner-level and network-level benchmarking, and much more.

Conclusion

MassPay isn’t just payments. We aren’t just payouts. We are the engine powering modern commerce. With AWS, we’re turning complex global payments infrastructure into a smart, transparent, and scalable platform for insights. For our partners, and for our customers, this means better decisions, faster payment processing, faster payouts, and truly global reach without guesswork.

We encourage you to leverage below resources to explore these features further

About the authors

Yossi Shlomo serves as the Director of Payment Systems Architecture at MassPay. Yossi is an expert in credit card payment systems, PCI compliance, and secure transaction architecture, helping global platforms process payments at scale with confidence. He specializes in building scalable, cloud-based transaction systems and optimizing global payment gateways for performance and reliability.

Milind Oke is a Amazon Redshift and SageMaker Lakehouse specialist Solutions Architect as AWS. He is based out of New York and has been building enterprise data platforms, data warehousing, and analytics solutions for customers across various domains over two decades. In the 5 years with AWS, Milind has been a speaker at worldwide technical conferences and is co-author of Amazon Redshift: The Definitive Guide: Jump-Start Analytics Using Cloud Data Warehousing 1st Edition.

Introducing AWS Serverless MCP Server: AI-powered development for modern applications

2025-05-29 Shridhar Pandey

Post Syndicated from Shridhar Pandey original https://aws.amazon.com/blogs/compute/introducing-aws-serverless-mcp-server-ai-powered-development-for-modern-applications/

Modern application development demands faster, more efficient ways to build and deploy software. Over the past decade, serverless computing has emerged as a transformative approach to software development, enabling developers to focus on building applications without having to manage the underlying infrastructure. As developers build their applications using AWS Serverless Compute, they seek guidance on selecting appropriate services, understanding best practices, and implementation patterns to make the most of this paradigm.

Today, AWS announces the open-source AWS Serverless Model Context Protocol (MCP) Server, a tool that combines the power of AI assistance with serverless expertise to enhance how developers build modern applications. The Serverless MCP Server provides contextual guidance specific to serverless development, helping developers make informed decisions about architecture, implementation, and deployment.

This post describes how the Serverless MCP Server works with AI coding assistants to streamline serverless development. Learn how to use this solution to accelerate your serverless development workflow and build robust, high-performing applications more efficiently.

Overview

Serverless computing enables development teams to significantly reduce time-to-market while improving operational efficiency. Developers can focus on creating business value, while AWS services automatically handle scaling, availability, and infrastructure maintenance. AWS Lambda provides a seamless compute service where code runs in response to events, scaling instantly from a few requests per day to thousands per second. Through integration with over 200 AWS services, Lambda enables developers to build sophisticated applications using triggers from Amazon API Gateway, Amazon S3, Amazon DynamoDB, and many others. Whether you’re building data processing pipelines, real-time stream processing, or web applications, Lambda’s support for popular programming languages and development frameworks helps development teams leverage their existing skills while embracing the serverless paradigm.

MCP Server

MCP is an open protocol for AI agents to interact with external tools and data sources. It defines how AI assistants can discover, understand, and use various capabilities provided by external systems. This protocol allows AI models to extend their functionality beyond their own training data by accessing real-time information and executing specific tasks through standardized interfaces.

MCP servers implement this protocol by providing tools, resources, and contextual information that AI assistants can use via MCP clients. They serve as a knowledge bridge that gives AI assistants, such as Amazon Q Developer, Cline, and Cursor, the additional context needed to make informed decisions about cloud architecture and implementation. This is particularly valuable for serverless applications, where developers navigate multiple services, event patterns, and integration points to build scalable, performant applications.

AWS currently offers the AWS Lambda Tool MCP Server, which allows AI models to directly interact with existing Lambda functions as MCP tools without any code changes. This MCP server acts as a bridge between MCP clients and Lambda functions, allowing AI assistants to access and invoke Lambda functions.

Serverless MCP Server

The open-source AWS Serverless MCP launched today enhances the serverless development experience by providing AI coding assistants with comprehensive knowledge of serverless patterns, best practices, and AWS services. This MCP server acts as an intelligent companion, guiding developers through the entire application development lifecycle, from initial design to deployment, offering contextual assistance at each stage.

The new Serverless MCP server provides tools that cover many areas of serverless development. During the initial planning and setup phase, the MCP server helps developers initialize new projects using AWS Serverless Application Model (AWS SAM) templates, select appropriate Lambda runtimes, and set up project dependencies. This enables developers to quickly bootstrap new serverless applications with the right configuration and structure.

As development progresses, the MCP server assists with building and deploying serverless applications. It provides tools for local testing, building deployment artifacts, and managing deployments. For web applications, the MCP server offers specialized support for deploying backend, frontend, or full-stack applications, and setting up custom domains.

The MCP server also emphasizes operational excellence through comprehensive observability tools, helping developers to effectively monitor application performance and troubleshoot issues. Throughout the development process, the server provides contextual guidance for infrastructure as code (IaC) decisions, Lambda-specific best practices, and event schemas for Lambda event source mappings (ESMs).

Serverless MCP Server in action

To demonstrate the capabilities of the Serverless MCP Server, this example walks you through a scenario of creating, deploying, and troubleshooting a serverless application.

Prerequisites and installation

To get started, download the AWS Serverless MCP Server from GitHub or Python Package Index (PyPi) and follow the installation instructions. You can use this MCP server with any AI coding assistant of your choice, such as Q Developer, Cursor, Cline, etc. The walkthrough example in this post uses Cline.

Add the following code to your MCP client configuration. The Serverless MCP Server uses the default AWS profile by default. Specify a value in AWS_PROFILE if you want to use a different profile. Similarly, adjust the AWS Region and log level values as needed.

{
  "mcpServers": {
    "awslabs.aws-serverless-mcp": {
      "command": "uvx",
      "args": [
        "awslabs.aws_serverless_mcp_server@latest"
      ],
      "env": { 
        "AWS_PROFILE": "your-aws-profile",
        "AWS_REGION": "us-east-1",
        "FASTMCP_LOG_LEVEL": "ERROR"
      }
    }
  }
}

The Serverless MCP Server incorporates built-in guardrails to ensure secure and controlled development. By default, the MCP server operates in a read-only mode, allowing only non-mutating actions. This safety-first approach allows you to explore serverless capabilities and architectural patterns while preventing unintended changes to your applications or infrastructure. The server also restricts access to Amazon CloudWatch Logs by default, protecting sensitive operational data from exposure to AI assistants.

As your development needs evolve, you can selectively override these security defaults. The --allow-write flag enables mutating operations for tasks such as deployments and updates, while --allow-sensitive-data-access provides access to CloudWatch Logs for debugging and troubleshooting. Consider enabling these permissions only when necessary and in appropriate development contexts.

Creating and deploying a serverless application

Imagine that you want to build a to-do list web application. Start by prompting your AI assistant.

I want to build a new to-do list web application in a new workspace. I want to add, list, and delete to-dos. Would AWS Lambda be a good choice for this?

The agent uses the get_lambda_guidance_tool to receive tailored guidance based on the use case and the inferred event source, API Gateway in this case. Then, you want to better understand how to deploy the application to AWS.

I later want to deploy the application to AWS. Which Infrastructure as Code tool would be best for this?

There are several ways to deploy your functions to AWS such as AWS SAM or the AWS Cloud Development Kit (AWS CDK). The model opts to get more information before making a recommendation. It selects the get_iac_guidance_tool from the Serverless MCP Server.

The Serverless MCP Server can also assist you with interacting with the AWS Serverless Application Model Command Line Interface (AWS SAM CLI) as well as deploying application-based blueprints such as backend, frontend, or full-stack applications.

Before the deployment, test the function locally to verify that it works correctly.

I want to test my Lambda function locally before deploying it. Can you invoke the “TodoFunction” from my SAM project with a GET request?

Now that you are confident that the application works, you want to deploy the backend logic to AWS. So, you prompt the agent to deploy your app. It selects the deploy_webapp tool and provides the tool with arguments to deploy the app, including backend deploymentType. This tool uses the AWS Lambda Web Adapter framework, allowing you to easily deploy applications written in standard web frameworks like Express or Next.js, without any additional web adapters.

The agent can also check on the status of a deployment to make sure that it completes successfully.

Your application is now deployed to AWS and ready to serve traffic.

Troubleshooting the serverless application

The Serverless MCP Server also supports you in testing and troubleshooting your Lambda functions. It provides valuable context to AI assistants in identifying and remediating issues. For example, you can use your agent to check for any issues with your application.

Check the logs and metrics of the todo app for any issues

The model can analyze the logs and metrics and offer recommendations such as adjusting memory allocation.

The Serverless MCP Server tools provide the assistant with the additional context it needs to identify issues and generate recommendations.

In this example, the analysis and recommendations are:

Log Analysis

I’ve checked the logs for all three Lambda functions of your todo app, looking specifically for errors:

• ListTodosFunction: No errors detected
• AddTodoFunction: No errors detected
• DeleteTodoFunction: No errors detected

All functions are operating normally with successful executions and no error logs.

Performance Metrics

Lambda Function Performance
• Memory usage is reasonable at 100–106 MB out of 512 MB allocated (about 20% utilization)
• Fast execution durations for warm starts: 85–195 ms range
• No throttling or out-of-memory errors detected
• No function errors or exceptions in the logs

API Gateway Metrics
• Total Requests: 11 requests recorded
• Average Latency: 797 ms
• P95 Latency: 3364 ms (high)
• 4XX Errors: 0 (no client errors)
• 5XX Errors: 1 (one server error detected)

Issues Identified

The log analysis is clean with no errors, but the metrics reveal one potential issue:

One 5XX Error in API Gateway: There was a single server-side error recorded in the API Gateway logs. Since there are no corresponding errors in the Lambda function logs, this might be related to:
• A timeout (API Gateway taking too long to respond)
• An integration issue between API Gateway and Lambda
• A temporary network issue
Relatively High P95 Latency: The 95th percentile latency of 3364 ms is quite high compared to the average. This suggests some outlier requests are taking significantly longer to complete.

Recommendations

Monitor the 5XX error rate as it resurfaces.
Consider implementing more detailed logging at the edge of the error path.
Investigate the cause of the high P95 latency. Check for cold starts, payload size, or network latency as likely culprits.
Consider enabling API Gateway and Lambda performance dashboards for more granular, historical latency and error metrics.
Continue to monitor memory and duration metrics to ensure performance remains healthy as traffic increases.

Features supported by the Serverless MCP Server

The Serverless MCP Server provides several MCP tools, which can be classified into four categories.

Serverless application lifecycle
- sam_init_tool: Initializes a new AWS SAM project with the chosen runtime and dependencies.
- sam_build_tool: Builds a serverless application using the AWS SAM CLI and prepares deployment artifacts.
- sam_deploy_tool: Deploys a serverless application to AWS, managing artifact upload and stack creation.
- sam_local_invoke_tool: Locally invokes a Lambda function for testing with custom events and environments.
Web application deployment and management
- deploy_webapp_tool: Deploys backend, frontend, or fullstack web applications to Lambda using the Lambda Web Adapter.
- update_frontend_tool: Updates the frontend assets and optionally invalidates the Amazon CloudFront cache.
- configure_domain_tool: Configures a custom domain, includes certificate and DNS setup.
Observability
- sam_logs_tool: Retrieves logs, and supports filtering and time range selection.
- get_metrics_tool: Fetches specified metrics.
Guidance, IaC templates, and deployment help
- get_iac_guidance_tool: Provides guidance for selecting IaC tools.
- get_lambda_guidance_tool: Offers advice on when to use Lambda for specific runtimes and use cases.
- get_lambda_event_schemas_tool: Returns event schemas for Lambda integrations.
- get_serverless_templates_tool: Supplies example AWS SAM templates for different serverless application types.
- deployment_help_tool: Provides help and status information about deployments.
- deploy_serverless_app_help_tool: Offers instructions for deploying serverless applications to Lambda.

Visit the Serverless MCP Server documentation for the full list of tools and resources.

Best practices and considerations

When building serverless applications with the AWS Serverless MCP Server, start by using its AI-assisted guidance for architectural decisions. Throughout development, use its guidance tools to make informed decisions about service selection, event patterns, and infrastructure design. Before deploying to AWS, use the Serverless MCP Server’s local testing capabilities to validate your application’s behavior. This approach helps ensure your application aligns with AWS best practices.

Robust monitoring and observability are critical to reliably operate your applications running in production. Use the Serverless MCP Server tools for deployment monitoring and setting up logging and metrics. This helps track application performance and quickly identify potential issues.

Conclusion

The open-source AWS Serverless MCP Server streamlines serverless application development by providing AI-assisted guidance throughout the development lifecycle. By combining AI assistance with serverless expertise, it enables developers to build and deploy applications more efficiently. The Serverless MCP Server’s toolset supports the complete development process, from initialization to observability, while helping developers implement AWS best practices.

As organizations continue to adopt serverless computing, tools that streamline development and accelerate delivery become increasingly valuable. AWS will continue to expand the collection of MCP servers for developers building serverless applications and refine existing tools based on customer feedback and emerging serverless development patterns.

To get started, visit the GitHub repository and explore the documentation. Share your experiences and suggestions through the GitHub repository to improve the MCP server’s capabilities and help shape the future of AI-assisted serverless development.

For more serverless learning resources, visit Serverless Land.

Enhancing multi-account activity monitoring with event-driven architectures

2025-05-28 Anton Aleksandrov

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/compute/enhancing-multi-account-activity-monitoring-with-event-driven-architectures/

Enterprise cloud environments are growing increasingly complex as they scale, with organizations managing hundreds to thousands of Amazon Web Services (AWS) accounts across multiple business units and AWS Regions. Organizations need efficient ways to collect, transport, and analyze activity data for threat detection and compliance monitoring. This presents unique challenges for enterprise Application Security (AppSec) teams, cloud security vendors, and DevSecOps professionals, because traditional polling-based monitoring approaches struggle to provide real-time activity insights needed for modern cloud operations.

In this post, you will learn to use AWS CloudTrail and Amazon EventBridge for real-time cloud activity monitoring and automated response.

Overview

As organizations expand their cloud footprint, account activity monitoring that comprehensively tracks user actions and successfully identifies security threats becomes crucial for threat detection and compliance. Although AWS provides native tools—such as CloudTrail for API activity capture, EventBridge for real-time event routing, AWS Organizations for multi-account management, and AWS Config for resource evaluation—many enterprises struggle with the volume of activities while maintaining efficiency and controlling costs. Organizations need to carefully architect solutions to effectively use these tools as their environments scale.

Traditional polling-based techniques, which worked well for smaller environments, can become unsustainable when scaled to enterprise deployments, where the volume of activity data grows exponentially with each new account and service. API polling limitations, growing data volumes, and increasing demand for real-time analysis are pushing teams to rethink their architectural approach.

Figure 1. Poll model, periodically retrieving the latest state.

Adopting push-based event-driven architectures offers a compelling solution for AppSec teams and cloud security vendors facing these challenges. Using AWS services, such as CloudTrail and EventBridge, allows these teams and vendors to build scalable activity monitoring solutions that overcome the limitations of traditional polling-based approaches and provide real-time notifications across thousands of AWS accounts. This approach not only enables security use cases but also supports broader real-time operational monitoring, compliance reporting, and automation requirements.

“By integrating AWS CloudTrail and Amazon EventBridge, we’ve built a scalable architecture to monitor activity across thousands of AWS accounts. This provides the visibility needed to detect threats in real time and protect large, distributed AWS environments.” — Rob Solomon, Senior Cloud Solution Architect, CrowdStrike

Solution components

Enterprise AppSec teams and cloud security vendors share common requirements when building multi-account monitoring solutions. They need to efficiently collect activity data across thousands of accounts, transport it to a centralized location for analysis, and process it in real-time to detect threats and compliance violations. The solution must scale seamlessly from dozens to thousands of accounts while remaining highly-performant and cost-efficient. At its core, a scalable multi-account activity monitoring solution consists of three components: activity data collection, cross-account transport to a centralized location, and processing. In the following sections, you will learn how AppSec teams and cloud security vendors can implement each step efficiently while avoiding common pitfalls.

Figure 2. Push model. Account activity is collected at the source, and pushed to the AppSec or cloud security vendor account for further processing.

Data collection strategies

Many teams begin their cloud activity monitoring journey by polling the resource status through service management APIs. Although this approach works good for retrieving the latest resource state on-demand, its fundamental limitation is inability to detect state changes efficiently, necessitating continuous querying of all resources at fixed intervals. Consider a scenario where you’re monitoring 1,000 accounts, with 100 resources in each account. A single polling cycle would necessitate 100,000 API requests, consuming over 28 million API calls daily if running at five-minute intervals. This inefficiency compounds as environments grow, leading to throttling issues, increased costs, and scaling challenges.

AWS Config improved upon this by offering continuous resource configuration tracking without manual polling. Although this works excellent for configuration compliance and a history of changes for auditing, AWS Config reports changes on a best-effort basis and is not optimal for real-time threat detection.

To overcome this constraint, your solution can use services such as CloudTrail and EventBridge as primary data sources, complemented by intelligent on-demand targeted API polling. CloudTrail records API activity across AWS services, providing a detailed history of actions taken by users, roles, and AWS services in your accounts. Over 250 AWS services automatically report their activity and API calls to CloudTrail and EventBridge in real-time. This allows you to capture this information, providing a detailed history of actions taken in your accounts, and enabling security analysis, resource state change tracking, and compliance audit.

Figure 3. Over 250 AWS services are automatically emitting activity events to CloudTrail.

When a resource state changes, commonly as a result of a management API call, the affected service sends an event to CloudTrail and EventBridge. Your monitoring solution can examine the event payload to determine if polling for supplementary data is necessary, particularly when the initial payload lacks complete information. This provides you with comprehensive service coverage with reduced maintenance effort. This hybrid approach guarantees delivery of activity data to eliminate monitoring blind spots, while significantly reducing AWS management API quota consumption.

Cross-account data transport

Your solution should transport activity data from thousands of tenant accounts into a small number of centralized accounts, such as a regional AppSec account, for further processing and analysis. The solution must be secure, scalable, resilient, and cost-efficient while maintaining real-time delivery.

The most direct way to achieve it is to enable Amazon S3 event notifications for new objects that are created in the CloudTrail trails S3 bucket. When you receive the notification, you can retrieve and process new activities.

Figure 4. Exporting CloudTrail events into an S3 bucket and retrieving after receiving a notification.

This direct way to consume CloudTrail events has one important consideration: typically it can take an average of five minutes to deliver events to Amazon S3. Teams and vendors looking for lower mean-time-to-detect (MTTD) and mean-time-to-respond (MTTR) should evaluate transporting CloudTrail events across accounts with EventBridge, which provides close-to-real-time delivery.

Transporting events with EventBridge

EventBridge is a serverless event router that connects applications. It receives events from various sources, such as CloudTrail, and routes them to multiple targets based on defined rules.

Using EventBridge for cross-account data transport comes with several major benefits:

EventBridge seamlessly delivers events across accounts and AWS Regions, providing support for automatic retries, exponential backoff with jitter, and dead-letter queues.
EventBridge is natively integrated with CloudTrail.
EventBridge delivers events in close to real-time.
EventBridge is a serverless event router managed and operated by AWS. You don’t need to be concerned with its scaling and you always pay per number of events it transports.

There are two approaches you can take for delivering cross-account events with EventBridge: direct service-to-service or service-to-API-endpoint.

The first approach uses the EventBridge direct bus-to-bus and bus-to-service delivery capabilities. This method is most suitable when you want AWS to handle data ingestion on the receiving end. The delivery target is always either an EventBridge bus, or another AWS service, such as an Amazon Simple Queue Service (Amazon SQS) queue, Amazon Kinesis Data Streams stream, or an AWS Lambda function. With support for up to 18,750 target invocations per second and native AWS Identity and Access Management (IAM) integration, this method is particularly suitable for large multi-account deployments.

The second approach uses the EventBridge API destinations feature. This method is most suitable when you have existing HTTP-based ingestion endpoints in place. Although it offers lower throughput, it provides greater flexibility for ingestion endpoint and authentication methods implementation, making it attractive for AppSec teams and cloud security vendors integrating with existing ingestion infrastructure.

Figure 5. Emitting CloudTrail events in real-time through EventBridge.

The following table summarizes two approaches for transporting events across accounts with EventBridge.

	Direct bus-to-bus or bus-to-service	API destinations
Data ingestion implementation effort	Minimal	Needed
Default target invocations per second (TPS) quotas	Up to 18,750 (region dependent)	Up to 300 (region dependent)
Can the TPS quota be increased	Yes	Yes
Authorization support	Native AWS IAM, fully handled by AWS	Basic, OAuth2, API Key. You’re responsible for implementing credentials validation during ingestion.
Cross-account delivery costs	$1 per million events	$0.20 per million events

Go to the EventBridge quotas and pricing pages for more details.

Processing architecture

Processing would commonly be done by existing products and services the AppSec team or cloud security vendor provides for activity analysis. The architecture for event processing pipeline operating at enterprise scale must consider design decisions to handle large and potentially irregular event volumes while maintaining high performance, as shown in the following figure.

Figure 6. An activity event processing pipeline, with priority-based processing.

Use the following best practices for a robust processing architecture:

Buffer ingested events Use services such as Amazon SQS, Amazon Kinesis Data Streams, or Amazon Managed Streaming for Apache Kafka to buffer incoming events, handle traffic surges, and make sure of reliable processing.
Use serverless services that scale automatically, or invest in automated scaling mechanisms that adjust processing capacity based on event volume
Minimize polling: Resort to intelligent on-demand polling, only poll when you need additional data that is not available in the CloudTrail event payload.
Routing and classification: Rather than processing all events equally, implement intelligent classification and routing early in your pipeline. Security-related events such as IAM changes or security group modifications should take priority over routine activities or data events. This approach helps to control costs while maintaining rapid detection of important security events.
Cost optimization: At the enterprise scale, cost optimization becomes crucial. Use EventBridge rules in source accounts to filter out irrelevant events before they enter your processing pipeline. Consider implementing regional collection points to optimize data transfer costs. When using Lambda functions for data processing, use batch processing to reduce invocation costs. Evaluate which event types must be delivered in real-time through EventBridge, which event types can be delayed and collected through S3 bucket export, and which events should be discarded.
Observability: Monitor the ingestion and processing throughput to react to potential slowdowns early. Detect when source accounts are approaching EventBridge quotas. Consider using AWS Service Quotas to request quota increases automatically through APIs.
Cross-Region considerations: Design your architecture to support efficient cross-Region event collection while respecting data sovereignty requirements. Consider implementing regional processing nodes with centralized aggregation for global security analysis.
Integration patterns: Modern security solutions must integrate with existing security tools and workflows. Implement standardized output formats that allow seamless integration with SIEM systems, ticketing platforms, and automation frameworks. Consider publishing security findings back to EventBridge buses to enable automated remediation workflows. If you’re a cloud security vendor, then consider integrating with EventBridge as an SaaS partner.

Conclusion

Event-driven architectures present a powerful opportunity for building scalable multi-account activity monitoring solutions. Using services such as AWS CloudTrail and Amazon EventBridge allows teams to overcome the limitations of traditional polling-based approaches while achieving close to real-time delivery.The shift to event-driven security monitoring isn’t just an architectural choice—it’s becoming a necessity for teams operating at enterprise scale. This approach enables security teams to achieve the real-time threat detection capabilities needed in today’s cloud environments while maintaining operational efficiency and cost control.

Optimizing fleet operations using Amazon SageMaker AI and Amazon Bedrock

2025-05-28 Manny Sidhu

Post Syndicated from Manny Sidhu original https://aws.amazon.com/blogs/architecture/optimizing-fleet-operations-using-amazon-sagemaker-ai-and-amazon-bedrock/

Every year in the United States, distracted driving claims thousands of lives and causes immense financial damage. More than 1.6 million accidents annually are caused by cell phone use while driving, and another 1.5 million result from drowsy drivers falling asleep at the wheel. These devastating—and preventable—accidents have sparked a major push for enhanced driver safety.

This initiative is particularly crucial in the commercial fleet industry, as accidents involving a large truck are often more dangerous and can cost hundreds of thousands of dollars. This post explores an innovative solution that leverages Amazon SageMaker AI and Amazon Bedrock to revolutionize driver coaching and enhance fleet efficiency. By harnessing the power of machine learning and artificial intelligence, we demonstrate how fleet operators can transform raw dashcam footage into actionable insights, empowering real-time driver monitoring and proactive safety measures – reducing costly accidents. Our approach combines AWS Artificial Intelligence (AI) and Internet of Things (IoT) services to create a comprehensive solution that not only detects distracted driving but also continuously improves its performance over time. Through this solution, we aim to show how fleet managers can significantly reduce distracted driving incidents, improve operational efficiency, and ultimately drive down costs in their commercial vehicle operations.

The Challenge: Effectively managing multiple dashcam feeds from commercial vehicle fleet

Today’s commercial vehicles are equipped with multi-camera systems that provide comprehensive coverage: inward-facing cameras monitor driver behavior, outward-facing cameras track oncoming traffic, and side/rear cameras detect cross-traffic and potential rear-end collisions. The sheer volume of video data generated by thousands of vehicles daily creates significant management and analysis challenges. While fleet operators traditionally use this dashcam footage for reactive purposes – such as law enforcement reporting, insurance claims, and driver exoneration – many organizations are missing a significant opportunity to leverage this data. As commercial fleets accumulate more miles, they generate rich datasets that can be used to train AI models capable of facilitating proactive safety improvements.

In this post, we’ll explore how to maximize the value of dashcam footage through best practices for implementing and managing Computer Vision systems in commercial fleet operations. We’ll demonstrate how to build and deploy edge-based machine learning models that provide real-time alerts for distracted driving behaviors, while effectively collecting, processing, and analyzing footage to train these AI models. This approach transforms fleet operations from reactive incident management to proactive safety enhancement, helping organizations convert raw video data into actionable insights that reduce safety incidents and improve overall fleet operational efficiency and cost-effectiveness.

Solution overview

A Distracted Driving Incident can occur when drivers engage in unsafe behaviors such as speeding, rolling stops, harsh braking, and aggressive acceleration. Fleet managers need to understand not just what happened during these incidents, but also the driver’s state of attention – whether they were focused on the road or distracted by activities like using a cellphone, eating, drinking, or experiencing fatigue common in long-haul driving.

Our solution leverages AWS services to create an end-to-end workflow capable of detecting and mitigating distracted driving. The steps involved include:

Incident capture, ingestion, and labeling
Model training, optimization, and deployment
Continuous testing and improvement

Solution deep dive

This solution relies on a mix of AWS IoT, AI and generative AI services to build a scalable and cost-effective solution. Let’s start by looking at high level solution architecture and build the solution step-by-step.

Incident capture, ingestion, and labeling

To start the process of ingesting videos from a driver’s dashboard camera into the cloud, we capture the dashcam’s feed using the IoT Greengrass Kinesis Video Streamer Component. The video is streamed into the AWS Cloud using Kinesis Video Streams and stored in Amazon S3 by leveraging Kinesis Firehose. The videos are then converted into individual frames, analyzed by the Amazon Bedrock Nova Pro model to determine driver distraction, and sorted by an AWS Lambda function into an S3 bucket based on the analysis results. The sorted frames will next be used to train an AI model for edge deployment to detect distracted driving.

From a security perspective, it’s good practice to encrypt data in Amazon S3 buckets using AWS Key Management Service (KMS). You can enforce this by setting up SSE-KMS as the default encryption method to automatically encrypt uploaded objects. We also recommend implementing fine-grained AWS Identity & Access Management (IAM) roles to grant scoped access to images and videos. For data in transit between the edge and the cloud, you can use AWS IoT Greengrass certificates to encrypt your data and enforce identity verification. These measures can help protect against unauthorized access.

With this process in place, we are continually collecting data from our fleet of commercial vehicles (while keeping security in mind). This data is automatically categorized and labeled based on the analysis from our Nova Pro model, and conveniently stored in S3, enabling us to seamlessly train an AI model – a process which we will describe next.

Model training, optimization, and deployment

The following diagram illustrates the process of training and deploying a distracted driver detection model. The process runs inside of an Amazon SageMaker Pipelines Workflow, which allows for seamless orchestration of other Amazon SageMaker AI services. This workflow begins with labeled driver images stored in Amazon S3, generated from the previously described workflow. This labeled dataset – consisting of driver images labeled as “distracted” or “not distracted’ – is used to train a ResNet50 model using Amazon SageMaker Training Jobs running on a Trn1 instance for price performance. As we train, the model learns how to identify distracted drivers. Once complete, the trained model is then quantized to INT8 using SageMaker Processing Jobs, and optimized for our specific type of edge hardware using SageMaker Neo. The optimized model is then stored in the SageMaker Model Registry for version control and governance (this will be helpful later when we iterate on our model with new training data). Finally, the model is pushed to S3 where AWS IoT Greengrass can initiate a deployment to the fleet of edge devices.

Running on the edge, the model performs inference multiple times a second on frames from the inward facing dashcam. (Inference speed calculated assuming edge compute has specs comparable to a Raspberry-Pi class of device.) If the driver is found to be distracted, the system alerts the driver by means of a noise. (ex. driver was falling asleep, and alert awakens them).

With this process in place, we have successfully leveraged the dataset we generated in the first diagram to train, optimize, and deploy our custom model to the ‘edge’ – in this case, to each vehicle in our fleet. Our model is now alerting drivers of dangerous behavior and helping to proactively prevent collisions. But our model likely isn’t perfect – perhaps it misses a dangerous behavior that wasn’t in the training dataset, or alerts unnecessarily. To validate our model is working well and further improve it to reduce errors, we implement continuous testing and improvement procedures.

Continuous testing and improvement

We need to continue to ingest driver dashcam data and compare our edge model’s predictions with our original source of ‘ground truth’ – Nova Pro.

The system collects frames for model validation in two scenarios: when vehicle telemetry detects incidents (hard braking, crashes) or when the edge model identifies distracted driving. These frames are sent to Amazon Bedrock for a ‘fact check’ to see if the edge model performed optimally. The comparative results between Amazon Bedrock and the edge model are stored in a dedicated S3 bucket for model evaluation. When sufficient new validated data is collected, or when the model’s agreement with Amazon Bedrock falls below a threshold, Amazon EventBridge triggers the previously described SageMaker Pipelines Workflow to fine tune, optimize, and re-deploy the improved model to the edge, now powered by our newly collected ‘disagreement data’.

We should also perform comparative analysis of our new model against our historical models stored in the Amazon SageMaker Model Registry to validate that our latest model actually performs better than historical models, verifying we don’t see a regression. If our latest model doesn’t outperform historical models, we should not deploy it, and instead investigate if we are suffering from overfitting or bad training data. In summary, we now have a model running inside fleet vehicles capable of alerting drivers to unsafe behavior. This could effectively reduce drowsy driving accidents by keeping drivers awake and alert, while also warning drivers about unsafe decisions like eating or using a cell phone while driving. This system is also self-training and self-improving, so it will continue to get better over time. Additionally, fleet management companies could aggregate safety data and reward top drivers to further incentivize safe driving habits.

Conclusion

In this post, we’ve explored an innovative solution that leverages AWS services to revolutionize driver coaching and fleet operations. By combining the power of Amazon SageMaker and Amazon Bedrock with AWS IoT and edge computing capabilities, we’ve demonstrated how to create a comprehensive, scalable solution for monitoring and improving driver behavior in real-time. This solution addresses the challenges of managing vast amounts of dashcam footage from commercial vehicle fleets, transforming raw video data into actionable insights. By implementing an end-to-end workflow that includes incident capture, categorization, model training, deployment, and continuous improvement, fleet operators can shift from reactive incident management to proactive safety enhancement. The benefits of this approach include:

Enhanced safety: Real-time detection of distracted driving behaviors allows for immediate intervention and coaching.
Improved efficiency: Automated analysis of dashcam footage reduces manual review time and costs.
Scalability: The solution can handle large fleets and growing datasets with ease.
Continuous improvement: The system learns and adapts over time, becoming more accurate and effective.
Cost-effectiveness: By leveraging edge computing and optimized models, the solution minimizes compute costs.

As the transportation industry continues to evolve, solutions like this will play a crucial role in improving road safety, reducing operational costs, and enhancing overall fleet performance. By harnessing the power of AI and cloud computing, fleet operators can create safer, more efficient driving environments that benefit not only their businesses but also society as a whole. The future of fleet operations is here, and it’s driven by intelligent, data-driven systems that turn every mile driven into an opportunity for improvement and innovation.

Learn more by exploring AWS code samples to build hands-on SageMaker expertise. See the service in action through practical examples that demonstrate how to optimize model training and deployment across various use cases. Understand the financial advantages by conducting a cloud economics TCO analysis comparing traditional infrastructure against SageMaker’s managed services. This exercise reveals how SageMaker alleviates hidden costs while accelerating your ML development cycle.

Ready to take the next step? Connect with your AWS Solutions Architect to arrange a SageMaker AI Immersion Day tailored to your team’s specific challenges. These expert-led sessions provide personalized guidance that will help you implement SageMaker effectively within your organization’s unique context. For deeper dive into other relevant services Amazon Kinesis Video Streams, AWS IoT Greengrass, Amazon Bedrock