Tag Archives: Technical How-to

Build unified pipelines spanning multiple AWS accounts and Regions with Amazon MWAA

2025-04-10 Anubhav Gupta

Post Syndicated from Anubhav Gupta original https://aws.amazon.com/blogs/big-data/build-unified-pipelines-spanning-multiple-aws-accounts-and-regions-with-amazon-mwaa/

As organizations scale their Amazon Web Services (AWS) infrastructure, they frequently encounter challenges in orchestrating data and analytics workloads across multiple AWS accounts and AWS Regions. While multi-account strategy is essential for organizational separation and governance, it creates complexity in maintaining secure data pipelines and managing fine-grained permissions particularly when different teams manage resources in separate accounts.

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed orchestration service for Apache Airflow that you can use to set up and operate data pipelines in the Amazon Cloud at scale. Apache Airflow is an open source tool used to programmatically author, schedule, and monitor sequences of processes and tasks, referred to as workflows. With Amazon MWAA, you can use Apache Airflow to create workflows without having to manage the underlying infrastructure for scalability, availability, and security.

In this blog post, we demonstrate how to use Amazon MWAA for centralized orchestration, while distributing data processing and machine learning tasks across different AWS accounts and Regions for optimal performance and compliance.

Solution overview

Let’s consider an example of a global enterprise with distributed teams spread across different AWS regions. Each team generates and processes valuable data that is often required by other teams for comprehensive insights and streamlined operations. In this post, we consider a scenario where the data processing team sits in one region and the machine learning (ML) team sits in another region and there is a central team that manages the tasks between the two teams.

To address this complex challenge of orchestrating dependent teams across geographic regions, we’ve designed a data pipeline that spans multiple AWS accounts across different AWS Regions and is centrally orchestrated using Amazon MWAA. This design enables seamless data flow between teams, making sure that each team has access to the necessary data from other AWS accounts and Regions while maintaining compliance and operational efficiency.

Here’s a high-level overview of the architecture:

Centralized orchestration hub (Account A, us-east-1)
- Amazon MWAA serves as the central orchestrator, coordinating operations across all regional data pipelines.
Regional data pipelines (Account B, two Regions)
- Region 1 (for example, us-east-1)
  - Wait for users to upload raw data to an Amazon Simple Storage Service (Amazon S3) bucket
  - AWS Glue transforms the data
  - Processed data is saved in an S3 bucket (data processing)
- Region 2 (for example, us-west-2)
  - Receives processed data from Region 1 through Amazon S3 Cross-Region replication
  - Amazon SageMaker performs ML tasks on the replicated data in the S3 bucket (ML)

This architecture maintains the concept of separate regional operations within Account B, with data processing in AWS Region 1 and ML in AWS Region 2. The central Amazon MWAA instance in Account A orchestrates these operations across AWS Regions, enabling different teams to work with the data they need. It enables scalability, automation, and streamlined data processing and ML workflows across multiple AWS environments.

Prerequisites

This solution requires two AWS accounts:

Account A: Central managed account for the Amazon MWAA environment.
Account B: Data processing and ML operations
- Primary Region: US East (N. Virginia) [us-east-1]: Data processing workloads
- Secondary Region: US West (Oregon) [us-west-2]: ML workloads

Step 1: Set up Account B (data processing and ML tasks)

in us-east-1 and provide Account A as input. This template creates the following three stacks:

Stack in us-east-1: Creates the required roles for stackset execution.
Second stack in us-east-1: Creates an S3 bucket, S3 folders, and AWS Glue job.
Stack in us-west-2: Creates a S3 bucket, S3 folders, Amazon SageMaker Config file, cross-account-role, and AWS Lambda function.

Collect stack outputs: After successful deployment, gather the following output values from the created stacks. These outputs will be used in subsequent steps of the setup process.

From the us-east-1 stack:
- The value of SourceBucketName
From the us-west-2 stack:
- The value of DestinationBucketName
- The value of CrossAccountRoleArn

Step 2: Set up Account A (central orchestration)

in us-east-1. Provide value of CrossAccountRoleArn from Account B setup as input. This template does the following:

Deploys an Amazon MWAA environment
Sets up an Amazon MWAA Execution role with a cross-account trust policy.

Step 3: Setting up S3 CRR and bucket policies in Account B

in us-east-1 for cross-Region replication of the S3 data-processing bucket in us-east-1 and the ML pipeline bucket in us-west-1. Provide values of SourceBucketName, DestinationBucketName, and AccountAId as input parameters.

This stack should be deployed after completing the Amazon MWAA setup. This sequence is necessary because you need to grant the Amazon MWAA execution role appropriate permissions to access both the source and destination buckets.

Step 4: Implement cross-account, cross-Region orchestration

IAM cross-account role in Account B

The stack in Step 2 created an AWS Identity and Access Management (IAM) role in Account B with a trust relationship that allows the Amazon MWAA execution role from Account A (the central orchestration account) to assume it. Additionally, this role is granted the necessary permissions to access AWS resources in both Regions of Account B.

This setup enables the Amazon MWAA environment in Account A to securely perform actions and access resources across different Regions in Account B, maintaining the principle of least privilege while allowing for flexible, cross-account orchestration.

Airflow connection in Account A

To establish cross-account connections in Amazon MWAA:

Create a connection for us-east-1. Open the Airflow UI and navigate to Admin and then to Connections. Choose the plus (+) icon to add a new connection and enter the following details:

Connection ID: Enter aws_crossaccount_role_conn_east1
Connection type: Select Amazon Web Services.
Extras: Add the cross-account-role and Region name using the following code. Replace <CrossAccountRoleArn> with the cross-account role Amazon Resource Name (ARN) created while setting Account B in Step 1, in Region 2 (us-west-2):

{
"role_arn": "<CrossAccountRoleArn>",
"region_name": "us-east-1"
}

Create a second connection for us-west-2.

Connection ID: Enter aws_crossaccount_role_conn_west2
Connecton type: Select Amazon Web Services.
Extras: Add a CrossAccountRoleArn and Region name using the following code:

{
"role_arn": "<CrossAccountRoleArn>",
"region_name": "us-west-2"
}

By setting up these Airflow connections, Amazon MWAA can securely access resources in both us-east-1 and us-west-2, helping to ensure seamless workflow execution.

Implement cross-account workflows in Account A

Now that your environment is set up with the necessary IAM roles and Airflow connections, you can create data processing and ML workflows that span across accounts and Regions.

DAG 1: Cross-account data processing

The directed acyclic graph (DAG) depicted in the preceding figure demonstrates a cross-account data processing workflow using Amazon MWAA and AWS services.

To implement this DAG:

Download the cross_account_data_processing_dag.py file.
Replace <INSERT-DATA-PROCESSING-BUCKET-NAME-US-EAST-1> with the value of SourceBucketName from Account B.

Here’s a description of its key operators:

S3KeySensor: This sensor monitors a specified S3 bucket for the presence of a raw data file (raw/ml_train_data.csv). It uses a cross-account AWS connection (aws_crossaccount_role_conn_east1) to access the S3 bucket in a different AWS account. The sensor checks every 60 seconds and times out after 1 hour if the file is not detected.
GlueJobOperator: This operator triggers an AWS Glue job (mwaa_glue_raw_to_transform) for data preprocessing. It passes the bucket name as a script argument to the AWS Glue job. Like the S3KeySensor, it uses the cross-account AWS connection to execute the AWS Glue job in the target account.

DAG 2: Cross-account and cross-Region ML

The DAG in the preceding figure demonstrates a cross-account machine learning workflow using Amazon MWAA and AWS services. It shows Airflow’s flexibility in enabling users to write custom operators for specific use cases, particularly for cross-account operations.

To implement this DAG:

Download cross_account_machine_learning_dag.py
Replace <INSERT-MACHINE-LEARNING-BUCKET-NAME-US-WEST-2> with the value of DestinationBucketName from Account B.

Here’s a description of the custom operators and key components:

CrossAccountSageMakerHook: This custom hook extends the SageMakerHook to enable cross-account access. It uses AWS Security Token Service (AWS STS) to assume a role in the target account, enabling seamless interaction with SageMaker across account boundaries.
CrossAccountSageMakerTrainingOperator: Building on the CrossAccountSageMakerHook, this operator enables SageMaker training jobs to be executed in a different AWS account. It overrides the default SageMakerTrainingOperator to use the cross-account hook.
S3KeySensor: Used to monitor the presence of training data in a specified S3 bucket. These sensors verify that the required data is available before proceeding with the machine learning workflow. It uses a cross-account AWS connection (aws_crossaccount_role_conn_west2) to access the S3 bucket in a different AWS account.
SageMakerTrainingOperator: Uses the custom CrossAccountSageMakerTrainingOperator to initiate a SageMaker training job in the target account. The configuration for this job is dynamically loaded from an S3 bucket.
LambdaInvokeFunctionOperator: Invokes a Lambda function named dagcleanup after the SageMaker training job completes. This can be used for post-processing or cleanup tasks.

Step 5: Schedule and verify the Airflow DAGs

To schedule the DAGs, copy the Python scripts cross_account_data_processing_dag.py and cross_account_machine_learning_dag.py to the S3 location associated with Amazon MWAA in central Account A. Go to the Airflow environment created in Account A, us-east-1, and locate the S3 bucket link and upload them to the dags folder.
Download data file to the source bucket created in Account B, us-east-1, under raw folder.
Navigate to the Airflow UI.
Locate your DAG in the DAGs tab. The DAG automatically syncs from Amazon S3 to the Airflow UI. Choose the toggle button to enable the DAGs.
Trigger the DAG runs.

Best practices for cross-account integration

When implementing cross-account, cross-Region workflows with Amazon MWAA, consider the following best practices to help ensure security, efficiency, and maintainability.

Secrets management: Use AWS Secrets Manager to securely store and manage sensitive information such as database credentials, API keys, or cross-account role ARNs. Rotate secrets regularly using Secrets Manager automatic rotation. For more information, see Using a secret key in AWS Secrets Manager for an Apache Airflow connection.
Networking: Choose the appropriate networking solution (AWS Transit Gateway, VPC Peering, AWS PrivateLink) based on your specific requirements, considering factors such as the number of VPCs, security needs, and scalability requirements. Implement appropriate security groups and network ACLs to control traffic flow between connected networks.
IAM role management: Follow the principle of least privilege when creating IAM roles for cross-account access.
Error handling and retries: Implement robust error handling in your DAGs to manage cross-account access issues. Use Airflow’s retry mechanisms to handle transient failures in cross-account operations.
Managing Python dependencies: Use a requirements.txt file to specify exact versions of required packages. Test your dependencies locally using the Amazon MWAA local runner before deploying to production. For more information, see Amazon MWAA best practices for managing Python dependencies

Clean up

To avoid future charges, remove any resources you created for this solution.

Empty the S3 buckets: Manually delete all objects within each bucket, verify they are empty, then delete the buckets themselves.
Delete the CloudFormation stacks: Identify and delete the stacks associated with the architecture.
Verify resource cleanup: Make sure that Amazon MWAA, AWS Glue, SageMaker, Lambda, and other services are terminated.
Remove remaining resources: Delete any manually created IAM roles, policies, or security groups.

Conclusion

By using Airflow connections, custom operators, and features such as Amazon S3 cross-Region replication, you can create a sophisticated workflow that seamlessly operates across multiple AWS accounts and Regions. This approach allows for complex, distributed data processing and machine learning pipelines that can take advantage of resources spread across your entire AWS infrastructure. The combination of cross-account access, cross-Region replication, and custom operators provides a powerful toolkit for building scalable and flexible data workflows. As always, careful planning and adherence to security best practices are crucial when implementing these advanced multi-account, multi-Region architectures.

Ready to tackle your own cross-account orchestration challenges? Test this approach and share your experience in the comments section.

About the authors

Suba Palanisamy is a Senior Technical Account Manager helping customers achieve operational excellence using AWS. Suba is passionate about all things data and analytics. She enjoys traveling with her family and playing board games

Anubhav Gupta is a Solutions Architect at AWS supporting enterprise greenfield customers, focusing on the financial services industry. He has worked with hundreds of customers worldwide building their cloud foundational environments and platforms, architecting new workloads, and creating governance strategy for their cloud environments. In his free time, he enjoys traveling and spending time outdoors

Anusha Pininti is a Solutions Architect guiding enterprise greenfield customers through every stage of their cloud transformation, specializing in data analytics. She supports customers across various industries, helping them achieve their business objectives through cloud-based solutions. In her free time, Anusha loves to travel, spend time with family, and experiment with new dishes

Sriharsh Adari is a Senior Solutions Architect at AWS, where he helps customers work backward from business outcomes to develop innovative solutions on AWS. Over the years, he has helped multiple customers on data platform transformations across industry verticals. His core area of expertise includes technology strategy, data analytics, and data science. In his spare time, he enjoys playing sports, watching TV shows, and playing Tabla

Geetha Penmatsa is a Solutions Architect supporting enterprise greenfield customers through their cloud journey. She helps customers across various industries transform their business with the AWS Cloud. She has a background in data analytics and is specializing in Amazon Connect Cloud contact center to help transform customer experience at scale. Outside work, Geetha loves to travel, ski, hike, and spend time with friends and family

Automating AWS Private CA audit reports and certificate expiration alerts

2025-04-09 Santosh Vallurupalli

Post Syndicated from Santosh Vallurupalli original https://aws.amazon.com/blogs/security/automating-aws-private-ca-audit-reports-and-certificate-expiration-alerts/

Today’s organizations rely heavily on secure and reliable communication channels and digital certificates play a crucial role in securing internal and external-facing infrastructure by establishing trust and enabling encrypted communication. While public certificates are commonly used to secure internet applications, many organizations prefer private certificates for internal resources to maintain confidentiality and enable custom configurations that public certificates don’t support. AWS Private Certificate Authority (AWS Private CA) offers a comprehensive solution to create and manage private certificate hierarchies within an organization’s public key infrastructure (PKI). AWS handles the heavy lifting of certificate authority (CA) management, allowing organizations to issue certificates for various use cases, including creating encrypted communication channels, authenticating clients, and cryptographically signing code. These certificates remain trusted within the organization, helping to ensure internal security without exposing them to the public internet.

AWS Certificate Manager (ACM) and AWS Private CA provide robust tools to issue and manage certificates seamlessly within AWS. However, as workloads evolve—spanning cloud native microservices, containerized environments, and hybrid edge deployments—the default certificate configurations might not meet every need. For instance, private TLS certificates requested using ACM come with a fixed 13-month validity period, which ACM tracks and renews automatically. But what if your organization requires certificates with custom validity periods such as short-lived certificates for ephemeral containers or certificates with extended durations for your on-premises systems? This is a common scenario for enterprises using modern architectures. You can gain significant advantages by creating and updating your certificates through AWS Command Line Interface (AWS CLI) or AWS SDKs. These powerful tools offer enhanced flexibility and integrate seamlessly with existing workflows.

Taking this efficiency even further, you can optimize your certificate management by bypassing the AWS Management Console, using the AWS CLI or SDK to generate certificates programmatically through their custom PKI pipelines.

You can use this automation-friendly approach to maintain full control over your certificate lifecycle, though it’s worth noting that ACM doesn’t inherently track the expiration of certificates that are issued using the acm-pca:IssueCertificate API, and aren’t requested using ACM. Lack of oversight on certificate expiration can lead to operational disruptions and compromise the accessibility of your applications. The AWS Private CA offers a powerful option to address this gap: the Generate audit report option. This option produces a detailed report of the certificates issued by your certificate hierarchy—including their expiration dates—regardless of how they were generated. However, with organizations managing vast numbers of certificates across multiple certificate hierarchies, manual report generation and review becomes impractical and unsustainable.

In this blog post, we guide you through a custom automation workflow that harnesses AWS Private CA audit reports to monitor certificate expirations proactively. The solution uses Amazon EventBridge, AWS Lambda, Amazon Simple Storage Service (Amazon S3), Amazon Simple Notification Service (Amazon SNS), and AWS Security Hub to generate daily reports, review them for expiring certificates, notify stakeholders, and generate log findings for centralized visibility. We’ve also included an AWS CloudFormation template to deploy this solution in your AWS environments, complete with step-by-step instructions. This approach can help ensure that you stay ahead of certificate expirations.

The challenge: Certificate management beyond the defaults

To understand why this solution matters, let’s explore the evolving needs of certificate management.

Certificates requested using ACM that are issued by your private CA through the console default to a 13-month validity period; a reasonable middle ground for many workloads. ACM tracks these certificates, monitors their expiration, and even automates renewals. This hands-off approach works well for standard cloud applications, but modern IT environments are rarely standard because of the diverse requirements of real-world use cases.

Consider these real-world examples:

Short-lived certificates: in containerized environments running on EKS or Amazon Elastic Container Service (Amazon ECS) certificates with validity periods of a few hours or days are increasingly common. Service meshes like Istio or Linkerd rely on short-lived certificates to secure pod-to-pod communication, reducing the threat surface if a key is compromised. A 13-month certificate might not be optimal for this use case.
Long-lived certificates: On the other hand, some workloads—often found in traditional or resource-constrained environments—benefit from certificates with extended validity periods. For instance, systems deployed in locations with unreliable or restricted network access might require longer-lived certificates to minimize the challenges of frequent renewals, which could disrupt operations or require manual intervention. Likewise, infrastructures running critical applications with minimal automation might lean towards multi-year certificates to reduce the administrative burden and maintain consistent security over time. In such cases, long-lived certificates offer a dependable solution, balancing security needs with operational simplicity and minimizing the frequency of maintenance tasks.

To address these needs, many organizations turn to their own continuous integration and delivery (CI/CD) pipelines and custom automation using AWS Private CA and ACM. Using AWS CLI or SDKs, you can use AWS Private CA to issue certificates that have custom validity periods tailored to their workload requirements.

Extending certificate monitoring beyond ACM integrated services

Even if certificates aren’t requested using ACM, you can optionally re-import the certificates into ACM. After the certificates have been imported, ACM begins tracking and monitoring them. However, you have the flexibility to decide which certificates to import. Certificates that aren’t imported into ACM will not be tracked by the service. These certificates won’t appear in the ACM console, their expiration events won’t trigger Amazon CloudWatch Logs and managed renewals of these certificates aren’t supported by ACM.

Without a centralized view, you must manually monitor expiration dates, a task that quickly becomes unmanageable as certificate volume grows. An expired certificate can lead to downtime (for example, a load balancer rejecting traffic). This is where the ability to generate an audit report from AWS Private CA can help you. It provides a comprehensive list of all the certificates issued by your CA, including serial numbers, issuance dates, and expiration dates. However, generating this report manually using the console and reviewing it daily isn’t scalable.

In the following section, we show you how to set up a more scalable, automated solution that will notify you when certificates need to be renewed.

Prerequisites

For this walkthrough, you need to have the following:

An AWS account
A private CA from AWS Private CA
An externally created certificate imported into ACM

Solution overview

This audit generation solution provides an automated, scalable, and integrated approach to generating and analyzing audit reports for certificates issued by AWS Private CA. It uses AWS services to monitor certificate statuses, detect impending expirations, and notify administrators while integrating findings into Security Hub for centralized security monitoring. The solution helps ensure timely awareness of expiring certificates; enhancing compliance and operational security.

The following figure shows the solution architecture. The process begins with an EventBridge rule (PCAReportRule) that triggers the audit report generation on a user-defined schedule (for example: daily). This rule invokes the first of the two Lambda functions: PCAauditReportLambdaGenerator. This function interacts with the AWS boto3 SDK to generate an audit report, capturing details of issued certificates. The report is formatted as a CSV file (with optional JSON support configurable in the Lambda function) and stored in a designated S3 bucket. To simulate expiration alerts for demonstration purposes, certificates can be issued with a validity period of less than 30 days, as opposed to the default 13-month validity of AWS Private CA certificates.

Figure 1: Solution architecture

After the audit report is uploaded to the S3 bucket, an S3:PutObject event notification triggers the second Lambda function, PCAAuditReportLambdaProcessor. This function downloads the most recent report, parses the data in the CSV file, and analyzes the details to identify certificates that are expiring within the 30-day threshold. Upon identifying expiring certificates, the function sends a consolidated notification using an Amazon SNS topic PCASNSTopic, which supports subscriptions such as an email or an optional Amazon Simple Queue Service (Amazon SQS) queue for further processing. Simultaneously, the function integrates findings into Security Hub, providing a centralized view of expiring certificates for compliance tracking and security monitoring.

The architecture is deployed using a CloudFormation template, automating the setup of the core components—EventBridge, Lambda functions, an S3 bucket, an SNS topic, and Security Hub integration—into a cohesive system. Security Hub serves as a cloud security posture management service that provides organizations with a consolidated view of their security alerts and compliance status across your AWS accounts. It functions as a central dashboard where security data from various sources and AWS services is aggregated, enabling automatic assessment of resources against established security standards while helping teams prioritize security concerns throughout their environment. This design helps ensure scalability, flexibility, and minimal manual intervention, enabling users to modify the Lambda functions to support additional report formats (such as JSON) or adjust notification thresholds as required. It’s also worth noting that you can generate a report every 30 minutes.

Deploy the solution

With the prerequisites in place and an understanding of the architecture, you’re ready to deploy and test the automation workflow and run an audit report on-demand.

Deploy the CloudFormation template

To get started, clone the following GitHub repo.
```
~ $ curl -O https://aws-security-blog-content.s3.us-east-1.amazonaws.com/public/sample/2526-monitor-private-ca-issued-certificates-aws-private-certificate-authority-eventbridge/ACM-PCA-Monitoring-cfn.yml  

~ $ ls 
ACM-PCA-Monitoring-cfn.yml
```
The ACM-PCA-Monitoring-cfn.yml CloudFormation template includes the following parameters, which allow you to customize the deployment:
- CertificateAuthorityArn: The Amazon Resource Name (ARN) (<ARN_of_your_PrivateCA>) of your pre-existing private CA for which the audit report is generated.
- S3BucketName: A new S3 bucket (<Name_of_s3_bucket>) where the audit report will be stored.
- EventBridgeRuleName: The name of the EventBridge rule (<Name_of_EventBridgeRule>) to trigger the Lambda function (default value: PCAReportRule).
- CronJobExpression: A cron expression (<Frequency_of_running_evaluation>) to define the schedule for report generation (default value: cron(0 21 * * ? *)).
- SNSName: The name of a new Amazon SNS topic (<Name_of_SNS_Topic>) for expiration alerts (default value: PCASNSTopic).
- SQSName: The name of a new Amazon SQS queue (<Name_of_SQS>) for expiration alerts (default value PCASQS).
- EmailAddress: The email address for receiving notifications (<Email_to_Receive_alerts>).
- CertificateExpirationThreshold: The threshold value in days (<Expiration_threshold_in_days>) to monitor for your certificate’s expiration (default value: 30).

Run the following command to create the CloudFormation stack. Stack creation will take 2–3 minutes to complete.

aws cloudformation create-stack \
--stack-name PCAMonitoringWorkflow \
--template-body file://ACM-PCA-Monitoring-cfn.yml \
--capabilities CAPABILITY_NAMED_IAM \
--parameters '[
    {"ParameterKey": "CertificateAuthorityArn", "ParameterValue": "<ARN_of_your_PrivateCA>"},
    {"ParameterKey": "S3BucketName", "ParameterValue": "<Name_of_s3_bucket>"},
    {"ParameterKey": "EventBridgeRuleName", "ParameterValue": "<Name_of_EventBridgeRule>"},
    {"ParameterKey": "CronJobExpression", "ParameterValue": "<Frequency_of_running_evaluation>"},
    {"ParameterKey": "SNSName", "ParameterValue": "<Name_of_SNS_Topic>"},
    {"ParameterKey": "SQSName", "ParameterValue": "<Name_of_SQS>"},
    {"ParameterKey": "EmailAddress", "ParameterValue": "<Email_to_Receive_alerts>"},
    {"ParameterKey": "CertificateExpirationThreshold", "ParameterValue": "<Expiration_threshold_in_days>"}
]'

When stack creation is complete, you’ll get an email asking you to confirm your subscription to the specified SNS topic from the previous step.

Figure 2: Sample notification email sent by Amazon SNS

Test the automation workflow

Test the automation workflow by creating a private certificate that will trigger your expiration alert system. To do this, you’ll generate a private certificate using your private CA with an intentionally short expiration period. The certificate should expire before the threshold you set in the CloudFormation template (the default is 30 days). For example, if you kept the default 30-day threshold, the following code will generate a certificate that expires in 20 days, which should trigger the notification system:

#Generate a Private Key
~ $ % openssl genrsa -out private-key.pem 2048

#List the private key
~ $ % ls 
private-key.pem

#Generate a Certificate Signing Request (CSR)
~ $ % openssl req -new -key private-key.pem -out csr.pem -subj "/C=US/ST=Ohio/L=Columbus/O=MyOrg/OU=IT/CN=mydomain.com"


#Issue a Certificate
~ $ % aws acm-pca issue-certificate --certificate-authority-arn <specify_arn_of_PrivateCA> --csr "$(cat csr.pem | base64 | tr -d '\n')" --signing-algorithm "SHA256WITHRSA" --validity Value=20,Type="DAYS"

{
 "CertificateArn": "arn:aws:acm-pca:us-east-2:XXXXXX:certificate-authority/7574de75-e5fd-47d0-a4e2-3afc3c0ba4b3/certificate/87980cc7a1cca819dd9082e6cd360c65"
}


#Retrieve the Issued Certificate
~ $ % aws acm-pca get-certificate --certificate-authority-arn <specify_arn_of_PrivateCA> --certificate-arn <specify_arn_of_Certificate_generated_above> --output text > certificate.pem

Note: You’ll receive alerts for all certificates that are approaching expiration, even for certificates that are requested using ACM, which support managed renewal. You can compare the ARN of the expiring certificate to your list of requested certificates in the ACM console, or to the results of the acm:ListCertificates API.

With the audit report infrastructure deployed and a test certificate created within your expiration threshold, the next step is to trigger the automation workflow to generate and process the audit report.

Run an audit report on-demand

To test the EventBridge rule PCAReportRule, you’ll temporarily modify it to run every 30 minutes. When you’re done testing, you can revert it back to the original scheduled that you specified in the CloudFormation template parameters.

In the Amazon EventBridge console, choose Rules in the navigation pane. Select PCAReportRule and then choose Edit rule.
Select Define schedule.
1. Under Schedule pattern, select A schedule that runs at a regular rate…
2. Under Rate expression, for Value enter 30, and for Unit, select Minutes.
3. Choose Next.
Figure 3: Edit the schedule of PCAReportRule for the test
For an immediate test, you can also trigger this workflow from the Lambda console.
1. In the Lambda console, choose Functions in the navigation pane, and then select the PCAauditReportLambdaGenerator Lambda function.
2. Choose the Test tab, leave the default values for the Event JSON.
3. Choose Test at the top of the window.
Figure 4: Use the console to trigger a test
This Lambda function generates an AWS Private CA audit report and saves it to the specified S3 bucket at the audit-report prefix. To verify this, navigate to the Amazon S3 console and choose Buckets from the navigation pane.
Select the bucket that you created when you ran the CloudFormation template and verify the reports in the audit-report folder.

Figure 5: The audit report is saved to the specified S3 bucket
When an audit report is uploaded to the S3 bucket, it automatically triggers the PCAAuditReportLambdaProcessor Lambda function through S3 event notifications. The function analyzes the audit report to identify any certificates approaching expiration. If certificates are found that will expire within the specified threshold (30 days by default), the function automatically creates detailed findings in Security Hub for tracking and monitoring purposes. These findings include important details such as the certificate ARN, expiration date, and severity level.
Because you created a test certificate that expires in 20 days (which is within the test threshold), the automation workflow will detect this and generate corresponding findings in Security Hub. To see the results go to the Security Hub console and choose Findings in the navigation pane.

Figure 6: View the audit report findings in Security Hub
After creating Security Hub findings, the Lambda function sends detailed certificate expiration alerts through Amazon SNS. You’ll receive an email notification at the address you provided in the CloudFormation parameters. The email will contain important information about the certificates approaching expiration, including their ARNs and exact expiration dates. Here’s an example of the email notification format

Figure 7: Sample notification email sent by Amazon SNS

Conclusion

Certificate management is crucial for maintaining security across modern workloads, and AWS Private CA plays a vital role in issuing certificates with custom validity periods. The solution in this post delivers a robust, automated approach to certificate lifecycle management by seamlessly integrating several AWS services.

The solution combines Amazon EventBridge for scheduled execution of audit reports, AWS Lambda for automated processing and analysis, Amazon S3 for secure storage of audit reports, Amazon SNS for immediate notification delivery, and AWS Security Hub for centralized monitoring and tracking. This powerful integration creates a comprehensive automation workflow that actively monitors certificate expirations and provides timely alerts across your cloud, hybrid, and edge deployments.

By implementing this CloudFormation template, you can:

Automate the generation and processing of AWS Private CA audit reports at scheduled intervals
Receive immediate notifications when certificates approach their expiration threshold
Maintain centralized visibility through detailed Security Hub findings
Track certificate lifecycles across your entire infrastructure
Help ensure compliance with organizational security policies
Minimize the risk of service disruptions due to expired certificates

The solution transforms traditional certificate management from a manual, error-prone process into a streamlined, automated workflow. It provides security teams with the tools they need to proactively manage certificate lifecycles, maintain compliance requirements, and respond quickly to potential certificate-related issues. The automated notifications and centralized monitoring through Security Hub help ensure that no certificate expiration goes unnoticed, allowing teams to take timely action before service disruptions occur.

The result is a scalable, reliable system that simplifies certificate management and strengthens your organization’s overall security posture through consistent monitoring and proactive management of certificate lifecycles.

If you have feedback about this post, submit comments in the Comments section below.

Using Amazon Q Developer CLI for custom Java application transformations

2025-04-08 Dinesh Prabakaran

Post Syndicated from Dinesh Prabakaran original https://aws.amazon.com/blogs/devops/using-amazon-q-developer-cli-for-custom-java-application-transformations/

In today’s rapidly evolving software landscape, maintaining and modernizing Java applications is a critical challenge for many organizations. As new Java versions are released and best practices evolve, the need for efficient code transformation becomes increasingly important. Amazon Q Developer transformation for Java using the Command Line Interface (CLI) presents a powerful alternative to integrated development environments (IDEs), offering unique advantages in scenarios requiring batch processing, CI/CD integration, headless environments, and custom automation workflows. By leveraging the CLI, development teams can perform consistent, scalable, and easily reproducible transformations across extensive codebases.

One key difference between CLI and IDE-based transformations lies in the standardization and customization capabilities. With CLI transformations, teams can define and enforce standardized transformation rules across the entire organization, ensuring consistency in code modernization efforts. This standardization is particularly valuable for large teams or distributed development environments. Additionally, the CLI approach allows for deeper customization of transformation rules, enabling teams to tailor the modernization process to their specific needs and coding standards. Whether updating deprecated APIs, migrating to newer Java versions, or enforcing coding standards, Amazon Q Developer’s CLI transformations provide a flexible and powerful solution.

This blog will explore how to use Amazon Q Developer’s CLI capabilities to create custom transformations for upgrading Java applications. We’ll dive into the process of defining transformation rules, executing them across your codebase, and demonstrate how to customize these transformations to meet specific requirements. By the end of this blog, you’ll have a clear understanding of how to leverage Amazon Q Developer’s CLI for Java transformations, enabling you to modernize your applications more efficiently and with greater control. You’ll be equipped to standardize your transformation processes across teams and projects while also customizing them to fit your unique requirements.

Pre-requisites

Before you begin a transformation, see the prerequisites for transformation on the command line with Amazon Q Developer.

Note: The Amazon Q Developer command line tool for transformation (qct cli) is distinct from the Amazon Q Developer CLI – while qct cli is specifically designed for code transformations, the Amazon Q Developer CLI provides features such as autocompletion, Amazon Q chat, inline ZShell completion, etc.

These pre-requisites ensure that you have all the necessary tools and permissions to use Amazon Q Developer’s CLI capabilities for custom transformations on your Java applications.

About the Application

This sample project will be used to demonstrate the Amazon Q Developer CLI code transformation feature in action. It’s a Java 1.8 based microservice application that displays a free list of movies for the month using configuration stored in AWS AppConfig service. Originally open sourced in 2020, it intentionally uses legacy versions of libraries (Spring Boot 2.x, Log4j 2.13.x, Mockito 1.x, Javax and Junit 4) to showcase the upgrade process. The application includes a dependency on another module built in both Java 1.8 and 17, specifically to demonstrate how post transformation steps can be used to modernize your application’s internal dependencies. You can download this sample project to experiment with the CLI upgrade feature in your own environment.

Overview

You will use Amazon Q Developer command line tool for transformation to perform custom transformations on a Java application. This will involve:

Configure Amazon Q command line tool for transformation.
Use pre-transformation template to identify unused imports and variables and remove them before transformation.
After pre-transformation, upgrade the application to Java 17 to leverage the latest features.
Use post transformation template to

- Modify the POM file to point internal dependencies to the latest Java 17 or 21 (1p) version.
- Modify your code to replace deprecated methods from internal dependencies.
- Identify System.out.println statements and replace them with a proper logger framework.

Walkthrough

Setting up the transformation environment

First, ensure you have the Amazon Q command line tool for transformation installed. You can verify this by running:

which qct

Figure 1: Output of 'which qct' command showing the installation path of Amazon Q Developer CLI
Figure 1: which qct

If it’s not installed, follow the installation instructions in the Amazon Q Developer documentation.

Configuring Amazon Q command line tool and authenticate

Configure the Amazon Q command line tool for transformation by running the qct configure command:

qct configure

This command will:

Prompt you to specify the JDK path for Java 8, 11, 17 and 21. You only need to specify the path to the JDK of the Java version you are upgrading.
Two options are available for authentication
- Option 1 authenticates with IAM credentials stored in your AWS CLI profile. Refer Figure 2.
- Figure 2: qct configure – Authenticate with IAM
  - Provide the AWS CLI profile to use for the IAM authentication. You can specify a specific profile name or press enter to use the default profile.
  - Provide a file path that will point to a CSV file which will be used to add tags for your transformation (optional). The CSV must have two columns, with headers titled key and value, where tag key-value pairs are listed.
- Option 2 authenticates with IDC (IAM Identity Center). Refer Figure 3.
- Figure 3: qct configure – Authenticate with IDC
  - Prompt you to provide the Start URL to authenticate to Amazon Q Developer Pro. The Start URL can be obtained from the console in the Q Developer > Settings.
  - Provide the AWS Region
If you’re upgrading your code’s Java version, you have the option to receive your code suggestions from Amazon Q in one commit or multiple commits. Amazon Q will split the upgraded code into multiple commits by default. If you want all your code changes to appear in one commit, enter the letter ‘O’ for one commit when prompted.

For more information on how Amazon Q splits up the code changes, see Reviewing the transformation summary and accepting changes.

Customizing transformations

You can customize transformations by providing custom logic in the form of ast-grep rules that Amazon Q uses to make changes to your code.

To start with customization, create an Orchestrator file where you provide the paths to the custom transformation files. The Orchestrator file is a YAML file containing paths to custom pre-transformation and post-transformation files, which contain ast-grep rules that will run before and after transformation.

Here’s an example:

orchestrator_qct_cli.yaml

name: orchestrator_qct_cli
description: My collection of custom transformations to run before and after a transformation.

pre_qct_actions:
  ast-grep:
    rules:
      - custom-transformation-pre-qct.yaml

post_qct_actions:
  ast-grep:
    rules:
      - custom-transformation-post-qct.yaml

The pre-transformation rules file shown below is used to

Identify unused local variables declarations and remove them
Identify unused import declarations and remove them

This custom transformation cleans up unused local variable declarations and imports, helping in reducing the number of lines that will be considered for transformation as demonstrated in the following example:

custom-transformation-pre-qct.yaml

id: no-unused-vars
language: java
rule:
    kind: local_variable_declaration
    all:
        - has:
              has:
                  kind: identifier
                  pattern: $IDENT
        - not:
              precedes:
                  stopBy: end
                  has:
                      stopBy: end
                      any:
                          - { kind: identifier, pattern: $IDENT }
                          - { has: {kind: identifier, pattern: $IDENT, stopBy: end}}
fix: ''

--- # this is YAML doc separator to have multiple rules in one file

id: no-unused-imports
rule:
    kind: import_declaration
    all:
        - has:
            has:
                kind: identifier
                pattern: $IDENT
        - not:
            precedes:
                stopBy: end
                has:
                    stopBy: end
                    any:
                        - { kind: type_identifier, pattern: $IDENT }
                        - { has: {kind: type_identifier, pattern: $IDENT, stopBy: end}}
fix: ''

The post-transformation rules file serves multiple purposes and helps customers seamlessly upgrade their first-party (1P) dependencies after QCT transforms their application to Java 17 or 21:

When your project uses internal AWS dependencies that have been upgraded to Java 17 or 21, these rules automatically update your POM file to use the latest compatible versions. This eliminates manual dependency version updates and resolves build errors related to private dependencies.
After updating the POM file, the rules automatically modify your code to replace deprecated methods from internal dependencies with their latest supported versions, ensuring compatibility with the upgraded dependencies.
The rules identify System.out.println statements and replace them with a proper logger framework, improving application observability.
This automated approach significantly simplifies the migration process by handling both the application transformation and internal dependency updates in one streamlined operation.

These rules help modernize your codebase and ensure compatibility with updated dependencies.

custom-transformation-post-qct.yaml

id: update-movie-service-util-java17
language: html
rule:
    pattern: |-
        <dependency>
            <groupId>org.amazonaws.samples</groupId>
            <artifactId>movie-service-utils</artifactId>
            <version>$VERSION</version>
        </dependency>
constraints:
    VERSION:
        regex: "^0\\.1\\.0"
fix: |-
    <dependency>
        <groupId>org.amazonaws.samples</groupId>
        <artifactId>movie-service-utils</artifactId>
        <version>0.2.0</version>
    </dependency>

--- # this is YAML doc separator to have multiple rules in one file
id: update-movie-service-util-method
language: java
rule:
    pattern: |-
        MovieUtils.isValidMovieName($MOVIE_NAME)
fix: |-
    MovieUtils.isValidMovie($MOVIE_NAME, movieId)

--- # this is YAML doc separator to have multiple rules in one file
id: sysout-to-logger
language: java
rule:
    pattern: System.out.println($MATCH)
fix: logger.info($MATCH)

Both pre- and post-transformation enhances logging capabilities, increases configurability, improves error handling, and makes the application more production-ready, while the use of transformation rules automates the process, saving time and reducing errors in large codebases.

Executing the transformation

Now, run the transformation using the Amazon Q Developer CLI:

qct transform --source_folder <path-to-folder>
    --custom_transformation_file <path-to-orchestrator-file> 
    --target_version <your-target-java-version>

Here,

–source_folder points to the path of the folder containing the Java application that needs to be transformed from version 8 to either 17 or 21.

–custom_transformation_file specifies the path to the orchestrator file (orchestrator_qct_cli.yaml).

–target_version refers to the target Java version to which the application will be transformed. It can be either JAVA_17 or JAVA_21.

If you have common requirements across all the applications you’re transforming, it’s better to store this file in a shared location and use an absolute path during transformation.

For applications with specific requirements, you can include the orchestrator file in the application’s codebase and use a relative path.

Once you execute the transform command if you choose to authenticate with IDC, it will prompt to authenticate providing a URL using the credentials set up in identity center that have access to Amazon Q Developer Pro:
Figure 4: Screenshot showing Amazon Q Developer CLI transform command prompting for IAM Identity Center authentication with a browser verification code
Figure 4: qct transform authentication with the IAM Identity provider

After logging in through the browser, verify if the code matches with the code in the CLI and approve access to Amazon Q Developer.

Once the request is approved – enter Y in the command line to proceed with transformation:

Before starting the transformation, the agent verifies if you have at least the minimum supported version of Maven for transformation

Figure 5: Screenshot of terminal output showing Amazon Q Developer CLI transform command execution, displaying pre-processing steps and transformation job initiation
Figure 5: qct transform starts with pre-processing followed with transformation job

First ast-grep command is run using the pre-transformation template before transformation and once it’s successful, the Q Developer transformation job begins.

Figure 6: Screenshot of terminal showing successful completion of Amazon Q Developer CLI transformation, displaying the newly created Git branch name containing the transformed code
Figure 6: qct transform completed with the results committed to a new branch

After the transformation is complete the changes are saved to a local branch and the branch name can be obtained from CLI output.

After successful transformation ast-grep post-transformation step is executed and the branch is updated with the custom transformed code.

Reviewing and applying the changes

After the transformation is complete, review the changes in the new branch that’s in the CLI output. Use “git branch” to view the new branch created with the transformed files.

git branch

Figure 7: Screenshot of terminal output from 'git branch' command highlighting the newly created branch containing the transformed code
Figure 7: git branch – shows the new branch containing transformation result

Now compare the transformed branch with the source branch in our case its change-branch.

git diff change-branch

Figure 8: Screenshot of Git diff output showing Spring Boot dependency version update from 2.0.5 to 3.3.8 in the project's POM file
Figure 8: Spring boot upgraded to 3.3.8 from 2.0.5

Figure 8: Screenshot of Git diff output showing Java version configuration change from 8 to 17 in the project's build configuration
Figure 9: Java upgraded from 8 to 17

You can see the application is upgraded to Java 17, Spring Boot upgraded to 3.3.0 form 2.0.5, and the internal dependency movie-service-utils upgraded to 0.2.0 version to support Java 17.

Figure 10: Screenshot of Git diff output showing two code changes: removal of unused variables and replacement of System.out.println statements with logger.info calls
Figure 10: Unused variable removed also System.out.println replaced with Logging framework

Figure 11: Screenshot of Git diff output highlighting the removal of unused Java import statements from the source code
Figure 11: Unused imports removed

Figure 12: Screenshot of Git diff output showing MovieUtils.isValidMovie method signature change with updated parameters to align with new dependency version
Figure 12: MobieUtils.isValidMovie method updated to match the updated internal dependency

During pre-transformation

Unused local variables are identified and removed from the code.
Unused imports are identified and removed

During post-transformation

Using custom transformation, the sysout statements are replaced with logger framework.
The internal dependency MovieUtils.isValidMovieName method is updated with the required parameters that are required for the latest version using custom post transformation template.

Other changes of the Java 8 to 17 transformation are mentioned here.

Pre-transformation, transformation, and post-transformation changes are committed separately to help users compare differences and identify changes made in each step.

If you’re satisfied with the results, you can create a pull request from the branch which contains the transformed code to your corresponding release branch.

Troubleshooting

When working with Amazon Q Developer CLI transformations, you might encounter some common issues. Here’s how to address them:

Unstaged Git Commits
Before running a transformation, make sure to stash or commit any pending changes to your local branch. This ensures a clean working directory for the transformation process.
Clearing the Working Directory
If a transformation fails, clear the workspace located at ~/.aws/qcodetransform/transformation_projects/<your project name> before retrying the transformation. This step is only necessary for failed transformations.

Clean up

To clean up after the transformation:

Remove the user or group access to the Amazon Q Developer Pro application
Unsubscribe from Amazon Q Developer Pro

Call to Action

Ready to modernize your Java applications? Here’s how to get started:

Install Amazon Q Developer command line tool for transformation
Review the prerequisites for the command line transformation with Amazon Q Developer.
Download the sample project to practice transformations in a safe environment

Conclusion

Using Amazon Q Developer’s CLI capabilities for custom transformations provides a powerful and flexible way to upgrade Java applications. This approach allows you to automate the modernization process, saving time and reducing the risk of manual errors.

By leveraging custom rules, you can tailor the transformation to your specific needs, whether it’s updating deprecated methods, migrating to new APIs, or applying best practices across your codebase.

As you continue to work with Amazon Q Developer, explore more advanced transformation scenarios and integrate this process into your development workflow for ongoing modernization efforts.

About the authors

Converting embedded SQL in Java applications with Amazon Q Developer

2025-04-07 Suruchi Saxena

Post Syndicated from Suruchi Saxena original https://aws.amazon.com/blogs/devops/converting-embedded-sql-in-java-applications-with-amazon-q-developer/

As organizations modernize their database infrastructure, migrating from systems like Oracle to open source solutions such as PostgreSQL is becoming increasingly common. However, this transition presents a significant challenge: discovering and converting embedded SQL within existing Java applications to ensure compatibility with the new database system. Manual conversion of this code is time-consuming, error-prone, and can lead to extended downtime during migration. The process involves cautiously updating numerous SQL statements interwoven in Java code, which can take weeks depending on the application’s size and complexity. This manual approach is highly susceptible to errors, potentially introducing subtle bugs that are difficult to detect. It also requires deep expertise in both source and target database systems. Furthermore, ensuring consistency across the entire codebase during manual conversion is challenging. This can lead to inconsistencies in coding style, performance optimizations, and error handling. These factors combined make the SQL conversion process a significant bottleneck in database migration projects, often delaying modernization efforts and impacting business agility.

Solution

To address these challenges, AWS has introduced an innovative new capability: SQL code conversion using Amazon Q Developer in conjunction with AWS Database Migration Service (AWS DMS). This solution automates the process of transforming embedded SQL in Java applications, significantly reducing the time and effort required for database migrations.
Amazon Q Developer, a generative AI–powered assistant for software development that integrates directly into your Integrated Development Environment (IDE), offers a range of features to enhance developer productivity, including code generation, refactoring, and transformations such as java version upgrades and now SQL code conversion. It analyzes Java code, identifies embedded SQL statements, and automates conversion from the source dialect (e.g. Oracle) to the target dialect (e.g. PostgreSQL). This automation dramatically accelerates the conversion process, potentially reducing weeks of tedious work to just hours of effort.
The solution minimizes human error by leveraging AI algorithms trained on extensive SQL datasets, ensuring a level of consistency and accuracy difficult to achieve manually. It also allows developers to focus on higher-value tasks such as architecture optimization and performance tuning, rather than getting bogged down in the minutiae of SQL syntax differences. When combined with AWS Database Migration Service, which handles schema conversion and data replication, this solution creates a comprehensive migration workflow. It addresses not just code conversion but the entire database migration lifecycle, providing a streamlined path from legacy systems to modern database architectures. By automating SQL conversion, ensuring consistency across the codebase, and integrating with broader migration tools, this feature significantly simplifies the technical aspects of database migration. It aligns with organizational goals of improving efficiency, reducing costs, and maintaining competitiveness in an evolving technological landscape, making it a powerful tool for organizations undertaking database modernization projects.

Overview

To illustrate the power of this solution, let’s consider part of an application written in Java that manages shopping cart functionality for online retail operations using embedded Oracle SQL for database operations. The system allows customers to maintain their shopping carts, manage items, and handle basic e-commerce operations. In our sample application, we look at sections of code from a CartDAO.java class which has multiple Oracle-specific SQL queries. It demonstrates various Oracle-specific SQL features including regular expressions, XML handling, hierarchical queries, and analytical functions. These features make the code particularly optimized for Oracle databases. We’ll need to convert this SQL in order for it to be compatible PostgreSQL. Let’s explore each of these methods.

Method 1:
createItem is a basic insertion method that uses Oracle’s SYSDATE function to automatically timestamp the record. This is Oracle’s built-in function for current date and time.

public void createItem() throws SQLException {
  String sql = "INSERT INTO item (name, description, updated_date) VALUES (?, ?, SYSDATE)";
  Connection conn = getConnection();
  PreparedStatement pstmt = conn.prepareStatement(sql);
  pstmt.setString(1, "Sparkling Water");
  pstmt.setDouble(2, 5.0);
  pstmt.executeUpdate();
  System.out.println("Data inserted successfully");
}

Method 2:
getMfgCodes is a method which uses Oracle’s SUBSTR function to retrieve the first three characters of an item name.

public List<String> getMfgCodes() throws SQLException {
  Connection conn = getConnection();
  String sql = "SELECT DISTINCT(SUBSTR(name, 1, 3)) AS mfg_code FROM item";
  Statement stmt = conn.createStatement();
  ResultSet rs = stmt.executeQuery(sql);
  List<String> mfg_codes = new ArrayList<String>();
  while (rs.next()) {
    mfg_codes.add(rs.getString("mfg_code"));
  }
  return mfg_codes;
}

Method 3:
findItemsByRegex leverages Oracle’s REGEXP_LIKE function, which provides pattern matching capabilities beyond standard SQL. This is used for complex string searching that would be difficult with simple LIKE clauses.

public List<String> findItemsByRegex(String pattern) throws SQLException {
  Connection conn = getConnection();
  String sql = "SELECT name FROM item WHERE REGEXP_LIKE(name, ?)";
  PreparedStatement pstmt = conn.prepareStatement(sql);
  pstmt.setString(1, pattern);
  ResultSet rs = pstmt.executeQuery();
  List<String> names = new ArrayList<String>();
  while (rs.next()) {
    names.add(rs.getString("name"));
  }
  return names;
}

Method 4:
cleanItemDescriptions uses Oracle’s advanced regular expression capabilities through REGEXP_REPLACE. It specifically uses Oracle’s character class syntax [[:punct:]] to identify punctuation marks, which is an Oracle-specific implementation of POSIX regular expressions.

public void cleanItemDescriptions() throws SQLException {
  Connection conn = getConnection();
  String sql = "UPDATE item SET description = REGEXP_REPLACE(description, " 
               +" '([[:punct:]]{2,}|\\s{2,})', ' ') "
               +" WHERE REGEXP_LIKE(description, '([[:punct:]]{2,}|\\s{2,})') ";   
  try (Statement stmt = conn.createStatement()) {
  int rowsUpdated = stmt.executeUpdate(sql);
  System.out.println("Cleaned descriptions for " + rowsUpdated + " items");
  }
}

Method 5:
This function retrieves the top 3 most expensive items for each premium category from the ‘item’ table using Oracle’s analytical RANK() function. It creates a formatted string for each item containing the premium category, item name, price, and its rank within its category. The results are stored in a List and returned.

public List<String> getTopItemsByCategory() throws SQLException {
  Connection conn = getConnection();
  String sql = "SELECT * FROM (SELECT name, premium, price,RANK() OVER (PARTITION BY premium"
               +" ORDER BY price DESC) as price_rank FROM item) WHERE price_rank <= 3";
  List<String> topItems = new ArrayList<>();
  try (Statement stmt = conn.createStatement();
    ResultSet rs = stmt.executeQuery(sql)) 
    while (rs.next()) {
      String result = String.format("Premium: %s, Item: %s, Price: %.2f, Rank: %d",
      rs.getString("premium"),
      rs.getString("name"),
      rs.getDouble("price"),
      rs.getInt("price_rank"));
      topItems.add(result);
    }
  }
  return topItems;
}

Method 6:
The SQL query in the findItemsByPriceRange method performs a targeted search and ranking operation on the item table in the database. It begins by filtering items to only those within a specific price bracket.

public List<String> findItemsByPriceRange() throws SQLException {
  Connection conn = getConnection();
  String sql = "SELECT name, price, RANK() OVER (ORDER BY price) as price_rank FROM item"
               + "WHERE price BETWEEN ? AND ?";
  List<String> items = new ArrayList<>();
  try (PreparedStatement pstmt = conn.prepareStatement(sql)) {
    pstmt.setDouble(1, 50.0);
    pstmt.setDouble(2, 75.0);
    ResultSet rs = pstmt.executeQuery();
    while (rs.next()) {
      String result = String.format("Item: %s, Price: %.2f, Rank: %d",
      rs.getString("name"),
      rs.getDouble("price"),
      rs.getInt("price_rank"));
      items.add(result);
    }
  }
  return items;
}

Method 7:
This function demonstrates Oracle’s ROWNUM pseudo-column, which is a specific Oracle database feature used to limit the number of rows returned by a query. The function retrieves the first N items from the ‘item’ table by using ROWNUM <= ? in the WHERE clause.

public List<String> getFirstNItems(int n) throws SQLException {
  Connection conn = getConnection();
  String sql = "SELECT name FROM item WHERE ROWNUM <= ?";
  List<String> items = new ArrayList<>();
  try (PreparedStatement pstmt = conn.prepareStatement(sql)) {
    pstmt.setInt(1, n);
    ResultSet rs = pstmt.executeQuery();
    while (rs.next()) {
      items.add(rs.getString("name"));
    }
  }
  return items;
}

Our goal is to convert these embedded Oracle specific queries to PostgreSQL queries.

Solution Walkthrough

Prerequisites

Before beginning the conversion process, ensure you have installed and configured Amazon Q in your IDE by following the setup guide. Your source codebase must be a Java application containing embedded Oracle SQL statements that you plan to migrate to PostgreSQL. The transformation specifically targets Oracle SQL syntax within Java code, so verify your application meets these requirements. Complete your database schema migration using AWS DMS Schema Conversion before starting the code transformation process. This crucial step creates the foundation for your PostgreSQL database structure.

Convert Embedded SQL

Open your Java application containing embedded SQL statements in your IDE where Amazon Q is installed. Access the Amazon Q chat panel by selecting the Amazon Q icon in your IDE interface. Start the transformation process by typing /transform in the chat panel. When prompted, specify ‘SQL conversion’ as your transformation type. Amazon Q validates your Java application’s eligibility for SQL conversion before proceeding.

Initiating SQL conversion in the Q chat panel. Open the Amazon Q chat panel. Enter “/transform” in the chat panel. Enter “SQL conversion” when prompted.

Upload your schema metadata file when prompted by Amazon Q. The chat interface provides detailed instructions for retrieving this file from your previous DMS schema conversion process. Select your project containing embedded SQL and the corresponding database schema file from the dropdown menus in the chat panel. Amazon Q displays the detected database schema details for your confirmation. Take a moment to verify these details are accurate before proceeding with the conversion.

Uploading schema metadata file and verifying the detected database schema details for conversion.

The SQL conversion process would begin, during which Amazon Q analyzes and transforms your Oracle SQL statements into PostgreSQL-compatible syntax.

Amazon Q analyzes and transforms Oracle SQL statements into PostgreSQL-compatible syntax.

For this application, Amazon Q was able to detect 7 Oracle specific queries in the code and was able to process them to the corresponding PostgreSQL queries. It generated a conversion summary of the embedded SQL statements processed, and shared recommended actions for 2 queries that needed further inspection.
Amazon Q presented a comprehensive diff view showing all proposed changes to the embedded SQL. Review each modification in the diff view carefully. After your review, accept the changes to update your codebase. Amazon Q generates a detailed transformation summary documenting all modifications made during the conversion.

Reviewing proposed code changes and accepting them.

Let’s take a look at how each of SQL statements within each function got converted to be compatible with PostgreSQL.

Method 1:
The key difference in this query involves the transition from Oracle’s SYSDATE function to PostgreSQL’s CLOCK_TIMESTAMP() with time zone handling. While SYSDATE in Oracle returns the current date and time of the database server, the PostgreSQL version uses CLOCK_TIMESTAMP() which provides the actual current time and explicitly handles timezone conversion through AT TIME ZONE.

public void createItem() throws SQLException {
    String sql = "INSERT INTO admin.item (name, description, updated_date) VALUES (?, ?,    
                 (CLOCK_TIMESTAMP() AT TIME ZONE COALESCE(CURRENT_SETTING('aws_oracle_ext.tz', 
                 TRUE), 'UTC'))::TIMESTAMP(0))";
    Connection conn = getConnection();
    PreparedStatement pstmt = conn.prepareStatement(sql);
    //pstmt.setInt(1, 1);
    pstmt.setString(1, "Sparkling Water");
    pstmt.setDouble(2, 5.0);
    pstmt.executeUpdate();
    System.out.println("Data inserted successfully");
}

Method 2:
The key change in the next query involves using an extension pack added by the DMS Schema Conversion process which emulates source database functions. This is referenced using the fully qualified function name aws_oracle_ext.substr instead of the simple SUBSTR. The aws_oracle_ext schema contains Oracle-compatible functions to maintain compatibility with Oracle SQL syntax. Additionally, the table reference has been made more specific by including the schema name admin.item instead of just item, which helps avoid ambiguity in multi-schema environments.

public List<String> getMfgCodes() throws SQLException {
  Connection conn = getConnection();
  String sql = "SELECT DISTINCT (aws_oracle_ext.substr(name, 1, 3)) AS mfg_code FROM 
                admin.item";
  Statement stmt = conn.createStatement();
  ResultSet rs = stmt.executeQuery(sql);
  List<String> mfg_codes = new ArrayList<String>();
  while (rs.next()) {
    mfg_codes.add(rs.getString("mfg_code"));
  }
  return mfg_codes;
}

Method 3:
The next transformation demonstrates several important adaptations required for Oracle-compatible regular expression functionality in AWS. The key changes include: REGEXP_LIKE function has been replaced with its AWS Oracle extension equivalent aws_oracle_ext.regexp_like. Explicit type casting to TEXT has been added using the PostgreSQL-style cast operator:: TEXT for both the column name and the parameter. The schema qualifier admin has been added to the table name and extra parentheses have been added around the arguments for proper type handling. These modifications ensure that regular expression pattern matching works correctly in the AWS environment while maintaining Oracle-like functionality. The explicit TEXT type casting is particularly important as it ensures proper data type handling during the regular expression comparison operations.

public List<String> findItemsByRegex(String pattern) throws SQLException {
  Connection conn = getConnection();
  String sql = "SELECT name FROM admin.item WHERE aws_oracle_ext.regexp_like((name)::TEXT,
               (?::TEXT)::TEXT)";
  PreparedStatement pstmt = conn.prepareStatement(sql);
  pstmt.setString(1, pattern);
  ResultSet rs = pstmt.executeQuery();
  List<String> names = new ArrayList<String>();
  while (rs.next()) {
    names.add(rs.getString("name"));
  }
  return names;
}

Method 4:
The next conversion shows several sophisticated adaptations required for Oracle-compatible regular expression functionality. The changes include the addition of the aws_oracle_ext schema prefix to both regexp_replace and regexp_like functions. The introduction of the E prefix before string literals containing escape sequences, which is PostgreSQL’s syntax for enabling escape sequence interpretation. Additional escaping of backslashes (from \s to \\s) has been added to properly handle whitespace matching in the AWS environment. Explicit type casting to TEXT using ::TEXT has been added for all arguments in both the regexp_replace and regexp_like functions. The schema qualifier admin has been added to the table name. Single quotes around the replacement space character have been wrapped with parentheses. These modifications ensure that the regular expression replacement and matching operations work correctly in the AWS environment while maintaining Oracle-like functionality. The pattern itself is designed to replace multiple consecutive punctuation marks or whitespace characters with a single space character.

public void cleanItemDescriptions() throws SQLException {
  Connection conn = getConnection();
  String sql = "UPDATE admin.item SET description = 
                aws_oracle_ext.regexp_replace((description)::TEXT, 
                (E'([[:punct:]]{2,}|\\\\s{2,})')::TEXT, 
                ('')::TEXT) WHERE aws_oracle_ext.regexp_like((description)::TEXT,   
                (E'([[:punct:]]{2,}|\\\\s{2,})')::TEXT)";  
  try (Statement stmt = conn.createStatement()) {
    int rowsUpdated = stmt.executeUpdate(sql);
    System.out.println("Cleaned descriptions for " + rowsUpdated + " items");
  }
}

Method 5:
The next transformation shows a few key modifications required for proper execution in PostgreSQL. The addition of an explicit alias AS var_sbq for the derived subquery, which is required in PostgreSQL systems to properly reference derived tables The schema qualifier admin has also been added to the table name item.

public List<String> getTopItemsByCategory() throws SQLException {
  Connection conn = getConnection();
  String sql = "SELECT * FROM (SELECT name, premium, price, RANK() OVER (PARTITION BY 
                premium ORDER BY price DESC) AS price_rank FROM admin.item) AS var_sbq WHERE 
                 price_rank <= 3";
  List<String> topItems = new ArrayList<>();
  try (Statement stmt = conn.createStatement();
    ResultSet rs = stmt.executeQuery(sql)) {   
      while (rs.next()) {
        String result = String.format("Premium: %s, Item: %s, Price: %.2f, Rank: %d",
        rs.getString("premium"),
        rs.getString("name"),
        rs.getDouble("price"),
        rs.getInt("price_rank"));
        topItems.add(result);
      }
    }
  return topItems;
}

Method 6:
The next transformation demonstrates a few important modifications for PostgreSQL compatibility: The addition of explicit type casting using ::NUMERIC for both parameters in the BETWEEN clause, which ensures proper numeric comparison and helps prevent type conversion issues The schema qualifier admin has been added to the table name item The window function RANK() syntax remains unchanged as it’s standard ANSI SQL.

public List<String> findItemsByPriceRange() throws SQLException {
  Connection conn = getConnection();
  String sql = "SELECT name, price, RANK() OVER (ORDER BY price) AS price_rank FROM 
                admin.item WHERE price BETWEEN ?::NUMERIC AND ?::NUMERIC";
  List<String> items = new ArrayList<>();
  try (PreparedStatement pstmt = conn.prepareStatement(sql)) {
    pstmt.setDouble(1, 50.0);
    pstmt.setDouble(2, 75.0);
    ResultSet rs = pstmt.executeQuery();
    while (rs.next()) {
      String result = String.format("Item: %s, Price: %.2f, Rank: %d",
      rs.getString("name"),
      rs.getDouble("price"),
      rs.getInt("price_rank"));
      items.add(result);
    }
  }
  return items;
}

Method 7:
The next transformation shows several significant changes for PostgreSQL compatibility. The Oracle-specific ROWNUM syntax has been replaced with the standard SQL LIMIT clause. A CASE expression has been added to handle input validation. TRUNC(?::NUMERIC) converts the input parameter to a numeric value and removes any decimal places. The CASE statement ensures that only positive numbers are accepted. If the input is less than or equal to 0, it returns 0 (effectively no rows). The schema qualifier admin has been added to the table name. The parameter is now used twice in the query (once for comparison and once for the actual limit). Type casting to NUMERIC has been added for safer numeric handling.

public List<String> getFirstNItems(int n) throws SQLException {
  Connection conn = getConnection();
  String sql = "SELECT name FROM admin.item LIMIT CASE WHEN TRUNC(?::NUMERIC) > 0 THEN 
                 TRUNC(?::NUMERIC) ELSE 0 END";
  List<String> items = new ArrayList<>();
  try (PreparedStatement pstmt = conn.prepareStatement(sql)) {
    pstmt.setInt(1, n);
    ResultSet rs = pstmt.executeQuery();
    while (rs.next()) {
      items.add(rs.getString("name"));
    }
  }
  return items;
}

End-to-End Testing

After completing the SQL code conversion, update your application’s database connection settings to point to your new PostgreSQL database. This includes modifying connection strings, updating database credentials, and adjusting any database-specific configuration parameters.
Execute your application’s comprehensive test suite to validate the converted SQL statements. The test suite should cover all database interactions, ensuring queries return expected results and maintain proper data integrity. Pay particular attention to complex queries, stored procedure calls, and transaction management scenarios. Conduct thorough testing of your application’s critical paths. Focus on core business workflows that heavily depend on database operations. Test edge cases and error conditions to verify proper exception handling with the new PostgreSQL database. As a best practice recommendation, implement detailed monitoring of your application logs during testing. Watch for any SQL-related errors, unexpected query behavior, or performance degradation.

Conclusion

The combination of Amazon Q Developer and AWS DMS represents a significant leap forward in database migration technology. By automating the conversion of embedded SQL, we’ve addressed one of the most time-consuming and error-prone aspects of moving from Oracle to PostgreSQL.

Key benefits of this approach include:

Reduced migration time: What once took weeks can now be accomplished in days or hours
Improved accuracy: AI-powered conversion minimizes human error
Cost savings: Less developer time spent on manual code updates shortening modernization and upgrade initiatives
Seamless integration: Works within your existing development environment

As organizations continue to modernize their database infrastructure, services like Amazon Q Developer will play a crucial role in ensuring smooth, efficient migrations. By leveraging the power of AI to handle complex code transformations, developers can focus on adding value to their applications rather than getting bogged down in the intricacies of SQL dialect differences. We encourage you to try Amazon Q Developer using the Amazon Q User Guide for your next database migration project and experience firsthand the benefits of automated SQL code conversion.

About the Authors

Enhance governance with metadata enforcement rules in Amazon SageMaker

2025-03-28 Pradeep Misra

Post Syndicated from Pradeep Misra original https://aws.amazon.com/blogs/big-data/enhance-governance-with-metadata-enforcement-rules-in-amazon-sagemaker/

The next generation of SageMaker brings together widely adopted AWS machine learning and analytics capabilities, delivering an integrated experience with unified access to all data. Amazon SageMaker Lakehouse supports unified data access, and Amazon SageMaker Catalog, built on Amazon DataZone, offers catalog and governance features to meet enterprise security needs. Amazon SageMaker Catalog now supports metadata rules allowing organizations to enforce metadata standards across data publishing and subscription workflows.

A rule is a formal agreement that enforces specific metadata requirements across user workflows (e.g., publishing assets to the catalog, requesting data access) within the Amazon SageMaker Unified Studio portal. For instance, a metadata enforcement rule can specify the required information for creating a subscription request or publishing a data asset or a data product to the catalog, ensuring alignment with organizational standards. Metadata rules also enable the creation of custom approval workflows for subscriptions to assets, using collected metadata to facilitate access decisions or auto-fulfillment—outside of SageMaker.

By standardizing metadata practices, Amazon SageMaker Catalog enables customers to meet compliance requirements, enhance audit readiness, and streamline access workflows for greater efficiency and control. One such customer is Amazon Shipping Tech, which uses SageMaker Catalog for cataloging, discovery, sharing, and governance across their data ecosystem:

“We’re building an Analytics Ecosystem to drive discovery across the organization—but without consistent metadata, even our most valuable data can go unused. This feature empowers more teams to actively contribute to metadata curation with the right governance in place. It allows us to set clear standards for data producers while streamlining the collection of required subscription details—no extra templates needed. By enforcing standard metadata attributes, we improve discoverability, add context to each request, and strengthen support for analytics and GenAI solutions.”

— Saurabh Pandey, Principal Data Engineer at Amazon Shipping Tech

Sample use-cases

Metadata rules could help in the following use cases:

A producer at an automobile company is preparing to publish a new dataset into the organization’s data catalog. The domain owner for the automotive domain requires that the producer include metadata fields such as Model Year, Region, and Compliance Status. Before the dataset can be published, automated checks make sure that these fields are correctly filled out according to the predefined standards.
A consumer is requesting access to data assets in SageMaker. To meet organization standards and support audit and reporting needs, they must complete the subscription request, fill out a detailed form that includes the project purpose, and attach an email link with pre-approval and compliance training evidence to request subscription for financial data product. The data owner reviews the request, checking that all required metadata are provided before granting access.

Key benefits

Key benefits of new metadata enforcement rules include:

Enhanced control for domain (unit) owners – Admins can enforce additional metadata fields on subscription and publishing workflows, which must be adhered to by data users. This process supports thorough reviews and enforces organizational compliance.
Custom workflow support – You can create custom workflows for fulfilling subscriptions on non-managed assets by capturing essential metadata from data consumers. This metadata is used to configure access or support specific business requirements.

In this post, we guide you through two workflows: setting up metadata enforcement rules for a specific domain and publishing an asset or data product in a catalog, and setting up metadata enforcement rules for a specific domain and subscribing to an asset or data product that is owned by a project within that domain.

Solution Overview: Metadata Enforcement for Publishing

In this solution, we’ll walk through two workflows: setting up metadata enforcement for publishing, and setting up metadata enforcement for subscription.

Prerequisites

To follow this post, you should have a SageMaker Unified Studio domain set up with a domain owner or domain unit owner privileges. For instructions, refer to the following Getting started guide.

Set up metadata enforcement for publishing

In this section, we show you how to set up metadata rules for a specific domain as a domain admin. We also explain what happens when you publish an asset or data product in a catalog with these rules applied.

Create a domain unit for the marketing team

As a domain admin, complete the following steps:

On the SageMaker Unified Studio console, choose the Govern dropdown menu and choose Domain units.
Choose CREATE DOMAIN UNIT.
Provide details shown in the following screenshot and choose CREATE DOMAIN UNIT.

You can see the domain unit as shown in the following screenshot.

Enable a metadata form creation policy in the Marketing domain unit

Complete the following steps:

Navigate to the AUTHORIZATION POLICIES tab in the Marketing domain unit and choose Metadata form creation policy.
Choose ADD POLICY GRANT.
Select All projects in a domain unit and add a policy grant.
You can also select specific projects that can create metadata forms.
Choose ADD POLICY GRANT.

You can see the policy now created for the Marketing domain unit.

Create a metadata form to be enforced for assets before publishing

To create a metadata form, complete the following steps:

In the publish-1 project, choose Metadata entities under Project catalog in the navigation pane.
On the Metadata forms tab, choose CREATE METADATA FORM.
Provide a display name, technical name, and description.
Choose CREATE METADATA FORM.
After you create the form, you can choose CREATE FIELD to enforce fields that should be there in all published assets.
Provide details as shown in the following screenshot.
Select Searchable, Required, and Publishing because these fields are required before publishing.
Choose CREATE FIELD.
Add another field as shown in the following screenshot.

Both fields created with the Publishing action will require values before publishing to the catalog.

Create rules for asset publishing

Complete the following steps:

In the publish-1 project, under Domain Management in the navigation pane, choose Domain units.
Choose the Marketing domain unit.
On the Rules tab, choose ADD.
Create the rule configuration with details in the following screenshot and add the metadata form created in the previous step.
You can select the scope of enforcement by asset type and projects.
Choose ADD RULE to create the rule.

The publishing enforcement rule publish_rules is now created.

Create a project in the Marketing domain unit

Create a project named publish-1 in the Marketing domain unit. To learn how to create a project, refer to Create a project.

Create an asset in the project

Rules work on assets managed by the SageMaker Catalog or on custom assets. To create an asset, complete the following steps:

In the publish-1 project, choose Assets under Project catalog in the navigation pane.
On the Create dropdown menu, choose Create asset.
Provide an asset name and description, then choose Next.

For this solution, you will create an Amazon Simple Storage Service (Amazon S3) object collection.

For Asset type, choose S3 object collection.
For S3 location ARN¸ enter the Amazon Resource Name (ARN) of the S3 object.
Choose Next.
Choose CREATE.

The asset marketing_campaign_asset is now created. This is still an inventory asset and not published to the catalog.

Publish rules enforcement

Asset details now show that the required values are missing for the mandatory form Publish_form.

You can try to publish without the required fields and the system will throw an error to enforce publishing metadata rules, as shown in the following screenshot.

To fix the issue, edit the value for the metadata form to provide the required info.

Provide details for the fields and choose SAVE.

Choose PUBLISH ASSET now and the asset will be published to the catalog.

You can see the asset is published with the required fields enforced with rules.

Set up metadata enforcement for subscription requests

In this section, we show you how to set up metadata rules for a specific domain as a domain admin. We also explain what happens when you subscribe to an asset or data product with these rules applied.

Create rules for asset subscription

Complete the following steps:

Navigate to the project used in the previous section and choose Metadata entities under Project catalog in the navigation pane.
On the Metadata forms tab, choose CREATE METADATA FORM to create a new form.
Provide a form name and description, then choose CREATE METADATA FORM.
Add fields to the form by choosing CREATE FIELD and turning on Enabled.
Add a field for subscribers to explain the use case when requesting access.

Create rules for asset subscription

Complete the following steps:

On the project page, choose Domain units under Domain Management in the navigation pane.
Choose the Marketing domain unit.

We already have a publishing rule.

On the Rules tab, choose ADD to add a new rule.
Provide details for the new rule.
Specify the action as Subscription request.
Add the metadata form created in the previous steps (Subscribe_form).
Choose the scope and projects for enforcement as shown in the following screenshot.
Choose ADD RULE.

You will see the subscription enforcement rule is now created.

Subscribe the asset

Complete the following steps to subscribe the asset:

On the project page, navigate to the marketing asset.
Choose SUBSCRIBE.

The subscribe form is now attached in the request for the user to provide information.

After a data consumer submits a subscription request, the data producer receives it along with the provided metadata—such as Use Case. This allows producers to review the request before granting access.

Clean up

To avoid incurring additional charges, delete the Amazon SageMaker domain. Refer to Delete domains for the process.

Conclusion

In this post, we discussed metadata rules and how to implement them for both publishing and subscribing to assets across different domains, demonstrating effective metadata governance practices.

The new metadata enforcement rule in Amazon SageMaker strengthens data governance by enabling domain unit owners to establish clear metadata requirements for data users, streamlining catalog health and enhancing data governance process for access request. This feature enables organizations to align with organization’s metadata standards, implement custom workflows, and provide a consistent, governed data workflow experience.

The feature is supported in AWS Commercial Regions where Amazon SageMaker is currently available. To get started with metadata rules—

Read the user guide for creating rules in the publishing workflow
Read the user guide for creating rules in subscription requests

About the Authors

Pradeep Misra is a Principal Analytics Solutions Architect at AWS. He works across Amazon to architect and design modern distributed analytics and AI/ML platform solutions. He is passionate about solving customer challenges using data, analytics, and AI/ML. Outside of work, Pradeep likes exploring new places, trying new cuisines, and playing board games with his family. He also likes doing science experiments, building LEGOs and watching anime with his daughters.

Ramesh H Singh is a Senior Product Manager Technical (External Services) at AWS in Seattle, Washington, currently with the Amazon SageMaker team. He is passionate about building high-performance ML/AI and analytics products that enable enterprise customers to achieve their critical goals using cutting-edge technology. Connect with him on LinkedIn.

Sandhya Edupuganti is a Senior Engineering Leader spearheading Amazon DataZone (aka) SageMaker Catalog. She is based in Seattle Metro area and has been with Amazon for over 17 years leading strategic initiatives in Amazon Advertising, Amazon-Retail, Latam-Expansion and AWS Analytics.

Enhancing cloud security in AI/ML: The little pickle story

2025-03-26 Nur Gucu

Post Syndicated from Nur Gucu original https://aws.amazon.com/blogs/security/enhancing-cloud-security-in-ai-ml-the-little-pickle-story/

As AI and machine learning (AI/ML) become increasingly accessible through cloud service providers (CSPs) such as Amazon Web Services (AWS), new security issues can arise that customers need to address. AWS provides a variety of services for AI/ML use cases, and developers often interact with these services through different programming languages. In this blog post, we focus on Python and its pickle module, which supports a process called pickling to serialize and deserialize object structures. This functionality simplifies data management and the sharing of complex data across distributed systems. However, because of potential security issues, it’s important to use pickling with care (see the warning note in pickle — Python object serialization). In this post, we’re going to show you ways to build secure AI/ML workloads that use this powerful Python module, ways to detect that it’s in use that you might not know about, and when it might be getting abused, and finally highlight alternative approaches that can help you avoid these issues.

Quick tips

Avoid unpickling data from untrusted sources
Use alternative serialization formats, when possible, such as Safetensors
Implement integrity checks for serialized data
Use static code analysis tools to detect unsafe pickling patterns, such as Semgrep
Follow the AWS Well-Architected Framework’s Machine Learning Lens guidelines

Understanding insecure pickle serialization and deserialization in Python

Effective data management is crucial in Python programming, and many developers turn to the pickle module for serialization. However, issues can arise when deserializing data from untrusted sources. The Python bytestream that pickling uses, is proprietary to Python. Until it’s unpickled, the data in the bytestream can’t be thoroughly evaluated. This is where security controls and validation become critical. Without proper validation, there’s a risk that an unauthorized user could inject unexpected code, potentially leading to arbitrary code execution, data tampering, or even unintended access to a system. In the context of AI model loading, secure deserialization is particularly important—it helps prevent outside parties from modifying model behavior, injecting backdoors, or causing inadvertent disclosure of sensitive data.

Throughout this post, we will refer to pickle serialization and deserialization collectively as pickling. Similar issues can be present in other languages (for example, Java and PHP) when untrusted data is used to recreate objects or data structures, resulting in potential security issues such as arbitrary code execution, data corruption, and unauthorized access.

Static code analysis compared to dynamic testing for detecting pickling

Security code reviews, including static code analysis, offer valuable early detection and thorough coverage of pickling-related issues. By examining source code (including third-party libraries and custom code) before deployment, teams can minimize security risks in a cost-effective way. Tools that provide static analysis can automatically flag unsafe pickling patterns, giving developers actionable insights to address issues promptly. Regular code reviews also help developers improve secure coding skills over time.

While static code analysis provides a comprehensive white-box approach, dynamic testing can uncover context-specific issues that only appear during runtime. Both methods are important. In this post, we focus primarily on the role of static code analysis in identifying unsafe pickling.

Tools like Amazon CodeGuru and Semgrep are effective at detecting security issues early. For open source projects, Semgrep is a great option to maintain consistent security checks.

The risks of insecure pickling in AI/ML

Pickling issues in AI/ML contexts can be especially concerning.

Invalidated object loading: AI/ML models are often serialized for future use. Loading these models from untrusted sources without validation can result in arbitrary code execution. Libraries such as pickle, joblib, and some yaml configurations allow serialization but must be handled securely.
- For example: If a web application stores user input using pickle and unpickles it later with no validation, an unauthorized user could craft a harmful payload that executes arbitrary code on the server.
Data integrity: The integrity of pickled data is critical. Unexpectedly crafted data could corrupt models, resulting in incorrect predictions or behaviors, which is especially concerning in sensitive domains such as finance, healthcare, and autonomous systems.
- For example: A team updates its AI model architecture or preprocessing steps but forgets to retrain and save the updated model. Loading the old pickled model under new code might trigger errors or unpredictable outcomes.
Exposure of sensitive information: Pickling often includes all attributes of an object, potentially exposing sensitive data such as credentials or secrets.
- For example: An ML model might contain database credentials within its serialized state. If shared or stored without precautions, an unauthorized user who unpickles the file might gain unintended access to these credentials.
Insufficient data protection: When sent across networks or stored without encryption, pickled data can be intercepted, leading to inadvertent disclosure of sensitive information.
- For example: In a healthcare environment, a pickled AI model containing patient data could be transmitted over an unsecured network, enabling an outside party to intercept and read sensitive information.
Performance overhead: Pickling can be slower than other serialization formats (such as, JSON or Protocol Buffers), which can affect ML and large language model (LLM) applications when inference speed is critical.
- For example: In a real-time natural language processing (NLP) application using an LLM, heavy pickling or unpickling operations might reduce responsiveness and degrade the user experience.

Detecting unsafe unpickling with static code analysis tools

Static code analysis (SCA) is a valuable practice for applications dealing with pickled data, because it helps detect insecure pickling before deployment. By integrating SCA tools into the development workflow, teams can spot questionable deserialization patterns as soon as code is committed. This proactive approach reduces the risk of events involving unexpected code execution or unintended access due to unsafe object loading.

For instance, in a financial services application where objects are routinely pickled, a SCA tool can scan new commits to detect unvalidated unpickling. If identified, the development team can quickly address the issue, protecting both the integrity of the application and sensitive financial data.

Patterns in the source code

There are various ways to load a pickle object in Python. In this context, methods for detection can be tailored for secure coding habits and needed package dependencies. Many Python libraries include a function to load pickle objects. An effective approach can be to catalog all Python libraries used in the project, then create custom rules in your static code analysis tool to detect unsafe pickling or unpickling within those libraries.

CodeGuru and other static analysis tools continue to evolve their capability to detect unsafe pickling patterns. Organizations can use these tools and create custom rules to identify potential security issues in AI/ML pipelines.

Let’s define the steps for creating a safe process for addressing pickling issues:

Generate a list of all the Python libraries that are used in your repository or environment.
Check the static code analysis tool in your pipeline for current rules and the ability to add custom rules. If the tool is capable of discovering all the libraries used in your project, you can rely on it. However, if it’s not able to discover all the libraries used in your project, you should consider adding user-provided custom rules in your static code analysis tool.
Most of the issues can be identified with well-designed, context-driven patterns in the static code analysis tool. For addressing the pickling issues, you need to identify pickling and unpickling functions.
Implement and test the custom rules to verify full coverage of pickling and unpickling risks. Let’s identify patterns for a few libraries:
- NumPy can efficiently pickle and unpickle arrays; useful for scientific computing workflows requiring serialized arrays. To catch potential unsafe pickle usage in NumPy, custom rules could target patterns like:
```
import numpy as np
data = np.load('data.npy', allow_pickle=True)
```
- npyfile is a utility for loading NumPy arrays from pickled files. You can add the following patterns to your custom rules to discover potentially unsafe pickle object usage.
```
import npyfile
data = npyfile.load('example.pkl')
```
- pandas can pickle and unpickle DataFrames using pickle, allowing for efficient storage and retrieval of tabular data. You can add the following patterns to your custom rules to discover potentially unsafe pickle object usage.
```
import pandas as pd
df = pd.read_pickle('dataframe.pkl')
```
- joblib is often used for pickling and unpickling Python objects that involve large data, especially NumPy arrays, more efficiently than standard pickle. You can add the following patterns to your custom rules to discover potentially unsafe pickle object usage.
```
from joblib import load
data = load('large_data.pkl')
```
- Scikit-learn provides joblib for pickling and unpickling objects and is particularly useful for models. You can add the following patterns to your custom rules to discover potentially unsafe pickle object usage.
```
from sklearn.externals import joblib
data = joblib.load('example.pkl')
```
- PyTorch provides utilities for loading pickled objects that are especially useful for ML models and tensors. You can add the following patterns to your custom rule format to discover potentially unsafe pickle object usage.
```
import torch
data = torch.load('example.pkl')
```

By searching for these functions and parameters in code, you can set up targeted rules that highlight potential issues with pickling.

Effective mitigation

Addressing pickling issues requires not only detection, but also clear guidance on remediation. Consider recommending more secure formats or validations where possible as follows:

PyTorch
- Use Safetensors to store tensors. If pickling remains necessary, add integrity checks (for example, hashing) for serialized data.
pandas
- Verify data sources and integrity when using pd.read_pickle. Encourage safer alternatives (for example, CSV, HDF5, or Parquet) to help avoid pickling risks.
scikit-learn (via joblib)
- Consider Skops for safer persistence. If switching formats isn’t feasible, implement strict validation checks before loading.
General advice
- Identify safer libraries or methods whenever possible.
- Switch to formats such as CSV or JSON for data, unless object-specific serialization is absolutely required.
- Perform source and integrity checks before loading pickle files—even those considered trusted.

Example

The following is an example implementation that shows safe pickle implementation as a representation of the preceding information.

import io
import base64
import pickle
import boto3
import numpy as np
from cryptography.fernet import Fernet

###############################################################################
# 1) RESTRICTED UNPICKLER
###############################################################################
#
# By default, pickle can execute arbitrary code when loading. Here we implement
# a custom Unpickler that only allows certain safe modules/classes. Adjust this
# to your application's requirements.
#

class RestrictedUnpickler(pickle.Unpickler):
    """
    Restricts unpickling to only the modules/classes we explicitly allow.
    """
    allowed_modules = {
        "numpy": set(["ndarray", "dtype"]),
        "builtins": set(["tuple", "list", "dict", "set", "frozenset", "int", "float", "bool", "str"])
    }

    def find_class(self, module, name):
        if module in self.allowed_modules:
            if name in self.allowed_modules[module]:
                return super().find_class(module, name)
        # If not allowed, raise an error to prevent arbitrary code execution.
        raise pickle.UnpicklingError(f"Global '{module}.{name}' is forbidden")

def restricted_loads(data: bytes):
    """Helper function to load pickle data using the RestrictedUnpickler."""
    return RestrictedUnpickler(io.BytesIO(data)).load()

###############################################################################
# 2) AWS KMS & ENCRYPTION HELPERS
###############################################################################

def generate_data_key(kms_key_id: str, region: str = "us-east-1"):
    """
    Generates a fresh data key using AWS KMS. 
    Returns (plaintext_key, encrypted_data_key).
    """
    kms_client = boto3.client("kms", region_name=region)
    response = kms_client.generate_data_key(KeyId=kms_key_id, KeySpec='AES_256')
    
    # Plaintext data key (use to encrypt the pickle data locally)
    plaintext_key = response["Plaintext"]
    # Encrypted data key (store along with your ciphertext)
    encrypted_data_key = response["CiphertextBlob"]
    return plaintext_key, encrypted_data_key

def decrypt_data_key(encrypted_data_key: bytes, region: str = "us-east-1"):
    """
    Decrypts the encrypted data key via AWS KMS, returning the plaintext key.
    """
    kms_client = boto3.client("kms", region_name=region)
    response = kms_client.decrypt(CiphertextBlob=encrypted_data_key)
    return response["Plaintext"]

def build_fernet_key(plaintext_key: bytes) -> Fernet:
    """
    Construct a Fernet instance from a 32-byte data key.
    Fernet requires a 32-byte key *encoded* in URL-safe base64.
    """
    if len(plaintext_key) < 32:
        raise ValueError("Data key is smaller than 32 bytes; cannot build a Fernet key.")
    fernet_key = base64.urlsafe_b64encode(plaintext_key[:32])
    return Fernet(fernet_key)

###############################################################################
# 3) MAIN LOGIC
###############################################################################

def upload_pickled_data_s3(
    numpy_obj: np.ndarray,
    bucket_name: str,
    s3_key: str,
    kms_key_id: str,
    region: str = "us-east-1"
):
    """
    Pickle a numpy object, encrypt it locally, and upload the ciphertext + 
    encrypted data key to S3.
    """
    # 1. Generate data key from KMS
    plaintext_key, encrypted_data_key = generate_data_key(kms_key_id, region)
    
    # 2. Build Fernet from plaintext data key
    fernet = build_fernet_key(plaintext_key)
    
    # 3. Serialize the numpy object with pickle
    pickled_data = pickle.dumps(numpy_obj, protocol=pickle.HIGHEST_PROTOCOL)
    
    # 4. Encrypt the pickled data
    encrypted_data = fernet.encrypt(pickled_data)
    
    # 5. Upload to S3 along with the encrypted data key (in metadata)
    s3_client = boto3.client("s3", region_name=region)
    s3_client.put_object(
        Bucket=bucket_name,
        Key=s3_key,
        Body=encrypted_data,
        Metadata={
            "encrypted_data_key": base64.b64encode(encrypted_data_key).decode("utf-8")
        }
    )
    print(f"Encrypted pickle uploaded to s3://{bucket_name}/{s3_key}")

def download_and_unpickle_data_s3(
    bucket_name: str,
    s3_key: str,
    region: str = "us-east-1"
) -> np.ndarray:
    """
    Download the ciphertext and the encrypted data key from S3. Decrypt the data 
    key with KMS, use it to decrypt the pickled data, then load with a restricted 
    unpickler for safety.
    """
    s3_client = boto3.client("s3", region_name=region)
    
    # 1. Get object from S3
    response = s3_client.get_object(Bucket=bucket_name, Key=s3_key)
    
    # 2. Extract the encrypted data key from metadata
    metadata = response["Metadata"]
    encrypted_data_key_b64 = metadata.get("encrypted_data_key")
    if not encrypted_data_key_b64:
        raise ValueError("Missing encrypted_data_key in S3 object metadata.")
    
    encrypted_data_key = base64.b64decode(encrypted_data_key_b64)
    
    # 3. Decrypt data key via KMS
    plaintext_key = decrypt_data_key(encrypted_data_key, region)
    fernet = build_fernet_key(plaintext_key)
    
    # 4. Decrypt the pickled data
    encrypted_data = response["Body"].read()
    decrypted_pickled_data = fernet.decrypt(encrypted_data)
    
    # 5. Use restricted unpickler to load the numpy object
    numpy_obj = restricted_loads(decrypted_pickled_data)
    
    return numpy_obj

###############################################################################
# DEMO USAGE
###############################################################################

if __name__ == "__main__":
    # --- Replace with your actual values ---
    KMS_KEY_ID = "arn:aws:kms:us-east-1:123456789012:key/your-kms-key-id"
    BUCKET_NAME = "your-secure-bucket"
    S3_OBJECT_KEY = "encrypted_npy_demo.bin"
    AWS_REGION = "us-east-1"  # or region of your choice
    
    # Example numpy array
    original_array = np.random.rand(2, 3)
    print("Original Array:")
    print(original_array)
    
    # Upload (pickle + encrypt) to S3
    upload_pickled_data_s3(
        numpy_obj=original_array,
        bucket_name=BUCKET_NAME,
        s3_key=S3_OBJECT_KEY,
        kms_key_id=KMS_KEY_ID,
        region=AWS_REGION
    )
    
    # Download (decrypt + unpickle) from S3
    retrieved_array = download_and_unpickle_data_s3(
        bucket_name=BUCKET_NAME,
        s3_key=S3_OBJECT_KEY,
        region=AWS_REGION
    )
    
    print("\nRetrieved Array:")
    print(retrieved_array)
    
    # Verify integrity
    assert np.allclose(original_array, retrieved_array), "Arrays do not match!"
    print("\nSuccess! The retrieved array matches the original array.")

Conclusion

With the rapid expansion of cloud technologies, integrating static code analysis into your AI/ML development process is increasingly important. While pickling offers a powerful way to serialize objects for AI/ML and LLM applications, you can mitigate potential risks by applying manual secure code reviews, setting up automated SCA with custom rules, and following best practices such as using alternative serialization methods or verifying data integrity.

When working with ML models on AWS, see the AWS Well-Architected Framework’s Machine Learning Lens for guidance on secure architecture and recommended practices. By combining these approaches, you can maintain a strong security posture and streamline the AI/ML development lifecycle.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Effectively implementing resource controls policies in a multi-account environment

2025-03-26 Tatyana Yatskevich

Post Syndicated from Tatyana Yatskevich original https://aws.amazon.com/blogs/security/effectively-implementing-resource-controls-policies-in-a-multi-account-environment/

Every organization strives to empower teams to drive innovation while safeguarding their data and systems from unintended access. For organizations that have thousands of Amazon Web Services (AWS) resources spread across multiple accounts, organization-wide permissions guardrails can help maintain secure and compliant configurations. For example, some AWS services support resource-based policies that can be used to grant identities permissions to perform actions on the resources they’re attached to. With the management of resource-based policies frequently delegated to application owners, central security teams use permissions guardrails to help ensure that possible misconfigurations don’t lead to unintended access to these resources.

In this post, we discuss how you can use resource control policies (RCPs) to centrally restrict access to resources. We demonstrate how RCPs can help improve your security posture while allowing even more freedom to developers in managing their resources, thus reducing friction between central security and application teams. Using a sample use case, we uncover key considerations for designing and effectively implementing RCPs in your organization at scale.

If you’re new to RCPs, we recommend starting with Introducing resource control policies (RCPs), a new type of authorization policy in AWS Organizations, which provides an introduction to RCPs and their role in your security strategy.

RCP implementation journey

RCPs are a type of authorization policy in AWS Organizations. RCPs work alongside service control policies (SCPs) to help establish permissions guardrails across multiple accounts in your organization. To understand their differences and use cases, see General use cases for SCPs and RCPs and Enforcing enterprise-wide preventive controls with AWS Organizations.

We recommend implementing permissions guardrails, including RCPs, using the following iterative process, which consists of five phases (as shown in Figure 1).

Examine your security control objectives
Design permissions guardrails
Anticipate potential impacts
Implement permissions guardrails
Monitor permissions guardrails

Figure 1: Permissions guardrails implementation journey

This phased approach helps ensure an effective integration of RCPs into your security strategy, improving your security posture while helping to maintain business continuity. Let’s explore each phase of RCP implementation in detail and outline key considerations for an effective implementation strategy.

Phase 1: Examine your security control objectives

The first step in implementing RCPs is identifying areas where RCPs can help improve your security posture or optimize the implementation of controls for your organization’s specific security control objectives.

Your control objectives can be influenced by a variety of factors such as compliance and regulatory requirements, legal and contractual obligations, types of workloads, data classification, and your organization’s threat model. After your control objectives are well-defined and prioritized, identify those that can be achieved using RCPs.

Like SCPs, RCPs are designed to establish coarse-grained access controls, security invariants that rarely change and serve as always-on boundaries across a wide range of AWS resources in your accounts. RCPs aren’t for managing fine-grained access controls. You will keep using policies such as resource-based and identity-based policies to apply least-privilege permissions.

More specifically, the following are key control objectives that you can achieve using RCPs:

Establish a data perimeter around your AWS resources. For example, you can use RCPs to help ensure that only trusted identities can access your AWS resources.
Mitigate the cross-service confused deputy risk. You can use RCPs to help ensure that your AWS resources are accessed by AWS services only on behalf of your organization.
Apply consistent access controls to your AWS resources regardless of the identities accessing them. For example, you can use RCPs to help ensure your Amazon Simple Storage Service (Amazon S3) buckets require TLS v1.2 or higher for in-transit encryption.

For additional use cases and types of controls that can be implemented using RCPs, you can explore the resource control policy examples repository. In this post, we demonstrate how to help ensure that only trusted identities can access your AWS Identity and Access Management (IAM) roles.

Let’s begin with the scenario illustrated in Figure 2. Your company’s central cloud team manages your corporate AWS Organizations organization, which consists of two corporate AWS accounts. An IAM principal in Account A should be able to assume an IAM role in Account B to perform day-to-day operations. To align to the broader control objective of Only trusted identities can access my resources, the central security team wants to make sure that the IAM role in Account B (my resource) can only be assumed by IAM principals that belong to their organization (trusted identities).

Figure 2: Simple scenario depicting a trusted identity accessing an IAM role

One way of achieving this control objective is to follow the principle of least-privilege and make sure that the role trust policy, the resource-based policy attached to the IAM role, only allows access to identities that require that access. The following is an example trust policy that grants permissions to Role A in Account A to assume Role B in Account B.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "GrantCrossAccountAccess",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<my-account-a-id>:role/RoleA"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

In organizations that have only a few accounts, central teams typically manage these policies. While this centralized governance model helps ensure that trust policies applied to roles are always restricted to trusted identities, it can also impede the productivity of application teams when operating at a greater scale.

Assume that your company has started growing its cloud footprint so much that your central security team now must achieve the same control objective with hundreds of IAM roles that are spread across multiple AWS accounts, as demonstrated in Figure 3.

Figure 3: Restricting access by managing individual IAM role trust policies

At this scale, we see organizations delegating permissions management to application teams to better support the growth of their business and empower developers to innovate faster. While central security teams no longer have full control over the permissions granted to resources across AWS accounts, they must make sure that access is aligned with their organization’s security standard. For example, they might want to make sure that the GrantCrossAccountAccess statement that is now managed by developers doesn’t inadvertently grant access to an account that doesn’t belong to their organization. Previously, central security teams typically achieved this by developing automated mechanisms to insert a standard statement into all trust policies. This statement helped ensure that access remained bounded to their organization, even when developers configured broad access permissions for their roles. The following is an example trust policy where a developer granted permissions to an external account through the GrantCrossAccountAccess statement. However, because of the RestrictAccessToMyOrg statement added to the policy by the central security team, the external account will be unable to use these permissions.

{
  "Version": "2012-10-17",
  "Statement": [
   	{
      "Sid": "GrantCrossAccountAccess",
      "Effect": "Allow",
      "Principal": {
        "AWS":"arn:aws:iam::<noncorp-account-id>:role/<role-name>"
      },
      "Action": "sts:AssumeRole"
    },
    {
      "Sid": "RestrictAccessToMyOrg",
      "Effect": "Deny",
      "Principal": {
        "AWS": "*"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringNotEqualsIfExists": {
          "aws:PrincipalOrgID": "<my-org-id>"
        },
        "BoolIfExists": {
          "aws:PrincipalIsAWSService": "false"
        }
      }
    }
  ]
}

The RestrictAccessToMyOrg statement uses the aws:PrincipalOrgID and aws:PrincipalIsAWSService condition keys to restrict access to principals within your organization or to AWS service principals. The BoolIfExists operator with the aws:PrincipalIsAWSService condition key is required if the roles you’re applying a control to are service roles that are used by AWS services to perform operations on your behalf. When an AWS service assumes a service role, it uses its AWS service principal, an identity that is owned by AWS and that does not belong to your organization.

The central security teams could, for example, use AWS Config rules to detect misconfigurations and then use AWS Config remediation to automatically add the RestrictAccessToMyOrg statement to the IAM roles’ trust policies when new IAM roles are created or their trust policies are changed. Even though the addition of the RestrictAccessToMyOrg statement to trust policies can be automated, RCPs can greatly simplify enforcement of such coarse-grained controls in a multi-account environment.

Phase 2: Design permissions guardrails

Central security teams can implement permissions guardrails by creating an RCP that centrally blocks external access to IAM roles. The RCP that you will implement contains similar restrictions to the RestrictAccessToMyOrg statement that you used in the IAM trust policy.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RestrictAccessToMyOrg",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "sts:AssumeRole",
      "Resource": "*",
      "Condition": {
        "StringNotEqualsIfExists": {
          "aws:PrincipalOrgID": "<my-org-id>"
        },
        "BoolIfExists": {
          "aws:PrincipalIsAWSService": "false"
        }
      }
    }
  ]
}

Like SCPs, you attach the RCP to an account, organizational unit (OU), or the root of your organization. After being attached, the RCP automatically applies to applicable resources—in this case, IAM roles—within the scope of that AWS Organizations entity. This centralized approach alleviates the need to modify hundreds of trust policies across multiple accounts, lowering the operational overhead for central security teams and helping ensure consistent access controls are applied at scale. RCPs also help you achieve separation of duties with developers still managing their least-privilege permissions in trust policies and administrators applying coarse-grained access controls in RCPs. If developers make configuration mistakes while managing permissions for their applications, the preventative access controls implemented using RCPs will help ensure that they stay within your organization’s access control guidelines. See How AWS enforcement code logic evaluates requests to allow or deny access to understand how different policy types impact the authorization process.

If you’re transitioning existing controls from resource-based policies to RCPs, use the opportunity to reassess the control design based on your current control objectives and the additional benefits offered by RCPs. For example, your previous controls might have been limited to specific resource types, such as IAM roles in this use case, or to particular accounts, such as those storing the most sensitive data. RCPs enable you to extend controls to additional resources across your entire organization, reducing operational overhead through centralized management of permissions guardrails.

If you need to apply a control on resources not yet covered by RCPs, you can implement or retain your custom automation for enforcing controls with resource-based policies. See the List of AWS services that support RCPs and Resources and entities not restricted by RCPs and plan for additional controls if applicable.

While designing your RCPs, consider the following guidelines.

Design for operational excellence

A key foundation for effectively implementing and operating permissions guardrails like RCPs is organizing your AWS environment using multiple accounts. Account boundaries and strategic placement of workloads across them allow you to apply tailored access controls that align with data sensitivity and specific access requirements. Grouping accounts into OUs within AWS Organizations enables more effective access control, even in scenarios where cross-account access is required. Figure 4 illustrates an example organization structure, demonstrating how RCPs can be applied at various levels of the organizational hierarchy to adhere to the security requirements of different workloads.

Figure 4: A sample organization with RCPs applied at various levels

When operating at scale, consider delegating policy management to a central security account in your organization. With AWS Organizations resource-based delegation, central teams don’t need access to the management account for any SCP or RCP related changes or troubleshooting.

Review Achieving operational excellence with design considerations for AWS Organizations SCPs, which focuses on SCPs but also covers foundational principles for designing and implementing permissions guardrails at scale. These considerations also apply to RCPs for enabling operational excellence. Additionally, see AWS Organizations quotas and RCP evaluation for the RCP-related quotas and unique implementation details.

Define your governance

Establishing clear governance helps you define how to implement and continuously manage RCPs within your organization. This includes the operating model, change management processes, and exceptions handling procedures. RCPs provide authorization controls similar to SCPs and therefore should integrate with your existing governance framework rather than requiring separate oversight. For example, if your change management process requires two-person approval for SCP changes, you should consider applying the same approval process for RCP implementation. You should also adopt the same mechanisms you currently use to prevent unauthorized changes or detect drifts in your policies.

Plan for exceptions

There might be scenarios where you have a few resources that should be accessible publicly or by identities that don’t belong to your organization. If you’re organizing your resources across multiple accounts and OUs based on their compliance requirements or a common set of controls, then you most likely have such resources in a dedicated set of accounts or OUs, such as the Public Data OU in Figure 4. These accounts or OUs can have applicable policies that account for their unique access requirements.

Another option to accommodate these scenarios is to use the aws:ResourceAccount or aws:ResourceOrgPaths condition key to exclude certain accounts from the control. For example, the following policy will deny access to identities outside your organization from assuming IAM roles unless the identity is an AWS service principal or the role that is being accessed belongs to Account A.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RestrictAccessToMyOrgExceptMyAccounts",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "sts:AssumeRole",
      "Resource": "*",
      "Condition": {
        "StringNotEqualsIfExists": {
          "aws:PrincipalOrgID": "<my-org-id>",
          "aws:ResourceAccount": "<my-account-a-id>"
        },
        "BoolIfExists": {
          "aws:PrincipalIsAWSService": "false"
        }
      }
    }
  ]
}

There also might be situations where your company’s trusted partners or acquisitions need to be granted an exception for access to a subset of your company’s resources distributed across multiple accounts. For example, your company might integrate with Cloud Security Posture Management (CSPM) tools that assume roles in your accounts to assess your accounts’ security posture, as shown in Figure 5.

Figure 5: Representative view of granting exceptions to trusted partners

When implementing a control with an RCP that by default will apply to all resources of the entity it’s attached to, you can manage resource specific exceptions using the aws:ResourceTag condition key. In addition, use the aws:PrincipalAccount context key to conditionally grant exceptions based on the AWS account ID of the trusted partner.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "RestrictAccessToMyOrgExceptTaggedRoles",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "sts:AssumeRole",
            "Resource": "*",
            "Condition": {
                "StringNotEqualsIfExists": {
                    "aws:PrincipalOrgID": "<my-org-id>",
                    "aws:ResourceTag/partner-access-exception": "trusted-partner"
                },
        	  	"BoolIfExists": {
					"aws:PrincipalIsAWSService": "false"
				}					
			}
        },
        {
            "Sid": "RestrictAccessForTaggedRoles",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "sts:AssumeRole",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/partner-access-exception": "trusted-partner"
                },
                "StringNotEqualsIfExists": {
                    "aws:PrincipalAccount": "<trusted-partner-account-id>"
                }
            }
        }
    ]
}

Let’s examine the two statements in the preceding RCP:

RestrictAccessToMyOrgExceptTaggedRoles
This statement helps ensure that your roles can only be assumed by identities that belong to your organization or by AWS service principals, unless a role is tagged with partner-access-exception set to trusted-partner.
RestrictAccessForTaggedRoles
This statement further restricts access by helping ensure that the roles that have the partner-access-exception tag can only be assumed by identities that belong to your trusted partner account.

If you have a well-known, tightly scoped set of resources that need to be excluded, you can also use the IAM policy element, NotResource, to list the Amazon Resource Names (ARNs) of resources to exclude from the control.

When implementing tag-based exception processes, establishing strict controls over tag management is key. Unauthorized modifications of tags on resources, principals, or sessions could impact your security posture by enabling unintended access. You should implement controls to help prevent unauthorized tag manipulation. For example, the following SCP restricts the use of the partner-access-exception tag to the admin role so that unauthorized users cannot alter the control by attaching, detaching, or modifying the tag.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RestrictAccessToExceptionTag",
      "Effect": "Deny",
      "Action": "*",
      "Resource": "*",
      "Condition": {
        "StringNotEqualsIfExists": {
          "aws:PrincipalArn": "<admin-role-arn>"
        },
        "ForAnyValue:StringEquals": {
          "aws:TagKeys": [
			"partner-access-exception"
		  ]
        }
      }
    }
  ]
}

You should also make sure that the partner-access-exception tag cannot be passed as a session tag when identities assume roles. See the sample RCP in the data perimeter policy examples repository.

Phase 3: Anticipate potential impacts

Before rolling out RCPs, you need to understand their potential impact on your organization. Introducing new policies or modifying existing ones without proper validation can disrupt your security-productivity balance. Be aware that overly restrictive policies might inadvertently impede legitimate data flows that are essential for achieving your business objectives.

Consider using AWS Identity and Access Management Access Analyzer to monitor effective permissions across resources in your organization. For our IAM role example, use an organization external access analyzer to identify IAM roles in your organization that are shared with external entities. This analysis will help you to create appropriate exceptions or lock down any overly permissive access.

Another effective method to assess impact is to review and analyze your account activity using AWS CloudTrail. For example, if you centralize all your CloudTrail logs in an S3 bucket, you can use Amazon Athena to query these logs. Specifically, look for STS API calls made against your IAM roles by identities outside your organization. Then, compare the results with your list of known trusted partners and those you have already accounted for in your RCPs. Based on this analysis, determine if you need to add the partner-access-exception tag to additional IAM roles and further refine the policy before enforcement. This is essential to ensure trusted partner integrations continue to function as expected when you enforce your RCPs. Furthermore, use this analysis to identify any illegitimate access patterns in your environment and plan for necessary remediations, further enhancing your security posture as part of RCP implementation.

For detailed guidance on how to perform an impact analysis in your environment, see Analyze your account activity to evaluate impact and refine controls, which describes the tools and options you need to be able to conduct the analysis.

Phase 4: Implement permissions guardrails

As you transition into the implementation phase, consider the following key factors to promote a smooth rollout while enhancing your security posture.

Deployment automation and integration

Use your existing deployment pipelines to implement RCPs, the same as you do for SCPs. This approach will minimize operational overhead while maintaining consistency in the deployment of your controls.

You can use the AWS CloudFormation AWS::Organizations::Policy resource type to deploy RCPs as infrastructure as code (IaC) using your continuous integration and continuous delivery (CI/CD) pipeline. If you’re using AWS Control Tower and the Customizations for AWS Control Tower solution (CfCT) for account management and want to deploy your custom RCPs, use rcp as the deploy_method in the CfCT manifest file. You can also take advantage of the AWS Control Tower provided RCP-based controls to streamline the implementation.

Progressive deployment in stages

As with SCPs, AWS strongly advises against attaching RCPs in production environments without thoroughly testing the impact that the policies have on resources in your accounts. Follow standard CI/CD processes and begin your RCP rollout in lower environments by attaching them to individual test accounts or OUs first. After you validate that the controls behave as excepted, gradually promote the RCPs to upper environments.

If your goal is to transition an existing control from resource-based policies to RCPs, keep your resource-based policies in place while conducting the progressive rollout. After you have completed rolling out your RCPs and confirmed that they operate as expected, you can consider deactivating the automation you used to apply the control using resource-based policies. This approach lets you deploy RCPs without impacting your existing security posture or disrupting business workflows.

Additionally, consider deploying RCPs to a subset of resources or accounts first to limit the scope of impact and provide an opportunity to test and refine your deployment and operational processes. You can follow your standard prioritization approach to define deployment waves, for example, start with resources or accounts that store sensitive data or pose the highest risk, based on your current operational practices and other controls that might be in place. For additional best practices, see OPS06-BP03 Employ safe deployment strategies in the AWS Well-Architected Framework: Operation Excellence Pillar whitepaper.

Phase 5: Monitor permissions guardrails

Finally, establish monitoring processes to help ensure that controls for preventing external access to your resources operate as expected. You can use the same tools you used for impact analysis. For example, you can use IAM Access Analyzer external access findings to understand the impact of your RCPs on resource permissions. This information will help you verify that your RCPs are crafted in accordance with your intent and plan remediation actions, if required. You can also set alerts for occurrences of unintended access patterns observed in your CloudTrail logs.

Furthermore, follow the phased approach outlined in this post to regularly review and update your controls to help ensure that they align with evolving business and security objectives. Consider factors such as organizational changes, changes in partner relationships, data criticality shifts, and opportunities for expanding your RCP coverage. This continuous improvement process helps maintain the effectiveness of your security controls while supporting business growth and transformation.

Conclusion

In this post, we discussed how to effectively implement coarse-grained access controls on AWS resources at scale using RCPs. You can use the phased implementation approach described here to achieve your security control objectives while minimizing the risk of disrupting your business workflows. You can apply the same approach to implement other preventative controls, such as SCPs, across your multi-account environment.

Remember that RCPs, like SCPs, provide a powerful mechanism for enforcing coarse-grained controls across multiple accounts in your organization. They don’t replace your least-privilege controls and should be part of a broader, multi-layered approach to data security that includes other well-architected security design principles.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Build multi-Region resilient Apache Kafka applications with identical topic names using Amazon MSK and Amazon MSK Replicator

2025-03-25 Subham Rakshit

Post Syndicated from Subham Rakshit original https://aws.amazon.com/blogs/big-data/build-multi-region-resilient-apache-kafka-applications-with-identical-topic-names-using-amazon-msk-and-amazon-msk-replicator/

Resilience has always been a top priority for customers running mission-critical Apache Kafka applications. Amazon Managed Streaming for Apache Kafka (Amazon MSK) is deployed across multiple Availability Zones and provides resilience within an AWS Region. However, mission-critical Kafka deployments require cross-Region resilience to minimize downtime during service impairment in a Region. With Amazon MSK Replicator, you can build multi-Region resilient streaming applications to provide business continuity, share data with partners, aggregate data from multiple clusters for analytics, and serve global clients with reduced latency. This post explains how to use MSK Replicator for cross-cluster data replication and details the failover and failback processes while keeping the same topic name across Regions.

MSK Replicator overview

Amazon MSK offers two cluster types: Provisioned and Serverless. Provisioned cluster supports two broker types: Standard and Express. With the introduction of Amazon MSK Express brokers, you can now deploy MSK clusters that significantly reduce recovery time by up to 90% while delivering consistent performance. Express brokers provide up to 3 times the throughput per broker and scale up to 20 times faster compared to Standard brokers running Kafka. MSK Replicator works with both broker types in Provisioned clusters and along with Serverless clusters.

MSK Replicator supports an identical topic name configuration, enabling seamless topic name retention during both active-active or active-passive replication. This avoids the risk of infinite replication loops commonly associated with third-party or open source replication tools. When deploying an active-passive cluster architecture for regional resilience, where one cluster handles live traffic and the other acts as a standby, an identical topic configuration simplifies the failover process. Applications can transition to the standby cluster without reconfiguration because topic names remain consistent across the source and target clusters.

To set up an active-passive deployment, you have to enable multi-VPC connectivity for the MSK cluster in the primary Region and deploy an MSK Replicator in the secondary Region. The replicator will consume data from the primary Region’s MSK cluster and asynchronously replicate it to the secondary Region. You connect the clients initially to the primary cluster but fail over the clients to the secondary cluster in the case of primary Region impairment. When the primary Region recovers, you deploy a new MSK Replicator to replicate data back from the secondary cluster to the primary. You need to stop the client applications in the secondary Region and restart them in the primary Region.

Because replication with MSK Replicator is asynchronous, there is a possibility of duplicate data in the secondary cluster. During a failover, consumers might reprocess some messages from Kafka topics. To address this, deduplication should occur on the consumer side, such as by using an idempotent downstream system like a database.

In the next sections, we demonstrate how to deploy MSK Replicator in an active-passive architecture with identical topic names. We provide a step-by-step guide for failing over to the secondary Region during a primary Region impairment and failing back when the primary Region recovers. For an active-active setup, refer to Create an active-active setup using MSK Replicator.

Solution overview

In this setup, we deploy a primary MSK Provisioned cluster with Express brokers in the us-east-1 Region. To provide cross-Region resilience for Amazon MSK, we establish a secondary MSK cluster with Express brokers in the us-east-2 Region and replicate topics from the primary MSK cluster to the secondary cluster using MSK Replicator. This configuration provides high resilience within each Region by using Express brokers, and cross-Region resilience is achieved through an active-passive architecture, with replication managed by MSK Replicator.

The following diagram illustrates the solution architecture.

The primary Region MSK cluster handles client requests. In the event of a failure to communicate to MSK cluster due to primary region impairment, you need to fail over the clients to the secondary MSK cluster. The producer writes to the customer topic in the primary MSK cluster, and the consumer with the group ID msk-consumer reads from the same topic. As part of the active-passive setup, we configure MSK Replicator to use identical topic names, making sure that the customer topic remains consistent across both clusters without requiring changes from the clients. The entire setup is deployed within a single AWS account.

In the next sections, we describe how to set up a multi-Region resilient MSK cluster using MSK Replicator and also show the failover and failback strategy.

Provision an MSK cluster using AWS CloudFormation

We provide AWS CloudFormation templates to provision certain resources:

Deploy this CloudFormation template in us-east-1 (primary)
Deploy this CloudFormation template in us-east-2 (secondary)

This will create the virtual private cloud (VPC), subnets, and the MSK Provisioned cluster with Express brokers within the VPC configured with AWS Identity and Access Management (IAM) authentication in each Region. It will also create a Kafka client Amazon Elastic Compute Cloud (Amazon EC2) instance, where we can use the Kafka command line to create and view a Kafka topic and produce and consume messages to and from the topic.

Configure multi-VPC connectivity in the primary MSK cluster

After the clusters are deployed, you need to enable the multi-VPC connectivity in the primary MSK cluster deployed in us-east-1. This will allow MSK Replicator to connect to the primary MSK cluster using multi-VPC connectivity (powered by AWS PrivateLink). Multi-VPC connectivity is only required for cross-Region replication. For same-Region replication, MSK Replicator uses an IAM policy to connect to the primary MSK cluster.

MSK Replicator uses IAM authentication only to connect to both primary and secondary MSK clusters. Therefore, although other Kafka clients can still continue to use SASL/SCRAM or mTLS authentication, for MSK Replicator to work, IAM authentication has to be enabled.

To enable multi-VPC connectivity, complete the following steps:

On the Amazon MSK console, navigate to the MSK cluster.
On the Properties tab, under Network settings, choose Turn on multi-VPC connectivity on the Edit dropdown menu.

For Authentication type, select IAM role-based authentication.
Choose Turn on selection.

Enabling multi-VPC connectivity is a one-time setup and it can take approximately 30–45 minutes depending on the number of brokers. After this is enabled, you need to provide the MSK cluster resource policy to allow MSK Replicator to talk to the primary cluster.

Under Security settings¸ choose Edit cluster policy.
Select Include Kafka service principal.

Now that the cluster is enabled to receive requests from MSK Replicator using PrivateLink, we need to set up the replicator.

Create a MSK Replicator

Complete the following steps to create an MSK Replicator:

In the secondary Region (us-east-2), open the Amazon MSK console.
Choose Replicators in the navigation pane.
Choose Create replicator.
Enter a name and optional description.

In the Source cluster section, provide the following information:
1. For Cluster region, choose us-east-1.
2. For MSK cluster, enter the Amazon Resource Name (ARN) for the primary MSK cluster.

For cross-Region setup, the primary cluster will appear disabled if the multi-VPC connectivity is not enabled and the cluster resource policy is not configured in the primary MSK cluster. After you choose the primary cluster, it automatically selects the subnets associated with primary cluster. Security groups are not required because the primary cluster’s access is governed by the cluster resource policy.

Next, you select the target cluster. The target cluster Region is defaulted to the Region where the MSK Replicator is created. In this case, it’s us-east-2.

In the Target cluster section, provide the following information:
1. For MSK cluster, enter the ARN of the secondary MSK cluster. This will automatically select the cluster subnets and the security group associated with the secondary cluster.
2. For Security groups, choose any additional security groups.

Make sure that the security groups have outbound rules to allow traffic to your secondary cluster’s security groups. Also make sure that your secondary cluster’s security groups have inbound rules that accept traffic from the MSK Replicator security groups provided here.

Now let’s provide the MSK Replicator settings.

In the Replicator settings section, enter the following information:
1. For Topics to replicate, we keep the topics to replicate as a default value that replicates all topics from the primary to secondary cluster.
2. For Replication starting position, we choose Earliest, so that we can get all the events from the start of the source topics.
3. For Copy settings, select Keep the same topic names to configure the topic name in the secondary cluster as identical to the primary cluster.

This makes sure that the MSK clients don’t need to add a prefix to the topic names.

For this example, we keep the Consumer group replication setting as default and set Target compression type as None.

Also, MSK Replicator will automatically create the required IAM policies.

Choose Create to create the replicator.

The process takes around 15–20 minutes to deploy the replicator. After the MSK Replicator is running, this will be reflected in the status.

Configure the MSK client for the primary cluster

Complete the following steps to configure the MSK client:

On the Amazon EC2 console, navigate to the EC2 instance of the primary Region (us-east-1) and connect to the EC2 instance dr-test-primary-KafkaClientInstance1 using Session Manager, a capability of AWS Systems Manager.

After you have logged in, you need to configure the primary MSK cluster bootstrap address to create a topic and publish data to the cluster. You can get the bootstrap address for IAM authentication on the Amazon MSK console under View Client Information on the cluster details page.

Configure the bootstrap address with the following code:

sudo su - ec2-user

export BS_PRIMARY=<<MSK_BOOTSTRAP_ADDRESS>>

Configure the client configuration for IAM authentication to talk to the MSK cluster:

echo -n "security.protocol=SASL_SSL
sasl.mechanism=AWS_MSK_IAM
sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler
" > /home/ec2-user/kafka/config/client_iam.properties

Create a topic and produce and consume messages to the topic

Complete the following steps to create a topic and then produce and consume messages to it:

Create a customer topic:

/home/ec2-user/kafka/bin/kafka-topics.sh --bootstrap-server=$BS_PRIMARY \
--create --replication-factor 3 --partitions 3 \
--topic customer \
--command-config=/home/ec2-user/kafka/config/client_iam.properties

Create a console producer to write to the topic:

/home/ec2-user/kafka/bin/kafka-console-producer.sh \
--bootstrap-server=$BS_PRIMARY --topic customer \
--producer.config=/home/ec2-user/kafka/config/client_iam.properties

Produce the following sample text to the topic:

This is a customer topic
This is the 2nd message to the topic.

Press Ctrl+C to exit the console prompt.
Create a consumer with group.id msk-consumer to read all the messages from the beginning of the customer topic:

/home/ec2-user/kafka/bin/kafka-console-consumer.sh \
--bootstrap-server=$BS_PRIMARY --topic customer --from-beginning \
--consumer.config=/home/ec2-user/kafka/config/client_iam.properties \
--consumer-property group.id=msk-consumer

This will consume both the sample messages from the topic.

Press Ctrl+C to exit the console prompt.

Configure the MSK client for the secondary MSK cluster

Go to the EC2 cluster of the secondary Region us-east-2 and follow the previously mentioned steps to configure an MSK client. The only difference from the previous steps is that you should use the bootstrap address of the secondary MSK cluster as the environment variable. Configure the variable $BS_SECONDARY to configure the secondary Region MSK cluster bootstrap address.

Verify replication

After the client is configured to talk to the secondary MSK cluster using IAM authentication, list the topics in the cluster. Because the MSK Replicator is now running, the customer topic is replicated. To verify it, let’s see the list of topics in the cluster:

/home/ec2-user/kafka/bin/kafka-topics.sh --bootstrap-server=$BS_SECONDARY \
--list --command-config=/home/ec2-user/kafka/config/client_iam.properties

The topic name is customer without any prefix.

By default, MSK Replicator replicates the details of all the consumer groups. Because you used the default configuration, you can verify using the following command if the consumer group ID msk-consumer is also replicated to the secondary cluster:

/home/ec2-user/kafka/bin/kafka-consumer-groups.sh --bootstrap-server=$BS_SECONDARY \
--list --command-config=/home/ec2-user/kafka/config/client_iam.properties

Now that we have verified the topic is replicated, let’s understand the key metrics to monitor.

Monitor replication

Monitoring MSK Replicator is very important to make sure that replication of data is happening fast. This reduces the risk of data loss in case an unplanned failure occurs. Some important MSK Replicator metrics to monitor are ReplicationLatency, MessageLag, and ReplicatorThroughput. For a detailed list, see Monitor replication.

To understand how many bytes are processed by MSK Replicator, you should monitor the metric ReplicatorBytesInPerSec. This metric indicates the average number of bytes processed by the replicator per second. Data processed by MSK Replicator consists of all data MSK Replicator receives. This includes the data replicated to the target cluster and filtered by MSK Replicator. This metric is applicable if you use Keep same topic name in the MSK Replicator copy settings. During a failback scenario, MSK Replicator starts to read from the earliest offset and replicates records from the secondary back to the primary. Depending on the retention settings, some data might exist in the primary cluster. To prevent duplicates, MSK Replicator processes the data but automatically filters out duplicate data.

Fail over clients to the secondary MSK cluster

In the case of an unexpected event in the primary Region in which clients can’t connect to the primary MSK cluster or the clients are receiving unexpected produce and consume errors, this could be a sign that the primary MSK cluster is impacted. You may notice a sudden spike in replication latency. If the latency continues to rise, it could indicate a regional impairment in Amazon MSK. To verify this, you can check the AWS Health Dashboard, though there is a chance that status updates may be delayed. Once you identify signs of a regional impairment in Amazon MSK, you should prepare to fail over the clients to the secondary region.

For critical workloads we recommend not taking a dependency on control plane actions for failover. To mitigate this risk, you could implement a pilot light deployment, where essential components of the stack are kept running in a secondary region and scaled up when the primary region is impaired. Alternatively, for faster and smoother failover with minimal downtime, a hot standby approach is recommended. This involves pre-deploying the entire stack in a secondary region so that, in a disaster recovery scenario, the pre-deployed clients can be quickly activated in the secondary region.

Failover process

To perform the failover, you first need to stop the clients pointed to the primary MSK cluster. However, for the purpose of the demo, we are using console producer and consumers, so our clients are already stopped.

In a real failover scenario, using primary Region clients to communicate with the secondary Region MSK cluster is not recommended, as it breaches fault isolation boundaries and leads to increased latency. To simulate the failover using the preceding setup, let’s start a producer and consumer in the secondary Region (us-east-2). For this, run a console producer in the EC2 instance (dr-test-secondary-KafkaClientInstance1) of the secondary Region.

The following diagram illustrates this setup.

Complete the following steps to perform a failover:

Create a console producer using the following code:

/home/ec2-user/kafka/bin/kafka-console-producer.sh \
--bootstrap-server=$BS_SECONDARY --topic customer \
--producer.config=/home/ec2-user/kafka/config/client_iam.properties

Produce the following sample text to the topic:

This is the 3rd message to the topic.
This is the 4th message to the topic.

Now, let’s create a console consumer. It’s important to make sure the consumer group ID is exactly the same as the consumer attached to the primary MSK cluster. For this, we use the group.id msk-consumer to read the messages from the customer topic. This simulates that we are bringing up the same consumer attached to the primary cluster.

Create a console consumer with the following code:

/home/ec2-user/kafka/bin/kafka-console-consumer.sh \
--bootstrap-server=$BS_SECONDARY --topic customer --from-beginning \
--consumer.config=/home/ec2-user/kafka/config/client_iam.properties \
--consumer-property group.id=msk-consumer

Although the consumer is configured to read all the data from the earliest offset, it only consumes the last two messages produced by the console producer. This is because MSK Replicator has replicated the consumer group details along with the offsets read by the consumer with the consumer group ID msk-consumer. The console consumer with the same group.id mimic the behaviour that the consumer is failed over to the secondary Amazon MSK cluster.

Fail back clients to the primary MSK cluster

Failing back clients to the primary MSK cluster is the common pattern in an active-passive scenario, when the service in the primary region has recovered. Before we fail back clients to the primary MSK cluster, it’s important to sync the primary MSK cluster with the secondary MSK cluster. For this, we need to deploy another MSK Replicator in the primary Region configured to read from the earliest offset from the secondary MSK cluster and write to the primary cluster with the same topic name. The MSK Replicator will copy the data from the secondary MSK cluster to the primary MSK cluster. Although the MSK Replicator is configured to start from the earliest offset, it will not duplicate the data already present in the primary MSK cluster. It will automatically filter out the existing messages and will only write back the new data produced in the secondary MSK cluster when the primary MSK cluster was down. The replication step from secondary to primary wouldn’t be required if you don’t have a business requirement of keeping the data same across both clusters.

After the MSK Replicator is up and running, monitor the MessageLag metric of MSK Replicator. This metric indicates how many messages are yet to be replicated from the secondary MSK cluster to the primary MSK cluster. The MessageLag metric should come down close to 0. Now you should stop the producers writing to the secondary MSK cluster and restart connecting to the primary MSK cluster. You should also allow the consumers to read data from the secondary MSK cluster until the MaxOffsetLag metric for the consumers is not 0. This makes sure that the consumers have already processed all the messages from the secondary MSK cluster. The MessageLag metric should be 0 by this time because no producer is producing records in the secondary cluster. MSK Replicator replicated all messages from the secondary cluster to the primary cluster. At this point, you should start the consumer with the same group.id in the primary Region. You can delete the MSK Replicator created to copy messages from the secondary to the primary cluster. Make sure that the previously existing MSK Replicator is in RUNNING status and successfully replicating messages from the primary to secondary. This can be confirmed by looking at the ReplicatorThroughput metric, which should be greater than 0.

Failback process

To simulate a failback, you first need to enable multi-VPC connectivity in the secondary MSK cluster (us-east-2) and add a cluster policy for the Kafka service principal like we did before.

Deploy the MSK Replicator in the primary Region (us-east-1) with the source MSK cluster pointed to us-east-2 and the target cluster pointed to us-east-1. Configure Replication starting position as Earliest and Copy settings as Keep the same topic names.

The following diagram illustrates this setup.

After the MSK Replicator is in RUNNING status, let’s verify there is no duplicate while replicating the data from the secondary to the primary MSK cluster.

Run a console consumer without the group.id in the EC2 instance (dr-test-primary-KafkaClientInstance1) of the primary Region (us-east-1):

/home/ec2-user/kafka/bin/kafka-console-consumer.sh \
--bootstrap-server=$BS_PRIMARY --topic customer --from-beginning \
--consumer.config=/home/ec2-user/kafka/config/client_iam.properties

This should show the four messages without any duplicates. Although in the consumer we specify to read from the earliest offset, MSK Replicator makes sure the duplicate data isn’t replicated back to the primary cluster from the secondary cluster.

This is a customer topic
This is the 2nd message to the topic.
This is the 3rd message to the topic.
This is the 4th message to the topic.

You can now point the clients to start producing to and consuming from the primary MSK cluster.

Clean up

At this point, you can tear down the MSK Replicator deployed in the primary Region.

Conclusion

This post explored how to enhance Kafka resilience by setting up a secondary MSK cluster in another Region and synchronizing it with the primary cluster using MSK Replicator. We demonstrated how to implement an active-passive disaster recovery strategy while maintaining consistent topic names across both clusters. We provided a step-by-step guide for configuring replication with identical topic names and detailed the processes for failover and failback. Additionally, we highlighted key metrics to monitor and outlined actions to provide efficient and continuous data replication.

For more information, refer to What is Amazon MSK Replicator? For a hands-on experience, try out the Amazon MSK Replicator Workshop. We encourage you to try out this feature and share your feedback with us.

About the Author

Subham Rakshit is a Senior Streaming Solutions Architect for Analytics at AWS based in the UK. He works with customers to design and build streaming architectures so they can get value from analyzing their streaming data. His two little daughters keep him occupied most of the time outside work, and he loves solving jigsaw puzzles with them. Connect with him on LinkedIn.

Connect, share, and query where your data sits using Amazon SageMaker Unified Studio

2025-03-21 Lakshmi Nair

Post Syndicated from Lakshmi Nair original https://aws.amazon.com/blogs/big-data/connect-share-and-query-where-your-data-sits-using-amazon-sagemaker-unified-studio/

The ability for organizations to quickly analyze data across multiple sources is crucial for maintaining a competitive advantage. Imagine a scenario where the retail analytics team is trying to answer a simple question: Among customers who purchased summer jackets last season, which customers are likely to be interested in the new spring collection?

While the question is straightforward, getting the answer requires piecing together data across multiple data sources such as customer profiles stored in Amazon Simple Storage Service (Amazon S3) from customer relationship management (CRM) systems, historical purchase transactions in an Amazon Redshift data warehouse, and current product catalog information in Amazon DynamoDB. Traditionally, answering this question would involve multiple data exports, complex extract, transform, and load (ETL) processes, and careful data synchronization across systems.

In this blog post, we will demonstrate how business units can use Amazon SageMaker Unified Studio to discover, subscribe to, and analyze these distributed data assets. Through this unified query capability, you can create comprehensive insights into customer transaction patterns and purchase behavior for active products without the traditional barriers of data silos or the need to copy data between systems.

SageMaker Unified Studio provides a unified experience for using data, analytics, and AI capabilities. You can use familiar AWS services for model development, generative AI, data processing, and analytics—all within a single, governed environment. To strike a fine balance of democratizing data and AI access while maintaining strict compliance and regulatory standards, Amazon SageMaker Data and AI Governance is built into SageMaker Unified Studio. With Amazon SageMaker Catalog, teams can collaborate through projects, discover, and access approved data and models using semantic search with generative AI-created metadata, or you can use natural language to ask Amazon Q to find your data. Within SageMaker Unified Studio, organizations can implement a single, centralized permission model with fine-grained access controls, facilitating seamless data and AI asset sharing through streamlined publishing and subscription workflows. Teams can also query the data directly from sources such as Amazon S3 and Amazon Redshift, through Amazon SageMaker Lakehouse.

SageMaker Lakehouse streamlines connecting to, cataloging, and managing permissions on data from multiple sources. Built on AWS Glue Data Catalog and AWS Lake Formation, it organizes data through catalogs that can be accessed through an open, Apache Iceberg REST API to help ensure secure access to data with consistent, fine-grained access controls. SageMaker Lakehouse organizes data access through two types of catalogs: federated catalogs and managed catalogs (shown in the following figure). A catalog is a logical container that organizes objects from a data store, such as schemas, tables, views, or materialized views such as from Amazon Redshift. You can also create nested catalogs to mirror the hierarchical structure of your data sources within SageMaker Lakehouse.

Federated catalogs: Through SageMaker Unified Studio, you can create connections to external data sources such as Amazon DynamoDB. See Data connections in Amazon SageMaker Lakehouse for all the supported external data sources. These connections are stored in the AWS Glue Data Catalog (Data Catalog) and registered with Lake Formation, allowing you to create a federated catalog for each available data source.
Managed catalogs: A managed catalog refers to the data that resides on Amazon S3 or Redshift Managed Storage (RMS).

The existing Data Catalog becomes the Default catalog (identified by the AWS account number) and is readily available in SageMaker Lakehouse.

If the business units don’t have a data warehouse but need the benefits of one—such as a query result cache and query rewrite optimizations—then, they can create an RMS managed catalog in SageMaker Unified Studio. This is a SageMaker Lakehouse managed catalog backed by RMS storage. The table metadata is managed by Data Catalog. When you create an RMS managed catalog, it deploys an Amazon Redshift managed serverless workgroup. Users can write data to managed RMS tables using Iceberg APIs, Amazon Redshift, or Zero-ETL ingestion from supported data sources.

Functional working model

In SageMaker Unified Studio, the infrastructure team will enable the blueprints and configure the project profiles for tools and technologies to the respective business units to build and monitor their pipelines. They will also onboard the teams to SageMaker Unified Studio, enabling them to build the data products in a single integrated, governed environment. To enforce standardization within the organization, the central governance team can also create hierarchical representations of business units through domain units and dictate certain actions that these teams can perform under a domain unit. Global policies such as data dictionaries (business glossaries), data classification tags, and additional information with metadata forms can be created by the governance team to ensure standardization and consistency within the organization.

Individual business units will use these project profiles based on their needs to process the data using the authorized tool of their choice and create data products. Business units can enjoy the full flexibility to process and consume the data without worrying about the maintenance of the underlying infrastructure. Depending on the nature of the workloads, business units can choose a storage solution that best fits their use case. You can use SageMaker Lakehouse to unify the data across different data sources.

To share the data outside the business unit, the teams will publish the metadata of their data to a SageMaker catalog and make it discoverable and accessible to other business units. Amazon SageMaker Catalog serves as a central repository hub to store both technical and business catalog information of the data product. To establish trust between the data producers and data consumers, SageMaker Catalog also integrates the data quality metrics and data lineage events to track and drive transparency in data pipelines. While sharing the data, data producers of these business units can apply fine grained access control permissions at row and column level to these assets during subscription approval workflows. SageMaker Unified Studio automatically grants subscription access to the subscribed data assets after the subscription request is approved by the data producer. As shown in the following figure, the data sharing capability highlights that the data remains at its origin with the data producer, while consumers from other business units can consume and analyze it using their own compute resources. This approach eliminates any data duplication or data movement.

Solution overview

In this post, we explore two scenarios for sharing data between different teams (retail, marketing, and data analysts). The solution in this post gives you the implementation for a single account use case.

Scenario 1

The retail team needs to create a comprehensive view of customer behavior to optimize their spring collection launch. Their data landscape is diverse:

Customer profiles stored in Amazon S3 (default Data Catalog)
Historical purchase transactions stored in RMS (SageMaker Lakehouse managed RMS catalog)
Inventory information of the product in DynamoDB. (federated catalog)

The team needs to share this unified view with their regional data analysts while maintaining strict data governance protocols. Data analysts discover the data and subscribe to the data. We will also walk through the publishing and subscription workflow as part of the data sharing process. To get a unified view of the customer sales transactions for active products, the data analysts will use Amazon Athena.

Here are the high level steps of the solution implementation as shown in the preceding diagram:

In this post, we take an example of two teams who participate in the collaboration. The retail team has created a project retailsales-sql-project and the data analysts team has created a project dataanalyst-sql-project within SageMaker Unified Studio.
The retail team creates and stores their data in various sources:
1. customer data in Amazon S3 (contains customer data)
2. inventory data in a DynamoDB table (contains product catalog information)
3. store_sales_lakehouse in SageMaker Lakehouse managed RMS (contains purchase history)
The retail team publishes the assets to the project catalog to make them discoverable to other domain members within the organization.
The data analysts team discovers the data and subscribes to the data assets.
An incoming request is sent to the retail team, who then approves the subscription request. After the subscription is approved, data analysts use Athena to create a unified query from all the subscribed data assets to get insights into the data.

In this scenario, we will review how SageMaker Catalog manages the subscription grants to Data Catalog assets (both federated and managed).

For this scenario, we assume that the retail team doesn’t have their own data warehouse and they want to create and manage Amazon Redshift tables using Data Catalog.

Scenario 2

The marketing team needs access to transaction data for campaign optimization. They have campaign performance data stored in an Amazon Redshift data warehouse. However, to have improved campaign ROI and better resource allocation, they need data from the retail team to understand actual customer purchase behavior. To improve the campaign ROI, they need answers to crucial questions such as:

What is the true conversion rate across different customer segments?
Which customers should be targeted for upcoming promotions?
How do seasonal buying patterns affect campaign success?

Here the retail team shares the purchase history data store_sales to the marketing team. In this scenario, shown in the preceding figure, we assume that the retail team has their own data warehouse and uses Amazon Redshift to store the purchase history data.

The high level steps of the solution implementation for this scenario are:

The marketing team has created the project marketing-sql-project within SageMaker Unified Studio.
The retail team has store_sales in Amazon Redshift data warehouse (contains purchase history)
The retail team has published the assets to the project catalog
The marketing team discovers the data and subscribes to the data assets.
An incoming request is sent to the retail team, who then approves the subscription request. After the subscription is approved, the marketing team uses Amazon Redshift to consume the purchase history and identify high-value customer segments.

In this scenario, we will review the process of how SageMaker Catalog grants access to managed Amazon Redshift assets.

Prerequisites

To follow the step by step guide, you must complete the following prerequisites:

Sign up for an AWS account.
Create a user with administrative access.
Enable AWS IAM Identity Center in the AWS Region where you want to create your SageMaker Unified Studio domain. Make sure that you are using a Region that SageMaker Unified Studio is available in. Set up your IdP and synchronize identities and groups with IAM Identity Center. For more information, refer to IAM Identity Center Identity source tutorials.
Create a SageMaker Unified Studio domain and three projects using the SQL analytics project profile. See Create a new project to create a project. For this post, you will create the following projects: retailsales-sql-project, marketing-sql-project and dataanalyst-sql-project.

Note that the default SQL analytics project profile provides you with a RedshiftServerless blueprint. However, in this post, we want to showcase the data sharing capabilities of different types of SageMaker Lakehouse catalogs (managed and federated).

For the simplicity, we chose the SQL analytics project profile. However, you can also test this by using the Custom project profile by selecting specific blueprints such as LakehouseCatalog and LakeHouseDatabase for scenarios where the business unit doesn’t have their own data warehouse.

Solution walkthrough (Scenario 1)

The first step focuses on preparing the data for each data source for unified access.

Data preparation

In this section, you will create the following data sets:

customer data in Amazon S3 (default Data Catalog)
inventory data in a DynamoDB table (federated catalog)
store_sales_lakehouse in SageMaker Lakehouse managed RMS (managed catalog)

Sign in to SageMaker Unified Studio as a member of the retail team and select the project retailsales-sql-project.
On the top menu, choose Build, and under DATA ANALYSIS & INTEGRATION, select Query Editor.

Select the following options:
1. Under CONNECTIONS, select Athena (Lakehouse).
2. Under CATALOGS, select AwsDataCatalog.
3. Under DATABASES, select glue_db_<environmentid> or the customer glue database name you provided during project creation.
4. After the options are selected, choose Choose.

When users select a project profile within SageMaker Unified Studio, the system automatically triggers the relevant AWS CloudFormation stack (DataZone-Env-<environmentid>) and deploys the necessary infrastructure resources in the form of environments. Environments are the actual data infrastructure behind a project.

Run the following SQL:

CREATE TABLE customer AS
SELECT 13251813 cust_id,'Joyce Deaton'   cust_name,'Greece'   cust_country, '[email protected]'   cust_email
UNION
SELECT 1581546  ,'Daniel Dow'  ,'India'  , '[email protected]'  
UNION
SELECT 1581536  ,'Marie Lange'  ,'Canada'  , '[email protected]'  
UNION
SELECT 1827661  ,'Wesley Harris'  ,'Rome'  , '[email protected]'  
UNION
SELECT 1581536  ,'Alexander Salyer'  ,'Germany'  , '[email protected]'  
UNION
SELECT 3581536  ,'Jerry Tracy'  ,'Swiss'  , '[email protected]'

After the SQL is executed, you will find that the customer table has been created in the Lakehouse section under Lakehouse/AwsDataCatalog/glue_db_<environmentid>.

The product catalog is stored in DynamoDB. You can create a new table named inventory in DynamoDB with partition key prod_id through AWS CloudShell with the following command:

aws dynamodb create-table \
    --table-name inventory\
    --attribute-definitions \
AttributeName=prod_id,AttributeType=N \
    --key-schema \
AttributeName=prod_id,KeyType=HASH \
    --provisioned-throughput \
ReadCapacityUnits=5,WriteCapacityUnits=5 \
    --table-class STANDARD

Populate the DynamoDB table using the following commands:

aws dynamodb put-item --table-name inventory --item '{"prod_id": {"N": "1"}, "prod_name": {"S": "Widget A"},"active": {"S": "Y"}}' 

aws dynamodb put-item --table-name inventory --item '{"prod_id": {"N": "2"}, "prod_name": {"S": "Gadget B"},"active": {"S": "Y"}}'

aws dynamodb put-item --table-name inventory --item '{"prod_id": {"N": "3"}, "prod_name": {"S": "Item C"},"active": {"S": "N"}}'

To use the DynamoDB table in SageMaker Unified Studio, you need to configure a resource-based policy that allows the appropriate actions for the project role.
1. To create the resource-based policy, navigate to the DynamoDB console and choose Tables from the navigation pane.
2. Select the Permissions table and choose Create table policy.

The following is an example policy that allows connecting to DynamoDB tables as a federated source. Replace the <aws_region> with the Region you are working on, <aws_account_id> with the AWS Account ID where DynamoDB is deployed, <dynamodb_table> with the DynamoDB table (in this case inventory) that you intend to query from Amazon SageMaker Unified Studio and <datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy> with the Project role Amazon Resource Name (ARN) in SageMaker Unified Studio portal. You can get the project role ARN by navigating to the project in SageMaker Unified Studio and then to Project overview.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": "*",
            "Action": [
                "dynamodb:Query",
                "dynamodb:Scan",
                "dynamodb:DescribeTable",
                "dynamodb:PartiQLSelect",
                "dynamodb:BatchWriteItem"
            ],
            "Resource": "arn:aws:dynamodb:<aws_region>:<aws_accountid>:table/<dynamodb_table>",
            "Condition": {
                "ArnEquals": {
                    "aws:PrincipalArn": "arn:aws:iam::<aws_accountid>:role/<datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy>"
                }
            }
        }
    ]
}

After the policies are incorporated on the DynamoDB table, create an SageMaker Lakehouse connection within SageMaker Unified Studio. As shown in the example, dynamodb-connection-catalogs is created.

After the connection is successfully established, you will see the DynamoDB table inventory under Lakehouse.

The next step is to create a managed catalog for RMS objects using SageMaker Lakehouse.

Choose Data in the navigation pane.
In the data explorer, choose the plus icon to add a data source.
Select Create Lakehouse catalog.
Choose Next.

Enter the name of the catalog. The catalog name provided in the example is redshift-lakehouse-connection-catalogs. Choose Add data.

After the connection is created, you will see the catalog under Lakehouse.

This creates a managed Amazon Redshift Serverless workgroup in your AWS account. You will see a new database dev@<redshift-catalog-name> in the managed Amazon Redshift Serverless workgroup.
1. On the top menu, choose Build, and under DATA ANALYSIS & INTEGRATION, select Query Editor.
2. Select Redshift (Lakehouse) from CONNECTIONS, dev@<redshift-catalog-name> from DATABASES and public from SCHEMAS

Run the following SQL in order. The SQL creates the store_sales_lakehouse table in the dev database in the public schema. The retail team inserts data into the store_sales_lakehouse table.

CREATE TABLE public.store_sales_lakehouse (
    sale_id INTEGER IDENTITY(1,1) PRIMARY KEY,
    cust_id INTEGER NOT NULL,
    sale_date DATE NOT NULL,
    sale_amount DECIMAL(10, 2) NOT NULL,
    prod_id INTEGER  NOT NULL,
    last_purchase_date DATE
);

INSERT INTO public.store_sales_lakehouse (cust_id, sale_date, sale_amount, prod_id, last_purchase_date)
VALUES
(13251813, '2023-01-15', 150.00, 1, '2023-01-15'),
(29033279, '2023-01-20', 200.00, 4, '2023-01-20'),
(12755125, '2023-02-01', 75.50, 3, '2023-02-01'),
(26009249, '2023-02-10', 300.00, 2, '2023-02-10'),
(3270685, '2023-02-15', 125.00, 2, '2023-02-15'),
(6520539, '2023-03-01', 100.00, 2, '2023-03-01'),
(10251183, '2023-03-10', 250.00, 1, '2023-03-10'),
(10251283, '2023-03-15', 180.00, 1, '2023-03-15'),
(10251383, '2023-04-01', 90.00, 2, '2023-04-01'),
(10251483, '2023-04-10', 220.00, 3, '2023-04-10'),
(10251583, '2023-04-15', 175.00, 3, '2023-04-15'),
(10251683, '2023-05-01', 130.00, 1, '2023-05-01'),
(10251783, '2023-05-10', 280.00, 1, '2023-05-10'),
(10251883, '2023-05-15', 195.00, 4, '2023-05-15'),
(10251983, '2023-06-01', 110.00, 2, '2023-06-01'),
(10251083, '2023-06-10', 270.00, 1, '2023-06-10'),
(10252783, '2023-06-15', 185.00, 2, '2023-06-15'),
(10253783, '2023-07-01', 95.00, 3, '2023-07-01'),
(10254783, '2023-07-10', 240.00, 1, '2023-07-10'),
(10255783, '2023-07-15', 160.00, 3, '2023-07-15');

On successful creation of the table, you should now be able to query the data. Select the table store_sales_lakehouse and select Query with Redshift.

Import assets to the project catalog from various data sources

To share your assets outside your own project to other business units, you must first bring your metadata to SageMaker Catalog. To import the assets into the project’s inventory, you need to create a data source in the project catalog. In this section, we show you how to import the technical metadata from AWS Glue data catalogs. Here, you will import data assets from various sources that you have created as part of your data preparation.

Sign in to SageMaker Unified Studio as a member of the retail team. Select the project retailsales-sql-project, under Project catalog. Choose Data sources and import the assets by choosing Run.

To import the federated catalog, create a new data source and choose Run. This will import the metadata of the inventory data from DynamoDB table.

After successful run of all the data sources, choose Assets under Project catalog in the navigation plane. You will find all the assets in the Inventory of Project catalog.

Publish the assets

To make the assets discoverable to the data analysts team, the retail team must publish their assets.

In the project retailsales-sql-project, choose Project catalog and select Assets.
Select each asset in the INVENTORY tab, enrich the asset with the automated metadata generation and PUBLISH ASSET.

Discover the assets

SageMaker Catalog within SageMaker Unified Studio enables efficient data asset discovery and access management. The data analysts team signs in to SageMaker Unified Studio and selects the project dataanalyst-sql-project. The data analysts team then locates the desired assets in SageMaker Catalog and initiates the subscription request.

In this section, members of dataanalyst-sql-project browse the catalog and find the assets. There are multiple ways to find the desired assets.

Sign in to SageMaker Unified Studio as a member of the data analysts team. Choose Discover in the top navigation bar and select Catalog. Find the desired asset by browsing or entering the name of the asset into the search bar.
Search for the asset through a conversational interface using Amazon Q.
Use the faceted filter search by selecting the desired project in the BROWSE CATALOG.

The data analysts team selects the project retailsales-sql-project.

Subscribe to the assets

The data analysts team submits a subscription request with an appropriate justification for each of these assets.

For each asset, choose SUBSCRIBE.
Select dataanalyst-sql-project in Project.
Provide the Reason for request as “need this data for analysis”.

Note that during the subscription process, the requester sees a message that the asset access control and fulfillment will be Managed. This means that SageMaker Unified Studio automatically manages subscription access grants and permissions for these assets.

Subscription approval workflow

To approve the subscription request, you must be a member of the retail team and select the project that has published the asset.

Sign in to SageMaker Unified Studio as a member of the retail team and select the project retailsales-sql-project.
In the navigation pane, choose Project catalog and then select Subscription requests.
In INCOMING REQUESTS, choose the REQUESTED tab and select View request for each asset to see detailed information of the subscription request.

REQUEST DETAILS provides information about the subscribing project, the requestor, and the justification to access the asset.
RESPONSE DETAILS provides an option to approve the subscription with full access to the data (Full access) or restricted access to the data (Approve with row or column filters). With restricted access to data, the subscription approval workflow process offers granular access control for sensitive data through row-level filtering and column-level filtering. Using row filters, approvers can restrict access to specific records based on defined criteria. Using column filters, approvers can control access to specific columns within the data sets. This allows excluding sensitive fields while sharing the relevant data. Approvers can implement these filters during the approval process, helping to ensure that the data access aligns with the organization’s security requirements and compliance policies. For this post, select Full access in the RESPONSE DETAILS
(Optional) Decision comment is where you can add a comment about accepting or rejecting the subscription request.
Choose APPROVE.

Repeat the subscription approval workflow process for all the requested assets.
After all the subscription requests are approved, choose the APPROVED tab to view all the approved assets.

Subscription fulfillment methods

After subscription approval, a fulfillment process manages access to the assets. SageMaker Unified Studio provides fulfillment methods for managed assets and unmanaged assets.

Managed assets: SageMaker Unified Studio automatically manages the fulfillment and permissions for assets such as AWS Glue tables and Amazon Redshift tables and views.
Unmanaged assets: For unmanaged assets, permissions are handled externally. SageMaker Unified Studio publishes standard events for actions such as approvals through Amazon EventBridge, enabling integration with other AWS services or third-party solutions for custom integrations.

In this scenario 1, because the assets are Data Catalogs, SageMaker Unified Studio grants and manages access to these managed assets on your behalf through Lake Formation. See the SageMaker Unified Studio subscription workflow for updates on sharing options.

Analyze the data

The data analysts team uses the subscribed data assets from varied sources to get unified insights.

As a data analyst, sign in to SageMaker Unified Studio and select the project dataanalyst-sql-project. In the navigation pane, choose Project catalog and select Assets.
Choose the SUBSCRIBED tab to find all the subscribed assets from the retailsales-sql-project.
The status under each asset is Asset accessible. This indicates that the subscription grants are fulfilled and the data analysts team can now consume the assets with the compute of their choice.

Query using Athena (subscription grants fulfilled using Lake Formation)

As a member of the data analysts team, create a unified view to get purchase history with customer information for active products.

In the dataanalyst-sql-project project, go to Build and select Query Editor.
Use the following sample query to get the required information. Replace glue_db_<environmentid> with your subscribed glue database.

select * from "redshift-lakehouse-connection-catalogs/dev"."public"."store_sales_lakehouse" sales 
 left  join "awsdatacatalog"."glue_db_<environmentid>"."customer" customer
 on sales.cust_id=customer.cust_id
 inner  join "dynamodb-connection-catalogs"."default"."inventory" inventory
 on sales.prod_id = inventory.prod_id
 where inventory.active ='Y'

Solution walk-through (Scenario 2)

In this scenario, we assume that the retail team stores the purchase history data in their Amazon Redshift data warehouse. Because you’re using the default SQL analytics project profile to create the project, you will use a Redshift Serverless compute (project.redshift). The purchase history data is shared with the marketing team for enhanced campaign performance.

Sign in to SageMaker Unified Studio as a member of the retail team and select the project retailsales-sql-project.
On the top menu, choose Build, and under DATA ANALYSIS & INTEGRATION, select Query Editor
Select the following options:
- Under CONNECTIONS, select Redshift(Lakehouse).
- Under CATALOGS, select dev.
- Under DATABASES, select public.
Run the following SQL:

CREATE TABLE public.store_sales (
sale_id INTEGER IDENTITY(1,1) PRIMARY KEY,
cust_id INTEGER NOT NULL,
sale_date DATE NOT NULL,
sale_amount DECIMAL(10, 2) NOT NULL,
prod_id INTEGER  NOT NULL,
last_purchase_date DATE
);

INSERT INTO public.store_sales (cust_id, sale_date, sale_amount, prod_id, last_purchase_date)
VALUES
(13251813, '2023-01-15', 150.00, 1, '2023-01-15'),
(29033279, '2023-01-20', 200.00, 4, '2023-01-20'),
(12755125, '2023-02-01', 75.50, 3, '2023-02-01'),
(26009249, '2023-02-10', 300.00, 2, '2023-02-10'),
(3270685, '2023-02-15', 125.00, 2, '2023-02-15'),
(6520539, '2023-03-01', 100.00, 2, '2023-03-01'),
(10251183, '2023-03-10', 250.00, 1, '2023-03-10'),
(10251283, '2023-03-15', 180.00, 1, '2023-03-15'),
(10251383, '2023-04-01', 90.00, 2, '2023-04-01'),
(10251483, '2023-04-10', 220.00, 3, '2023-04-10'),
(10251583, '2023-04-15', 175.00, 3, '2023-04-15'),
(10251683, '2023-05-01', 130.00, 1, '2023-05-01'),
(10251783, '2023-05-10', 280.00, 1, '2023-05-10'),
(10251883, '2023-05-15', 195.00, 4, '2023-05-15'),
(10251983, '2023-06-01', 110.00, 2, '2023-06-01'),
(10251083, '2023-06-10', 270.00, 1, '2023-06-10'),
(10252783, '2023-06-15', 185.00, 2, '2023-06-15'),
(10253783, '2023-07-01', 95.00, 3, '2023-07-01'),
(10254783, '2023-07-10', 240.00, 1, '2023-07-10'),
(10255783, '2023-07-15', 160.00, 3, '2023-07-15');

5. On successful execution of the query, you will see store_sales under Redshift in the navigation pane.

Import the asset to the project catalog inventory

To share your assets outside your own project to other marketing business units, you must first share your metadata to SageMaker Catalog. To import the assets into the project’s inventory, you need to run the data source in the project catalog.

In the project retailsales-sql-project, under Project catalog, select Data sources and import the asset store-sales. Select the highlighted data source and choose Run as shown in the screenshot.

Publish the asset

To make the assets discoverable to the marketing team, the retail team must publish their asset.

Go to the navigation pane and choose Project catalog, and then select Assets.
Select store-sales in the INVENTORY tab, enrich the asset with the automated metadata generation and PUBLISH ASSET as illustrated in the screenshot.

Discover and subscribe the asset

The marketing team discovers and subscribes to the store-sales asset.

Sign in to SageMaker Unified Studio as a member of the marketing team and select marketing-sql-project.
Navigate to the Discover menu in the top navigation bar and choose Catalog. Find the desired asset by browsing or entering the name of the asset into the search bar.
Select the asset and choose SUBSCRIBE.
Enter a justification in Reason for request and choose REQUEST.

Subscription approval workflow

The retail team gets an incoming request in their project to approve the subscription request.

Sign in to the SageMaker Unified Studio and select the project retailsales-sql-project as a member of the retail team. Under Project catalog, select Subscription requests.
In the INCOMING REQUESTS, under the REQUESTED tab, select View request for store-sales.

You will see detailed information for the subscription request.
Select Full access in the RESPONSE DETAILS and choose APPROVE.

Analyze the data

In the Project catalog, select Assets and choose the SUBSCRIBED tab to find all the subscribed assets from the retailsales-sql-project.
Notice the status under the asset marked as Asset accessible. This indicates that the subscription grants are fulfilled and the marketing team can now consume the asset with the compute of their choice.

Query using Amazon Redshift (subscription grants fulfilled using native Amazon Redshift data sharing)

To query the shared data with Amazon Redshift compute, select Build and then Query Editor. Select the following options

Under CONNECTIONS, select Redshift(Lakehouse).
Under CATALOGS, select dev.
Under DATABASES, select project.

select * from "dev"."project"."store_sales" sales

When a subscription to an Amazon Redshift table or view is approved, SageMaker Unified Studio automatically adds the subscribed asset to the consumer’s Amazon Redshift Serverless workgroup for the project. Notice the subscribed asset is shared under the folder project. In the Redshift navigation pane, you can also see the datashare created between the source and the target cluster. In this case, because the data is shared in the same account but between different clusters, SageMaker Unified Studio creates a view in the target database and permissions are granted on the view. See Grant access to managed Amazon Redshift assets in Amazon SageMaker Unified Studio for information about data sharing options within Amazon Redshift.

Clean up

Make sure you remove the SageMaker Unified Studio resources to avoid any unexpected costs. Start by deleting the connections, catalogs, underlying data sources, projects, databases, and domain that you created for this post. For additional details, see the Amazon SageMaker Unified Studio Administrator Guide.

Conclusion

In this post, we explored two distinct approaches to data sharing and analytics.

Business units without an existing data warehouse can use a SageMaker Lakehouse managed RMS catalog. In the first scenario, we showcased subscription fulfillment of AWS Glue Data Catalogs using AWS Lake Formation for federated and managed catalogs. The data analysts team was able to connect and subscribe to the data shared by the retail team that resided in Amazon S3, Amazon Redshift, and other data sources such as DynamoDB through SageMaker Lakehouse.

In the second scenario, we demonstrated the native data-sharing capabilities of Amazon Redshift. In this scenario, we assume that the retail team has sales transactions stored in an Amazon Redshift data warehouse. Using the data sharing feature of Amazon Redshift, the asset was shared to the marketing team using Amazon SageMaker Unified Studio.

Both approaches enable unified querying across varied data sources with teams able to efficiently discover, publish, and subscribe to data assets while maintaining strict access controls through Amazon SageMaker Data and AI Governance. Subscription fulfillment is automated, reducing the administrative overhead. Using the query-in-place approach eliminates data redundancy and maintains data consistency while allowing unified analysis across data sources through a single integrated experience.

To learn more, see the Amazon SageMaker Unified Studio Administrator Guide and the following resources:

About the authors

Lakshmi Nair is a Senior Analytics Specialist Solutions Architect at AWS. She specializes in designing advanced analytics systems across industries. She focuses on crafting cloud-based data platforms, enabling real-time streaming, big data processing, and robust data governance. She can be reached through LinkedIn

Ramkumar Nottath is a Principal Solutions Architect at AWS focusing on Analytics services. He enjoys working with various customers to help them build scalable, reliable big data and analytics solutions. His interests extend to various technologies such as analytics, data warehousing, streaming, data governance, and machine learning. He loves spending time with his family and friends.

Master architecture decision records (ADRs): Best practices for effective decision-making

2025-03-20 Christoph Kappey

Post Syndicated from Christoph Kappey original https://aws.amazon.com/blogs/architecture/master-architecture-decision-records-adrs-best-practices-for-effective-decision-making/

Architecture decision records (ADRs) help you document and communicate important process and architecture decisions in your engineering projects. Based on our experience implementing over 200 ADRs across multiple projects, we’ve developed best practices that can help you streamline your decision-making processes and improve team collaboration.

In this post, you’ll learn:

How to implement ADRs in your organization
Best practices based on more than 200 ADRs across multiple projects
Practical tips for streamlining architectural decision-making
Real-world examples from projects with 10 to more than 100 team members

Common challenges in architecture decision-making

Before implementing ADRs, your teams might face these common challenges:

Team alignment – Development teams spend a huge part of their time (20 –30%, based on our project experience of the past 3 years) coordinating with other teams, which can slow down feature deployment and increase costs through repeated architecture refactoring
Design flexibility – Finding the right balance between upfront design and evolving architecture when working with agile and DevOps approaches
Nonfunctional requirements – Making trade-offs between security, maintainability, and scalability requirements
Changing requirements – Adapting architectural decisions to evolving business goals while maintaining system integrity
Knowledge transfer – Onboard new team members efficiently and make sure they follow the team’s current way of working

How to streamline the decision-making process

We base the recommendations in this post on our experience with several projects, working with teams with fewer than 10 team members as well as complex projects with 100 team members across 10 work streams. We embarked on ambitious projects with a green-field start as well as projects covering ongoing development of new features in production. Especially in teams with 100 people contributing to the code base, we faced the challenge of making sure that collaboration was seamless and decision-making consistent.

To address this challenge, we implemented an ADR mechanism, which served as our guiding light throughout the project’s lifecycle. After more than 3 years of following this approach, we’ve amassed a wealth of experience and best practices that we’re excited to share with the software development community. By capturing the context, alternatives considered, and the rationale behind each decision, ADRs foster transparency, knowledge-sharing, and accountability within teams. Our goal is to guide you through the process of writing effective ADRs with the following best practice recommendations:

Keep ADR meetings short and focused – Effective ADR meetings should be concise and time-bound. Aim to keep them 30–45 minutes maximum. This focused approach keeps discussions on track and participants engaged throughout the process.
Embrace the readout meeting style – Adopt the readout meeting style, where participants spend 10–15 minutes reading the ADR document. Encourage attendees to provide written comments on sections, paragraphs, or sentences that require clarification or where they have differing opinions. This approach promotes active engagement and fosters a bias for action and frugality.
Maintain a cross-functional yet lean participant list – Invite representatives from each team that might be affected by the architectural decision but strive to keep the total number of participants below 10. This cross-functional representation provides diverse perspectives while maintaining a lean and efficient decision-making process, aligning with the principles of frugality and bias for action.
Focus on a single decision – Keep ADRs concise by focusing on a single decision. Don’t hesitate to split up decisions if necessary. Concentrating on one decision at a time simplifies the decision-making process so that participants can thoroughly evaluate the impact during readout sessions. This approach aligns with the principles of ownership and customer obsession.
Separate design from decision – Use a separate design document mechanism to explore alternative options thoroughly. Reference these design documents within the ADR, adhering to the principles of invention and simplification.
Address comments and resolve feedback – Actively follow up on comments received during the ADR review process. Resolve all comments, either by incorporating changes or by discussing and reaching a consensus with the comment author. This practice demonstrates a commitment to delivering results and fostering a sense of ownership.
Push for a timely decision – Avoid prolonged discussions and multiple readout meetings. Based on our experience, one to three ADR readouts should be sufficient. If more sessions are required, reevaluate the dependencies and consider reducing the number of invitees or reducing the scope of the ADR. Most of the decisions are two-way door decisions, meaning that they can be changed with little impact in the future. It’s always better to make a decision and try it fast instead of endlessly discussing it. This approach aligns with the AWS principles of working backwards, customer obsession, delivering results, and being right a lot.
Embrace team collaboration – Approving an ADR is a team effort. The author must own the document and gather feedback from all affected teams before finalizing the decision. This practice encourages having backbone, disagreeing and committing, and fostering a collaborative environment.
Maintain and follow the process – Keep ADRs up to date and follow the established process. If an ADR supersedes a previous one, document the change and link the new ADR in the superseded document. Insist on the highest standards by adhering to the defined processes—consider ADRs as a team law.
Centralize ADR storage – Store ADRs in a central location accessible to all project members, regardless of their team affiliation. This practice promotes transparency and makes sure that architectural decisions are readily available to everyone involved.

Implementation tips and success measures

When implementing these practices, we recommend that you start small with a pilot team, create clear templates, and establish review cycles. Defining success measures such as the time to decision, team satisfaction, architecture rework reduction, or cross-team collaboration improvement help to evaluate your decision-making process

Conclusion

By implementing these best practices for ADRs, you’ll streamline your decision-making processes, foster collaboration, and make sure that architectural decisions are well-documented, communicated, and aligned with your organization’s principles and goals. Embrace these practices and witness the positive impact they have on the success of your software projects.

Read the AWS Prescriptive Guidance for an introduction into ADRs and an example ADR or the homepage of ADR GitHub organization.

About the Authors

Pilot light with reserved capacity: How to optimize DR cost using On-Demand Capacity Reservations

2025-03-20 Antoine Boucherie

Post Syndicated from Antoine Boucherie original https://aws.amazon.com/blogs/architecture/pilot-light-with-reserved-capacity-how-to-optimize-dr-cost-using-on-demand-capacity-reservations/

For digital enterprises to remain competitive, resilience is essential for maintaining reliability and building customer trust. End users expect applications to be available 24 hours a day, leading companies to develop increasingly sophisticated methods to provide continuous operation of critical services. Some companies, such as financial services companies, have to meet regulatory requirements such as Digital Operational Resilience Act (DORA) and are expected to manage the risk of outsourcing critical applications. They must design for high availability and plan for potential impairments. By proactively planning for potential disruptions, they’re not just mitigating risks – they’re building trust and delivering unparalleled value to their customers.

When assessing your own applications, you should define a set of objectives, perform a business impact analysis and a risk assessment. This way, you can estimate the impact to the business if the application isn’t available. This results in categorization of the applications and influence their design according to the AWS resilience lifecycle framework. Each application is given a specific Recovery Point Objective (RPO) and Recovery Time Objective (RTO), depending on its criticality for the business.

Not all applications fall in the most critical category. You allocate resources according to the results of the assessment and make trade-offs when designing applications. For example, you’ll have a more stringent RTO and RPO for—and be willing to spend more time and money on—a critical application than on a less critical application. The challenge becomes how to minimize the risk of breaching a specific RTO while optimizing for resources, such as cost and operational complexity.

At AWS, we provide guidance through the Well-Architected Framework and specifically within the Reliability pillar. Disruption can happen at several levels, and we recommend that you explore and prepare for four types of disruptions in the AWS Resilience Hub: application, infrastructure, Availability Zone, and AWS Region.

We recommend that you use managed services and make sure that all production workloads are designed to take advantage of multiple Availability Zones in AWS Regions. If your application also needs to be protected against the unlikely risk of Regional impairment, you should consider a multi-Region disaster recovery (DR) strategy.

You can select from several DR strategies: backup and restore, pilot light, warm standby, and multi-site active-active:

Backup and restore – This strategy might not provide the necessary RPO or RTO required for a highly critical application.
Multi-site active-active – This strategy increases significantly the cost and operational complexity of your application.
Pilot light – This strategy allows for a RPO or RTO in the tens of minutes by having the data asynchronously copied to the secondary Region and ready to be accessed. However, unlike a warm standby, the application servers aren’t deployed and aren’t ready to serve traffic. The pilot light strategy allows for a lower cost but brings a risk that you might not be able to provision the compute capacity you need when you want to fail over to the secondary Region, especially if you require a specific instance type.

In this post, we explore an intermediate strategy between the pilot light and the warm standby strategies: pilot light with reserved capacity. You can use this strategy to reserve compute capacity in a secondary Region while also limiting cost.

The following diagram illustrates where the pilot light with reserved capacity solution lies in the spectrum of disaster recovery strategies.

spectrum of disaster recovery strategies

Reserving capacity, on demand

On-Demand Capacity Reservations were launched in 2018. They make it possible to reserve capacity in the Availability Zone of your choice without a long-term contract. You have the flexibility to create, modify, or cancel reservations at your discretion. It’s especially well-suited if your application is dependent on a specific instance type or size.

Optimizing the cost of On-Demand Capacity Reservations with a Savings Plan

On-Demand Capacity Reservations is a reservation mechanism and doesn’t require a commitment. However, you can optimize your spending by combining the capacity reservation with an AWS Savings Plan. By using Savings Plans, you can achieve up to a 72% discount, a very significant cost reduction for DR instances that have to stay available all year long.

Optimizing the cost of On-Demand Capacity Reservations by sharing Capacity Reservations

To further optimize the cost, you can use your reserved capacity in another account when you don’t need it for DR.

Here’s an example in which we share On-Demand Capacity Reservations with our development and test account:

We have a three-tier application running in production in a primary AWS Region. This application is composed of a load balancer forwarding traffic to a fleet of application servers running on Amazon Elastic Compute Cloud (Amazon EC2) instances, backed by an Amazon Relational Database Service (Amazon RDS) database. All services used by this application are configured to use multiple Availability Zones in this primary Region.

We use the pilot light strategy, so the application data is being replicated to the disaster recovery environment in a secondary Region using Amazon RDS cross-Region read replicas. However, the load balancer and EC2 services aren’t running in DR to limit cost and operational complexity. Following best practices, each environment is running in a different AWS account.

The following diagram illustrates the pilot light strategy setup for our example.

pilot light strategy setup for our example

To reserve capacity in case of failover to the secondary Region, we create an On-Demand Capacity Reservation in the DR account, according to our baseline compute capacity. Because we don’t need this capacity until we fail over the application from the primary to the secondary Region, we share those On-Demand Capacity Reservations with a development and test account hosting our nonproduction environment in the secondary Region. On-Demand Capacity Reservations are Availability Zone specific (and hence Region specific) and can be shared with either AWS accounts or AWS Organizations using AWS Resource Access Manager (AWS RAM).

Best practices are to share those On-Demand Capacity Reservations with a nonproduction organizational unit (OU) within an organization or to directly share with the account(s) hosting the testing environments (for example, user acceptance testing or preproduction). Those environments are usually very similar to the production account in baseline sizing, in order to perform load and performance testing. This is an important point: you want to be able to retrieve those On-Demand Capacity Reservations when needed without impacting other critical applications.

The following diagram illustrates the Capacity Reservations sharing with the development and test account.

Capacity Reservation sharing with development and test account

If an impairment affects our production environment in the primary Region, we can trigger failover to the secondary Region. To reclaim capacity, we need to terminate the EC2 instances running in our development and test account. Capacity becomes available nearly immediately after these instances are successfully terminated. Separately, we can also stop the sharing of On-Demand Capacity Reservations to make sure that the development and test account can’t consume that capacity again. Know that merely unsharing your reservation without terminating development and test instances might not result in complete or immediate capacity retrieval. This is because when you unshare an On-Demand Capacity Reservation, the instances in the consumer account continue to run, and capacity is only returned to the owner account if additional capacity is available in the Amazon EC2 service on-demand pool.

The following diagram illustrates the failover to the DR environment in a secondary Region.

cancellation of the capacity reservations sharing

Steps

Here is a possible approach to take advantage of On-Demand Capacity Reservations to reduce the application’s total infrastructure cost:

Calculate the baseline compute capacity necessary for the DR environment in the secondary Region in the event of failover, including the compute that might already be running in this secondary Region for data stores (for example, a Kafka broker running on Amazon EC2). How much vCPU and RAM is required or what are the exact EC2 instances necessary to host the whole application in case of failover of the production from the primary to the secondary Region.
Create an On-Demand Capacity Reservation for the exact EC2 instances that the application need as a baseline in the DR account. Capacity Reservation Fleet is also a possible choice to reserve capacity across multiple instance types, which is often the case for Amazon Elastic Kubernetes Service (Amazon EKS) or Amazon Elastic Container Service (Amazon ECS) clusters, for example. Creating a Capacity Reservation Fleet will create multiple Capacity Reservations that can be shared independently. It’s also recommended to apply for Savings Plans on those On-Demand Capacity Reservations to save up to 72%.
Share those On-Demand Capacity Reservations from the DR account to one or several accounts, depending on your need. In our example, we share the On-Demand Capacity Reservations with the development and test account, effectively allowing the development and test environment to use compute capacity that has already been reserved.
In case of impairment in the primary Region, terminate the development and test instances first and then stop the On-Demand Capacity Reservation sharing The DR account will recover those reservations. If you want to keep the development and test instances, you will be charged at the on-demand rate.
Redeploy in an automated manner the application in the DR account on new EC2 instances behind a load balancer.

Benefits

By purchasing On-Demand Capacity Reservations in the DR account, you make sure that you always have Amazon EC2 capacity access when required and for as long as you need it. By sharing those On-Demand Capacity Reservations with another AWS account or organization, you can share the cost of the application’s compute capacity with other environments, reducing your application’s total cost of ownership. The additional cost of the DR instances can even reach zero, if your instances are completely consumed by nonproduction environments such as development and testing.

	DR savings over On-Demand
Compute Savings Plan – 1 year, no upfront	Around 27% (for example, for m7i instance)
Compute Savings Plan – 3 years, all upfront	Up to 66%
Instance Savings Plan – 3 years, all upfront	Up to 72%
Reservation shared and consumed 100% by development and test environment	Up to 100%

Limits

Although you can reserve DR capacity at a minimal cost using the pilot light with reserved capacity solution, there are some limits to keep in mind.

Firstly, we advise looking at this solution only if the Recovery Time Objective of the application, in case of Regional disruption, is in hours because you need to take into account the time needed to:

Detect the impairment in the primary Region.
Trigger the failover procedure.
Terminate the used instances to retrieve capacity (estimated time in minutes)
Stop the On-Demand Capacity Reservations sharing and automatically retrieve them in the DR account (estimated time in minutes).
Launch the compute infrastructure with the necessary application software in the DR account. You need to make sure that it matches the On-Demand Capacity Reservations according to the criteria used (open or targeted)

If your application requires a lower RTO, we recommend exploring the warm standby strategy.

Secondly, this strategy can only be used for application servers running EC2 instances and ECS or EKS clusters on EC2 because On-Demand Capacity Reservations aren’t available for managed services such as AWS Fargate or AWS Lambda. For those managed services, we recommend having them up and running like in a warm standby strategy, with a minimum baseline capacity that you’re comfortable with.

Thirdly, it requires some nonproduction development and test usage in the selected secondary Region to use the shared On-Demand Capacity Reservation.

Finally, it’s important to consider that this solution brings some complexity and extra operational work. You should plan well ahead, automate the operational tasks where possible, but most importantly, regularly test that the failover of the application works according to plan. We encourage you to perform your own game days to support your operational resilience.

Deciding whether this strategy is a good fit for your application will ultimately be a decision based on your business and regulatory requirements.

Conclusion

In this post, we explained how to reserve capacity in a secondary Region using On-Demand Capacity Reservations. We highlighted how cost can be optimized using Savings Plans and by sharing reserved capacity with noncritical workloads. We saw how we can recover that capacity for the DR environment, in the event of a disaster, to allow the application to continue to serve end users. We looked at the benefits and limits of the pilot light with reserved capacity solution and the necessary steps to put it in place.

About the Authors

Enable single-sign-on for Amazon WorkMail with IAM Identity Center and Okta Universal Directory

2025-03-18 Zip Zieper

Post Syndicated from Zip Zieper original https://aws.amazon.com/blogs/messaging-and-targeting/enable-single-sign-on-for-amazon-workmail-with-iam-identity-center-and-okta-universal-directory/

Securing your business email is more critical than ever in today’s digital workplace. To help you protect your users and data in Amazon WorkMail (WorkMail), we regularly introduce security features that give organizations more control and protection for their business email. Thanks to the recently announced integration between WorkMail and AWS Identity and Access Management (IAM) Identity Center (IAM IdC), WorkMail users can now enjoy single sign-on (SSO) via compatible 3rd party external identity providers including Okta Universal Directory (Okta), Microsoft Entra ID, Google Workspace, JumpCloud, OneLogin, Ping Identity products and Active Directory.

This blog post explains the specific steps needed to integrate WorkMail with Okta via IAM IdC. WorkMail customers who use other 3rd party identity providers that are compatible with IdC will also find this post useful alongside the AWS documentation for their specific identity source. Alternatively, WorkMail organization administrators who do not use an external identity provider, but wish to enable multi-factor-authentication (MFA) for WorkMail viaIAM IdC will find guidance in a different blog post.

By integrating WorkMail with Okta via IAM IdC, users benefit from the security, convenience, and familiarity of SSO when accessing their WorkMail inbox and calendars via the WorkMail web-app. WorkMail organization administrators benefit from the security and efficiency of managing WorkMail user additions, modifications, or removals via the Okta admin tools. More detailed information about this integration can be found in the IAM IdC documentation as well as in the WorkMail documentation.

Introduction

Email remains a critical business communication tool, yet it is also one of the most targeted entry points for cyber threats. A single compromised email account can expose sensitive business data, damage an organization’s reputation, and serve as a gateway for further cyber attacks. As traditional username and password authentication becomes increasingly inadequate, organizations must adopt stronger access controls to protect their users and communications.

You can enhance your WorkMail organization’s security and simplify user authentication with SSO and a familiar login experience by integrating WorkMail with IAM IdC and Okta. This (and similar) integrations provide a more seamless authentication experience and helps administrators enforce consistent security policies across their organization.
The integration between WorkMail, IAM IdC and Okta supports these industry standards:

System for Cross-domain Identity Management (SCIM) – Which enables automated user provisioning and identity synchronization across systems and applications.
Security Assertion Markup Language (SAML) – Which provides secure authentication and authorization by exchanging identity and privilege information between identity providers and WorkMail.

WorkMail authentication options

When you first set up WorkMail, you use the built-in user directory which relies upon standard username | password authentication for the WorkMail web browser client as well as popular desktop and mobile email clients such as Microsoft Outlook, Apple Mail and Thunderbird.

By default, WorkMail uses username | password authentication

The integration between IAM Identity Center and Okta uses the SCIM protocol to provision and map user attributes in Okta to named attributes in IAM IdC. Once enabled, user and group information from Okta is automatically synchronized with IAM IdC. IAM IdC in turn uses the SAML protocol to provide Okta users with a familiar and convenient SSO experience.

After integrating IAM IdC with Okta, WorkMail uses:

Okta SSO for users of the WorkMail web-app.
Personal access tokens (PAT) for users of desktop &/or mobile email applications.

After integration with IdC and Okta, WorkMail uses SSO (web-app) and username | PAT (desktop/mobile) for authentication

Prerequisites for the WorkMail, IAM IdC and Okta integration

Admin access to an AWS account
- You can evaluate the integration in this post for a limited period of time using an AWS Free Tier Account
Admin access to an Amazon WorkMail Organization
- The integration uses the email domain that is configured as WorkMail’s default
Admin access to Amazon IAM Identity Center
- Or admin access to the IdC organization account
Your WorkMail and IAM Identity Center must be in the same AWS region
IAM Role with the security credentials needed to deploy AWS infrastructure via the AWS CDK
Administrator access to a fully functional Okta Identity account populated with users and (optionally) group(s) that you want to integrate and synchronize with IAM IdC and WorkMail.
Access to the AWS Samples GitHub repository, where you’ll find sample code that deploys via AWS CDK. This code sample deploys a customizable AWS Lambda initially configured to perform user synchronization between IAM IdC and WorkMail every 15 minutes.
A source code editor such as Visual Studio Code with your AWS admin credentials.

High-level Steps

Configure Identity Center (See our documentation for detailed information).
Configure Okta Identity integration with IAM Identity Center. See this post for more information
Provision and synchronize Okta’s Groups (or Okta Users) with IAM Identity Center
Configure WorkMail to use Identity Center
Deploy the AWS CDK sample project found in aws-samples in GitHub
Test user access to WorkMail with and without Okta SSO (web-app) and Personal Access Token (desktop & mobile email clients)
Notify WorkMail users of upcoming changes to login procedures.
Switch WorkMail Authentication Mode to IAM IdC.

Detailed Configuration Guidance

The instructions that follow walk you through the steps needed to integrate your Okta system with IAM IdC. Once IAM IdC is synchronizing with Okta, you’ll integrate IAM IdC with WorkMail. This process will keep your WorkMail organization’s users and groups synchronized with Okta. Once fully enabled, your WorkMail users will use their Otka SSO credentials to access the WorkMail web-client or a username and unique Personal Access Tokens to access desktop &/or mobile email clients. Let’s get started!

Configure Okta integration with IAM Identity Center (these steps have been adapted from this blog post)

1. Open the Okta admin landing page in a new browser (tab).

Okta admin landing page

1. Click the Applications Section in the left menu.
2. Click on Applications in the dropdown menu.
3. Click Browse App Catalog.

Applications page.

1. Type AWS IAM Identity Center in the search box.
2. Click on the app that matches AWS IAM Identity Center.

AWS IAM Identity Center

1. Click Add Integration to initiate the integration process.

Add Integration button

1. Type a memorable name (like Workshop AWS IAM Identity Center) for the application label, and click Done.

Application Labe

1. Click on the Sign On tab, scroll down to view SAML Signing Certificates.

1. Click Actions to download the certificate to your local machine. Keep the Okta Admin dashboard Open, you will need the SAML information found here to configure IdC in the next section.

Configure AWS IAM IdC

In a new browser window (or tab), open the IAM IdC console in the same region as your WorkMail Organization (we’re using us-east-1).
If this is your first time accessing IAM IdC, you’ll be greeted with the IdC console home page prompting you to “Enable IAM Identity Center”. Click Enable.

Enable IAM Identity Center

Check that you are in the correct region, click Enable (note – this will enable an Organization instance of IdC, which is recommended. If you want to use an Account instance of IdC, please see the documentation).

enable an Organization instance of IdC

Once IdC has been enabled, scroll down to the IAM Identity Center setup section and click Confirm Identity Source.

Click Actions, and select Change identity source from dropdown menu.

Change identity source

Select External identity provider and click Next.

External identity provider

Upload the SAML certificate you downloaded from Okta (above, step xx)
Open the Okta dashboard browser/tab. Under the SAML 2.0 configuration, scroll down and expand > More details.

- Copy the Sign on URL from Okta SAML and paste the URL into the IdP sign-in URL field in IdC.
- Copy the Issuer URL from Okta SAML and paste the URL into the IdP sign-in URL field in IdC.
- Click Next.

Type ACCEPT in the confirmation text box & click Change identity source.

Change identity source

Switch back to the IAM IdC console browser window/tab. On the Settings page, locate the Automatic provisioning information box, and then choose Enable. Leave this browser window/tab open as you’ll need it to copy the values for the SCIM endpoint and Access token into Okta shortly (note – IdC Automatic provisioning uses SCIM protocol to provision users from Okta).

IdC Automatic provisioning

Return to the Okta admin dashboard (you should have left it open in another browser window or tab). Under IAM Identity Center applications, select the Workshop AWS IAM Identity Center and open the Provisioning tab.
Under Integration, click Configure API Integration.

Configure API Integration

Select the checkbox next to Enable API integration to enable provisioning.
Refer to the IAM IcC console browser window/tab you left open and copy (from IdC) and paste (into Okta) the values for:

- Base URL field – paste the IdC SCIM endpoint value (make sure there is no trailing forward slash at the end of the URL).
- API Token field, enter the Access token value.
- Import Groups should be checked

Click Test API Credentials to verify the credentials entered are valid (the message AWS IAM Identity Center was verified successfully! should appear).

AWS IAM Identity Center was verified successfully

Save.
Under Settings, choose To App, click Edit and select Enable for Create Users, Update User Attributes and Deactivate Users. Click Save.

Update User Attributes and Deactivate Users

In the Okta admin dashboard’s left-hand menu pane, click on Directory.

- Choose Groups to see the list of existing groups.
- Click on the Add Group button to create a new group called workmail_users.
- Fill in the required details for the new group, including:
  - Group Name: Enter workmail_users for the group.
  - Group Description (optional): Provide a description for the group’s purpose (WorkMail_users).
- Save the New Group
- Reload the page to see the newly created group
- Click Assign people
- Add Okta users to the workmail_users group
- Click Save

In the Okta admin dashboard’s left-hand menu pane, click on Applications and open the AWS IAM Identity Center application.
Click the Assignments tab, choose Assign, then choose Assign to people.
Choose the Okta users that you want to have access to WorkMail via IAM Identity Center app.
For each user, click Assign, review the user info, and click Save then Go Back
When all Okta users for whom you want WorkMail users created/synced have been assigned, click Done to start the process of provisioning the users into IAM Identity Center.

provisioning the users into IAM Identity Center

On the Assignments page, choose Assign, and then choose Assign to groups.
Click the Okta group (workmail_users) that you want to have access to WorkMail via IAM Identity Center.
Click Assign, review the selections, click Save and Go Back. Click Done. This starts the process of provisioning the users in the workmail_users group into IAM Identity Center.
Choose the Push Groups tab.
Choose the workmail_users group that contains all the users that you assigned to the IAM Identity Center app.
If the group is not found in the list, choose the Find groups by name option by clicking (+)Push Groups Menu.
Enter the group name ( workmail_users) that you created in the previous step and select it from the search results
Click Save. The group status changes to “pushing” then “Active” (this can take several minutes), once the Okta workmail_users group and all its members have been pushed to IAM Identity Center.
Return to the browser window/tab with the IAM Identity Center console.
In the left navigation, select Groups (or Users) (you should see the Group (or users) list that was just populated by the Okta “push” action).

elect Groups (or Users)

Enable IAM Identity Center in Amazon WorkMail

Open a new browser window (tab) to the WorkMail console in the same region as IdC, and open your WorkMail Organization.
In WorkMail’s left navigation rail click Identity Center
Click on Multi-factor authentication setup guide, Click Enable Identity Center, read the default configuration notice ,and click Enable

Enable Identity Center

Under Identity Center Settings, click the Authentication mode tab and ensure it is set for WorkMail directory and Identity center (this is the default).
This mode is needed for testing and to allow desktop & mobile email client users to retrieve their unique personal access tokens.
Once the integration with IdC is successfully tested, and those WorkMail users who rely on desktop or mobile email clients for access have had the opportunity to login to the WorkMail web app to retrieve a PAT, you will change the Authentication mode to Identity center only.

Deploy the sample solution from GitHub

Solution prerequisites:

– AWS Account
– AWS IAM user with Administrator permissions
– Python (> v3.x) and Pip fully installed and configured on your computer
– AWS CLI (v2) installed and configured on your computer
– AWS CDK (v2) installed and configured on your computer
– Sample CDK project that creates and deploys an AWS Lambda in your AWS account to perform users synchronization between IAM IdC and WorkMail.

The instructions that follow guide you in deploying a sample solution for the AWS Samples Serverless-Mail repository that programmatically creates/syncs WorkMail users from IAM Identity Center using the AWS CDK CLI. The sample uses naming conventions used in this blog. For example, the Lambda only syncs those users found in the IdC Group named "workmail_users". If your IdC group has a different name, or you want to sync multiple IdC groups with WorkMail, you will need to modify the sample code before proceeding. (BE AWARE: The sample code base is an Open Source starter project designed to provide a demonstration and a base to start from for specific use cases. It should not be considered fully Production-ready, nor should it be deployed in a production environment. Refer to the README.md and License.md for more information.)

Clone the sample project to your computer (git clone https://github.com/aws-samples/serverless-mail/tree/main/idc_okta-workmail). CD into the ‘idc_okta-workmail‘ directory.
Create a virtual environment for packaging in Python (Docker must already be running locally) with python3 -m venv .venv
Activate the virtual environment with source .venv/bin/activate
Make sure cdk.json references the correct path to Python (by default cdk.json refers to .venv/bin/python app.py). If you need to locate Python, run the which python command.
Install the project dependencies with pip3 install -r requirements.txt
Check the AWS CLI local credentials and region with aws sts get-caller-identity . If you need to change any parameter, run aws configure
Prepare the `app.py` file by running python3 get-my-variables.py . The ‘get-my-variables.py’ script fetches the necessary variables from your AWS account and create the `app.py` file needed by the CDK. You may want to review the auto-generated `app.py` file for accuracy, especially if the AWS account has been used for other IdC or WorkMail related projects. The app.py file requires the following account variables:
- AWS accountId
- AWS Region – the region must be the same for IdC and WorkMail.
- IDENTITYSTORE_ID – – found in the Identity Center console under settings or via the AWS CLI aws identitystore list-groups --identity-store-id {identity_store_id}
- IDENTITY_CENTER_INSTANCE_ARN – found in the Identity Center console under settings or via the AWS CLI aws sso-admin list-instances
- IDENTITY_CENTER_APPLICATION_ARN – found in the Identity Center console under Applications or via the AWS CLI aws sso-admin list-applications --instance-arn {instance_arn}`
- WORKMAIL_ORGANIZATION_ID – found in the WorkMail console under Organizations or via the AWS CLI aws workmail list-organizations
- OKTA_GROUP_ID_TO_ASSIGN_TO_WORKMAIL – found in the Identity Center console under Groups > General Information or via the AWS CLI aws identitystore list-groups --identity-store-id {identity_store_id}
Bootstrap the package (this step may take several minutes to complete) with cdk bootstrap
Synthesize the package (this step may take several minutes to complete) with cdk synthesize
Deploy your package (this step may take several minutes to complete) with cdk deploy and follow the prompts in the AWS CDK CLI.
Once deployment to your AWS account starts, you can view the deployment status in the CloudFormation console.

Confirm WorkMail Authentication mode

WorkMail’s default Authentication mode should be set to WorkMail directory and Identity center. Don’t change this yet. This mode will allow WorkMail web-app users to continue to login to the WorkMail client directly, without Okta SSO. WorkMail’s Personal access token configuration default is set to active, and token lifespan set to 365 days. You can change this if your security needs differ from the default. PATs are used by desktop and mobile email clients to login to WorkMail. Leaving WorkMail’s default Authentication mode set to WorkMail directory allows desktop and mobile email clients to continue to login to the WorkMail with their username and password, without yet requiring a PAT instead of password.

Conduct a few spot tests to verify WorkMail web-app users can properly access their WorkMail accounts via:

The Amazon WorkMail web application URL (found on the WorkMail organization page)

Webmail login link

Your organization’s unique AWS access portal URL (found on the setting page of the IAM IdC dashboard).

This will open a new browser tab that redirects to the Okta user login page.

Okta user login page

Once user has authenticated, this page will redirect to the AWS access portal with linked application tiles to all of the AWS applications that have been integrated with IAM Identity Center.

AWS access portal with linked application tiles

Click on the WorkMail logo, and the WorkMail web-app will load.

WorkMail web-app

Desktop or mobile email software users need to create a WorkMail Personal Access Token (PAT) to access WorkMail. These users must retrieve their own PATs after logging into the WorkMail web-client. Open the Webmail login link found on the WorkMail Organization page under User Login

Webmail login link

In the WorkMail web-app,
1. click the settings gear (in top right)
2. Click Personal Access token and enter a token name (i.e. desktop-thunderbird-pat)
3. Click Create token.
4. Click Create token.
5. Copy the token value (this is the only time you can retrieve this token value).
Open your desktop or mobile email software, enter your username and your PAT (the PAT replaces your existing user password).

WorkMail web-app

In settings, click Personal Access token and Create token
Enter a token name (typically the device on which you’ll use this PAT) and select create token.
Copy the token value (this is the only time you can retrieve this token value).
Open your desktop or mobile email software, enter your username and your PAT (the PAT replaces your existing user password), and test your new login (username | PAT).

enter your username and your PAT

Notify your WorkMail users of upcoming changes to login procedures

Once you have tested the integration between WorkMail and IAM Identity Center with a few test users, you should prepare your WorkMail users for the increased login security. For example, you could send them an email that explains:

The organization is adding an additional login security step to help protect their inboxes.
Inform them that they should anticipate an email from the AWS IAM Identity Center with info about the upcoming implementation of MFA for web-app users and PATs for desktop and mobile client users.
Users should accept the invitation and create a new password for the AWS IAM Identity Center.
Inform them that once WorkMail MFA is enabled, all WorkMail web-app users will be required to use their username, password and MFA.
Inform them that once WorkMail PATs are enabled, all WorkMail desktop and mobile email client users will need to login to the WorkMail web client (with MFA) via the AWS access portal URL, create one PAT per software client (the same PAT can not be used on desktop and mobile). They then must update their desktop or mobile email software to use their username and PAT, instead of their current password. Explain that the PAT is now their email client application password and their personal desktop or mobile email software passwords will no longer work.
Provide users with a way to request support.

Switch WorkMail Authentication Mode to Identity Center only

Once you are satisfied that your WorkMail users have incorporated Okta SSO and/or PATs into their WorkMail login routines, the WorkMail administrator should disable the WorkMail directory Authentication mode found in the WorkMail console under Organization > Identity Center.

disable the WorkMail directory Authentication mode

Optional – Ask several trusted users to test their ability to login to WorkMail web app via Okta credentials.

Congratulations, your WorkMail organization’s authentication is now being managed by IAM Identity Center and your external identity provider, Okta.

Conclusion

By integrating IAM Identity Center with Okta Identity and WorkMail, you can provide your users with the convenience of Okta SSO for the WorkMail web-app. This allows your organization to improve it’s email security posture, better safeguard sensitive communications and protect you WorkMail organization against unauthorized access. Furthermore, this integration reduces admin overhead as user access to WorkMail is allowed, or revoked via the Okta administration dashboard.

Take control of your email communications today with Amazon WorkMail:

Visit the WorkMail Console to begin configuration.
Enable IAM Identity Center Integration with Okta and WorkMail.
Connect your WorkMail organization to centralized access management.
Configure WorkMail to require:
1. Okta SSO – Adds an extra layer of security for web-app users.
2. Personal Access Tokens (PAT) – Add an extra layer of security for desktop and mobile client access.

Need guidance? Check out our technical documentation. Contact your AWS account team.

To help keep your WorkMail organization secure, we recommend:

Regularly reviewing your WorkMail audit logs for unexpected activity.
Monitoring for unauthorized access attempts.
Staying informed about the latest security practices.

Dive deeper into WorkMail security with these resources:

Join the Conversation:
Connect with other administrators and security professionals on the AWS re:Post community to share insights and learn best practices.

Visually build telephony applications with AWS Step Functions

2025-03-18 Reynaldo Hidalgo

Post Syndicated from Reynaldo Hidalgo original https://aws.amazon.com/blogs/messaging-and-targeting/visually-build-telephony-applications-with-aws-step-functions/

Developers face numerous challenges when building telephony applications: managing unpredictable user responses, handling disconnections, processing incorrect inputs, and addressing errors. These challenges extend development cycles and create unstable applications that fail to meet user expectations.

This blog demonstrates how Amazon Web Services (AWS) Step Functions, combined with Amazon Chime SDK Public Switched Telephone Network (PSTN) audio service, offers a solution to overcome these challenges.

Overview of the solution

To demonstrate our solution, we built a sample telephony application that lets business owners manage customer calls through a dedicated business phone number. This solution helps small business owners separate personal and business communications, while managing all calls from their existing phone.

The beta version of this sample application delivers these six core call flows:

During business hours: Routes incoming customer calls to the business owner
After hours: Enables customers to leave voice messages
Message retrieval: Allows owner to access customer voice messages
Business caller ID: Enables owner to call customers using the business number
Call scheduling: Permits owner to schedule customer calls for later in the day
Automated calling: Initiates scheduled calls between owner and customer automatically

Using Workflow Studio, we built a Step Functions workflow (Figure 1) that processes all six call flows and handles unexpected scenarios.

Figure 1 – Visual diagram of a telephony workflow created in Workflow Studio for Step Functions, showing six interconnected call routing paths with decision points and error handling states. Each path represents a different customer interaction scenario, connected by arrows indicating the flow direction.

Figure 1 – Step Functions telephony workflow designed in Workflow Studio

How it works

AWS Step Functions enable agile visual workflow design, through pre-built components and error handling rules. This creates workflows composed of event-driven states that input, process, and output JavaScript Object Notation (JSON)-formatted messages. The PSTN audio service streamlines telephony applications through its serverless approach using a request/response programming model. It invokes AWS Lambda functions with Events and waits for Actions responses, both in predefined JSON formats. This shared JSON format enables seamless integration between the PSTN audio service and Step Functions, leading us to design a serverless architecture (Figure 2) that allows for bidirectional JSON message exchange between the two services.

Figure 2 – Architectural diagram showing the integration flow between AWS Step Functions and PSTN audio service. Arrows indicate JSON message exchange between services, with Lambda functions handling the communication. The diagram illustrates the serverless architecture components and their connections in a top-to-bottom layout.

Figure 2 – Serverless architecture for Step Functions and PSTN audio service integration

Main components:

eventRouter: Lambda function managing JSON message exchange
appWorkflow: Step Functions implementing call flow logic
actionsQueue: Amazon Simple Queue Service (Amazon SQS) queue storing response actions

Architecture flow:

PSTN audio service receives inbound call
Service sends NEW_INBOUND_CALL event to eventRouter
eventRouter creates the actionsQueue
eventRouter asynchronously executes appWorkflow with event data
eventRouter begins long-polling from actionsQueue, waiting for next action(s) message
appWorkflow processes JSON-formatted event data, computing next action(s)
appWorkflow queues next action(s) using Amazon SQS SendMessage API with Wait for Callback with Task Token integration pattern to stop the workflow until the next event call is received
eventRouter retrieves and removes action(s) from actionsQueue
eventRouter returns action(s) to PSTN audio service

Observations:

eventRouter code logic is generic and agnostic from the calls and different Step Function workflows
eventRouter queries an environment variable to determine the workflow to call
Pairs of actionsQueue and appWorkflow instances lives for the duration of each call
eventRouter is responsible for the creation and deletion of each actionsQueue
appWorkflow instances are created by the eventRouter at the start of each call
appWorkflow instances complete its execution when all parties involved on the call hang up

Building your telephony application

Prerequisites:

Familiarity with developing workflows in Step Functions Workflow Studio
Access to AWS Management Console

Implementation Guidelines:

Create dedicated Step Functions workflows for each telephony application
Design and implement workflows using Workflow Studio
Use a Standard workflow type to accommodate extended call durations
Update the eventRouter Lambda function’s “CallFlowsDIDMap” environment variable to map phone numbers to their workflow Amazon Resource Name (ARN)
Set workflow variables in the “Init” state Variables tab (Figure 3). The eventRouter function automatically sets “QueueUrl”, and adding other variables here removes the need for external storage

Figure 3 – Screenshot of Workflow Studio's Variables tab showing an editable text box for JSON data entry. The interface displays a code editor with syntax highlighting for entering variable names and their values that persist throughout the workflow execution.

Figure 3 – Step Functions “Init” state Variables tab showing workflow data configuration

Configure Choice state rules to route calls based on conditions. Rules one through three (Figure 4) handle call routing based on inbound/outbound direction, owner/customer identification, while the default rule manages unexpected scenarios.

Figure 4 – Screenshot of Workflow Studio's Choice state configuration panel. The interface shows a rules editor where multiple condition blocks are displayed. Each block contains dropdown menus and input fields for setting call routing logic based on variable values. The rules appear in a vertical list with options to add, edit, or remove conditions.

Figure 4 – Step Functions Choice state defines rules for call routing decisions

Configure the SQS: SendMessage state (Figure 5) to instruct the next action to the PSTN audio service by:
- Formatting the message content to match supported actions for the PSTN audio service
- Setting TransactionAttributes to pass back and forth the values of the “WaitToken” and “QueueUrl” throughout the call duration
- Enabling the Wait for Callback with a Task Token integration pattern

Figure 5 – Screenshot of the SQS: SendMessage state configuration in Step Functions Workflow Studio. The interface shows three main concepts: a message content formatter for PSTN audio service actions, transaction attribute fields for the WaitToken and QueueUrl values, and callback integration pattern settings. The message content input section displays input fields and options for setting up the message structure that enables communication between Step Functions and the PSTN audio service.

Figure 5 – SQS: SendMessage state configuration for PSTN audio service callback integration

Leverage AWS service integration states to interact with other AWS services directly from the workflow.
- Example: Use a DynamoDB PutItem state (Figure 6) to store Amazon Simple Storage Service (Amazon S3) recording files, including bucket name and key, in Amazon DynamoDB.

Figure 6 – Screenshot of Step Functions Workflow Studio showing a DynamoDB PutItem state configuration. The interface displays fields for setting up direct interaction with DynamoDB to store S3 recording file information. The configuration panel includes input parameters for the DynamoDB table, item details, and S3 bucket and key values.

Figure 6 – AWS service integration states enable direct service connections without custom code

Utilize JSONata expressions (Figure 7) to minimize the number of Lambda functions.
- Example: For Amazon EventBridge scheduling, compute time expressions using JSONata functions [$fromMillis(), $millis(), number()] and string concatenation to handle customer call scheduling.

Figure 7 – Screenshot of Step Functions Workflow Studio showing JSONata expression configuration. The interface displays a code editor with syntax highlighting where time calculation expressions are written using JSONata functions like $fromMillis(), $millis(), and number(). The panel demonstrates how to transform data directly within the workflow, eliminating the need for separate Lambda functions. Example expressions show date and time calculations for EventBridge scheduling.

Figure 7 – JSONata expressions for direct data transformation without Lambda functions

Use Step Functions error handling with success and fail states (Figure 8) to manage error paths and call termination results.

Figure 8 – Screenshot of Step Functions Workflow Studio showing the error handling configuration interface. The panel displays multiple state configurations: error catching paths for failed calls, success state definitions for completed calls, and termination handling settings. The interface includes dropdown menus and input fields for defining error types, retry attempts, and fallback actions. Visual connections between states illustrate the error handling flow from detection through resolution.

Figure 8 – Call error handling and termination setup

Key benefits

This approach for building telephony applications offers multiple advantages:

Visual workflow-based designer
Self-document call flow logic
Managed versioning and publishing
Native integration with AWS Services
Visual log and inspection for each call
Auto-scalable
Pay-per-use pricing

Deploying the solution

The following steps allows you to deploy the sample telephony application together with the serverless architecture (Figure 2).

Prerequisites:

AWS Management Console access
Node.js and npm installed
AWS Command Line Interface (AWS CLI) installed and configured

Walkthrough:

The Cloud Development Kit (CDK) project on the AWS GitHub repository will deploy the following resources:

phoneNumberBusiness – Provisioned phone number for the sample application
sipMediaApp – SIP media application that routes calls to lambdaProcessPSTNAudioServiceCalls
sipRule – SIP rule that directs calls from phoneNumberBusiness to sipMediaApp.
stepfunctionBusinessProxyWorkflow – Step Functions workflow for the sample application
roleStepfuntionBusinessProxyWorkflow – IAM Role for stepfunctionBusinessProxyWorkflow
lambdaProcessPSTNAudioServiceCalls – Lambda function for call processing
roleLambdaProcessPSTNAudioServiceCalls – IAM Role for lambdaProcessPSTNAudioServiceCalls
dynamoDBTableBusinessVoicemails – DynamoDB table to store customer voicemails
s3BucketApp –S3 bucket for storing system recordings and customer voicemails
s3BucketPolicy – IAM Policy granting PSTN audio service access to s3BucketApp
lambdaOutboundCall – Lambda function for placing scheduled customer calls
roleLambdaOutboundCall – IAM Role for lambdaOutboundCall
roleEventBridgeLambdaCall – IAM Role to allow the EventBridge service to execute lambdaOutboundCall

Follow these steps to deploy the CDK stack:

Clone the repository

git clone https://github.com/aws-samples/amazon-chime-sdk-visual-media-applications 

cd amazon-chime-sdk-visual-media-applications 

npm install

Bootstrap the stack

#default AWS CLI credentials are used, otherwise use the –-profile parameter
#provide the <account-id> and <region> to deploy this stack 
cdk bootstrap aws://<account-id>/<region>

Deploy the stack

#default AWS CLI credentials are used, otherwise use the –-profile parameter
#personalNumber: the personal phone number of the business owner in E.164 format 
#businessAreaCode: the United States area code used to provision the business number 
cdk deploy –-context personalNumber=+1NPAXXXXXXX –-context businessAreaCode=NPA

Call the provisioned phone number to test the sample application. Optionally, edit the workflow to update the business name and working hours on the “Init” Task state, in the Variables tab.

Cleaning up:

To clean up this demo, execute:

cdk destroy

Conclusion

This blog demonstrates how combining AWS Step Functions and Amazon Chime SDK PSTN audio service streamlines the development of reliable telephony applications through visual workflow design and managed error handling. We provided a sample application, implementing six core business phone features, showcasing how the solution effectively manages multiple conditional paths and edge cases like disconnections and invalid inputs.

The serverless architecture created enables seamless integration between the two services through JSON-based communication, while providing automatic scaling and pay-per-use pricing. Together, these components create a robust foundation for building sophisticated telephony applications that reduce maintenance costs and enhance reliability.

Contact an AWS Representative to know how we can help accelerate your business.

AWS KMS CloudWatch metrics help you better track and understand how your KMS keys are being used

2025-03-17 Norman Li

Post Syndicated from Norman Li original https://aws.amazon.com/blogs/security/aws-kms-cloudwatch-metrics-help-you-better-track-and-understand-how-your-kms-keys-are-being-used/

AWS Key Management Service (AWS KMS) is pleased to launch key-level filtering for AWS KMS API usage in Amazon CloudWatch metrics, providing enhanced visibility to help customers improve their operational efficiency and aid in security and compliance risk management.

AWS KMS currently publishes account-level AWS KMS API usage metrics to Amazon CloudWatch, enabling you to monitor and manage your API usage. However, if you’re using numerous KMS keys, pinpointing the ones with the highest request rate quota usage or significant API costs becomes challenging. For example, if you have more than 10 active KMS keys in your account, prior to this launch you would have needed to build a custom CloudTrail and Amazon Athena based solution to locate which specific keys are driving the majority of API usage and costs. With the new CloudWatch metrics, which are available under the AWS/KMS namespace in CloudWatch, you can track, understand, and set alerts on detailed API usage at the individual KMS key level without building a costly customized solution.

This blog post explores several use cases to help you better take advantage of these newly introduced CloudWatch metrics to manage your AWS KMS API usage and costs. The use cases cover viewing and understanding your API usage at the key level, and creating CloudWatch alerts to detect unintentional runaway usage.

Overview of new CloudWatch metrics for KMS keys

With CloudWatch metrics for KMS keys, you can now do the following:

View the API usage for a specific KMS key, filtered by individual API operations (for example, Encrypt, Decrypt, or GenerateDataKey).
See the aggregated usage across cryptographic operations for a given KMS key.
Set up an alarm if a specific KMS key exceeds a specified threshold on a single API operation, or a set of API operations.

This streamlined approach allows you to quickly monitor, understand, and troubleshoot the API usage patterns of your KMS keys, without the overhead of the previous multi-step process. Let’s detail how these key-level API usage metrics can be used in two real-world examples.

Example 1: How to locate the KMS keys that consume the most API usage quota or contribute the most API charges

When you surpass your AWS KMS API request rate quotas, you can view your AWS KMS API utilization within the Service Quotas console. However, you might still find it cumbersome to identify the KMS keys that consume the largest amount of your request quota. When you receive the AWS KMS API charges that exceed your expectation, you can check the detailed billing usage in each AWS Region in Cost Explorer, but you cannot easily locate the KMS keys with the most API charges. This process becomes even more challenging when you manage a large number of KMS keys.

With the key-level API usage CloudWatch metrics, you can use the advanced metric query option to query CloudWatch Metrics Insights data with a user-friendly dialect of SQL to locate the KMS keys that consume the largest portion of the API usage quota or contribute the most API charges.

Walkthrough

To use Amazon CloudWatch Metrics Insights to identify the top 20 KMS keys that have the most cryptographic API usage up to the last 3 hours, complete the following steps:

Open the CloudWatch console.
In the navigation pane, choose Metrics, and then choose All metrics.
Choose the Multi source query tab.
For the data source, choose CloudWatch Metrics Insights.
You can enter the following example query in Editor view:

Note: In Builder view, the metric namespace, metric name, filter by, group by, order by, and limit options are shown. In Editor view, the same options as in Builder view are shown in query format.
```
	SELECT SUM(SuccessfulRequest)
	FROM SCHEMA("AWS/KMS", KeyArn, Operation)
	GROUP BY KeyArn
	ORDER BY MAX () DESC
	LIMIT 20
```
Choose Run in the Editor view or Graph query in the Builder view.

Example 2: How to set a new detailed alarm on unintentional runaway AWS KMS API usage

Running big data processing workflows that read Amazon Simple Storage Service (Amazon S3) files encrypted by KMS keys is a common scenario for analytics, business reporting, or machine learning projects. Typically, these workflows read a limited number of files from S3 on each invocation. However, misconfigured workflows could unintentionally read a large number of S3 files, which could result in exceeding your AWS KMS API request rate quotas or incurring undesirable charges due to spiky AWS KMS API usage. Historically, to address this issue, you would have had to build a customized alarm system by following these steps: 1) send AWS CloudTrail events generated by AWS KMS to Amazon CloudWatch Logs; 2) write queries in Amazon CloudWatch Logs Insights to track your API request usage; and 3) enable anomaly detection on the corresponding CloudWatch Log Insights math expression.

Now, with key-level API usage CloudWatch metrics, you can directly enable anomaly detection on these metrics to set up alarms for anomalous AWS KMS API usage patterns. This provides a more streamlined and efficient way to monitor and detect potential runaway workflows. By using these CloudWatch metrics and anomaly detection capabilities, you can proactively identify and address unintended increases in AWS KMS API usage, helping to avoid unexpected charges or service disruptions in your analytics, reporting, or machine learning pipelines.

Walkthrough

Consider a scenario where you have an analytics workflow that runs frequently, which uses the Decrypt AWS KMS API operation on a KMS key to decrypt and read data from S3. You would like to enable anomaly detection on the KMS key to trigger an alarm when the Decrypt call volume to the specific KMS key sees a discernible trend or pattern. To do so, complete the following steps:

Open the CloudWatch console.
In the navigation pane, choose Metrics, and then choose All metrics.
Choose KMS, and then choose KeyArn, Operation.
In the search bar, enter the Amazon Resource Name (ARN) of the key, and then choose Search. Select the CloudWatch metric you would like to enable anomaly detection for.
Navigate to Graphed metrics, and using the Statistic and Period drop-down lists, choose the statistic and period that you would like to monitor. Then you can enable anomaly detection by selecting the Pulse icon.

Figure 1: How to enable anomaly detection on a SuccessfulRequest metric
You can adjust the anomaly detection by setting the sensitivity to adjust the bandwidth, if needed.

Figure 2: Anomaly detection is enabled on the SuccessfulRequest metric. The gray band illustrates the expected range of values and the anomaly is in red

Conclusion

This blog post highlighted the newly introduced key-level filtering capability for the AWS KMS API usage in CloudWatch. We showed two real-world use cases to demonstrate how you can use the new CloudWatch metrics. These use cases include improving operational visibility, setting up proactive alarms on anomalies in KMS API usage patterns, and potentially tracking detailed key usage for compliance purposes.

If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, start a new thread in the AWS Key Management Service re:Post.

Architect fault-tolerant applications with instance fleets on Amazon EMR on EC2

2025-03-14 Deepmala Agarwal

Post Syndicated from Deepmala Agarwal original https://aws.amazon.com/blogs/big-data/architect-fault-tolerant-applications-with-instance-fleets-on-amazon-emr-on-ec2/

Organizations rely on Amazon EMR on EC2 clusters to process large-scale data workloads using frameworks like Apache Spark, Apache Hive, and Trino. Events such as TV advertisements or unplanned promotions might lead to an increase in demand of compute capacity, making effective capacity planning necessary to make sure your workloads don’t hit capacity limits or job failures.

A common scenario is to run daily Spark jobs on Amazon EMR using consistent Amazon Elastic Compute Cloud (Amazon EC2) instance types (for example, a single instance size and family for the cluster). Although this might work well to sustain the baseline, spikes can trigger auto scaling, which narrows the chances of capacity availability when trying to stop and relaunch a larger EMR cluster, because the specific on-demand instance pool might lack capacity to meet the demand.

In this post, we show how to optimize capacity by analyzing EMR workloads and implementing strategies tailored to your workload patterns. We walk through assessing the historical compute usage of a workload and use a combination of strategies to reduce the likelihood of InsufficientCapacityExceptions (ICE) when Amazon EMR launches specific EC2 instance types. We implement flexible instance fleet strategies to reduce dependency on specific instance types and use Amazon EC2 On-Demand Capacity Reservation (ODCRs) for predictable, steady-state workloads. Following this approach can help prevent job failures due to capacity limits while optimizing your cluster for cost and performance.

Solution overview

Instance fleets in Amazon EMR offer a flexible and robust way to manage EC2 instances within your cluster. This feature allows you to specify target capacities for On-Demand and Spot Instances, select up to five EC2 instance types per fleet (or 30 when using the AWS Command Line Interface [AWS CLI] and API with an allocation strategy), and use multiple subnets across different Availability Zones. Importantly, instance fleets support the use of ODCRs, enabling you to align your EMR clusters with pre-purchased EC2 capacity. You can configure your instance fleet to prefer or require capacity reservations, making sure that your EMR clusters use your reserved capacity efficiently.

EMR workload patterns typically fall into two categories: stable and variable (spiky). In the following sections, we explore how to optimize for each pattern using various options available with instance fleets, starting with stable workloads and then addressing variable workloads.

Stable workloads are workloads with a predictable pattern of resource utilization over time; for example, a pharmaceutical provider needs to process 21 TB of research data, patient records, and other information daily. The workload is consistent and needs to run reliably every day on long-running persistent clusters. For critical business operations requiring high reliability and guaranteed capacity, we recommend reserving the baseline capacity as part of your capacity planning. We demonstrate the following steps:

Use AWS Cost and Usage Reports (AWS CUR) to estimate the baseline of existing workloads.
Reserve the baseline capacity using ODCR.
Configure Amazon EMR to use the targeted ODCR.

Spiky workloads are defined by unpredictable and often significant fluctuations in processing demands. These surges can be triggered by various factors (such as batch processing, real-time data streaming, or seasonal business fluctuations) that trigger Amazon EMR to request more capacity to match the demand. We address the resource allocation by using instance and Availability Zone flexibility, with the following steps:

Introduce EC2 instance flexibility with EMR instance fleets.
Achieve resiliency through intelligent subnet selection with EMR instance fleets.
Use managed scaling to automatically manage scaling in and out.

Stable workloads

In this section, we demonstrate how to define your baseline, configure AWS Identity and Access Management (IAM) permissions, create an ODCR, and associate your reservations to a capacity group and configure Amazon EMR to use targeted ODCRs. You can opt for a mixed ODCR strategy—for example, one ODCR with a short period of duration that supports the launch of your EMR cluster, and another ODCR with a longer period of duration that supports your task nodes based on the baseline capacity reservation.

Estimate the baseline

Make sure to activate the AWS generated cost allocation tag aws:elasticmapreduce:job-flow-id. This enables the field resource_tags_aws_elasticmapreduce_job_flow_id in the AWS CUR to be populated with the EMR cluster ID and is used by the SQL queries in the solution. To activate the cost allocation tag from the AWS Billing Console, complete the following steps:

On the AWS Billing and Cost Management console, choose Cost allocation tags in the navigation pane.
Under AWS generated cost allocation tags, choose the aws:elasticmapreduce:job-flow-id tag.
Choose Activate.

It can take up to 24 hours for tags to activate. For more information, see here.

After the tags are activated, you can use AWS CUR and perform the following query on Amazon Athena to find the compute resources used by the EMR cluster ID vs. the timeline of usage. For more details, see Querying Cost and Usage Reports using Amazon Athena. Update the following query with your CUR table name, EMR cluster ID, desired timestamps, and AWS account ID, and run the query on Athena:

SELECT bill_payer_account_id as Payer,
    product_product_family as PFamily,
    product_product_name as PName,
    resource_tags_aws_elasticmapreduce_job_flow_id,
    line_item_usage_account_id as LinkedAccount,
    line_item_usage_start_date as UsageDate,
    bill_billing_period_start_date as BillingDate,
    SPLIT_PART(line_item_usage_type, ':', 2) AS InstanceType,
    line_item_availability_zone AS AvailabilityZone,
    COUNT(line_item_resource_id) as ResourceIDCount
FROM <YOUR_CUR_TABLE_NAME>
WHERE (
        line_item_usage_start_date BETWEEN TIMESTAMP 'YYYY-MM-DD 00:00:00'
        AND TIMESTAMP 'YYYY-MM-DD 23:59:59' 
    )
    AND line_item_operation LIKE '%%RunInstance%%'
    AND line_item_line_item_type LIKE '%%Usage%%'
    AND product_product_family NOT IN ('Data Transfer')
    AND resource_tags_aws_elasticmapreduce_job_flow_id LIKE '%%<emr-cluster-id>%%'
    AND line_item_usage_account_id IN (
        '<aws_account_id>'
)
GROUP BY 1,2,3,4,5,6,7,8,9

As an example, the preceding query filters instances usage per hour for a given account and EMR cluster for the period of 6 months, to generate the following figure. You can export the results in CSV format and analyze the data. Now that you have a visual representation of your workloads’ baseline and bursts, you can define the strategy and configuration of your EMR cluster.

Create an ODCR to reserve the baseline capacity

ODCRs can be either open or targeted:

With an open ODCR, new instances and existing instances that have matching attributes (such as operating system or instance type) will run using the capacity reservation attributes first.
With a targeted ODCR, instances must match the attributes of the ODCR specification and the ODCR is specifically targeted at launch. This approach is recommended if you have multiple concurrent EMR clusters consuming capacity from the shared On-Demand pool of EC2 instances. EMR clusters larger than the targeted ODCR quantity will fall back to On-Demand Instances that are in the same Availability Zone.

In this example, we use a targeted ODCR with an EMR instance fleet in the us-east-1a Availability Zone. The following diagram illustrates the workflow.

Complete the following steps:

Use the create-capacity-reservation AWS CLI command to create the ODCR and make a note of the CapacityReservationArn value in the output:

aws ec2 create-capacity-reservation \
     --availability-zone <Input Your Availability Zone> \
     --instance-type r8g.2xlarge \
     --instance-match-criteria targeted \
     --instance-platform Linux/UNIX \
     --instance-count <enter the number of instances out of your baseline estimation>

We get the following output:

{
     "CapacityReservation": {
         "CapacityReservationId": "cr-0123456f9907xxxxx",
         "OwnerId": "XXXX",
         "CapacityReservationArn": "arn:aws:ec2:us-east-1:XXXX:capacity-reservation/cr-0123456f9907xxxxx",
         "InstanceType": "r8g.2xlarge",
         "InstancePlatform": "Linux/UNIX",
         "AvailabilityZone": "us-east-1a"

 ....
     }
 }

You can use Amazon CloudWatch to monitor ODCR usage and trigger an alert for unused capacity. For more details, see Monitor Capacity Reservations usage with CloudWatch metrics.

Create a resource group named EMRSparkSteadyStateGroup and make a note of GroupArn values in the output:

aws resource-groups create-group --name EMRSparkSteadyStateGroup \
--configuration '{"Type":"AWS::EC2::CapacityReservationPool"}' '{"Type":"AWS::ResourceGroups::Generic", "Parameters":[{"Name":"allowed-resource-types","Values":["AWS::EC2::CapacityReservation"]}]}'

We get the following output:

"Group": {
         "GroupArn": "arn:aws:resource-groups:us-east-1:XXXX:group/EMRSparkSteadyStateGroup",
         "Name": "EMRSparkSteadyStateGroup"
     }, ...

Use the following code to associate the capacity reservation to the resource group. You can have multiple capacity reservations associated to a resource group.

aws resource-groups group-resources --group EMRSparkSteadyStateGroup \
 --resource-arns arn:aws:ec2:us-east-1:XXXX:capacity-reservation/cr-0123456f9907xxxxx

As a best practice for effective management and cleanup, Create a tag Purpose=EMR-Spark-Steady-State for the newly created ODCR and the resource group.

# Tag your Capacity Reservation
 aws ec2 create-tags \
 --resources cr-0123456f9907xxxxx \
 --tags Key=Purpose,Value=EMR-Spark-Steady-State
# Tag your Resource Group
 aws resource-groups tag \
 --arn "arn:aws:resource-groups:us-east-1:XXXX:group/EMRSparkSteadyStateGroup" --tags Purpose=EMR-Spark-Steady-State

Implement Amazon EMR with ODCR

Complete the following steps to create an EMR cluster tagged with the specific targeted ODCR:

Add required permissions to the EMR service role before using capacity reservations. With these permissions, you can lock down the resource with the specific Amazon Resource Name (ARN) of the group name to be created with the following code:

{
     "Version": "2012-10-17",
     "Statement": [
         {
             "Effect": "Allow",
             "Resource": "*",
             "Action": [
                 "ec2:CreateFleet",
                 "ec2:RunInstances",
                 "ec2:CreateLaunchTemplate",
                 "ec2:CreateLaunchTemplateVersion",
                 "ec2:DeleteLaunchTemplateVersions",
                 "ec2:DescribeCapacityReservations",
                 "ec2:DescribeLaunchTemplateVersions",
                 "resource-groups:ListGroupResources"
             ]
         }
     ]
 }

Configure the EMR cluster to use ODCR with instance fleets. We use the CapacityReservationOptions parameter to configure the EMR cluster, as shown in the following example:

  {
 ...
     "LaunchSpecifications": {
       "OnDemandSpecification": {
         "AllocationStrategy": "LOWEST_PRICE",
         "CapacityReservationOptions": {
           "UsageStrategy": "USE_CAPACITY_RESERVATIONS_FIRST",
           "CapacityReservationResourceGroupArn": "arn:aws:resource-groups:us-east-1:xxxxxx:group/EMRSparkSteadyStateGroup"
         }
       }
     }
   }

The following step-by-step breakdown illustrates the Amazon EMR decision-making process when prioritizing targeted capacity reservations, from core node provisioning through task node allocation:

Cluster provisioning initiation:
- The user chooses to override the lowest-price allocation strategy.
- The user specifies targeted capacity reservations in the launch request.
Core node provisioning:
- Amazon EMR evaluates all EC2 instance capacity pools with targeted capacity reservations, and selects the pool with the lowest price that has sufficient capacity for all requested core nodes.
- If no pool with targeted reservations has sufficient capacity, Amazon EMR reevaluates all specified EC2 instance capacity pools and selects the lowest-priced pool with sufficient capacity for core nodes. Available open capacity reservations are applied automatically.
Availability Zone selection:
- After the core capacity is acquired, Amazon EMR locks in the Availability Zone for your cluster.
Primary and task node provisioning:
- Amazon EMR evaluates EC2 instance capacity pools within that Availability Zone for primary and task fleets. First, Amazon EMR evaluates all the pools with targeted ODCRs specified in the request, ordered by lowest price by default.
- From the ordered list, Amazon EMR launches as much capacity as possible from the unused targeted ODCRs of each instance pool until the request is fulfilled.
- If the unused targeted ODCRs don’t fulfill the request yet, Amazon EMR continues to launch the remaining capacity into On-Demand pools, in the lowest-price order by default.

For more details about the allocation strategy, refer to Allocation strategy for instance fleets or Amazon EMR Support for Targeted ODCR.

Spiky workloads

Spiky workloads are defined by unpredictable and often significant fluctuations in processing demands, triggered by factors such as infrequent but resource-intensive periodic batch processing jobs. For example, a geographic information system processes location data from millions of users in real time to provide up-to-date traffic information, calculate routes, and suggest points of interest. User location data is constantly being generated, but the volume can spike dramatically during rush hour or special events, as illustrated in the following figure. This graph shows the number of used resources (Amazon EC2) by hour; it varies from 1 when the cluster scales in, waiting for jobs, to spikes of 1,000 nodes.

If you’re running spiky workloads with limited flexibility in instance type, family, and Availability Zone, you might face ICE errors when the available capacity can’t meet the cluster’s scaling requirements. To address this, we explore a set of best practices for EMR cluster creation to maximize availability and balance price-performance. Although spiky workloads present a unique challenge in resource management, configuring EMR instance fleets offers a powerful solution. By using diverse instance types, prioritized allocation strategies, Availability Zone flexibility, and managed scaling, organizations can create a robust, cost-effective infrastructure capable of handling unpredictable workload patterns. This configuration offers the following benefits:

Improved availability – By diversifying instance types and using multiple Availability Zones, the cluster mitigates insufficient capacity issues
Cost savings – Allocation strategies reduce costs while minimizing interruptions
Resilience for spiky workloads – Prioritizing instance generations provides seamless scaling under varying demands
Optimized performance – Managed scaling dynamically adjusts resources to meet workload demands efficiently

Introduce EC2 instance flexibility and instance fleets with a prioritized allocation strategy

Amazon EMR supports instance flexibility with instance fleet deployment. Instance fleets give you a wider variety of options and intelligence around instance provisioning. You can now provide a list of up to 30 instance types with corresponding weighted capacities and spot bid prices (including spot blocks) using the AWS CLI or AWS CloudFormation. Amazon EMR will automatically provision On-Demand and Spot capacity across these instance types when creating your cluster. This can make it more straightforward and more cost-effective to quickly obtain and maintain your desired capacity for your clusters. In August 2024, Amazon EMR introduced the prioritized allocation strategy to enhance instance flexibility with instance fleets. This feature allows you to specify priority levels for your instance types, enabling Amazon EMR to allocate capacity to the highest-priority instances first. This strategy helps improve cost savings and reduces the time required to launch clusters, even in scenarios with limited capacity. For more details, see Amazon EMR support prioritized and capacity-optimized-prioritized allocation strategies for EC2 instances. To maximize cost-efficiency and availability for spiky workloads, combine the price-performance advantages of new-generation instances with the broader availability of previous-generation instances. For workloads with strict latency requirements, fix the instance size to maintain consistent performance. This approach takes advantage of the strengths of both instance generations, providing flexibility and reliability decreasing the likelihood of capacity constraints. For On-Demand nodes, choose the prioritized allocation strategy, so the cluster tries to use newer-generation instances first. While configuring the instance fleet, arrange instances in a prioritized order reflecting price-performance and availability trade-offs, for example:

Primary node – m8g.12xlarge > m8g.16xlarge > m7g.12xlarge > m7g.16xlarge
Core node – r8g.8xlarge > r8g.12xlarge > r7g.8xlarge > r6g.16xlarge > r5.16xlarge
Task Node – r8g.8xlarge > r8g.12xlarge > r7g.8xlarge > r6g.16xlarge > r5.16xlarge

For Spot Instances, make sure the capacity-optimized prioritized allocation strategy is selected to reduce interruptions. See the following CloudFormation template snippet as an example:

...
       "Properties": {
         "Instances": {
          "MasterInstanceFleet": {
            "Name": "cfnMaster",
            "InstanceTypeConfigs": [
               {
                 "BidPrice": "10.50",
                 "InstanceType": "m5.xlarge",
                 "Priority": "1",
 ...
             "LaunchSpecifications": {
               "SpotSpecification": {
                 "TimeoutAction": "SWITCH_TO_ON_DEMAND",
                 "TimeoutDurationMinutes": 20,
                 "AllocationStrategy": "CAPACITY_OPTIMIZED_PRIORITIZED"
               },
               "OnDemandSpecification": {
                "AllocationStrategy": "PRIORITIZED"
               }
 ...

Select subnets with EMR instance fleets

When creating a cluster, specify multiple EC2 subnets within a virtual private cloud (VPC), each corresponding to a different Availability Zone. Amazon EMR provides multiple subnet (Availability Zone) options by employing subnet filtering at cluster launch, and selects one of the subnets that has adequate available IP addresses to successfully launch all instance fleets. If Amazon EMR can’t find a subnet with sufficient IP addresses to launch the whole cluster, it will prioritize the subnet that can at least launch the core and primary instance fleets.

Use managed scaling

Managed scaling is another powerful feature of Amazon EMR that automatically adjusts the number of instances in your cluster based on workload demands. This makes sure that your cluster scales up during periods of high demand to meet processing requirements and scales down during idle times to save costs. With managed scaling, you can set minimum and maximum scaling limits, giving you control over costs while benefiting from an optimized and efficient cluster performance.

The following workflow illustrates Amazon EMR configured with instance fleets and managed scaling.

The workflow consists of the following steps:

The user defines the EMR instance configurations and instance types, along with their launch priority.
The user selects subnets for the Amazon EMR configuration to provide Availability Zone flexibility.
Amazon EMR calls the Amazon EC2 Fleet API to provision instances based on the allocation strategy.
The EMR instance fleet is launched.
The cycle is repeated for scaling operations within the launched Availability Zone, providing optimized performance and scalability.

Conclusion

In this post, we demonstrated how to optimize capacity by analyzing EMR workloads and implementing strategies tailored to your workload patterns. As you implement any of the preceding strategies, remember to continuously monitor your cluster’s performance and adjust configurations based on your specific workload patterns and business needs. With the right approach, the challenges of spiky workloads can be transformed into opportunities for optimized performance and cost savings.

To effectively manage workloads with both baseline demands and unexpected spikes, consider implementing a hybrid approach in Amazon EMR. Use ODCRs for consistent baseline capacity and configure instance fleets with a strategic mix of ODCR, On-Demand, and Spot Instances prioritizing ODCR usage.

Try these strategies with your own use case, and leave your questions in the comments.

About the Authors

Deepmala Agarwal works as an AWS Data Specialist Solutions Architect. She is passionate about helping customers build out scalable, distributed, and data-driven solutions on AWS. When not at work, Deepmala likes spending time with family, walking, listening to music, watching movies, and cooking!

Suba Palanisamy is a Senior Technical Account Manager, helping customers achieve operational excellence on AWS. Suba is passionate about all things data and analytics. She enjoys traveling with her family and playing board games.

Flavio Torres is a Principal Technical Account Manager at AWS. Flavio helps Enterprise Support customers design, deploy, and scale resilient cloud applications. Outside of work, he enjoys hiking and barbecuing.

Manage authorization within a containerized workload using Amazon Verified Permissions

2025-03-13 Manuel Heinkel

Post Syndicated from Manuel Heinkel original https://aws.amazon.com/blogs/security/manage-authorization-within-a-containerized-workload-using-amazon-verified-permissions/

Containerization offers organizations significant benefits such as portability, scalability, and efficient resource utilization. However, managing access control and authorization for containerized workloads across diverse environments—from on-premises to multi-cloud setups—can be challenging.

This blog post explores four architectural patterns that use Amazon Verified Permissions for application authorization in Kubernetes environments. Verified Permissions is a scalable permissions management and fine-grained authorization service for your applications.

In this blog post, we cover the following patterns and discuss their trade-offs:

Calling Verified Permissions from an Amazon API Gateway API fronting your application in Kubernetes
Calling Verified Permissions from a Kubernetes Ingress controller component
Calling Verified Permissions from a sidecar container running in the same pod as the application container
Calling Verified Permissions from the application container

Understanding these patterns and their implications can help you implement secure and consistent authorization mechanisms across your entire infrastructure without compromising the scalability, portability, and resource efficiency of your containerized workloads.

Consistent authorization through centralized policy management

Access to application resources can be secured more effectively with a centralized and consistent approach to authorization. Especially in containerized environments with distributed architectures and shared resources, traditional access control methods, like embedding authorization logic within individual application code or relying on local access control policies, can become difficult to manage and prone to errors. This becomes even more challenging when you have a combination of on-premises and cloud setups.

A centralized authorization solution empowers developers to implement consistent access control across individual components of an application efficiently. Benefits include reduced duplicate work, an improved security posture, and lower complexity in managing and enforcing access control policies.

Verified Permissions benefits in a containerized environment

Amazon Verified Permissions provides several key benefits as an external authorization service:

Benefits for the platform engineering team – Centralized authorization enables platform engineering teams to implement, maintain, and govern authorization policies across the organization without requiring changes to individual applications. This aligns with modern platform engineering practices, where platform engineering teams can provide authorization as a service to application teams, promoting consistent security standards while reducing the operational burden on development teams.
Consistent authorization across environments – With Verified Permissions, you can define and manage access control policies in a centralized location. This makes it easier to apply consistent authorization rules across your entire infrastructure, including on-premises deployments and different cloud environments.
Simplified application development – Externalizing authorization logic from applications reduces development complexity. Developers can focus on core application functionality without having to implement and maintain authorization mechanisms within each service or component. This separation of concerns promotes code modularity, reusability, and faster iteration cycles.
Scalable and highly available – Verified Permissions is a managed service, designed to be scalable and highly available out of the box. As your containerized workloads grow in scale and complexity, Verified Permissions can handle increasing authorization request volumes while maintaining performance and availability.
Fine-grained access control – Verified Permissions supports attribute-based access control (ABAC) and role-based access control (RBAC). This allows you to define granular policies in the open source Cedar language based on various attributes like user roles, resource properties, environmental factors, and more.

Integration patterns for authorization

Kubernetes provides many options for architecting applications. Therefore, there are multiple locations in a typical architecture where authorization decisions can be enforced, as shown in Figure 1.

Figure 1: Integration points for authorization in containerized workloads

The workflow is as follows:

API Gateway. Organizations can use entry points to the application outside of the Kubernetes cluster, such as an API gateway, to obtain an authorization decision. In AWS, Amazon API Gateway enables customers to use authorizer Lambda functions to send an authorization request to Verified Permissions.
Ingress controller. The Kubernetes API defines Ingress objects, which provide load balancing and routing functions on layer 7. Common Ingress controllers like Traefik offer the option to integrate external authorization services.
Sidecar proxy container. You can intercept every request routed to the application by using a sidecar container running in the same pod as the application container. This sidecar container calls Verified Permissions for authorization decisions.
Application container. Developers can use the Amazon SDK to communicate with Verified Permissions from inside the application when an authorization decision is needed.

In the following sections, we explore each of these patterns in detail, examining their implementation, use cases, and specific considerations. At the end of our discussion, we provide a comprehensive comparison table to help you choose the most appropriate pattern based on factors such as scalability, performance, maintenance overhead, and specific use case requirements. This will help you make an informed decision about which pattern best suits your application’s needs.

Authorization workflow

Independent of which of the four mentioned options for authorization you choose, the overall authorization workflow, shown in Figure 2, will stay the same.

Figure 2: Authorization workflow with Amazon Verified Permissions

The workflow is as follows:

Authentication. The user first authenticates with an identity provider to obtain a JSON Web Token (JWT). You can configure the identity provider to write relevant information like user roles, tenant ID, or other needed user attributes into the JWT. You can then use this information later to make an authorization decision.
API request. The user makes a request to your application that includes the JWT.
Authorization information. Your application extracts the relevant information that is needed to make an authorization decision from the request. This can include principal information from the JWT, information about the resource that the user requests, and what action the user wants to perform.
(Optional) Policy information point lookup. Depending on your policies, you might need additional information in order for Verified Permissions to make an authorization decision. For example, you can query ownership details for a document from a database.
Authorization decision. You then send the relevant information to Verified Permissions, which returns a decision stating whether the request is permitted or forbidden.
Authorization enforcement. You then enforce the decision from Verified Permissions in your application by allowing or denying an action. For a REST API, this would result in sending back an HTTP 403 forbidden status if the request was denied, or processing the request if it was allowed and sending an HTTP 200 OK status.

Authorization outside of the cluster by using Amazon API Gateway

In this pattern, authorization decisions are made at the API gateway layer before requests reach the Kubernetes cluster. When a request arrives at the API gateway, it triggers an authorization check with Verified Permissions to evaluate the request against defined policies. Based on the Verified Permissions response, the gateway either forwards the request to the containerized application or denies access.

This pattern excels in scenarios where you need coarse-grained access control that can be enforced with information accessible at the API level (such as an HTTP header or ID or access token) and that supports RBAC and ABAC. Consider a document management application where different users have access to different documents based on group membership or identity attribute.

This approach to authorization works consistently regardless of whether your application runs in containers, virtual machines, or serverless environments. The API gateway acts as a unified control point for enforcing access policies across backend services.

For implementations that use Amazon API Gateway specifically, you can use Lambda authorizers to integrate with Verified Permissions. For each incoming API request, API Gateway invokes the authorizer Lambda function, which makes a call to Verified Permissions to evaluate the request against the defined authorization policies, as shown in Figure 3.

Figure 3: Integration of Amazon Verified Permissions in Amazon API Gateway

AWS provides a quick-start solution that demonstrates this integration by using Amazon API Gateway and Amazon Cognito, making it easier to implement this pattern. The setup process is detailed in the blog post Authorize API Gateway APIs using Amazon Verified Permissions and Amazon Cognito or bring your own identity provider.

Authorization in a Kubernetes Ingress

Another option to implement coarse-grained access control in use cases as described in the previous section is to use a Kubernetes Ingress layer. Some customers prefer Kubernetes-native solutions, especially if they need to run Kubernetes clusters within and outside of AWS.

Kubernetes provides an API to create and maintain Ingress objects, operating at layer 7 (the application layer), which enables routing decisions based on HTTP attributes. This layer 7 capability makes Ingress controllers ideal for implementing authorization checks.

One Kubernetes Ingress controller that supports external authorization is Traefik Proxy. With this feature, you can delegate authorization decisions to an external service like Verified Permissions before routing requests to the application container.

Assuming that the authorization endpoint is backed by a service in the same Kubernetes cluster, the architecture looks as shown in Figure 4.

Figure 4: Integration of Amazon Verified Permissions in a Kubernetes Ingress

The workflow is as follows:

Authenticated users access the service through an Elastic Load Balancer of type Network Load Balancer (NLB).
The NLB—operating at layer 4—exposes a Kubernetes Ingress inside the cluster that provides layer 7 capabilities. The Ingress object is implemented by an Ingress controller that supports external authorization, as described earlier.
The Ingress forwards the request—or parts of it—that needs authorization to a local authorization service in the cluster. We use a dedicated authorization service in this architecture because the Ingress backend service allows an external endpoint to be called for authorization.
The authorization service is deployed into its own Kubernetes namespace with a dedicated Kubernetes service account. EKS Pod Identity provides the ability to link the service account in this namespace to an AWS Identity and Access Management (IAM) role that grants access to Verified Permissions by injecting temporary AWS access credentials into the pod at runtime.
The authorization service extracts relevant information from the request and sends it to Verified Permissions for an authorization decision.
The Ingress for the backend service awaits the response of the authorization service and forwards it to the backend service, if access is granted. The Ingress expects the authorization service to respond with HTTP status code 200 for authorized requests. If the Ingress receives HTTP status code 403, the requester is not allowed to access the requested resource, and the Ingress will block the request at this stage.
Only authorized requests are forwarded to registered backend pods.

Because integration with external authorization services is not part of the Kubernetes Ingress API, you need to consult the documentation of the Ingress controller that you decide to use to determine the availability of this feature and its implementation details. Forward authentication of the Traefik Kubernetes Ingress supports this pattern and can be configured with the vendor-specific annotations described in the Traefik documentation.

Authorization in sidecar containers

Not all Ingress controllers support integration with an external authorization service. Amazon Elastic Kubernetes Service (Amazon EKS) customers might prefer the AWS Load Balancer Controller to manage the lifecycle of NLBs and ALBs for their services. Customers can continue using their existing Ingress controller, even if it does not support calling external authorization services today. You can move the authorization of requests behind the Ingress layer with the sidecar container pattern.

Sidecar containers are a common pattern for extending an application’s functionality in Kubernetes. A sidecar is a container running in the same pod as the application it relates to. This means that the sidecar and application follow the same lifecycle and share resources, such as the network ID. This pattern is a good fit when the authorization logic is service-specific. Because the authorization service is deployed alongside the application, this pattern also provides better support in situations where changes to the application demand changes in the authorization logic.

Consider a document management system where access control depends on document metadata and team structures. When a user attempts to edit a document, the sidecar queries the document’s metadata, such as the classification level, tags, and department ownership. The sidecar can also check the organizational team hierarchy to understand reporting relationships and access privileges. This context enables fine-grained authorization decisions that consider not just who the user is, but also, for example, their organizational context or the individual document’s metadata.

Although it’s possible to configure sidecar proxies such as Envoy for individual pods manually, the more convenient option is to introduce a service mesh. A service mesh provides a control plane to manage proxies, including centralized configuration, automated injection of sidecars, and an Ingress layer for traffic routing. Istio is a popular option for a service mesh in Kubernetes.

The diagram in Figure 5 shows the deployment architecture to implement authorization with Verified Permissions in a service mesh.

Figure 5: Integration of Amazon Verified Permissions in a Kubernetes Ingress

The workflow is as follows:

Authenticated users access the application through an NLB.
The request is routed through an Ingress in the Kubernetes cluster.
The Ingress forwards the request directly to the backend service.
Pods of the backend service consist of multiple containers. Each request is routed through an Envoy proxy first.
The Envoy proxy forwards the request to a co-located container running the authorization service.
Pod Identity is used to map an IAM role to a Kubernetes service account bound to the pod, which enables the authorization sidecar to invoke Verified Permissions for an authorization decision. Note that each container in this pod has access to the IAM credentials that are mapped to the service account.
The Envoy proxy awaits the response of the authorization sidecar and blocks or forwards the request to the backend container, depending on the Verified Permissions authorization decision.

When Istio is deployed into a Kubernetes cluster, it introduces Custom Resource Definitions (CRDs) for managing the service mesh. The authorization workflow can be implemented using the ServiceEntry CRD and an Istio Authorization Policy. The authorization service running as a local container in the application pod becomes a registered service entry in the mesh. This service entry can then be configured in an authorization policy as the target for request authorization in the proxy. For more details, see the External Authorization section in the Istio documentation.

Application container

When it comes to integrating Verified Permissions directly within your application container, you have the advantage of fine-grained control over authorization decisions at the application level. This approach allows for more context-aware authorization checks and can be useful when you need to make authorization decisions based on application-specific data that you can query from a policy information point.

Unlike the sidecar pattern, where authorization happens before the request reaches your application code, this approach lets you gather the necessary context from your application state, databases, or other services before making the authorization call. This is particularly valuable when the authorization logic is deeply intertwined with business logic or requires data that’s only available within the application context. This pattern also supports minimizing the number of authorization requests, if, for example, only a subset of requests processed by a monolithic service require authorization.

However, it’s important to note that this tight coupling between authorization and business logic makes the system more brittle and susceptible to breakage when functional or business logic changes occur. This means that modifications to your application code might require careful consideration of their impact on the authorization logic, potentially increasing maintenance complexity.

The architecture for authorization requests from the application container is shown in Figure 6.

Figure 6: Integration of Amazon Verified Permissions in the application container

The workflow is as follows:

Authenticated users access the application through an Elastic Load Balancer—either Application Load Balancer (ALB) or NLB depending on workload requirements.
The Kubernetes service or Ingress for the backend application is directly registered at the ALB or NLB by the AWS Load Balancer Controller.
Requests are directly routed to a pod that is backing the service.
The backend application’s logic is responsible for identifying whether a request needs authorization. The backend uses an IAM role injected at runtime through Pod Identity, when an authorization decision from Verified Permission is needed. The backend application returns HTTP status code 403 if the decision is a deny; otherwise it will continue processing the request.

See the Simplify fine-grained authorization with Amazon Verified Permissions and Amazon Cognito blog post for details on calling Verified Permissions within an application.

Choosing the right pattern for your app

You now have a set of patterns to introduce authorization into your containerized workloads. You need to consider multiple factors and understand the trade-offs that come with each pattern to identify the best option for a given scenario. In the following table, we list certain areas with influence on your architectural decisions.

Granularity of authorization decisions
API gateway	Authorization on API or service level Decision based on information from HTTP request Suitable for consistent coarse-grained authorization
Ingress controller	Authorization on API or service level Decision based on information from HTTP request Suitable for consistent coarse-grained authorization
Sidecar proxy	Authorization on service level Decision based on information from HTTP request and service domain (such as policy information point or static service-specific rules) Suitable for service-specific authorization with low or mid-level complexity for decisions
Application container	Authorization on code level Decision based on information from HTTP request and arbitrary business logic Suitable for highly complex decision logic
Resource overhead
API gateway	No cluster resources needed for authorization Unauthorized requests don’t consume cluster resources
Ingress controller	Central Ingress pods consume cluster resources Unauthorized requests don’t consume application pod resources
Sidecar proxy	Authorization services increases resource demand of each pod in the cluster Unauthorized requests consume application pod resources
Application container	Authorization service consumes resources only when authorization is performed Unauthorized requests consume CPU cycles of application logic until the authorization logic is triggered
Scalability
API gateway	Fully managed serverless service
Ingress controller	Scaling of Ingress pods needs to be defined Cluster auto-scaling or capacity planning for compute resources in cluster needed
Sidecar proxy	Existing scaling policies can be leveraged
Application container	Existing scaling policies can be leveraged
Performance
API gateway	Invokes Verified Permission in AWS for each request that needs authorization Supports caching to reduce number of requests to Verified Permission out of the box
Ingress controller	Invokes Verified Permission in AWS for each request that needs authorization (Optional) Integration of avp-local-agent to minimize number of requests to Verified Permissions
Sidecar proxy	Invokes Verified Permissions in AWS for each request that needs authorization (Optional) Integration of avp-local-agent to minimize number of requests to Verified Permissions
Application container	Invokes Verified Permissions in AWS only if the business logic processing a request requires authorization (Optional) Integration of avp-local-agent to minimize number of requests to Verified Permissions
Cost
API gateway	Consumption-based costs depending on requests received and data transferred out, and optionally cache size Consumption-based costs to invoke Verified Permissions Enabling a cache potentially reduces costs per requests
Ingress controller	Adds to infrastructure costs depending on the underlying compute resources needed to run the fleet of pods for Ingress and authorization service Consumption-based costs to invoke Verified Permissions, which can be minimized by integrating avp-local-agent
Sidecar proxy	Adds to infrastructure costs depending on the underlying compute resources needed to run the sidecar containers for the proxy and authorization service in each pod Consumption-based costs to invoke Verified Permissions, which can be minimized by integrating avp-local-agent
Application container	Typically minimal additional costs, since authorization code shares the application’s resources Consumption-based costs to invoke Verified Permissions, which can be minimized by integrating avp-local-agent
Portability
API gateway	Portability is limited and depends on API Gateway with functionality for custom authorization
Ingress controller	Portable across Kubernetes environments with Ingress controller supporting external authorization
Sidecar proxy	Portable across Kubernetes environments
Application container	Highly portable without dependencies to underlying components
Complexity
API gateway	Offered as central component by platform engineering team to offload complexity of authorization from product teams Changes in authorization service impact product teams
Ingress controller	Offered as central component by platform engineering team to offload complexity of authorization from product teams Changes in authorization service impact product teams
Sidecar proxy	Platform engineering teams provide standardized patterns (and optionally implementation) for authorization that can be integrated and implemented by product teams Increases autonomy of individual teams
Application container	Full responsibility for authorization lies with the product teams Increases autonomy of individual teams

Not all services have the same requirements for authorization. You can also combine the patterns discussed in this post. You can, for example, put a basic authorization workflow with coarse-grained access control in front of the majority of your services. You can then rely on sidecar proxies with policy information points to inject additional, dynamic context into authorization decisions for specific services. Lastly, if certain use cases of a service demand complex authorization decisions, you can fall back to application-level authorization for specific parts of your code base.

Conclusion

In this blog post, we explored four patterns for integrating Verified Permissions into a containerized environment. We discussed the benefits and considerations of implementing Verified Permissions at different levels: from an API gateway outside of Kubernetes clusters, by means of Ingress controllers and sidecar proxies as network components inside the Kubernetes cluster, to authorization within the application itself. We saw how each pattern offers unique advantages. We also discussed considerations for finding a suitable option for your situation and when to combine patterns.

By using Verified Permissions, organizations can implement consistent, fine-grained authorization across their containerized workloads, whether they’re running on-premises, in the cloud, or in hybrid environments. This centralized approach to authorization can enhance security and simplify policy management and application development.

To learn more about implementing these patterns and best practices, visit the Amazon Verified Permissions User Guide. For hands-on experience, we recommend exploring the Verified Permissions workshop, which provides practical examples and guided exercises.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Develop and test AWS Glue 5.0 jobs locally using a Docker container

2025-03-12 Subramanya Vajiraya

Post Syndicated from Subramanya Vajiraya original https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-5-0-jobs-locally-using-a-docker-container/

AWS Glue is a serverless data integration service that allows you to process and integrate data coming through different data sources at scale. AWS Glue 5.0, the latest version of AWS Glue for Apache Spark jobs, provides a performance-optimized Apache Spark 3.5 runtime experience for batch and stream processing. With AWS Glue 5.0, you get improved performance, enhanced security, support for the next generation of Amazon SageMaker, and more. AWS Glue 5.0 enables you to develop, run, and scale your data integration workloads and get insights faster.

AWS Glue accommodates various development preferences through multiple job creation approaches. For developers who prefer direct coding, Python or Scala development is available using the AWS Glue ETL library.

Building production-ready data platforms requires robust development processes and continuous integration and delivery (CI/CD) pipelines. To support diverse development needs—whether on local machines, Docker containers on Amazon Elastic Compute Cloud (Amazon EC2), or other environments—AWS provides an official AWS Glue Docker image through the Amazon ECR Public Gallery. The image enables developers to work efficiently in their preferred environment while using the AWS Glue ETL library.

In this post, we show how to develop and test AWS Glue 5.0 jobs locally using a Docker container. This post is an updated version of the post Develop and test AWS Glue version 3.0 and 4.0 jobs locally using a Docker container, and uses AWS Glue 5.0 .

Available Docker images

The following Docker images are available for the Amazon ECR Public Gallery:

AWS Glue version 5.0 – ecr.aws/glue/aws-glue-libs:5

AWS Glue Docker images are compatible with both x86_64 and arm64.

In this post, we use public.ecr.aws/glue/aws-glue-libs:5 and run the container on a local machine (Mac, Windows, or Linux). This container image has been tested for AWS Glue 5.0 Spark jobs. The image contains the following:

Amazon Linux 2023
AWS Glue ETL Library
Apache Spark 3.5.2
Open table format libraries; Apache Iceberg 1.6.1, Apache Hudi 0.15.0, and Delta Lake 3.2.1
AWS Glue Data Catalog client
Amazon Redshift connector for Apache Spark
Amazon DynamoDB connector for Apache Hadoop

To set up your container, you pull the image from the ECR Public Gallery and then run the container. We demonstrate how to run your container with the following methods, depending on your requirements:

spark-submit
REPL shell (pyspark)
pytest
Visual Studio Code

Prerequisites

Before you start, make sure that Docker is installed and the Docker daemon is running. For installation instructions, see the Docker documentation for Mac, Windows, or Linux. Also make sure that you have at least 7 GB of disk space for the image on the host running Docker.

Configure AWS credentials

To enable AWS API calls from the container, set up your AWS credentials with the following steps:

Create an AWS named profile.
Open cmd on Windows or a terminal on Mac/Linux, and run the following command:

PROFILE_NAME="profile_name"

In the following sections, we use this AWS named profile.

Pull the image from the ECR Public Gallery

If you’re running Docker on Windows, choose the Docker icon (right-click) and choose Switch to Linux containers before pulling the image.

Run the following command to pull the image from the ECR Public Gallery:

docker pull public.ecr.aws/glue/aws-glue-libs:5

Run the container

Now you can run a container using this image. You can choose any of following methods based on your requirements.

spark-submit

You can run an AWS Glue job script by running the spark-submit command on the container.

Write your job script (sample.py in the following example) and save it under the /local_path_to_workspace/src/ directory using the following commands:

$ WORKSPACE_LOCATION=/local_path_to_workspace
$ SCRIPT_FILE_NAME=sample.py
$ mkdir -p ${WORKSPACE_LOCATION}/src
$ vim ${WORKSPACE_LOCATION}/src/${SCRIPT_FILE_NAME}

These variables are used in the following docker run command. The sample code (sample.py) used in the spark-submit command is included in the appendix at the end of this post.

Run the following command to run the spark-submit command on the container to submit a new Spark application:

$ docker run -it --rm \
    -v ~/.aws:/home/hadoop/.aws \
    -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
    -e AWS_PROFILE=$PROFILE_NAME \
    --name glue5_spark_submit \
    public.ecr.aws/glue/aws-glue-libs:5 \
    spark-submit /home/hadoop/workspace/src/$SCRIPT_FILE_NAME

REPL shell (pyspark)

You can run a REPL (read-eval-print loop) shell for interactive development. Run the following command to run the pyspark command on the container to start the REPL shell:

$ docker run -it --rm \
    -v ~/.aws:/home/hadoop/.aws \
    -e AWS_PROFILE=$PROFILE_NAME \
    --name glue5_pyspark \
    public.ecr.aws/glue/aws-glue-libs:5 \
    pyspark

You will see following output:

Python 3.11.6 (main, Jan  9 2025, 00:00:00) [GCC 11.4.1 20230605 (Red Hat 11.4.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.2-amzn-1
      /_/

Using Python version 3.11.6 (main, Jan  9 2025 00:00:00)
Spark context Web UI available at None
Spark context available as 'sc' (master = local[*], app id = local-1740643079929).
SparkSession available as 'spark'.
>>>

With this REPL shell, you can code and test interactively.

pytest

For unit testing, you can use pytest for AWS Glue Spark job scripts.

Run the following commands for preparation:

$ WORKSPACE_LOCATION=/local_path_to_workspace
$ SCRIPT_FILE_NAME=sample.py
$ UNIT_TEST_FILE_NAME=test_sample.py
$ mkdir -p ${WORKSPACE_LOCATION}/tests
$ vim ${WORKSPACE_LOCATION}/tests/${UNIT_TEST_FILE_NAME}

Now let’s invoke pytest using docker run:

$ docker run -i --rm \
    -v ~/.aws:/home/hadoop/.aws \
    -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
    --workdir /home/hadoop/workspace \
    -e AWS_PROFILE=$PROFILE_NAME \
    --name glue5_pytest \
    public.ecr.aws/glue/aws-glue-libs:5 \
    -c "python3 -m pytest --disable-warnings"

When pytest finishes executing unit tests, your output will look something like the following:

============================= test session starts ==============================
platform linux -- Python 3.11.6, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/hadoop/workspace
plugins: integration-mark-0.2.0
collected 1 item

tests/test_sample.py .                                                   [100%]

======================== 1 passed, 1 warning in 34.28s =========================

Visual Studio Code

To set up the container with Visual Studio Code, complete the following steps:

Install Visual Studio Code.
Install Python.
Install Dev Containers.
Open the workspace folder in Visual Studio Code.
Press Ctrl+Shift+P (Windows/Linux) or Cmd+Shift+P (Mac).
Enter Preferences: Open Workspace Settings (JSON).
Press Enter.
Enter following JSON and save it:

{
    "python.defaultInterpreterPath": "/usr/bin/python3.11",
    "python.analysis.extraPaths": [
        "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip:/usr/lib/spark/python/:/usr/lib/spark/python/lib/",
    ]
}

Now you’re ready to set up the container.

Run the Docker container:

$ docker run -it --rm \
    -v ~/.aws:/home/hadoop/.aws \
    -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
    -e AWS_PROFILE=$PROFILE_NAME \
    --name glue5_pyspark \
    public.ecr.aws/glue/aws-glue-libs:5 \
    pyspark

Start Visual Studio Code.
Choose Remote Explorer in the navigation pane.
Choose the container ecr.aws/glue/aws-glue-libs:5 (right-click) and choose Attach in Current Window.

If the following dialog appears, choose Got it.

Open /home/hadoop/workspace/.

Create an AWS Glue PySpark script and choose Run.

You should see the successful run on the AWS Glue PySpark script.

Changes between the AWS Glue 4.0 and AWS Glue 5.0 Docker image

The following are major changes between the AWS Glue 4.0 and Glue 5.0 Docker image:

In AWS Glue 5.0, there is a single container image for both batch and streaming jobs. This differs from AWS Glue 4.0, where there was one image for batch and another for streaming.
In AWS Glue 5.0, the default user name of the container is hadoop. In AWS Glue 4.0, the default user name was glue_user.
In AWS Glue 5.0, several additional libraries, including JupyterLab and Livy, have been removed from the image. You can manually install them.
In AWS Glue 5.0, all of Iceberg, Hudi, and Delta libraries are pre-loaded by default, and the environment variable DATALAKE_FORMATS is no longer needed. Until AWS Glue 4.0, the environment variable DATALAKE_FORMATS was used to specify whether the specific table format is loaded.

The preceding list is specific to the Docker image. To learn more about AWS Glue 5.0 updates, see Introducing AWS Glue 5.0 for Apache Spark and Migrating AWS Glue for Spark jobs to AWS Glue version 5.0.

Considerations

Keep in mind that the following features are not supported when using the AWS Glue container image to develop job scripts locally:

Job bookmarks
AWS Glue Parquet writer (see Using the Parquet format in AWS Glue)
FillMissingValues transform
FindMatches transform
Vectorized SIMD CSV reader
The property customJdbcDriverS3Path for loading the JDBC driver from an Amazon Simple Storage Service (Amazon S3) path
AWS Glue Data Quality
Sensitive data detection
AWS Lake Formation permission-based credential vending

Conclusion

In this post, we explored how the AWS Glue 5.0 Docker images provide a flexible foundation for developing and testing AWS Glue job scripts in your preferred environment. These images, readily available in the Amazon ECR Public Gallery, streamline the development process by offering a consistent, portable environment for AWS Glue development.

To learn more about how to build end-to-end development pipeline, see End-to-end development lifecycle for data engineers to build a data integration pipeline using AWS Glue. We encourage you to explore these capabilities and share your experiences with the AWS community.

Appendix A: AWS Glue job sample codes for testing

This appendix introduces three different scripts as AWS Glue job sample codes for testing purposes. You can use any of them in the tutorial.

The following sample.py code uses the AWS Glue ETL library with an Amazon Simple Storage Service (Amazon S3) API call. The code requires Amazon S3 permissions in AWS Identity and Access Management (IAM). You need to grant the IAM-managed policy arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess or IAM custom policy that allows you to make ListBucket and GetObject API calls for the S3 path.

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions


class GluePythonSampleTest:
    def __init__(self):
        params = []
        if '--JOB_NAME' in sys.argv:
            params.append('JOB_NAME')
        args = getResolvedOptions(sys.argv, params)

        self.context = GlueContext(SparkContext.getOrCreate())
        self.job = Job(self.context)

        if 'JOB_NAME' in args:
            jobname = args['JOB_NAME']
        else:
            jobname = "test"
        self.job.init(jobname, args)

    def run(self):
        dyf = read_json(self.context, "s3://awsglue-datasets/examples/us-legislators/all/persons.json")
        dyf.printSchema()

        self.job.commit()


def read_json(glue_context, path):
    dynamicframe = glue_context.create_dynamic_frame.from_options(
        connection_type='s3',
        connection_options={
            'paths': [path],
            'recurse': True
        },
        format='json'
    )
    return dynamicframe


if __name__ == '__main__':
    GluePythonSampleTest().run()

The following test_sample.py code is a sample for a unit test of sample.py:

The following test_sample.py code is a sample for a unit test of sample.py:
import pytest
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
import sys
from src import sample


@pytest.fixture(scope="module", autouse=True)
def glue_context():
    sys.argv.append('--JOB_NAME')
    sys.argv.append('test_count')

    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
    context = GlueContext(SparkContext.getOrCreate())
    job = Job(context)
    job.init(args['JOB_NAME'], args)

Appendix B: Adding JDBC drivers and Java libraries

To add a JDBC driver not currently available in the container, you can create a new directory under your workspace with the JAR files you need and mount the directory to /opt/spark/jars/ in the docker run command. JAR files found under /opt/spark/jars/ within the container are automatically added to Spark Classpath and will be available for use during the job run.

For example, you can use the following docker run command to add JDBC driver jars to a PySpark REPL shell:

$ docker run -it --rm \
    -v ~/.aws:/home/hadoop/.aws \
    -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
    -v $WORKSPACE_LOCATION/jars/:/opt/spark/jars/ \
    --workdir /home/hadoop/workspace \
    -e AWS_PROFILE=$PROFILE_NAME \
    --name glue5_jdbc \
    public.ecr.aws/glue/aws-glue-libs:5 \
    pyspark

As highlighted earlier, the customJdbcDriverS3Path connection option can’t be used to import a custom JDBC driver from Amazon S3 in AWS Glue container images.

Appendix C: Adding Livy and JupyterLab

The AWS Glue 5.0 container image doesn’t have Livy installed by default. You can create a new container image extending the AWS Glue 5.0 container image as the base. The following Dockerfile demonstrates how you can extend the Docker image to include additional components you need to enhance your development and testing experience.

To get started, create a directory on your workstation and place the Dockerfile.livy_jupyter file in the directory:

$ mkdir -p $WORKSPACE_LOCATION/jupyterlab/
$ cd $WORKSPACE_LOCATION/jupyterlab/
$ vim Dockerfile.livy_jupyter

The following code is Dockerfile.livy_jupyter:

FROM public.ecr.aws/glue/aws-glue-libs:5 AS glue-base

ENV LIVY_SERVER_JAVA_OPTS="--add-opens java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED"

# Download Livy
ADD --chown=hadoop:hadoop https://dlcdn.apache.org/incubator/livy/0.8.0-incubating/apache-livy-0.8.0-incubating_2.12-bin.zip ./

# Install and configure Livy
RUN unzip apache-livy-0.8.0-incubating_2.12-bin.zip && \
rm apache-livy-0.8.0-incubating_2.12-bin.zip && \
mv apache-livy-0.8.0-incubating_2.12-bin livy && \
mkdir -p livy/logs && \
cat <<EOF >> livy/conf/livy.conf
livy.server.host = 0.0.0.0
livy.server.port = 8998
livy.spark.master = local
livy.repl.enable-hive-context = true
livy.spark.scala-version = 2.12
EOF && \
cat <<EOF >> livy/conf/log4j.properties
log4j.rootCategory=INFO,console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
log4j.logger.org.eclipse.jetty=WARN
EOF

# Switching to root user temporarily to install dev dependency packages
USER root 
RUN dnf update -y && dnf install -y krb5-devel gcc python3.11-devel
USER hadoop

# Install SparkMagic and JupyterLab
RUN export PATH=$HOME/.local/bin:$HOME/livy/bin/:$PATH && \
printf "numpy<2\nIPython<=7.14.0\n" > /tmp/constraint.txt && \
pip3.11 --no-cache-dir install --constraint /tmp/constraint.txt --user pytest boto==2.49.0 jupyterlab==3.6.8 IPython==7.14.0 ipykernel==5.5.6 ipywidgets==7.7.2 sparkmagic==0.21.0 jupyterlab_widgets==1.1.11 && \
jupyter-kernelspec install --user $(pip3.11 --no-cache-dir show sparkmagic | grep Location | cut -d" " -f2)/sparkmagic/kernels/sparkkernel && \
jupyter-kernelspec install --user $(pip3.11 --no-cache-dir show sparkmagic | grep Location | cut -d" " -f2)/sparkmagic/kernels/pysparkkernel && \
jupyter server extension enable --user --py sparkmagic && \
cat <<EOF >> /home/hadoop/.local/bin/entrypoint.sh
#!/usr/bin/env bash
mkdir -p /home/hadoop/workspace/
livy-server start
sleep 5
jupyter lab --no-browser --ip=0.0.0.0 --allow-root --ServerApp.root_dir=/home/hadoop/workspace/ --ServerApp.token='' --ServerApp.password=''
EOF

# Setup Entrypoint script
RUN chmod +x /home/hadoop/.local/bin/entrypoint.sh

# Add default SparkMagic Config
ADD --chown=hadoop:hadoop https://raw.githubusercontent.com/jupyter-incubator/sparkmagic/refs/heads/master/sparkmagic/example_config.json .sparkmagic/config.json

# Update PATH var
ENV PATH=/home/hadoop/.local/bin:/home/hadoop/livy/bin/:$PATH

ENTRYPOINT ["/home/hadoop/.local/bin/entrypoint.sh"]

Run the docker build command to build the image:

docker build \
    -t glue_v5_livy \
    --file $WORKSPACE_LOCATION/jupyterlab/Dockerfile.livy_jupyter \
    $WORKSPACE_LOCATION/jupyterlab/

When the image build is complete, you can use the following docker run command to start the newly built image:

docker run -it --rm \
    -v ~/.aws:/home/hadoop/.aws \
    -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
    -p 8998:8998 \
    -p 8888:8888 \
    -e AWS_PROFILE=$PROFILE_NAME \
    --name glue5_jupyter  \
    glue_v5_livy

Appendix D: Adding extra Python libraries

In this section, we discuss adding extra Python libraries and installing Python packages using

Local Python libraries

To add local Python libraries, place them under a directory and assign the path to $EXTRA_PYTHON_PACKAGE_LOCATION:

$ docker run -it --rm \
    -v ~/.aws:/home/hadoop/.aws \
    -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
    -v $EXTRA_PYTHON_PACKAGE_LOCATION:/home/hadoop/workspace/extra_python_path/ \
    --workdir /home/hadoop/workspace \
    -e AWS_PROFILE=$PROFILE_NAME \
    --name glue5_pylib \
    public.ecr.aws/glue/aws-glue-libs:5 \
    -c 'export PYTHONPATH=/home/hadoop/workspace/extra_python_path/:$PYTHONPATH; pyspark'

To validate that the path has been added to PYTHONPATH, you can check for its existence in sys.path:

Python 3.11.6 (main, Jan  9 2025, 00:00:00) [GCC 11.4.1 20230605 (Red Hat 11.4.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.2-amzn-1
      /_/

Using Python version 3.11.6 (main, Jan  9 2025 00:00:00)
Spark context Web UI available at None
Spark context available as 'sc' (master = local[*], app id = local-1740719582296).
SparkSession available as 'spark'.
>>> import sys
>>> "/home/hadoop/workspace/extra_python_path" in sys.path
True

Installing Python packages using pip

To install packages from PyPI (or any other artifact repository) using pip, you can use the following approach:

docker run -it --rm \
    -v ~/.aws:/home/hadoop/.aws \
    -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
    --workdir /home/hadoop/workspace \
    -e AWS_PROFILE=$PROFILE_NAME \
    -e SCRIPT_FILE_NAME=$SCRIPT_FILE_NAME \
    --name glue5_pylib \
    public.ecr.aws/glue/aws-glue-libs:5 \
    -c 'pip3 install snowflake==1.0.5; spark-submit /home/hadoop/workspace/src/$SCRIPT_FILE_NAME'

About the Authors

Subramanya Vajiraya is a Sr. Cloud Engineer (ETL) at AWS Sydney specialized in AWS Glue. He is passionate about helping customers solve issues related to their ETL workload and implementing scalable data processing and analytics pipelines on AWS. Outside of work, he enjoys going on bike rides and taking long walks with his dog Ollie.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He works based in Tokyo, Japan. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Automate the Creation & Rotation of Amazon Simple Email Service SMTP Credentials

2025-03-11 Zip Zieper

Post Syndicated from Zip Zieper original https://aws.amazon.com/blogs/messaging-and-targeting/automate-the-creation-rotation-of-amazon-simple-email-service-smtp-credentials/

[Amazon Simple Email Service] provides a secure email solution that scales with your business needs. Unfortunately, all email systems, including Amazon SES, remain the primary target for spammers and bad actors due to email’s widespread use and accessibility.

While SES offers powerful features for application-based email sending, its SMTP credentials require careful management to prevent unauthorized access. Compromised credentials enable bad actors to send malicious emails through legitimate domains, which can bypass security filters and damage sender reputation.

To protect your SES implementation, you must encrypt SMTP credentials during storage and transmission. Additionally, implementing role-based access controls helps restrict credential access to authorized personnel only. Regular credential rotation at fixed intervals, typically every 90 days, minimizes potential security breaches. Automating this rotation process eliminates human error and ensures consistent security practices across your organization.

Problem Statement

Imagine you are the administrator for a large financial institution. You recently began using Amazon SES to send email from two dozen on-premises servers. Your email servers authenticate with SES using SMTP credentials to access the SES SMTP interface. Your organization’s security policies mandate regular credential rotation, including the ability to rotate them on-demand. How can you automate SMTP credential rotation such that you can meet your organization’s security policies?

This blog post will present two solutions that automate the secure management and automatic rotation of SMTP credentials for Amazon SES. Each will help enhance email security, comply with regulations, and minimize operational overhead.

Option 1: A fully automated version of the solution that uses AWS Lambda, AWS Secrets Manager, and AWS Systems Manager.
Option 2: A partially automated solution using AWS Lambda, AWS Systems Manager Parameter Store, and AWS Step Functions.

Both solutions provide SES customers who use SMTP with additional tools to improve email security, ensure compliance, and reduce operational overhead. You can deploy the option that best suites your needs by following the guidance in this blog post.

If your environment supports automated rotation, AWS Systems Manager Documents (SSM Documents) can help by providing pre-defined or custom automation workflows for securely managing secrets rotation, deploy Option 1.

If your environment does not support automated rotation, you can still implement an auditable, managed rotation solution by storing your secrets in AWS Systems Manager Parameter Store by deploying Option 2.

As a pay-per-use platform, the underlying AWS services used in either deployment option will only charge you for the resources that you actually consume. You can leverage the AWS Pricing Calculator to estimate the run-time costs for your specific workload. Alternatively, you can work directly with your AWS account team to understand the pricing for these solutions.

Getting SES SMTP Credentials

To send emails through the Amazon SES SMTP interface, email servers must first authenticate with SES using dedicated SES SMTP credentials. Typically, a systems administrator logs into the AWS SES console, clicks the Create SMTP Credentials button, and navigates to the AWS Identity and Access Management] (IAM) console. There, the administrator creates an IAM user with permissions for SES. The administrator then uses the IAM user’s secret access key to generate the SES SMTP password, which they use to configure their email servers or SMTP-enabled applications for use with SES.

Multiple SMTP Credentials

The SES SMTP interface authenticates requests using an SMTP credential derived from an IAM user’s access key ID and secret access key. Since temporary access keys cannot be used to derive SES SMTP credentials, you must deploy and regularly rotate a long-lived key.

While the manual process of creating SES SMTP credentials works for a small number of credentials, it becomes cumbersome for customers with numerous email servers or strict password rotation policies. These customers may find the automated credential rotation mechanisms described in the following solutions better suited to their production needs.

Option 1 – Fully Automated Credential Rotation:

The fully automated version of this solution uses a custom Lambda function to create an SMTP password, which is stored in AWS Secrets Manager. AWS Secrets Manager’s built-in rotation feature then triggers the rotation of SES SMTP credentials. AWS Systems Manager Documents utilize AWS Systems Manager Agents to automatically make the changes to the authentication configuration on email servers.

The key advantages of using AWS Systems Manager to make the email server configuration changes include:

Ability to deploy changes to on-premises and Amazon EC2 hosts, allowing rotation of secrets across a hybrid estate.
Customization of the document to specific email software configurations.
Targeting the secret (SMTP credential) rotation document on all email servers based on tags.

Let’s dive deep into Option 1 – Fully Automated Credential Rotation.

Option 1 - Fully Automated Credential Rotation

How Option 1 works:

Refer to the image above for the workflow:

AWS Secrets Manager initiates a rotation request, either on a schedule or via an authorized user’s request, triggering the “rotation Lambda” to rotate the SES SMTP credentials.
The SES Secret Rotation Function Lambda (see figure x above):
- a. Creates a new IAM secret access key for the designated SES IAM user, derives the new SES SMTP password, and stores it in AWS Secrets Manager.
- b. Connects to SES to verify the new SMTP password can authenticate.
- c. Initiates an AWS Systems Manager Run Command to update the new SMTP password on target email servers using a pre-configured Systems Manager Document.
- d. (and e.) Monitors the status of the Systems Manager Document execution until all updates complete successfully
- f. Deletes the old IAM access and secret access keys.

With this fully automated solution, SES SMTP credentials can be rotated on a schedule or triggered manually, with no impact to email service uptime.

Deploying the Fully Automated Solution in Your AWS Account (Option 1)

Prerequisites for the Fully Automated Solution

AWS Account Access, typically with admin-level permission to allow for the deployment.
Your preferred IDE with AWS CLI version 2 and named profile setup.
- Alternatively, you can use the AWS CLI from the AWS CloudShell in your browser.
Clone the Github repository (for this solution, you only need the README.md and sesautomaticrotation.yaml files found in /ses-credential-rotation/automatic-rotation).
- git clone -b ses-credential-rotation https://github.com/aws-samples/serverless-mail.git
- Note – We follow the principles of least privilege in this solution. The CloudFormation templates we’ve supplied require you to specify an identity, or configuration-set resource to use in the SES sending operation. You can find guidance on defining these values at Actions, resources, and condition keys for Amazon SES. Additionally, we’ve limited the IAM User to the ses:SendRawEmail action, which you can adjust as appropriate).
Console access to your AWS SES account that is properly configured to send emails via at least one verified identity.
Target email server(s) properly configured to send email via SES using SES SMTP authentication.
- The AWS Systems Manager agent(s) must be correctly installed and configured on your target email server(s) as detailed in Setting up AWS Systems Manager.
- The target email servers must be decorated with the tag (“SSMServerTag“) and value (“SSMServerTagValue“). These values allow the Systems Manager Document to identify them.
  - We use the tag “EmailServer” and the value “True” in our example, but you can use any tag and value that you wish).
An email address (or list) to receive SMTP rotation notifications.
Console access to your AWS Secrets Manager.
Console access to your AWS Systems Manager.

Deployment Steps

Clone the GitHub repository to your IDE
- If using AWS CloudShell, ensure you are in the same region as your AWS Systems and Secrets Manager
- run: git clone https://github.com/aws-samples/serverless-mail.git
- Navigate to the directory ses-credential-rotation/automatic-rotation
Follow the steps in README.md to
- Create a S3 bucket to deploy the CloudFormation Template.
- Package the Lambda functions and upload them to Amazon S3.
- Deploy the Cloud Formation Template.
- Update the appropriate AWS Systems Manager sample document created by the CloudFormation Template to reflect your email server environments. These can be found in the AWS Systems Manager console under Documents > Owned by me
  - The ExampleWindowsIISSMTPSESpasswordrotator sample provides an example for Microsoft Windows hosts using the runPowerShellScript action to update the server’s SMTP credentials.
  - The ExamplePostfixSESpasswordrotator sample provides an example for Linux hosts using the runShellScript action to update the server’s SMTP credentials.

Testing Option 1 – Fully Automated Credential Rotation

To test the Fully Automated Credential Rotation solution, have Secrets Manager perform an immediate rotation by following these steps:

Open AWS Secrets Manager console
Locate the secret SESSendSecret
Select the Rotation tab
Click the “Rotate Secret immediately” button.

You can track the progress of the rotation by locating the logs of the Lambda that is deployed to manage the rotation.

In the AWS console, go to CloudFormationStack’s Resources tab
Find the LogicalID = SESSecretRotationFunction
Click the PhysicalID link to open the Lambda
Under the Monitor Tab, select the “View CloudWatch logs” button in the top right
The logs should show the rotation flow through 4 stages below (more details of each stage are available here):
1. create_secret
2. set_secret
3. test_secret
4. finish_secret

Option 2 – Partially Automated Credential Rotation:

The partially automated version uses a custom AWS Lambda function to create an SMTP password, which is stored in AWS Systems Manager Parameter Store. This solution simplifies credential rotation, where manual changes must be conducted by support staff. By wrapping the manual change process with AWS Step Functions, you can ensure a robust and auditable process to regularly rotate the SES SMTP credential.

How Option 2 works:

The credential rotation AWS Step Function creates a new SES SMTP credential and updates it in AWS Systems Manager Parameter Store.
It retrieves a list of servers from an Amazon DynamoDB table and launches a manual confirmation AWS Step Function execution for each server to initiate and track the manual step.
The manual confirmation AWS Step Function emails the designated address, requesting support staff to arrange the rotation. The email includes a link specific to that server.
The person completing the manual change confirms back to the AWS Step Function via the link that the rotation is complete.
Once the rotation on a server is confirmed, the manual confirmation AWS Step Function for that server is marked as complete.
After all server rotations are complete, the credential rotation AWS Step Function continues, disabling the old SES SMTP credential and deleting it after a few days.

AWS Step Function executions can last up to 365 days, providing sufficient time for the manual rotation and confirmation.

The screenshot below shows a graphical representation of the credential rotation AWS Step Function execution status, providing a real-time view of the rotation progress.

SMTP credential rotation AWS Step Function

You can also track the status of individual servers via the manual rotation step function execution list.

SMTP manual rotation step function execution list

The partially automated solution for rotating Amazon SES SMTP credentials is illustrated and detailed below:

Option 2 - partially automated solution

Refer to the image above for the option 2 workflow:

EventBridge Scheduler Trigger: An EventBridge scheduler rule triggers a custom Starter Function Lambda (SF Lambda) on the last day of every 3rd month (this can be adjusted to suit your needs in the CloudFormation template).
Credential Rotation Step Function: The Starter Function Lambda triggers the Credential Rotation AWS Step Function, providing a clearly defined name to facilitate auditing (“password-rotation-dd-mm-yy“).
Credential Rotation Step Function Actions:
1. Creates a new IAM (Identity and Access Management) secret access key for the SES IAM user.
2. Triggers the SMTP Credential Generator Lambda to derive the SES SMTP password from the newly created IAM secret access key (using the algorithm provided in the SES documentation.
3. Stores the new SES SMTP credential in AWS Systems Manager Parameter Store.
4. Reads a list of servers that are utilizing this credential from a DynamoDB table.
Manual Confirmation Step Function:
1. For each server, a manual confirmation AWS Step Function is triggered, sending a message on the Amazon Simple Notification Service (SNS) topic.
2. The SNS notification prompts the server administrator via email to manually rotate the SMTP credentials on the on-premises email server.
3. The server administrator uses a link in the email to confirm the credential has been rotated and tested on the server.
4. The link triggers the Confirmation Lambda exposed via API Gateway, which marks the ManualConfirmation step function as complete.
Credential Rotation Completion: The CredentialRotation step function waits until all manual confirmation step functions have completed before proceeding.
Old IAM Access Key Deletion: Once confirmation has been received for all servers, the step function deletes the old IAM access key.

Deployment

To deploy the partially automated solution in your AWS account, you will need the following prerequisites:

Prerequisites for the Partially Automated Solution

AWS Account Access, typically with admin-level permission to allow for the deployment.
Your preferred IDE with AWS CLI version 2 and named profile setup. Alternatively, you can use the AWS CLI from the AWS CloudShell in your browser.
SES enabled, configured, and properly sending emails.
External email server(s) currently configured to use SES with SMTP.
Administrator email address to receive notifications.
AWS Secrets Manager and AWS Systems Manager set up.
AWS Systems Manager agent(s) correctly installed and configured on your target email servers as detailed in Setting up AWS Systems Manager.
Amazon EC2 instance with Postfix configured to send emails through SES
Target email servers must be decorated with a tag (“SSMServerTag“) and value (“SSMServerTagValue“) that will be used to identify them by the Systems Manager Document (we used “server” and “email”)
AWS Parameter Store and AWS Step Functions.

Once you have the prerequisites in place, follow the instructions in the GitHub project.

Conclusion

Implementing an automated credential rotation process for Amazon SES SMTP enhances security and compliance, streamlines operations, and reduces the risk of downtime and human error. By leveraging AWS Secrets Manager and AWS Systems Manager (option 1) or AWS Systems Manager Parameter Store and Step Functions (option 2), organizations can centralize SES SMTP credential management, maintain an audit trail, and quickly update email application servers with new SMTP credentials.

Need additional guidance?

Join the conversation and connect with other administrators and security professionals on the AWS re:Post community to share insights and learn best practices.
Consult the Amazon SES documentation
Connect with AWS Premium Support
Reach out to your AWS account team

Cross-account data collaboration with Amazon DataZone and AWS analytical tools

2025-03-05 Arun Pradeep Selvaraj

Post Syndicated from Arun Pradeep Selvaraj original https://aws.amazon.com/blogs/big-data/cross-account-data-collaboration-with-amazon-datazone-and-aws-analytical-tools/

Data sharing has become a crucial aspect of driving innovation, contributing to growth, and fostering collaboration across industries. According to this Gartner study, organizations promoting data sharing outperform their peers on most business value metrics. A straightforward data access and sharing mechanism is crucial for enabling effective data sharing across an organization. There are challenges such as complexity in managing cross-account permissions and difficulty in discovering the right data across accounts that organizations face when trying to share data products across AWS accounts. Amazon DataZone is a fully managed data management service that customers can use to catalog, discover, share, and govern data stored across Amazon Web Services (AWS).

In this post, we will cover how you can use Amazon DataZone to facilitate data collaboration between AWS accounts.

Solution overview

This solution provides a streamlined way to enable cross-account data collaboration using Amazon DataZone domain association while maintaining security and governance. This post describes the process of using the business data catalog resource of Amazon DataZone to publish data assets so they’re discoverable by other accounts. After they’ve been published, you can query the published assets from another AWS account using analytical tools such as Amazon Athena and the Amazon Redshift query editor, as shown in the following figure.

In this solution (as shown in the preceding figure), the AWS account that contains the data assets is referred to as the producer account. The AWS account that needs to access or use the data from the producer account is referred to as the consumer account. The Amazon DataZone domain is created and managed within the producer account and then the consumer account is associated with that domain.

As part of Amazon DataZone domain association, Amazon DataZone uses AWS Resource Access Manager (AWS RAM) to share the resource. When the producer and consumer AWS accounts are in the same organization within AWS Organizations, the domain association happens automatically. If the producer and consumer AWS accounts are in different organizations, AWS RAM sends an invitation to the consumer AWS account to accept or reject the resource grant.

This solution presents three Amazon DataZone user personas as:

Data administrators: Account owners in both producer and consumer AWS accounts. The data administrators are responsible for creating Amazon DataZone domains, configuring domain associations, and accepting domain associations within the Amazon DataZone domain.
Data publishers: Users in producer AWS accounts. The data publishers are responsible for creating Amazon DataZone publish projects and environments, producing and publishing data assets, and accepting subscription requests.
Data subscribers: Users in consumer AWS accounts. The data subscribers are responsible for creating Amazon DataZone subscribe projects and environments, searching for and subscribing to data assets, and querying the data and deriving insights.

Prerequisites

To follow along with the instructions, you will need:

Two AWS accounts, one serving as producer and other account serving as consumer. Create new AWS accounts if necessary.
An Amazon Redshift provisioned cluster or Amazon Redshift Serverless workgroup in the producer and consumer AWS accounts provisioned by a data administrator.
A secret in AWS Secrets Manager storing the master user credentials for the Amazon Redshift cluster or workgroup in the producer and consumer AWS accounts.
- The data administrators are responsible for creating secrets.
- The data producers and consumers can obtain the Amazon Resource Name (ARN) of the secrets from the data administrators during the environment or environment profile creation steps.

Amazon DataZone uses Amazon Redshift Datashares to share data across clusters and accounts. There are specific requirements and limitations for using Amazon Redshift datashares.

For cross-account data sharing, both the producer and consumer clusters must be encrypted. See Cluster encryption section of datashare-considerations for more information about the encryption process.
Data sharing is supported only for provisioned ra3 cluster types (ra3.16xlarge, ra3.4xlarge, and ra3.xlplus) and Amazon Redshift Serverless.

Walkthrough:

The following are the high level steps to configure cross-account access. We’ve provided step-by-step instructions in the following sections.

Create an Amazon DataZone domain in the producer account. The data administrator creates an Amazon DataZone domain.
Request Amazon DataZone domain association from the producer account to the consumer account.
Accept the domain association request in the consumer account. The data administrator accepts the domain association.
Add data users to the Amazon DataZone domain.
Create the necessary publish project for AWS Glue and Amazon Redshift in the producer account.
Create AWS Glue and Amazon Redshift environments to publish the data assets in the producer account.
Create and run a data source for AWS Glue and Amazon Redshift to publish assets into the business catalog.
Create subscribe projects for AWS Glue and Amazon Redshift.
Create AWS Glue and Amazon Redshift environment profiles and environments in the subscribe project
Subscribe to AWS Glue and Amazon Redshift tables. Consume the data using Athena and Amazon redshift editors. This step is performed by the data subscriber.

Create the Amazon DataZone domain in the producer account

Amazon DataZone domains serve as high-level organizational units for assets, users, and projects, facilitating cross-team and cross-account collaboration. This step focusses on creating the Amazon DataZone domain in the producer account.

Sign in to the producer account AWS Management Console for Amazon DataZone using the data administrator credentials.
Create an Amazon DataZone domain titled Demo_cross_account_domain using the instructions at create domains.
On the Create domain screen, select Quick setup checkbox to automate several configuration steps, saving time and reducing the potential for setup errors. Quick setup enables two default blueprints and creates the default environment profiles for the data lake and data warehouse default blueprints.

Request Amazon DataZone domain association from the producer account to the consumer account

To associate the Amazon DataZone domain with the consumer account, the producer account requests a domain association. This involves providing necessary information about the consumer account and granting appropriate permissions for data access and management.

Sign in to the Amazon DataZone console of the producer account using the data administrator credentials.
Navigate to the domain detail page, and then scroll down and select the Associated Accounts tab.
Enter the consumer account IDs that you want to request association. Choose Add another account if you want to add more than one account. When you’re satisfied with the list of account IDs, choose Request association.
- Use the latest (AWS RAM DataZonePortalReadWrite policy when requesting the account association. This policy allows users in the consumer account to execute Amazon DataZone APIs and to use the data portal interface.

Accept an account association request from an Amazon DataZone domain

This step focuses on accepting the account association request from the Amazon DataZone domain in the consumer account. This allows the consumer account to be linked with the Amazon DataZone domain to enable data sharing and collaboration between the producer and consumer accounts.

Sign in to the consumer account and go to the Amazon DataZone console in the same AWS Region as the domain. On the Amazon DataZone home page, choose View requests.
Select the name of the inviting Amazon DataZone domain and choose Review request.
Choose Accept association, you should see the Demo_cross_account_domain state as associated in the Associated domains screen

Choose the domain for which you want to enable an environment blueprint.
From the Blueprints list, choose either the DefaultDataLake blueprint
On the Permissions and resources page, for enabling the DefaultDataLake blueprint, for Glue Manage Access role, specify a new role that grants Amazon DataZone authorization to ingest and manage access to tables in AWS Glue and AWS Lake Formation.

Repeat steps 4 to 6 to enable the DefaultDataWarehouse blueprint by choosing DefaultDataWarehouse instead of DefaultDataLake

Add data users to the Amazon DataZone domain

To grant access to the Amazon DataZone data portal from the console for data publisher and data Subscriber IAM users, use the following steps to add them in the User Management section of the Amazon DataZone domain. See Manage users in the Amazon DataZone console for additional details.

Sign in to the Amazon DataZone console as a data administrator using the producer account.
Select the Amazon DataZone domain and, in the User management section, choose Add and select Add IAM users.
On the Add users page, choose Current account and add the user ARN of the data producer and choose Add users.
Next choose Associated account, and enter the data subscriber user’s ARN and add the user by choosing Add users.

Create the publish project for AWS Glue and Amazon Redshift

This step focuses on creating the publish project for AWS Glue and Amazon Redshift in the producer account. The project will be used to publish data from your data sources to the appropriate AWS services.

Using the producer account, sign in to the Amazon DataZone console as a data publisher.
Select View domains and select the demo_cross_account_domain.
Choose the Open data portal link and sign in to the data portal.
Choose Create New Project and create a project named Glue_Publish_Project for publishing AWS Glue data assets and create the project under demo_cross_account_domain.
Create another project named Redshift_Publish_Project for publishing Amazon Redshift data assets, also under the demo_cross_account_domain.

Create AWS Glue and Amazon Redshift environments to publish the data assets

In this step, you set up AWS Glue and Amazon Redshift environments in the producer account to share data assets. The required infrastructure, such as the AWS Glue Data Catalog and Redshift cluster for storing data, should already be in place. After setup, this will allow the consumer account to access and use the shared data assets. See Create a new environment for detailed instructions on creating a new environment.

Create the AWS Glue environment and a new AWS Glue table

In the same Amazon DataZone domain demo_cross_account_domain, choose Browse Project and select the Glue_Publish_Project and create Glue_Publish_Environment using the default DataLakeProfile.
Leave the producer_glue_db_name, consumer_glue_db_name and Workgroup_name blank.
Choose Create Environment and wait for the process to complete.
After the environment is created, browse the list of available projects and choose Glue_publish_project.
Next, navigate to the Glue_Publish_Environment, and under Analytics tools, choose Amazon Athena to open the Athena query editor
Choose Open Athena and make sure that Glue_Publish_Environment is selected in the Amazon DataZone environment dropdown at the upper right and that in Data on the left, glue_publish_environment_pub_db is selected as the Database.

Create a new AWS Glue table for publishing to Amazon DataZone. Paste the following create table as select (CTAS) query script in the Query window and run it to create a new table named mkt_sls_table. The script creates a table with sample marketing and sales data.

CREATE TABLE mkt_sls_table AS
SELECT 146776932 AS ord_num, 23 AS sales_qty_sld, 23.4 AS wholesale_cost, 45.0 as lst_pr, 43.0 as sell_pr, 2.0 as disnt, 12 as ship_mode,13 as warehouse_id, 23 as item_id, 34 as ctlg_page, 232 as ship_cust_id, 4556 as bill_cust_id
UNION ALL SELECT 46776931, 24, 24.4, 46, 44, 1, 14, 15, 24, 35, 222, 4551
UNION ALL SELECT 46777394, 42, 43.4, 60, 50, 10, 30, 20, 27, 43, 241, 4565
UNION ALL SELECT 46777831, 33, 40.4, 51, 46, 15, 16, 26, 33, 40, 234, 4563
UNION ALL SELECT 46779160, 29, 26.4, 50, 61, 8, 31, 15, 36, 40, 242, 4562
UNION ALL SELECT 46778595, 43, 28.4, 49, 47, 7, 28, 22, 27, 43, 224, 4555
UNION ALL SELECT 46779482, 34, 33.4, 64, 44, 10, 17, 27, 43, 52, 222, 4556
UNION ALL SELECT 46779650, 39, 37.4, 51, 62, 13, 31, 25, 31, 52, 224, 4551
UNION ALL SELECT 46780524, 33, 40.4, 60, 53, 18, 32, 31, 31, 39, 232, 4563
UNION ALL SELECT 46780634, 39, 35.4, 46, 44, 16, 33, 19, 31, 52, 242, 4557
UNION ALL SELECT 46781887, 24, 30.4, 54, 62, 13, 18, 29, 24, 52, 223, 4561

Go to the Tables and Views section and verify that the mkt_sls_table table was successfully created.

Create the Amazon Redshift publish environment and a new Redshift table

Staying in the same Amazon DataZone domain demo_cross_account_domain, choose Browse Project, to create an Amazon Redshift publish environment, select the Redshift_Publish_Project and create Redshift_Publish_Environment using the default data warehouse profile.
To configure environment parameters, enter the name of your Amazon Redshift cluster or workgroup, specify the database name and enter the AWS Secrets Manager secret ARN for the Redshift cluster or workgroup. You need to make sure that the secret in Secrets Manager includes the following tags. These tags help Amazon DataZone implement proper access control so that only authorized users within the correct Amazon DataZone project and domain can access the Amazon Redshift resource:
1. For Amazon Redshift cluster: DataZone.rs.cluster: <cluster_name:database name>
2. For Amazon Redshift Serverless workgroup: DataZone.rs.workgroup: <workgroup_name:database_name>
3. AmazonDataZoneProject: <projectID>
4. AmazonDataZoneDomain: <domainID>For more information for creating redshift database user secret in secret manager, see Storing database credentials in AWS Secrets Manager.

For more information for creating redshift database user secret in secret manager, see Storing database credentials in AWS Secrets Manager.

Note that the database user you provide in Secrets Manager must have superuser permissions. Data publishers should work with the data administrator to get the details of the Redshift cluster or workgroup, database name, and secret ARN.
The schema is optional.
Choose Create Environment and wait for the process to complete.
Verify that the environment is created successfully without errors.
Browse the list of available projects and select Redshift_publish_project. Navigate to Redshift_publish_environment.
Under Analytics tools, choose Amazon Redshift to open the Amazon Redshift query editor.
Select the Redshift cluster that you want to connect, choose Save and then choose Create Connection using temporary credentials with your IAM identity.

Create a new Redshift table. You can use the CTAS query to create a new table named rs_sls_tbl. Use the provided CTAS script, which creates a table with sample sales data in the datazone_env_redshift_publish_environment schema.

CREATE TABLE "datazone_env_redshift_publish_environment"."rs_sls_tbl" AS
SELECT 146776932 AS ord_num, 23 AS sales_qty_sld, 23.4 AS wholesale_cost, 45.0 as lst_pr, 43.0 as sell_pr, 2.0 as disnt, 12 as ship_mode,13 as warehouse_id, 23 as item_id, 34 as ctlg_page, 232 as ship_cust_id, 4556 as bill_cust_id
UNION ALL SELECT 46776931, 24, 24.4, 46, 44, 1, 14, 15, 24, 35, 222, 4551
UNION ALL SELECT 46777394, 42, 43.4, 60, 50, 10, 30, 20, 27, 43, 241, 4565
UNION ALL SELECT 46777831, 33, 40.4, 51, 46, 15, 16, 26, 33, 40, 234, 4563
UNION ALL SELECT 46779160, 29, 26.4, 50, 61, 8, 31, 15, 36, 40, 242, 4562
UNION ALL SELECT 46778595, 43, 28.4, 49, 47, 7, 28, 22, 27, 43, 224, 4555
UNION ALL SELECT 46779482, 34, 33.4, 64, 44, 10, 17, 27, 43, 52, 222, 4556
UNION ALL SELECT 46779650, 39, 37.4, 51, 62, 13, 31, 25, 31, 52, 224, 4551
UNION ALL SELECT 46780524, 33, 40.4, 60, 53, 18, 32, 31, 31, 39, 232, 4563
UNION ALL SELECT 46780634, 39, 35.4, 46, 44, 16, 33, 19, 31, 52, 242, 4557
UNION ALL SELECT 46781887, 24, 30.4, 54, 62, 13, 18, 29, 24, 52, 223, 4561

Make sure that the rs_sls_tbl table is successfully created.

Publish assets into the common business catalog

In this step, you create and run the Amazon DataZone data sources for AWS Glue and Amazon Redshift. You will then publish the data assets from these data sources.

The Amazon DataZone data sources allow you to connect to various data sources, including databases, data warehouses, and data lakes, and ingest metadata into Amazon DataZone. By creating and running these data sources, you can make your data available for analysis, transformation, and sharing within your organization.

After the data sources are set up, you can publish the data assets from these sources to make them accessible to other users and applications. This process involves mapping the data assets to the appropriate business terms and metadata, making sure that the data is properly described and categorized.

Add an AWS Glue data source to publish the new AWS Glue table.

Stay signed in the producer account and Amazon DataZone console as a data publisher.
Choose Select project from the top navigation pane and select the Glue_Publish_Project that you want to add the data source to.
Select the Glue_Publish_Environment.
Choose Create data source. Enter glue-publish-datasource as the name.
Under Data source type, choose AWS Glue.
Under Select an environment, select Glue_Publish_Environment.
Under Data selection, select the AWS Glue database glue_publish_environment_pub_db, enter your table selection criteria as “*“, and then and choose Next.
Leave all other setting as default and choose Next.
For Run Preference, select Run on demand to ingest metadata from the specified AWS Glue tables into Amazon DataZone.
Review and choose Create.
After the data source has been created choose Run. The mkt_sls_table will be listed in the inventory and available to publish.
Select the mkt_sls_table table and review the metadata that was generated. Choose Accept All if you’re satisfied with the metadata.
Choose Publish Asset and the mkt_sls_table table will be published to the business data catalog, making it discoverable and understandable across your organization.

Add an Amazon Redshift data source to publish the new Amazon Redshift table.

Stay signed in the producer account and Amazon DataZone console as a data publisher.
Choose Select project from the top navigation pane and select the Redshift_Publish_Project that you want to add the data source to.
Choose the Redshift_Publish_Environment.
Choose Create data source. Enter rs-publish-datasource as the name.
Under Data source type, select Amazon Redshift.
Under Select an environment, select Redshift_Publish_Environment.
Under Redshift Credentials, enter the Redshift cluster and secret details provided by the data administrator.
Under Data Selection, select the database dev and schema datazone_env_redshift_publish_environment.
Keep other setting as default and choose Next.
For Run Preference, select Run on Demand.
Choose Save. After the data source is created, choose Run. The data source runs and the rs_sls_tbl will be listed in the inventory and available to publish.
Select the rs_sls_tbl table and review the metadata that was generated. Choose Accept All if you are satisfied with the metadata.
Choose Publish Asset and the rs_sls_table table will be published to the business data catalog.

Create subscribe projects for AWS Glue and Amazon Redshift

In this step, you create the projects for subscribing to AWS Glue and Amazon Redshift data assets within your Amazon DataZone domain.

Sign in to the Amazon DataZone console as a data subscriber IAM user using the consumer account.
Choose Associated domains and select the demo_cross_account_domain.
Select the Open data portal link and sign in to the data portal.
Choose Create New Project and create a project named Glue_Subscribe_Project for subscribing to the AWS Glue data assets.
Create another project named Redshift_Subscribe_Project for subscribing to the Redshift data assets.

Create AWS Glue and Amazon Redshift environment profiles

In this step, you will set up the environment profiles and environments for AWS Glue and Amazon Redshift in your Amazon DataZone projects. This will allow you to connect and interact with resources across AWS accounts.

The purpose of environment profiles in Amazon DataZone is to streamline the process of environment creation. By using environment profiles, you can preconfigure essential placement information such as AWS account and AWS Region. In this solution, you will configure environment profiles with placement information pointing to your consumer account.

You will also create an Amazon DataZone environment from the profiles you are about to create. This will provision the necessary resources in the consumer account and establish the connections between the Amazon DataZone domain and the consumer account. After the environments are created, you can work with AWS Glue and Amazon Redshift assets seamlessly across different AWS accounts within your Amazon DataZone ecosystem.

Create an AWS Glue profile and environment

Stay signed in the consumer account’s Amazon DataZone console as a data subscriber IAM, select the Environments tab and then choose Create environment profile.
Configure the fields as follows:
1. Name: Enter glue_subscribe-env-profile.
2. Owner: The project where the profile is being created is selected by default in this field. Verify that it’s Glue_Subscribe_Project.
3. Blueprint: Select Default Data Lake.
4. AWS account parameters: Enter the consumer AWS account number and select the Region.
5. Authorized projects: Select All projects.
6. Publishing: Select Publish from any database.
7. Choose Create Environment Profile.
On the Create environment page, enter the following:
1. Name: Enter glue_subscribe_environment.
2. Verify that the Environment profile is set to glue_subscribe-env-profile.
(Optional) Parameters: Enter the Producer glue db name, Consumer glue db name, and Workgroup name.
Choose Create environment.
It takes a few minutes for the environment to be created. Verify that the environment creation is successful without any errors.

Create a Redshift environment profile and environment

Staying in the consumer account’s Amazon DataZone management console as a data subscriber IAM user, navigate to the Redshift_Subscribe_Project you created previously.
Select the Environments tab and then choose Create environment profile.
Configure the fields as follows:
1. Name: Enter redshift_subscribe-env-profile.
2. Owner: Verify that Project is set to Redshift_Subscribe_Project.
3. Blueprint: Select Default Data Warehouse.
4. Parameter set: Select Enter my own.
5. AWS account parameters: Enter the consumer AWS account number and select the Region.
6. Parameters: Select either Amazon Redshift Cluster or Amazon Redshift Serverless in the consumer account.
  - AWS Secret ARN: Enter the AWS Secrets Manager secret ARN for the Redshift cluster or workgroup. You need to make sure that the secret in Secrets Manager includes the following tags. These tags help Amazon DataZone implement proper access control so that only authorized users within the correct Amazon DataZone project and domain can access the Amazon Redshift resource.
    1. AmazonDataZoneDomain: [Domain_ID]
    2. AmazonDataZoneProject: [Project_ID]
  For more information for creating redshift database user secret in secret manager, see Storing database credentials in AWS Secrets Manager.
  
  Note that the database user you provide in AWS Secrets Manager must have superuser permissions. Data publishers should work with the data administrator to get the details of the Redshift cluster or workgroup, database name, and secret ARN.
  - Redshift cluster name: Enter the name of the Amazon Redshift cluster or Amazon Redshift Serverless workgroup.
  - Database name: Enter the name of the database within the selected Amazon Redshift cluster or Amazon Redshift Serverless workgroup
7. Authorized projects: Select All projects.
8. Publishing: Select Publish any schema.
Choose Create environment profile.
Create an environment from this profile: Create an environment from this profile:
1. Name: Enter redshift_subscribe_environment.
2. Verify that the Environment profile is set to redshift_subscribe-env-profile.
Choose Create Environment.

It takes a few minutes for the environment to be created. Verify that the environment creation is successful without any errors.

Subscribe to the AWS Glue and Redshift tables

In this step, you will subscribe AWS Glue and Amazon redshift tables published by the data producer.

Subscribe to the AWS Glue table

Sign in to the Amazon DataZone console of the consumer account using the data subscriber credentials and navigate to the Glue_Subscribe_project you created previously.
Search for the Market Sales Table in the Search bar.
Select the Market Sales Table and choose Subscribe.
In the Subscribe pop-up window, provide the following information:
- Project: Enter the name of the project that you want to subscribe to the asset. By default this will be Glue_Subscribe_Project.
- Enter a justification for your subscription request.
Choose Subscribe.
Switch to the data publisher role to approve the subscription request, then back to data subscriber after choosing Approve.
Select the Glue_subscribe_project and choose Subscribed Assets. Verify that the Market Sales Table is added to your environment.
Navigate to the Amazon Athena query editor using the link in the project’s home page.
Choose OPEN AMAZON ATHENA.
You will now be automatically routed to the Athena console, make sure that the Amazon DataZone Environment is set to glue_subscribe_environment.
For Database, select glue_subscribe_environment_sub_db.
You should see the mkt_sls_table in the Tables list. Preview the table by choosing the three-dot menu next to the table name and selecting Preview Table
Review the table preview results. You will be able to see all the sales related data from the mkt_sls_table

Subscribe to the Redshift table

Stay signed in to the Amazon DataZone management console as the data subscriber, Choose Select project from the top navigation pane and select the Redshift_Subscribe_project.
Search for Sales Table in the search bar, and select the Sales Table.
In the Subscribe pop-up window, provide the following information:
- Project: Enter the name of the project that you want to subscribe to the asset. By default this will be Redshift_Subscribe_Project.
- Enter a justification for your subscription request.
Choose Subscribe.
Switch back to the data publisher who is the producer of the Market Sales Table choose Approve.
After the subscription request is approved, switch back to data subscriber.
Select the Redshift_subscribe_project and choose Subscribed Assets. After the Sales Table is added to your environment, you can query the data in the table.
Select the Amazon Redshift link in the right side panel of the project home page and navigate to the Amazon Redshift query editor.
Select Open Amazon Redshift and the Redshift query editor v2 will open in a new tab.
In the query editor, right-click your Amazon DataZone environment’s Amazon Redshift cluster and select Create a connection.
Select Temporary credentials using your IAM identity for authentication.
- If that authentication method isn’t available, open Account settings by choosing the gear icon in the bottom left corner, choose Authenticate with IAM credentials and choose Save.
Enter the name of the Amazon DataZone environment’s database to create the connection.
Choose Create connection.
You can now view the Redshift table rs_sls_tbl in the datazone_env_redshift_subscribe_environment.
Execute the following query to make sure the data is accessible

SELECT * FROM "dev"."datazone_env_redshift_subscribe_environment"."rs_sls_tbl";

You will be able to preview the rs_sls_tbl which will show the sale data from the table.

Clean up

To avoid unnecessary future charges, follow these steps:

Delete the Amazon DataZone project if you created it as part of this post.
Delete the Amazon DataZone domain if you created it as part of this post.
Delete the Redshift clusters and the redshift secrets in both the producer and consumer accounts if you created them as part of the post.

Summary

Organizations often face significant challenges when trying to share data products across multiple AWS accounts. These challenges stem from the complexity of configuring proper cross-account access permissions and roles while maintaining robust data governance and security controls.

You can use the solution described in the post to publish and consume data across AWS accounts and make sure that reliable access and consistent data governance is in place. By combining the power of AWS Glue and Amazon Redshift, you can unlock valuable insights and accelerate your data-driven decision-making processes.

In this post, you followed a step-by-step guide to set up cross-account data sharing using Amazon DataZone domain association. You learned how to publish data assets from a producer account. You also learned how to subscribe to and query the published assets from a consumer account. You can optionally use AWS Lake Formation access monitoring to view permissions and data access activities. AWS Lake Formation uses AWS CloudTrail for historical analysis and CloudTrail retains logs for 90 days by default.

Now that you’re familiar with the elements involved in cross-account data sharing using Amazon DataZone and your choice of analytical tool, you’re ready to try it with multiple accounts.

About the Authors

Arun Pradeep Selvaraj is a Senior Solutions Architect at AWS. Arun is passionate about working with his customers and stakeholders on digital transformations and innovation in the cloud while continuing to learn, build and reinvent. He is creative, fast-paced, deeply customer-obsessed, and uses the working backwards process to build modern architectures to help customers solve their unique challenges. Connect with him on LinkedIn.

Piyush Mattoo is a Senior Solution Architect for the Financial Services Data Provider segment at Amazon Web Services. He’s a software technology leader with over a decade of experience building scalable and distributed software systems to enable business value through the use of technology. He has an educational background in Computer Science with a master’s degree in computer and information science from University of Massachusetts. He is based out of Southern California and current interests include camping and nature walks.

Mani Yamaraja is a Senior Customer Solutions Manager for Financial Services Data Provider segment at Amazon Web Services. He has over a decade long experience working with financial services customers enabling their digital transformation journey. Mani adopts a customer centric approach and provides technology solutions working backwards from customer’s business goals. He is passionate about the financial services industry and helps the customers accelerate their cloud based transformation using the proven mechanisms of AWS.

Best practices for rapidly deploying Landing Zone Accelerator on AWS

2025-03-03 Lei Shi

Post Syndicated from Lei Shi original https://aws.amazon.com/blogs/devops/best-practices-for-rapidly-deploying-landing-zone-accelerator-on-aws/

Landing Zone Accelerator on AWS (LZA) enables customers to deploy a flexible, configuration-driven solution to establish a landing zone while also leveraging AWS Control Tower. At AWS Professional Services, we’ve helped customers deploy and configure LZA hundreds of times. A common request we encounter is integrating LZA configuration into customers’ existing GitOps workflows. GitOps has emerged as a leading model for Infrastructure as Code (IaC), helping organizations automate and manage their cloud infrastructure. The model uses Git repositories as the single source of truth for infrastructure configuration, enabling teams to maintain consistent, version-controlled environments.

In this blog, we will focus on common LZA implementation steps based on our experience, helping customers jump-start their LZA environment and implement GitOps for their AWS infrastructure management. First, we will demonstrate how to leverage LZA while complying with your organization’s policies such as private package repositories. Next, we will guide you through a new installation of LZA that takes advantage of an auto-generated starter set of configuration files. Finally, we will direct you to another blog post that will enable you to leverage GitOps for ongoing management of your LZA configuration.

Architecture overview

The LZA solution leverages two distinct repositories; one for the LZA source code, and another for your organization’s specific configuration files. LZA creates two separate AWS CodePipelines , which are used to install the LZA solution and apply your organization’s specific configuration. Figure 1 illustrates the association between repositories and pipelines. By default, when installing LZA, the solution uses GitHub as the source and pulls the installation files published by AWS from the official LZA GitHub repository.

a diagram with icons illustrating LZA solution components

Figure 1. Landing Zone Accelerator solution components

Deploy LZA as a new install

Step 1: Preparing your enterprise private GitHub to host LZA source code.
Customers may choose to deploy LZA from the official AWS GitHub repository for LZA, but we often we find customers have policies in place that require these types of packages to be deployed from a private repository managed by the organization. For customers using GitHub privately in their enterprise, this can be as easy as cloning the LZA source code repository into your own private GitHub repository, enabling you to take advantage of policies and controls within your organization. Before moving to the next step, take a moment and clone the repository into your own private repository. A GitHub personal access token stored in AWS Secrets Manager is required to enable the stack to access your private repository. Before deploying LZA, follow these instructions to enable stack access to your repository.

Step 2: In the organization management account, install LZA as a CloudFormation Stack.

To get started, we will be going through a new installation of the LZA solution. The following steps provide specific parameter options to the CloudFormation template to support a new installation of LZA.

Specify the following parameters for Source Code Repository Configuration, see Figure 2.

For Stack name, specify a name you like.
For Source Location, choose github.
For Repository Owner, specify your GitHub account owner ID.
For Repository Name, specify your cloned LZA source code repository
For Branch Name, specify the branch name of your LZA source code repository.

a screenshot of a set of parameters setting of LZA source code repo in a cloudformation stack

Figure 2. LZA installer stack parameters – source code repository

Specify the following parameters for Configuration Repository Configuration, see Figure 3.

For Configuration Repository Location, choose s3.
For Use Existing Config Repository, choose No.
For Existing Config Repository Name, Existing Config Repository Branch Name, Existing Config Repository Owner, and Existing Config Repository CodeConnection ARN, leave blank.

We intentionally want to use S3 for the configuration repository because as the LZA solution is installed, it will auto-generate a set of starter configuration YAML files and deploy them for us in S3. This makes it very easy to get started with an initial set of customized YAML files for your environment. We choose “No” in the Use Existing Config Repository field, to have LZA to perform a new LZA installation.

a screenshot of a set of parameters setting of LZA configuration repo in a cloudformation stack

Figure 3. LZA installer stack parameters – configuration repository

Choose Next, and complete the remainder of the stack settings.
Finally, choose Create stack to launch the CloudFormation stack.

The installer stack typically takes minutes to complete (See Figure 4).

a screenshot showing LZA installation cloudformation stack completed successfully

Figure 4. LZA installer stack completion

Step 3: Validate two LZA pipelines are created and successfully completed in AWS CodePipeline console.

After the CloudFormation stack completes, open the AWS CodePipeline console. You’ll see a new pipeline named “AWSAccelerator-Installer” running (See Figure 5). This is the LZA Installer pipeline, and it’s connected to the GitHub source repository you specified in Step 2 above with parameters from 2 to 5. This Installer pipeline automatically generates a set of LZA configuration files stored as a compressed ZIP archive in Amazon S3. It will be designated as configuration repository of the LZA solution.

a screenshot showing LZA installer pipeline created in AWS CodePipeline successfully

Figure 5. AWSAccelerator-Installer pipeline creation

When the AWSAccelerator-Installer pipeline completes, the solution automatically creates and runs a second pipeline named “AWSAccelerator-Pipeline” as shown in Figure 6. This pipeline connects to both the GitHub source repository, and the newly created configuration repository in Amazon S3. The AWSAccelerator-Pipeline is the pipeline that manages your landing zone deployment and customization.

a screenshot showing LZA core pipeline created in AWS CodePipeline successfully

Figure 6. AWSAccelerator-Pipeline created from the AWSAccelerator-Installer pipeline

After the AWSAccelerator-Pipeline completes, your LZA solution is ready for customization.

Step 4: Migrate the LZA configuration repository from S3 to GitHub

With the AWSAccelerator-Pipeline completed, your initial landing zone is now deployed, leveraging the configuration stored in your S3 bucket. For some customers, they may need to ensure that changes to the landing zone configuration are controlled through their existing GitOps processes and tooling. See Figure 7 as an example where the S3 configuration files have been copied to a customer owned GitHub repository. This transition step can be performed in future LZA upgrade window when there is a new release of LZA source code, or right after the initial LZA installation completes in Step 3. For more information on migrating from S3 to GitHub, follow this guide to configure your AWSAccelerator-Pipelines with AWS CodeConnection.

a diagram with icons illustrating LZA solution new components with LZA configuration repo migrated to GitHub

Figure 7. CodeConnection based LZA Configuration Repository

Conclusion

In this post, we explored key steps to streamline your LZA implementation journey. By demonstrating how to work with your private package repositories, providing guidance on leveraging auto-generated configuration files, and introducing GitOps-based management, we’ve outlined a practical path to establish and maintain a robust AWS infrastructure foundation. These approaches can significantly reduce the time and complexity typically associated with LZA deployments while ensuring compliance with organizational policies. We encourage you to try these implementation steps and explore the referenced resources to enhance your AWS cloud operations. For more information about Landing Zone Accelerator, visit the AWS Landing Zone Accelerator on GitHub.

About the authors